Thinking through how to plot data on a log-scaled y-axis with ggplot2 in R


Hey folks,

One of the benefits of sending out these newsletters and making my YouTube videos is that I get a ton of practice. I can’t emphasize how much practice has paid off in learning to use dplyr, ggplot2, and other packages. Reproducing published figures has really helped me to dive into parts of ggplot2 that I wouldn’t normally use because I make plots that use the features of ggplot2 that I know. By expanding my knowledge of ggplot2, I’m finding that the plots I make from scratch are more varied and sophisticated than they would normally be. I hope I’m not bragging here - my point is that there’s no reason you couldn’t be doing the same thing. Grab your favorite journal, find a figure, think through what’s standard, what’s novel, and how you would approach the figure. Then go through the process of creating it yourself. Finally, change what you don’t like in a new version of the plot. I’d love it if you shared with me any of your (re)creations!


This week, a figure by Alessandro Lo Sciuto and colleagues caught my eye in their paper, “A molecular comparison of [Fe-S] cluster-based homeostasis in Escherichia coli and Pseudomonas aeruginosa”. The paper was published in mBio. The paper looks at the iron limitation physiology of E. coli and P. aeruginosa and how they use iron-sulfur clusters. The figure that I was interested in was Figure 2C

This set of figures describes the sensitivity of P. aeruginosa to stress, specifically antibiotic stress. The wild type strain (PAO1) was sensitive to these drugs. They created a mutant that had the iscU gene deleted but added back with an arabinose-inducible promoter. When there was no arabinose in the media (ΔiscU ParaiscU) the strain was also sensitive. But when they added arabinose to the media, the strain was resistant (ΔiscU ParaiscU+ (+0.5% ARA)). That’s enough biology :)

What type of figure is this? What do you think is interesting about how it was created? Is there any overlap with what these authors did with what you try to do with your figures? How much of this figure do you think you could figure out on your own? What would you be interested in learning to do?

For discussion, let’s assume the data comes as a data frame with a column for the name of the drug, the strain, the replicate, the time, and the density. At the end of this email is some simulated code to get you thinking.

First off, these four plots are all line plots with time in hours across the x-axis and cell density (CFU/mL) on the y-axis. They tested four antibiotics including gentamicin, ofloxacin, meropenem, and colistin. The legend tells us that the experiments were done in triplicate and that the mean is presented with error bars representing the standard deviation. For each plot in this panel, I’d create the general appearance using geom_line() and geom_errorbar() to generate the lines and error bars. I’d probably need to start by using dplyr’s summarize() function grouping the data by drug, strain, and time. To get geom_errorbar() to work, I’ll need to add the ylim and ymax aesthetics to the aes() function.

Second, I notice that the y-axis is on a log-10 scale. We can get a log scale by using scale_y_log10. Something that is sure to throw a wrench in this is that we have zero value at later time points for drugs like gentamicin. Try running log10(0). You’ll get the value of negative infinity. You’ll get a scary warning message and those points will be removed from the plot. My solution to this is either to add a small value to the data in density or add the small value to the mean value from the summarize function. The number will be so small that it won’t make a difference to big values like 500 or 5,000,000. This is called a “pseudocount”. Fiddle with the value of the pseudocount to get the warnings to go away. Be sure to use a seed for your random number generator so you don’t end up chasing your tail.

Third, normally I would draw a line to indicate where the limit of detection was using geom_hline(). In this case, I suspect the limit of detection was around 500. Sometimes people will code this as 490 (just under 500) or 0 like I have. Regardless, we’ll want to set the limits on our y-axis. If we set them in scale_y_log10() then we’ll get another scary warning message about data getting removed. That’s because the scale functions remove the data outside the limit before plotting it. You’ll also lose the segment of the line that goes from above to below the limit of detection. The better alternative is to use the ylim argument in coord_cartesian().

Fourth, sticking with that y-axis, I noticed that the y-axis labels are 5 times ten to a power (e.g., 5x106) rather than the typical 1 times ten to a power (e.g., 106). They also have minor tick marks. The minor tick marks are a common feature in log scaled axes, but I realized I rarely see these in R plots. They certainly haven’t ever been on any of my plots. We can adjust both of these in scale_y_log10(). We can use the breaks argument to show 5 times ten to a power and the labels argument to style it as these authors did. We’ll also set axis.text.y to ggtext’s element_markdown() to get the nice superscript font for the power. For those minor tick marks we’ll have to feed scale_y_log10() the values to the minor_breaks argument. This will get us the grid lines, which we’ll want to remove in theme(), but not the tick marks. For those, we’ll need to add guide=guide_axis(minor.ticks = TRUE) - this is a relatively new feature in ggplot2. I’m really not sure why they are “counting by 5s” and would have expected the tick marks to be a bit more screwy than they are. Hmmm.

Finally, they appear to have generated these plots as four separate plots. This gets us redundant x and y-axis titles and text. Regardless, we can use our friend the patchwork package to assemble the four plots into a single figure. It is interesting that they have a common legend for the four plots at the bottom of the figure. I know there’s a vignette on the patchwork website showing how to share legends across figures. I’d check that out to duplicate their legend.

Of course, there’s a number of small details that I’m skipping over here. Things like how they have a greek letter, subscripts, italics in their strain names or how they title of each plot is set off to the right side of the plot. All of these things can be modified adjusting the theme() function.

library(tidyverse)

set.seed(19760620)

tibble(
drug =rep(c("Gentamicin","Ofloxacin","Meromenem","Colistin"), each =36),
strain =rep(rep(c("PAO1","mutant","mutant_arabinose"), each =12), times =4),rep=rep(1:3, times =48),
time =rep(rep(c(0,1,2,3), each =3), times =12))%>%
mutate(density =c(rnorm(3, mean =5e6, sd =1e6),
#Gentamicin
rnorm(3, mean =1e3, sd =2e2),c(0,0,0,0,0,0),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),
rnorm(3, mean =1e3, sd =5e2),c(0,0,0,0,0,0),

rnorm(3, mean =5e6, sd =1e6),
#Ofloxacin
rnorm(3, mean =1e3, sd =2e2),c(0,0,0,0,0,0),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),c(0,0,0,0,0,0,0,0,0),

rnorm(3, mean =5e6, sd =1e6),
#Meromenem
rnorm(3, mean =5e5, sd =1e5),
rnorm(3, mean =1e5, sd =2e4),
rnorm(3, mean =6e4, sd =1e4),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),
rnorm(3, mean =5e5, sd =1e5),
rnorm(3, mean =1e5, sd =2e4),
rnorm(3, mean =6e4, sd =1e4),

rnorm(3, mean =5e6, sd =1e6),
#Colistin
rnorm(3, mean =5e3, sd =1e3),
rnorm(3, mean =1e3, sd =2e2),
rnorm(3, mean =8e2, sd =1e2),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),
rnorm(3, mean =5e3, sd =1e3),
rnorm(3, mean =1e3, sd =2e2),
rnorm(3, mean =8e2, sd =1e2)))

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, I hope you have enjoyed the current series of newsletters and videos recreating “data portraits” from the WEB DuBois collection of visuals he showed at the 1900 Paris Exhibition. You can find the entire collection of “data portraits” in a book assembled by Whitney Battle-Baptiste and Britt Rusert (here) or as a collection of plates through the Library of Congress (here). I’ve really appreciated the positive feedback! These figures are pretty different from what we do in modern data...

Hey folks, I hope you enjoyed thinking last week about how you would recreate Plate 12 from the WEB DuBois collection of visuals he showed at the 1900 Paris Exhibition using ggplot2 and related R tools. You can find the entire collection of “data portraits” in a book assembled by Whitney Battle-Baptiste and Britt Rusert (here) or as a collection of plates through the Library of Congress (here). I won’t reshare all the resources describing the collection, but do encourage you to check out last...

Hey folks, I’m at the end of a day after I pulled an all-nighter trying to hit a grant proposal deadline. I don’t recall ever doing this in college. I seem to pull an all-nighter every five years or so. I’m too old for this! Anyway, the proposal is in and now I’m ready to move on to fun things… like talking to you about visualizing data! A few years back Whitney Battle-Baptiste and Britt Rusert put together an amazing collection of visualizations by WEB DuBois that he presented at the 1900...