Thinking through how to plot data on a log-scaled y-axis with ggplot2 in R


Hey folks,

One of the benefits of sending out these newsletters and making my YouTube videos is that I get a ton of practice. I can’t emphasize how much practice has paid off in learning to use dplyr, ggplot2, and other packages. Reproducing published figures has really helped me to dive into parts of ggplot2 that I wouldn’t normally use because I make plots that use the features of ggplot2 that I know. By expanding my knowledge of ggplot2, I’m finding that the plots I make from scratch are more varied and sophisticated than they would normally be. I hope I’m not bragging here - my point is that there’s no reason you couldn’t be doing the same thing. Grab your favorite journal, find a figure, think through what’s standard, what’s novel, and how you would approach the figure. Then go through the process of creating it yourself. Finally, change what you don’t like in a new version of the plot. I’d love it if you shared with me any of your (re)creations!


This week, a figure by Alessandro Lo Sciuto and colleagues caught my eye in their paper, “A molecular comparison of [Fe-S] cluster-based homeostasis in Escherichia coli and Pseudomonas aeruginosa”. The paper was published in mBio. The paper looks at the iron limitation physiology of E. coli and P. aeruginosa and how they use iron-sulfur clusters. The figure that I was interested in was Figure 2C

This set of figures describes the sensitivity of P. aeruginosa to stress, specifically antibiotic stress. The wild type strain (PAO1) was sensitive to these drugs. They created a mutant that had the iscU gene deleted but added back with an arabinose-inducible promoter. When there was no arabinose in the media (ΔiscU ParaiscU) the strain was also sensitive. But when they added arabinose to the media, the strain was resistant (ΔiscU ParaiscU+ (+0.5% ARA)). That’s enough biology :)

What type of figure is this? What do you think is interesting about how it was created? Is there any overlap with what these authors did with what you try to do with your figures? How much of this figure do you think you could figure out on your own? What would you be interested in learning to do?

For discussion, let’s assume the data comes as a data frame with a column for the name of the drug, the strain, the replicate, the time, and the density. At the end of this email is some simulated code to get you thinking.

First off, these four plots are all line plots with time in hours across the x-axis and cell density (CFU/mL) on the y-axis. They tested four antibiotics including gentamicin, ofloxacin, meropenem, and colistin. The legend tells us that the experiments were done in triplicate and that the mean is presented with error bars representing the standard deviation. For each plot in this panel, I’d create the general appearance using geom_line() and geom_errorbar() to generate the lines and error bars. I’d probably need to start by using dplyr’s summarize() function grouping the data by drug, strain, and time. To get geom_errorbar() to work, I’ll need to add the ylim and ymax aesthetics to the aes() function.

Second, I notice that the y-axis is on a log-10 scale. We can get a log scale by using scale_y_log10. Something that is sure to throw a wrench in this is that we have zero value at later time points for drugs like gentamicin. Try running log10(0). You’ll get the value of negative infinity. You’ll get a scary warning message and those points will be removed from the plot. My solution to this is either to add a small value to the data in density or add the small value to the mean value from the summarize function. The number will be so small that it won’t make a difference to big values like 500 or 5,000,000. This is called a “pseudocount”. Fiddle with the value of the pseudocount to get the warnings to go away. Be sure to use a seed for your random number generator so you don’t end up chasing your tail.

Third, normally I would draw a line to indicate where the limit of detection was using geom_hline(). In this case, I suspect the limit of detection was around 500. Sometimes people will code this as 490 (just under 500) or 0 like I have. Regardless, we’ll want to set the limits on our y-axis. If we set them in scale_y_log10() then we’ll get another scary warning message about data getting removed. That’s because the scale functions remove the data outside the limit before plotting it. You’ll also lose the segment of the line that goes from above to below the limit of detection. The better alternative is to use the ylim argument in coord_cartesian().

Fourth, sticking with that y-axis, I noticed that the y-axis labels are 5 times ten to a power (e.g., 5x106) rather than the typical 1 times ten to a power (e.g., 106). They also have minor tick marks. The minor tick marks are a common feature in log scaled axes, but I realized I rarely see these in R plots. They certainly haven’t ever been on any of my plots. We can adjust both of these in scale_y_log10(). We can use the breaks argument to show 5 times ten to a power and the labels argument to style it as these authors did. We’ll also set axis.text.y to ggtext’s element_markdown() to get the nice superscript font for the power. For those minor tick marks we’ll have to feed scale_y_log10() the values to the minor_breaks argument. This will get us the grid lines, which we’ll want to remove in theme(), but not the tick marks. For those, we’ll need to add guide=guide_axis(minor.ticks = TRUE) - this is a relatively new feature in ggplot2. I’m really not sure why they are “counting by 5s” and would have expected the tick marks to be a bit more screwy than they are. Hmmm.

Finally, they appear to have generated these plots as four separate plots. This gets us redundant x and y-axis titles and text. Regardless, we can use our friend the patchwork package to assemble the four plots into a single figure. It is interesting that they have a common legend for the four plots at the bottom of the figure. I know there’s a vignette on the patchwork website showing how to share legends across figures. I’d check that out to duplicate their legend.

Of course, there’s a number of small details that I’m skipping over here. Things like how they have a greek letter, subscripts, italics in their strain names or how they title of each plot is set off to the right side of the plot. All of these things can be modified adjusting the theme() function.

library(tidyverse)

set.seed(19760620)

tibble(
drug =rep(c("Gentamicin","Ofloxacin","Meromenem","Colistin"), each =36),
strain =rep(rep(c("PAO1","mutant","mutant_arabinose"), each =12), times =4),rep=rep(1:3, times =48),
time =rep(rep(c(0,1,2,3), each =3), times =12))%>%
mutate(density =c(rnorm(3, mean =5e6, sd =1e6),
#Gentamicin
rnorm(3, mean =1e3, sd =2e2),c(0,0,0,0,0,0),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),
rnorm(3, mean =1e3, sd =5e2),c(0,0,0,0,0,0),

rnorm(3, mean =5e6, sd =1e6),
#Ofloxacin
rnorm(3, mean =1e3, sd =2e2),c(0,0,0,0,0,0),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),c(0,0,0,0,0,0,0,0,0),

rnorm(3, mean =5e6, sd =1e6),
#Meromenem
rnorm(3, mean =5e5, sd =1e5),
rnorm(3, mean =1e5, sd =2e4),
rnorm(3, mean =6e4, sd =1e4),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),
rnorm(3, mean =5e5, sd =1e5),
rnorm(3, mean =1e5, sd =2e4),
rnorm(3, mean =6e4, sd =1e4),

rnorm(3, mean =5e6, sd =1e6),
#Colistin
rnorm(3, mean =5e3, sd =1e3),
rnorm(3, mean =1e3, sd =2e2),
rnorm(3, mean =8e2, sd =1e2),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),
rnorm(3, mean =5e3, sd =1e3),
rnorm(3, mean =1e3, sd =2e2),
rnorm(3, mean =8e2, sd =1e2)))

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

In case you missed it, I have nine kids ranging in age from 23 to 7 that my wife homeschools. They’re a riot. Each of them has to find a way to be different from all of the others. This makes for some real characters. Let me introduce you to Peter. This week, Peter, who is 11, has been working on a times table. You may remember these from when you were a kid. Say you want to know what 7 times 8 is (this was always my hardest “times” to remember). You take your finger down the rows to the...

Hey folks, I’m really enjoying sharing with you my 30,000 foot view of how I would go about making figures that I find in the “wild”. Following up on these emails with a couple of related YouTube videos has been a lot of fun for me. Of course if you find any figures you like, send them my way - I love seeing what interests you all. I was reminded recently though that not everyone feels enough confidence with their R and tidyverse skills to keep up. Sorry! Towards the bottom of this email I...

Hey folks, We’re still slogging our way through Thanksgiving leftovers. As time passes from last Thursday, there’s a fine line between setting a good example about not wasting food and setting a bad example by getting every food poisoning! Speaking of eating, our teeth are pretty important, don’t you think? In the US, Trump’s expected head for the Department of Health and Human Services has a number of interesting views about health. One example is that its a bad idea to spike our drinking...