Thinking through how to plot data on a log-scaled y-axis with ggplot2 in R


Hey folks,

One of the benefits of sending out these newsletters and making my YouTube videos is that I get a ton of practice. I can’t emphasize how much practice has paid off in learning to use dplyr, ggplot2, and other packages. Reproducing published figures has really helped me to dive into parts of ggplot2 that I wouldn’t normally use because I make plots that use the features of ggplot2 that I know. By expanding my knowledge of ggplot2, I’m finding that the plots I make from scratch are more varied and sophisticated than they would normally be. I hope I’m not bragging here - my point is that there’s no reason you couldn’t be doing the same thing. Grab your favorite journal, find a figure, think through what’s standard, what’s novel, and how you would approach the figure. Then go through the process of creating it yourself. Finally, change what you don’t like in a new version of the plot. I’d love it if you shared with me any of your (re)creations!


This week, a figure by Alessandro Lo Sciuto and colleagues caught my eye in their paper, “A molecular comparison of [Fe-S] cluster-based homeostasis in Escherichia coli and Pseudomonas aeruginosa”. The paper was published in mBio. The paper looks at the iron limitation physiology of E. coli and P. aeruginosa and how they use iron-sulfur clusters. The figure that I was interested in was Figure 2C

This set of figures describes the sensitivity of P. aeruginosa to stress, specifically antibiotic stress. The wild type strain (PAO1) was sensitive to these drugs. They created a mutant that had the iscU gene deleted but added back with an arabinose-inducible promoter. When there was no arabinose in the media (ΔiscU ParaiscU) the strain was also sensitive. But when they added arabinose to the media, the strain was resistant (ΔiscU ParaiscU+ (+0.5% ARA)). That’s enough biology :)

What type of figure is this? What do you think is interesting about how it was created? Is there any overlap with what these authors did with what you try to do with your figures? How much of this figure do you think you could figure out on your own? What would you be interested in learning to do?

For discussion, let’s assume the data comes as a data frame with a column for the name of the drug, the strain, the replicate, the time, and the density. At the end of this email is some simulated code to get you thinking.

First off, these four plots are all line plots with time in hours across the x-axis and cell density (CFU/mL) on the y-axis. They tested four antibiotics including gentamicin, ofloxacin, meropenem, and colistin. The legend tells us that the experiments were done in triplicate and that the mean is presented with error bars representing the standard deviation. For each plot in this panel, I’d create the general appearance using geom_line() and geom_errorbar() to generate the lines and error bars. I’d probably need to start by using dplyr’s summarize() function grouping the data by drug, strain, and time. To get geom_errorbar() to work, I’ll need to add the ylim and ymax aesthetics to the aes() function.

Second, I notice that the y-axis is on a log-10 scale. We can get a log scale by using scale_y_log10. Something that is sure to throw a wrench in this is that we have zero value at later time points for drugs like gentamicin. Try running log10(0). You’ll get the value of negative infinity. You’ll get a scary warning message and those points will be removed from the plot. My solution to this is either to add a small value to the data in density or add the small value to the mean value from the summarize function. The number will be so small that it won’t make a difference to big values like 500 or 5,000,000. This is called a “pseudocount”. Fiddle with the value of the pseudocount to get the warnings to go away. Be sure to use a seed for your random number generator so you don’t end up chasing your tail.

Third, normally I would draw a line to indicate where the limit of detection was using geom_hline(). In this case, I suspect the limit of detection was around 500. Sometimes people will code this as 490 (just under 500) or 0 like I have. Regardless, we’ll want to set the limits on our y-axis. If we set them in scale_y_log10() then we’ll get another scary warning message about data getting removed. That’s because the scale functions remove the data outside the limit before plotting it. You’ll also lose the segment of the line that goes from above to below the limit of detection. The better alternative is to use the ylim argument in coord_cartesian().

Fourth, sticking with that y-axis, I noticed that the y-axis labels are 5 times ten to a power (e.g., 5x106) rather than the typical 1 times ten to a power (e.g., 106). They also have minor tick marks. The minor tick marks are a common feature in log scaled axes, but I realized I rarely see these in R plots. They certainly haven’t ever been on any of my plots. We can adjust both of these in scale_y_log10(). We can use the breaks argument to show 5 times ten to a power and the labels argument to style it as these authors did. We’ll also set axis.text.y to ggtext’s element_markdown() to get the nice superscript font for the power. For those minor tick marks we’ll have to feed scale_y_log10() the values to the minor_breaks argument. This will get us the grid lines, which we’ll want to remove in theme(), but not the tick marks. For those, we’ll need to add guide=guide_axis(minor.ticks = TRUE) - this is a relatively new feature in ggplot2. I’m really not sure why they are “counting by 5s” and would have expected the tick marks to be a bit more screwy than they are. Hmmm.

Finally, they appear to have generated these plots as four separate plots. This gets us redundant x and y-axis titles and text. Regardless, we can use our friend the patchwork package to assemble the four plots into a single figure. It is interesting that they have a common legend for the four plots at the bottom of the figure. I know there’s a vignette on the patchwork website showing how to share legends across figures. I’d check that out to duplicate their legend.

Of course, there’s a number of small details that I’m skipping over here. Things like how they have a greek letter, subscripts, italics in their strain names or how they title of each plot is set off to the right side of the plot. All of these things can be modified adjusting the theme() function.

library(tidyverse)

set.seed(19760620)

tibble(
drug =rep(c("Gentamicin","Ofloxacin","Meromenem","Colistin"), each =36),
strain =rep(rep(c("PAO1","mutant","mutant_arabinose"), each =12), times =4),rep=rep(1:3, times =48),
time =rep(rep(c(0,1,2,3), each =3), times =12))%>%
mutate(density =c(rnorm(3, mean =5e6, sd =1e6),
#Gentamicin
rnorm(3, mean =1e3, sd =2e2),c(0,0,0,0,0,0),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),
rnorm(3, mean =1e3, sd =5e2),c(0,0,0,0,0,0),

rnorm(3, mean =5e6, sd =1e6),
#Ofloxacin
rnorm(3, mean =1e3, sd =2e2),c(0,0,0,0,0,0),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),c(0,0,0,0,0,0,0,0,0),

rnorm(3, mean =5e6, sd =1e6),
#Meromenem
rnorm(3, mean =5e5, sd =1e5),
rnorm(3, mean =1e5, sd =2e4),
rnorm(3, mean =6e4, sd =1e4),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),
rnorm(3, mean =5e5, sd =1e5),
rnorm(3, mean =1e5, sd =2e4),
rnorm(3, mean =6e4, sd =1e4),

rnorm(3, mean =5e6, sd =1e6),
#Colistin
rnorm(3, mean =5e3, sd =1e3),
rnorm(3, mean =1e3, sd =2e2),
rnorm(3, mean =8e2, sd =1e2),
rnorm(12, mean =5e6, sd =1e6),
rnorm(3, mean =5e6, sd =1e6),
rnorm(3, mean =5e3, sd =1e3),
rnorm(3, mean =1e3, sd =2e2),
rnorm(3, mean =8e2, sd =1e2)))

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development
man floating holding on orange stick white people watching on the street

Hey folks, I have long since given up trying to anticipate what types of videos will resonate with people on YouTube. One of my most popular videos shows people how to make stacked bar plots. Throughout it, I tell people that these are a horrible way to visualize data. It’s my third most viewed video. I thought a video on slope plots would be popular. Nope. People panned last week’s episode. But Venn diagrams - holy cats! People are really geeking out about this week’s episodes on Venn...

Hey folks, I’m really grateful for the people who have emailed me recently to thank me for making the recreation and makeover videos. I’ve been excited to see the types of figures some of you are trying to make. It’s really been a great part of this work for me. Thank you! Eric Hill is a loyal Riffomonas Channel viewer who recently sent me an animation he made using the p5.js platform. The animation shows his son’s performance relative to other runners in the prestigious Nike Cross Nationals...

Hey folks, I hope you’re enjoying my new approach of integrating the newsletter with my YouTube videos. The feedback I’ve gotten has been very positive. Thank you! I’d love it if you were to reply to this email with a link to the most recent figure you found in your reading of the literature or popular media. This week, I’m sharing with you Figure 5D from a paper recently published in mSystems by Charlie Bayne and colleagues where they looked at the effect of interactions between tryptophan...