Reverse engineering a dilution series of box plots


Hey folks,

I hope you’re enjoying my new approach of integrating the newsletter with my YouTube videos. The feedback I’ve gotten has been very positive. Thank you! I’d love it if you were to reply to this email with a link to the most recent figure you found in your reading of the literature or popular media.

This week, I’m sharing with you Figure 5D from a paper recently published in mSystems by Charlie Bayne and colleagues where they looked at the effect of interactions between tryptophan and copper on the toxicity of colibactin. This toxin is produced by a strain of E. coli that has been associated with colorectal cancer. This specific panel shows that the ClbP enzyme is inhibited by increasing concentrations of copper using a fluorescence-based assay; I think the 7H4M is a control to see if copper effects fluorescence on its own.

Anyway, I want to encourage you to ask some questions about any plot you find to help you develop your taste and and think through how you would recreate elements of a plot. What type of plot is this? Aside from the data story, what is interesting about this figure? What do you like about it? What don’t you like about it? Can you outline the steps you would take to generate the figure? What are some of the steps you aren’t sure about and would like to learn?

First off, the figure is made up of box plots for two treatments depicting the amount of fluorescence at different dilutions of copper. I think this plot was made in R because of the styling of the legend and the other figures in the paper. It appears to me that the box plots are evenly spaced, which suggests that the authors didn’t map the copper concentration to the x-axis and then dodge the box plots by treatment. I’d likely do this by creating a column of concentration-treatment combinations and map that to the x-aesthetic and the percent fluorescence to the y-axis. I’d also map the treatment to the color of the box plot.

Second, assuming I’m correct about how they fashioned the x-axis, it’s likely treated each concentration-treatment combination as a unique treatment. They then re-labelled the x-axis with the concentration. I think I would do this with scale_x_continuous() or scale_x_discrete(). Two other things stand out to me about the x-axis. First, the x-axis title, “Cu” is in line with the axis text, but is in the lower left corner of the figure. I’d likely do this using annotate() and setting clip = "off" in coord_cartesian(), or I’d use the caption or y argument in labs() and then modify some theme() arguments including the margin argument of element_text() to move the label to the desired location. As I think about it, I think I’d prefer the annotate() approach. Second, the tick marks are between the concentrations rather than centered on each concentration. Since I’ve seen this in a few figures lately, I’m starting to think this is an increasingly common approach to placing tick marks! As I’ve done this in the past, I’d again use annotate and clip = "off to draw segments between the pairs of box plots. It also looks like they made the x and y-axis ticks thicker than normal so we’d want to modify the linewidth argument in annotate() and element_line(). Isn’t it cool how we can recycle concepts to get different effects?!

Third, on top of the box plots they have overlaid their triplicate data for each condition as jittered points. As an aside, I feel like the figure probably should have picked one geom and run with it. As you can see the middle of the three points falls on the median line and the other two points fall on the ends of the box plots’ whiskers. The box plot doesn’t really add much. Anyway, I’d use geom_jitter() to randomly place the points along the x-axis within each concentration-treatment combination. Interestingly, the points are all black. So, I’d likely use color = "black" within geom_jitter() without using the aes() function.

Finally, they moved the legend inside the plotting window and put a black border around the legend. I like that approach since it frees up room in the plot by getting rid of the right margin where the legend normally sits. By putting a black border around the legend, it says “this is the legend, these box plots are legend glyphs and not data”.

Aside from questioning whether we really need the box plots with the raw data, I have some other thoughts about this figure that I’d like to try. First, I’d be interested in trying to plot a line through the mean of the three points for each concentration-treatment combination. I’d color the points and the two line by the treatment. Second, I’d like to try putting the x-axis on a log scale. That’s basically what it is, right? The one problem would be the zero since you can’t have zero on a log scale.

If you want to give these ideas a try before I get to them in December, here’s some code to give you a data frame that you could use to play with:


library(tidyverse)
set.seed(19760620)

cu_fluor <- tibble(
  treatment = c(rep("7H4M", 21), rep("ClbP-17", 21)),
  copper = as.character(rep(rep(c(0, 0.003, 0.01, 0.03, 0.3, 30, 300), each = 3), 2)),
  fluorescence = c(101, 100, 99,
                  100, 98, 97,
                  101, 100, 95,
                  98, 97, 99,
                  100, 98, 97,
                  98, 97, 96,
                  98, 97, 95,
                  102, 101, 98,
                  92, 88, 85,
                  88, 85, 85,
                  85, 85, 84,
                  71, 65, 64,
                  63, 60, 58,
                  63, 62, 61)
  )

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development
man floating holding on orange stick white people watching on the street

Hey folks, I have long since given up trying to anticipate what types of videos will resonate with people on YouTube. One of my most popular videos shows people how to make stacked bar plots. Throughout it, I tell people that these are a horrible way to visualize data. It’s my third most viewed video. I thought a video on slope plots would be popular. Nope. People panned last week’s episode. But Venn diagrams - holy cats! People are really geeking out about this week’s episodes on Venn...

Hey folks, I’m really grateful for the people who have emailed me recently to thank me for making the recreation and makeover videos. I’ve been excited to see the types of figures some of you are trying to make. It’s really been a great part of this work for me. Thank you! Eric Hill is a loyal Riffomonas Channel viewer who recently sent me an animation he made using the p5.js platform. The animation shows his son’s performance relative to other runners in the prestigious Nike Cross Nationals...

Hey folks, One of the benefits of sending out these newsletters and making my YouTube videos is that I get a ton of practice. I can’t emphasize how much practice has paid off in learning to use dplyr, ggplot2, and other packages. Reproducing published figures has really helped me to dive into parts of ggplot2 that I wouldn’t normally use because I make plots that use the features of ggplot2 that I know. By expanding my knowledge of ggplot2, I’m finding that the plots I make from scratch are...