Fun with boxplots!

Hey folks,

Earlier this week I sent out an email announcing a new interactive training opportunity. The goal is to provide greater opportunities to hone your skills in a social setting. My experience with leading this approach has been excellent. I can’t wait to have you give it a try with me. Please let me know if you have any questions.

Let’s continue on with our efforts to develop intuition about how to recreate plots that we see out in the wild! This week, I found an interesting box and whisker plot in the paper, “Unveiling the importance of heterotrophy for coral symbiosis under heat stress”, published in the journal mBio by Stephane Martinez and colleagues. Their Figures 1 and 2 are the same type of figure. Let’s look at Figure 1 together. I’ll let you wrestle with Figure 2 on your own. Here’s Figure 1:

What’s going on in this plot? As we can see the figure has two panels, A and B. These panels are analogous - they’re both box and whisker plots. These plots are great for displaying data that are not normally distributed. For those of you unfamiliar with this type of plot, the black horizontal line across each rectangle (i.e., the “box”) represents the median (i.e., the 50th percentile) and the top and bottom edges of each box represent the 25th and 75th percentiles. The difference between the 27th and 75th percentiles is the inter-quartile range (IQR). The bars extending upwards from the boxes (i.e., the “whiskers”) will extend to an observed point much as 1.5 times the IQR. In the bottom panel, the “32 light” data has a point above the very short whiskers. This is because there was a point just outside the 75th percentile and another point more than 1.5 times the IQR at about 5 on the y-axis. This is an outlier. The stars between pairs of treatments tells us that there was a statistically significant difference between the treatments indicated by the brackets under each star.

My suspicion is that the researchers started with a data frame, coral_physiology that had 5 columns: a fragment column indicating which of the 72 coral fragments they used in the experiment, a temperature column indicating the temperature treatment (25 or 32C), a density column indicating the number of symbionts per square centimeter, a photosynthesis column indicating the oxygen flux in the light, and a respiration column for the oxygen flux in the dark.

How would we go about taking this data to generate the two panels? Let’s make them as two separate figures. The easiest way to make the box and whisker plot - or just “boxplot” - is to use geom_boxplot() from {ggplot2}. For panel A, I’d map the density column to the y aesthetic and the temperature column to the x aesthetic. For panel B, I’d select() the fragment, temperature, photosynthesis, and respiration columns from coral_physiology. Then I’d use pivot_longer() to collapse photosynthesis and respiration so that the column headings are in a column called process the values are in a column called flux. I’d then use mutate() along with an if_else statement and paste() to combine the temperature and process columns to make a pretty_treatment column that looks like what is on the x-axis (e.g. “25 dark”). By default the x-axis labels will be in alphanumeric order, which isn’t what we want. So, I’d use factor() to set the order of the four pretty_treatment values to follow what is on the x-axis.

One thing to note is that the default fill for the boxes will be white. To get them to be gray, we need to use fill = "gray" as the argument for geom_boxplot(). Depending on how faithful you want to be to their original figure, you might need to build the boxplot in two steps. That’s because the border of the box and the whiskers will be black by default. So, I’d likely set the color argument in geom_boxplot() to "gray" as well. Then, I’d use stat_summary to re-plot the median line using color = "black".

Of course you can set the labels on the x and y-axes using the labs() function. Both panels have superscript text and panel B has a subscript. Thankfully, there is a great package, {ggtext}, that allows you to write text in markdown or HTML and then use element_markdown() in the theme function for the axis labels. It’s a pretty slick tool! One other issue with the axis labels is the greek “mu” in the y-axis label for panel B. You can write that in unicode - “μ”. This would give you something like labs(y = "μmol O2cm-2h-1", x = "Treatments") that would render correctly if you use axis.title.y = element_markdown() after installing and loading {ggtext}. One other trick I’d try with labs() is to use the subtitle argument on panel A with the value “x105”. I think that should get you close to the right position. Of course, you’ll need to use element_markdown() again but for the theme argument that controls the subtitle.

There are a couple of other theme() related adjustments to make. First, the axis labels and text are all the same size. You could modify the size arguments in element_markdown() (and element_text()) for the titles and texts to be the same value. One other thing that stands out to me are the tick marks. On the x-axis there aren’t any tick marks and on the y-axis they’re a light gray color. The x-axis ticks can be removed by using axis.ticks.x = element_blank(). The element_blank() line is great for removing things you don’t want, perhaps like the axis line on the top and right side. The color of the y-axis ticks can be set using axis.ticks.y = element_line(color = "lightgray").

To complete each of the panels, we now need to put the comparison brackets and stars on each comparison. Most people would hunt for a package to do this for them. Because I’m stubborn and like to practice using {ggplot2}, I’d draw them myself. I’d create the brackets using geom_segment() feeding it the x and y-axis positions for each line through a separate data frame that would be used specifically for this purpose. Similarly, I’d create a separate data frame for the position of the stars and would use geom_text() to place the stars. It sounds harder than it is. I make these types of annotations so rarely, that I figure it is easier to build the bars and stars mannually like this than to re-learn how to use the specialized package. Here’s a video I made long ago showing how to build these types of annotations #YourMileageMayVary

Finally, most people would stop here and assemble the two figures in PowerPoint or some other monstrosity to reproducible research. But we are not most people. Are we?! After saving each figure to its own variable name (e.g., a and b) we can easily assemble these figures using {patchwork} or {cowplot}. Here are two videos showing how I’ve used {patchwork} (one and two). Here’s one on how to use {cowplot}. It’s probably worth being familiar with both packages so you can avoid monstrosities.

Here’s some code to generate coral_physiology that should get you going trying to recreate the figure in R

coral_physiology <- tibble( fragment = 1:72, temperature = rep(c(25, 32), each = 36), density = c(rnorm(n = 36, mean = 32, sd = 10), rnorm(n = 36, mean = 5, sd = 3)), photosynthesis = c(rnorm(n = 36, mean = 4.5, sd = 1), rnorm(n = 36, mean = 2, sd = 1)), respiration = c(rnorm(n = 36, mean = -2, sd = 0.75), rnorm(n = 36, mean = -1, sd = 0.75)) )

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Fun with boxplots!

Workshops

In case you missed it…

Making a basic line plot appear more sophisticated

Pseudo-waffle plots from LA from the Washington Post

Chartjunk in plain sight & I need your feedback!