Pat's rules of data visualization


Hey folks!

I just got back from a seminar. I’m still trying to stretch out my eyes from straining to see the small text on each slide! If you don’t know why I’m brining this up, then you must have missed the videos I posted earlier this week. I was discussing the factors we should consider when converting figures designed for papers to figures designed to a slide deck. You can see me critique a figure from my own lab here and the livestream where I refactor the figure can be found here. I’d love to hear your experience with customizing figures for your research talks.


As I’ve been going through these weekly critiques, I’ve settled into a set of “rules” that I think a lot about.

  • Be sure you know what the question is that you’re trying to answer
  • Put the most important variables on the axes
  • Put things you want the audience to compare as close as possible to each other

With these in mind, take a look at this panel from Figure 4 of a paper that was just published in a paper titled, “Weaning drives microbiome-mediated epigenetic regulation to shape immune memory in mice” from the journal Nature Microbiology. If you look through that paper you’ll see a number of figures - stacked bar plots and box and whisker plots - that I think violate one or both of the latter two “rules”. These plots are a bit different because they are presenting time series data, which I think is pretty cool and worth spending more time thinking about.

Hopefully the first rule is obvious, even if it’s often forgotten. In this case, the authors want to make the case that “LDP [low-dose penicillin] selectively reduced Gram-positive taxa without affecting overall microbial richness or expansion. In particular, the relative abundance of Gram-positive bacteria, including Bacillota and Actinomycetota, decreased in LDP-treated mice, whereas Gram-negative populations such as Bacteroides increased”. There are a few variables mentioned here: relative abundance (albeit not explicitly), time (expressed as increasing and decreasing), phyla and whether they are Gram-positive or -negative, and whether the animals received LDP. The authors want us to see that bacterial phyla have mixed responses over time to LDP relative to those that don’t receive LDP.

What are the most important variables? Of course, they’re all important! If you tell me that time is one of your variables and you have multiple time points, I’m going to expect it to be on the x-axis. That’s so engrained in our brains that doing anything else would be confusing to your audience. Next, we want to compare the abundances of things, so that should go on the y-axis. The authors have done pretty well on this rule with one point of critique. You’ll notice that there are two or three replicates for each time point. Those are represented next to each other. The authors are expecting me to visually calculate the mean or median of those and get a sense of the variation in their value. Oof. That’s pretty hard.

We now have two of the variables accounted for. What about the phylum and the LDP condition? Currently, the phylum is mapped to the color and the LDP condition to the two facets. But the way the authors worded their text in the paper was kind of like “population X decreased in LDP-treated mice whereas population X increased”. I read the increase/decrease as being relative to both time and the LDP treatment. This brings us to the third rule. For each population, we need to put these data as close to each other to see the change relative to time and the condition. With this in mind, I’d map the LDP treatment to color and facet by the populations. I imagine a set of line plots where the x-axis is time, the y-axis is relative abundance and each facet is for a different population. There would be two lines in each facet, one for each treatment and they’d be different colors. We could consider ordering the facets by whether their population is Gram-positive or not.

Let’s think about this another way. What is most easy to see in this panel as prepared by the authors? I’d argue that it is very easy to compare the lengths or abundance of the red bars and how they change with time. They start at a low relative abundance and increase after 2 weeks. I’m pretty confident that they’re taller in the “LDP” than the “w/o LDP” condition. Making that comparison is asking a lot of me and isn’t straightforward since there are two (or three) replicates for each time point and I have to scan back and forth between the facets to compare the LDP conditions. This task gets a lot harder for the other bacterial phyla. This is because we are very good at comparing the relative position of things relative to an anchor point. We are better at this than comparing colors, lengths, or sizes. Natural positional anchors are the top and bottom edge of the y-axis or the left and right edge of the x-axis. Looking at this panel, when the red bars are anchored to the bottom of the panel, it becomes simple to compare the heights of the bars within a facet. Next, what does that red bar represent? Whether its the Spirochaetota or the Bacteroidota is really hard for me to determine. I suppose if it’s Spirochaetota the mice were probably living a rough existence. Mapping the green (Pseudomonadota) or the dark blue (Actinomycetota) bars to their population is easier. But how well can I track the relative changes across time within and between the LDP conditions? It’s a lot harder because they don’t have a common anchor point. To make the cross condition comparison, I still have to scan back and forth. Faceting by the phylum instead of the condition makes the comparisons the authors want me to see much easier.

I can hear an objection already… “There are 9 phyla and that would require 9 facets. That’s way too many!” I agree. Most of these phyla are pretty rare and are only included so the stacked bars add up to 1. I would likely include the Bacteroidota, Bacillota, Actinomycetota, Pseudomonadota, and maybe the Campylobacterota. If I only used the first four phyla that would conveniently leave me with 2 Gram-positive phyla and 2 Gram-negative phyla. I’m pretty sure we could fit a 4-facet panel into the same space as the authors used for their stacked bar plot. More importantly, the results would be far more clear to the audience!

Do you want some homework? Check out panel b in Figure 4 or any of the panels in Figure 5. It’s clear to me that the authors struggled with how to show both the change over time and between LDP treatments. Here’s part of panel b from Figure 5, which shows they were getting closer, but didn’t quite get all the way there…

In next week’s livestream, I’ll look at refactoring these panels to make it easier for the audience to see what the authors intend. Be sure to tune in for a special day and time Tuesday (3/31) at 2 PM Eastern!

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, I was a student-invited speaker at the Syracuse University Biology department this week. It was great to meet with them and hear how they are benefiting from these newsletters and my videos. As much as I love posting newsletters and videos, seeing people light up at ideas, laugh at my jokes, and tell me how they are using what I teach them is like jet fuel. I actually gave two talks. One talk covered what I’ve learned about data visualization by critiquing, recreating, and remaking...

Hey folks, If you missed Wednesday’s livestream, I encourage you to go back and check it out. I recreated a panel from a paper published in Nature that is pretty typical. It was made up entirely of photographs. Sometimes I feel like I’m the only PI that doesn’t merge panels into figures using Illustrator or Powerpoint. I prefer to use R with some help from {cowplot} or {patchwork} to do this for me. That way I can write a single script to generate the entire set of panels. The result is a...

Hey folks, This week I’ve been teaching one of my 3 day R workshops as part of my official teaching duties at the U of Michigan. I really enjoy teaching these classes! I offer recorded versions of these workshops that use microbiome data or other types of data to help motivate my teaching of R’s tidyverse packages. If you would like to purchase your own version of these workshop click on those links! Also, if you would like me to teach a live workshop to your group, reply to this email and...