What's the most important part of your plot? Make sure it's mapped to the axes

Hey folks,

I’ve really enjoyed the flow of combining these newsletters with a Monday critique video, a Wednesday recreation video, and occasionally a Friday remake video. A few weeks in, I feel pretty good about our ability to engage in constructive critiques. Of course, we have to train ourselves (myself included) to use those tools and not just resort to immediate and emotional responses - “I hate that plot”. We need to engage, get in the head of the original creator, and try to understand their intentions. Then we can offer suggestions for what would make the figure easier to understand.

When I look at the scientific literature I’m increasingly frustrated by what I see in the figures. Over the past 5 (?!) years that I’ve been putting out videos, I’ve really wanted to directly take on figures that we find in the scientific literature. Recreating figures from the NY Times, Washington Post, Our World in Data, Pew Research, and other venues is a lot of fun and the topics are broadly interesting. But the odds are good that most of you aren’t doing data journalism or creating plots for the public. I get the sense that most of you are scientists struggling to make plots because your training programs aren’t providing the information I offer.

In the past, I’ve tried to show how to make different types of plots in the hopes that people would adopt the practices that I show them. For example, I did a series on various ways to generate plots that are relevant to the microbiome field. Many of those videos have been successful. But I don’t think the strategy has been successful. For example, after teaching how to connect RStudio to GitHub, my next most popular video is how to create a stacked bar plot in R. This is despite my admonition in the video that a stacked bar plot is never the right answer! I’m confident that I effectively teach people to use R and the tidyverse, but leading by example has not helped people develop a sense of style in making plots. People are clearly successful in making plots with R and other tools. However, that doesn’t mean that the plots are effective or work well with the other plots in the paper. This is where the critiques come in.

I would like to take a few weeks to focus more on examples from the scientific literature in the content I am producing. My plan will be to still provide the 30,000 foot overviews of how to recreate a figure with follow-up critique and recreation/refactor videos. I’ll draw examples from open access papers published in so-called “high impact journals” like Cell, Nature, Science, and Nature Microbiology. Many of these journals publish papers with the data for each figure included as supplementary data. Also, I would never want to be seen as “punching down” and being overly critical of work done by trainees. Aside from trying to stay constructive, by focusing on papers published in these high impact journals, I’ll be setting my critique at papers that are, by definition, highly regarded. If you have a figure from one of these journals that you’d like me to consider, please send it my way. Even if you are here for the New York Times recreations, I think you’ll still get a lot out of this content!

Since I already said too much, I’ll try to keep this week’s example brief. Here’s Figure 1 from a recently published paper in Nature Microbiology, titled “A widespread hydrogenase supports fermentative growth of gut bacteria in healthy people” by a large group of authors led by Caitlin Welsh from Australia. The title of the figure is “Abundance, transcription and distribution of hydrogenase genes and H2-related metabolic genes throughout the human gut”.

The details of the figure aren’t super important right now. What I find interesting about this figure is that there are three sets of faceted panels that all show data with a similar structure. The first panel shows abundance using a heatmap. In the second they are showing (relative) abundance using a stacked bar plot. The third panel depicts abundance using a bubble plot. The only thing they didn’t include was something like a box and whisker plot or jittered plot (stay tuned!).

To me, this has always been a strength of ggplot2 - the ability to quickly iterate between various ways of showing data. Let’s assume we have a data frame (gene_abundance) that has columns for the sample_id, the gene, and the abundance. To create a heatmap I would do something like this:

ggplot(gene_abundance, aes(x = sample, y = gene, fill = abundance)) + geom_tile()

Here’s one way to generate a stacked bar plot:

ggplot(gene_abundance, aes(x = sample, fill = gene, y = abundance)) + geom_col(position = "fill")

I’d make a bubble plot like this:

ggplot(gene_abundance, aes(x = sample, y = gene, size = abundance)) + geom_point()

Note the similarities in these three code chunks. They use the same data columns. The sample column is mapped to x in each plot. But they differ in whether gene and abundance are mapped to the fill, y, or size aesthetic. Of course, they also differ in the geom_* that they were using to draw the data. Which is “correct”?

To answer that question we need to think about the hierarchy of pre-attentive attributes. These are the things in a plot you notice first when you look at it. Humans are very good at comparing positions of objects. In other words, arraying things across the x or y-axes make them very easy to compare. Although colors are good for differentiating between categories, assuming there aren’t too many categories, they can’t be easily relied upon for making quantitative comparisons. In a pinch different sizes can be used for making quantitative comparisons, but like color, we aren’t very good at making those comparisons. I tell people to put the most important variables on the x and y axes and then map less important variables to shape, color, and size.

For the data shown in the first two panels, I’d say the sample isn’t important. In fact, there are so many samples that I can’t differentiate them. Aside from the type of data, which is used to define the facets, the genes and their abundances are the most important. This would lead me to suggest either making a jittered plot or a box and whisker plot or if you insist, overlapping them:

ggplot(gene_abundance, aes(x = gene, y = abundance)) + geom_boxplot() + # or geom_violin() geom_jitter()

Of course, there are always other constraints to consider when making complicated figures like this figure. Without going through the process of recreating the entire figure, I can’t fully anticipate those constraints, which may have pushed them to the design they chose. As an author (or reader) I would have no problem with picking one rather than three approaches to displaying these data. What do you think?

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

What's the most important part of your plot? Make sure it's mapped to the axes

Workshops

In case you missed it…

Looking back on 2025 and forward to 2026

Do you want to up your data visualization designs in 2026?

Adding additional layers of text and titles to an x-axis with ggplot2 in R