What's the most important part of your plot? Make sure it's mapped to the axes


Hey folks,

I’ve really enjoyed the flow of combining these newsletters with a Monday critique video, a Wednesday recreation video, and occasionally a Friday remake video. A few weeks in, I feel pretty good about our ability to engage in constructive critiques. Of course, we have to train ourselves (myself included) to use those tools and not just resort to immediate and emotional responses - “I hate that plot”. We need to engage, get in the head of the original creator, and try to understand their intentions. Then we can offer suggestions for what would make the figure easier to understand.

When I look at the scientific literature I’m increasingly frustrated by what I see in the figures. Over the past 5 (?!) years that I’ve been putting out videos, I’ve really wanted to directly take on figures that we find in the scientific literature. Recreating figures from the NY Times, Washington Post, Our World in Data, Pew Research, and other venues is a lot of fun and the topics are broadly interesting. But the odds are good that most of you aren’t doing data journalism or creating plots for the public. I get the sense that most of you are scientists struggling to make plots because your training programs aren’t providing the information I offer.

In the past, I’ve tried to show how to make different types of plots in the hopes that people would adopt the practices that I show them. For example, I did a series on various ways to generate plots that are relevant to the microbiome field. Many of those videos have been successful. But I don’t think the strategy has been successful. For example, after teaching how to connect RStudio to GitHub, my next most popular video is how to create a stacked bar plot in R. This is despite my admonition in the video that a stacked bar plot is never the right answer! I’m confident that I effectively teach people to use R and the tidyverse, but leading by example has not helped people develop a sense of style in making plots. People are clearly successful in making plots with R and other tools. However, that doesn’t mean that the plots are effective or work well with the other plots in the paper. This is where the critiques come in.

I would like to take a few weeks to focus more on examples from the scientific literature in the content I am producing. My plan will be to still provide the 30,000 foot overviews of how to recreate a figure with follow-up critique and recreation/refactor videos. I’ll draw examples from open access papers published in so-called “high impact journals” like Cell, Nature, Science, and Nature Microbiology. Many of these journals publish papers with the data for each figure included as supplementary data. Also, I would never want to be seen as “punching down” and being overly critical of work done by trainees. Aside from trying to stay constructive, by focusing on papers published in these high impact journals, I’ll be setting my critique at papers that are, by definition, highly regarded. If you have a figure from one of these journals that you’d like me to consider, please send it my way. Even if you are here for the New York Times recreations, I think you’ll still get a lot out of this content!


Since I already said too much, I’ll try to keep this week’s example brief. Here’s Figure 1 from a recently published paper in Nature Microbiology, titled “A widespread hydrogenase supports fermentative growth of gut bacteria in healthy people” by a large group of authors led by Caitlin Welsh from Australia. The title of the figure is “Abundance, transcription and distribution of hydrogenase genes and H2-related metabolic genes throughout the human gut”.

The details of the figure aren’t super important right now. What I find interesting about this figure is that there are three sets of faceted panels that all show data with a similar structure. The first panel shows abundance using a heatmap. In the second they are showing (relative) abundance using a stacked bar plot. The third panel depicts abundance using a bubble plot. The only thing they didn’t include was something like a box and whisker plot or jittered plot (stay tuned!).

To me, this has always been a strength of ggplot2 - the ability to quickly iterate between various ways of showing data. Let’s assume we have a data frame (gene_abundance) that has columns for the sample_id, the gene, and the abundance. To create a heatmap I would do something like this:


ggplot(gene_abundance, aes(x = sample, y = gene, fill = abundance)) +
geom_tile()

Here’s one way to generate a stacked bar plot:


ggplot(gene_abundance, aes(x = sample, fill = gene, y = abundance)) +
geom_col(position = "fill")

I’d make a bubble plot like this:


ggplot(gene_abundance, aes(x = sample, y = gene, size = abundance)) +
geom_point()

Note the similarities in these three code chunks. They use the same data columns. The sample column is mapped to x in each plot. But they differ in whether gene and abundance are mapped to the fill, y, or size aesthetic. Of course, they also differ in the geom_* that they were using to draw the data. Which is “correct”?

To answer that question we need to think about the hierarchy of pre-attentive attributes. These are the things in a plot you notice first when you look at it. Humans are very good at comparing positions of objects. In other words, arraying things across the x or y-axes make them very easy to compare. Although colors are good for differentiating between categories, assuming there aren’t too many categories, they can’t be easily relied upon for making quantitative comparisons. In a pinch different sizes can be used for making quantitative comparisons, but like color, we aren’t very good at making those comparisons. I tell people to put the most important variables on the x and y axes and then map less important variables to shape, color, and size.

For the data shown in the first two panels, I’d say the sample isn’t important. In fact, there are so many samples that I can’t differentiate them. Aside from the type of data, which is used to define the facets, the genes and their abundances are the most important. This would lead me to suggest either making a jittered plot or a box and whisker plot or if you insist, overlapping them:


ggplot(gene_abundance, aes(x = gene, y = abundance)) +
geom_boxplot() + # or geom_violin()
geom_jitter()

Of course, there are always other constraints to consider when making complicated figures like this figure. Without going through the process of recreating the entire figure, I can’t fully anticipate those constraints, which may have pushed them to the design they chose. As an author (or reader) I would have no problem with picking one rather than three approaches to displaying these data. What do you think?

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks! I’m appreciating the positive feedback on Monday critique videos. They’re a lot of fun to think through and make. I think I might start looking at figures that are drawn from the scientific literature since many of you found out about me from my science work. Let me know if there are plots or practices that you’d like to see me talk about. I’ll see if I can work them into the queue. Also, if you’re working on developing figures for a presentation, poster, or paper and would like to...

Hey folks! I continue to get positive feedback about my critique videos. This has me quite excited that I’ve perhaps scratched an itch that people have been struggling with. Would you like to meet with a group of other people who are committed to making their data visualizations better? I’m forming groups now that would meet once a week or every other week to give each other constructive feedback on the visualizations they are making for their work. Alternatively, if you have ever thought, “I...

Hey folks! I posted two videos last week! On Monday I posted a video critiquing the diverging bar plot that I described in this newsletter last Friday. My goal in this video was to think through a “constructive” approach to interpreting and critiquing data visualizations. As scientists, I think we are too worried about hurting each other’s feelings. So we don’t critique each other. At the same time, many of us think before we speak and can come off overly harsh. My goal is to create a...