How would you recreate this popular genomics data viz meme?

Hey folks,

This week I want to share with you a figure that resembles many a type of figure that I see in a lot of genomics papers. I’d consider it a data visualization meme - kind of like how you’re “required” to have a stacked bar plot if you’re doing microbiome research or a dynamite plot if you’re publishing in Nature :)

This figure was included in the paper, “Impact of intensive control on malaria population genomics under elimination settings in Southeast Asia” that was published earlier this week in Nature Microbiology. I’ve been wanting to look at this type of figure for a while, but haven’t because rarely do the authors make the underlying data available. This group of authors did, so we’ll be able to recreate it in a livestream next Wednesday. Yippee! I often feel bad that I only seem to put people’s figures under a microscope if they Do The Right Thing and make their papers open and data accessible.

The basic structure of these figure is a tree (i.e., a dendrogram) linked to one or more heat maps. In the figure above, you can see there’s a dendrogram on the left and the structure of the tree roughly matches the red blocks along the diagonal of the matrix to its right. Red indicates the malaria parasite genomes are more similar and yellow that they are quite different. There are nearly 3 million data points in that heat map. On top of the large heat map are three strips indicating whether the genome was sampled from a location before or after using “mass drug administration” (MDA) or not using it at all; the year; and the genotype of the kelch13 gene, which can confer resistance to artemisinin. Those three strips are actually heat maps - they only have one value on the y-axis and 1700 on the x-axis. Often I see the three horizontal bands as vertical columns. But this is a similar idea.

How would we build this? I see this as 5 figures - the dendrogram, the large red/yellow heat map, and the three horizontal bands. Similar to last week, we compose this figure using the {patchwork} package. These authors report using Cytoscape and {pheatmap}. But, I’m going to try to use more conventional {ggplot2}-friendly tools.

Let’s start with the dendrogram. Using the relatedness matrix, we can generate the data for a dendrogram using the base R hclust() function. The authors don’t report the clustering algorithm they used, so we’ll have to do some experimentation. To extract the tree information from the hclust object, we can use the {ggdendrogram} package. This will give us the starting and ending coordinates of ever line in the tree. We can use geom_segment() to plot those lines and generate the dendrogram. I believe {ggdendrogram} also has a geom_dendrogram() function that I’ll have to play with a bit. I’ve never made a tree in R, so this will be an adventure.

Next, let’s consider the large heat map. Heat maps can be created using the geom_tile() function from {ggplot2}. The color gradient can be generated using scale_fill_gradient(). Something we noticed last week with {patchwork} is that to “collect” the axes, the axes need to have the same values and formatting. Naturally, the labels at the tips of the trees will be in a different order than those in the heat map. Again, we can use {ggdendrogram} to extract the label information from the tree. I foresee using the order of the labels to set the order of the rows and columns of the heat map using factor() on the genome identifiers plotted on the x and y-axes of the heat map.

Similar to the large heat map, we can use geom_tile() to generate the three horizontal bands at the top. Again, we need to make sure that the data in those bands are in the same order as the genomes in the dendrogram and the heat map. I’d use factor() for this. A unique element to these bands is that the y-axis title is on the right side of the heat maps. In this week’s livestream, I showed how we could use the sec.axis argument in scale_y_continuous() to put the title on the right side. I think I’ll that again here.

One thing I notice about the legends in this panel is that they’re organized in a somewhat haphazard manner. The gradient legend is not vertically aligned with the legends for the three bands. To me, this looks weird. Using {patchwork}’s functionality, we will be able to gather together the legends and align them in a more reasonable manner.

On Monday, I’ll present a critique of this figure. Something I already foresee doing is moving those horizontal bands to be vertical and on the right side. What else would you do to improve the appearance of this figure?

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

How would you recreate this popular genomics data viz meme?

Workshops

In case you missed it…

Keep your writing and your visuals simple

Leverage your experimental design to improve the visualization of your data

What is causing the large number of figures in modern papers?

Riffomonas Professional Development

How would you recreate this popular genomics data viz meme?

Workshops

In case you missed it…

Riffomonas Professional Development

Keep your writing *and* your visuals simple

Leverage your experimental design to improve the visualization of your data

What is causing the large number of figures in modern papers?

Keep your writing and your visuals simple