New opportunity / What to do when you can reproduce someone's work


Hey folks,

It has been great to see the high level of engagement with my weekly critique videos on YouTube. I have really enjoyed making them and have learned a lot about current practices in data visualization. The one problem with these videos is that they’re a bit like an autopsy. We can figure out what went well or what didn’t work in a published figure. But we can’t do much to improve the published figure.

What if we could do critiques before submitting our papers, preparing a presentation, or printing a poster? I would love to help you with that. I am offering one-on-one sessions (for a fee) focusing on improving your data visualizations. If you are interested in learning more about what I can provide you, please sign up for a complimentary 30-minute exploratory meeting.


If you watched last week’s livestream, you’ll recall that I ran into a hiccup trying to figure out what clustering algorithm the authors used to create the dendrogram on their heatmap. My dendrogram had very long branches and the order of the clusters were quite a bit different from theirs. But, I wasn’t sure why we were getting such different results. What would you do in this situation? The answer is what I want to share with you in today’s newsletter.

I went back to the paper to see if I could find anything about the methods the authors used. They mentioned using the {pheatmap} package. I could use the pheatmap() function from the package on their data and see if I got the same result. As many of you know, I really don’t like learning new packages if I don’t have to. So, I looked through the arguments for pheatmap() to see if I could find the default clustering algorithm. There it was: clustering_method = "complete". They used the complete-linkage clustering algorithm, which was the same algorithm I was using in the base R hclust() function.

We seemed to be using the same clustering method on the same data, but we were getting different results. Perhaps we didn’t have the same data after all. I needed to go ahead and install {pheatmap} and run the function on the data the authors published with their data. Thankfully, running pheatmap() was pretty straightforward. I used all the default values and got the same result as the authors published. Hmmm. At this point, I could confirm that we had the same data and that the authors seemed to be using the defaults in pheatmap(). But why were we getting different results? I needed to take a step back.

When I looked at the heatmaps, there were clear clusters along the diagonal that were very different from the off-diagonal values. This told me that the long branch lengths that I was getting seemed pretty reasonable and that the short branch lengths generated by pheatmap() were not likely. But why was pheatmap() giving short branch lengths?

I decided to take a look at the pheatmap() code. My plan was to go through the function and see where our results diverged. This is not as hard as it looks - trust me :) You start by loading all the functions in the script and then run the function directly with our data. That worked. Then we want to look at the specific function. We start by loading all the arguments with their default values in the function definition (this is line 885). The one argument without a default was mat, which is the “numeric matrix of the values to be plotted”. I would set mat to my matrix of distance values from the Excel workbook provided in the paper. I then ran the lines of code in the body of the function and got the same result as the authors.

The next step was to run each line of code in succession trying to understand what was going on in the function. I got down to line 946 before things started to look different. I noticed that the different if...else statements led be to their function cluster_mat() that took the matrix, the clustering algorithm, and a distance argument. But, I wasn’t sure what the distance argument was for. So, I found the cluster_mat() function on line 535.

A few lines into the function, I found this statement…


if(!(distance[1] %in% c("correlation", "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski")) & !(inherits(distance, "dist")))

Basically, this is saying if the distance argument isn’t in this list of methods (e.g., correlation, eulidiean, etc) and the value of the distance argument isn’t a distance matrix, then do the next thing, which was to throw an error. Later the code generates the distance using the user-supplied method or their distance matrix. But because we had been using the default clustering_distance_rows = "euclidean" when calling pheatmap(), it was calculating the Euclidean distance using our distance matrix. The authors were calculating distances using their distances. Oops. Once I figured out to give the same data to mat and clustering_distance_rows I got the same result from pheatmap() that I got from running my own code.

I don’t think the authors intended to use distances of distances. Perhaps they were like me and assumed that the code would be smart enough to know that if a distance matrix, rather than a simple matrix, was provided then it didn’t need to recalculate distances. Perhaps they didn’t have any intuition and just used the function as a black box. All I know for sure is that calculating Euclidean distances from genomic relatedness values wasn’t described in the paper nor was the clustering method. But, the data from this clustering was used in subsequent analyses, so the method probably should have been more fully described.

I don’t fully fault the authors for neglecting these details. I don’t think the difference in results wildly changes the interpretation of the study. But, Nature Microbiology and other journals are placing a huge premium on reproducibility (this is the argument for why we need P-values with 6 significant digits). These types of details are important. Ultimately, I think I was able to reproduce their work, but trust me that I spent a lot of time trying to figure this out.

I apologize for this long-winded story that perhaps teaches us a bit about troubleshooting and the importance of reproducibility. However, I don’t think this story is a one-off. This week I’ve gone down two rabbit holes trying to find a visualization to present to you. Each time, I got foiled by not having enough information to understand what the authors did so that I would be able to recreate the figure.

I think the real lesson here is to be careful of default values. Just because they are defaults doesn’t mean they aren’t important and that we don’t need to document them. A second important lesson is the value of having open software where we can write code that provides a type of documentation like the code for pheatmap() did in my example. Finally, I’ll add that by recreating a figure I didn’t really like, I was able to learn something more about the data and methods used to generate this figure.

As for those two rabbit holes I went down, I emailed one of the authors with a question and the other I think I’ve got figured out. Because I’ve already written waaaay too much today, I’ll leave you with the figure I think I figured out that was published this week in Nature from a paper titled, “A pro-carcinogenic bacterial toxin binds claudin-4 to cleave E-cadherin”.

The paper indicates that this plot was made in Prism. Prism appears to have a bunch of different models to fit this type of logistic dose response curve. Which did the authors use? Because I don’t use Prism, I don’t have a direct way to answer this question. But I can try to recreate it in R using different approaches. So, I’ll have to figure out how to fit a non-linear model to data in R using geom_smooth(). If you want to see my solution you’ll have to turn in to next week’s livestream!

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, This week I want to share with you a figure that resembles many a type of figure that I see in a lot of genomics papers. I’d consider it a data visualization meme - kind of like how you’re “required” to have a stacked bar plot if you’re doing microbiome research or a dynamite plot if you’re publishing in Nature :) This figure was included in the paper, “Impact of intensive control on malaria population genomics under elimination settings in Southeast Asia” that was published...

Hey folks! I hope you enjoyed last week’s series on the radial volcano plot (newsletter, critique video, livestream). I think it did a good job of illustrating the various reasons I think it’s valuable to recreate figures, even if we don’t like how they display the data. Something I didn’t really emphasize in last week’s newsletter was that by recreating a figure, we can make sure that the data are legit. I’m surprised by the number of signals I’ve been finding where authors using tools like...

Hey folks! Before launching into this week’s visualization, I’m looking for a bit of feedback. Since November, I’ve settled into a new routine with this newsletter and the YouTube channel. Each week this newsletter introduces a visualization at a 30,000 ft view or discusses a specific topic in some depth (example). The following Monday I post a video critiquing the visualization (example). Then on Wednesday (or Tuesday like this past week), I livestream a video where I recreate the...