|
Hey folks, It has been great to see the high level of engagement with my weekly critique videos on YouTube. I have really enjoyed making them and have learned a lot about current practices in data visualization. The one problem with these videos is that they’re a bit like an autopsy. We can figure out what went well or what didn’t work in a published figure. But we can’t do much to improve the published figure. What if we could do critiques before submitting our papers, preparing a presentation, or printing a poster? I would love to help you with that. I am offering one-on-one sessions (for a fee) focusing on improving your data visualizations. If you are interested in learning more about what I can provide you, please sign up for a complimentary 30-minute exploratory meeting. If you watched last week’s livestream, you’ll recall that I ran into a hiccup trying to figure out what clustering algorithm the authors used to create the dendrogram on their heatmap. My dendrogram had very long branches and the order of the clusters were quite a bit different from theirs. But, I wasn’t sure why we were getting such different results. What would you do in this situation? The answer is what I want to share with you in today’s newsletter. I went back to the paper to see if I could find anything about the methods the authors used. They mentioned using the We seemed to be using the same clustering method on the same data, but we were getting different results. Perhaps we didn’t have the same data after all. I needed to go ahead and install When I looked at the heatmaps, there were clear clusters along the diagonal that were very different from the off-diagonal values. This told me that the long branch lengths that I was getting seemed pretty reasonable and that the short branch lengths generated by I decided to take a look at the The next step was to run each line of code in succession trying to understand what was going on in the function. I got down to line 946 before things started to look different. I noticed that the different A few lines into the function, I found this statement…
Basically, this is saying if the distance argument isn’t in this list of methods (e.g., correlation, eulidiean, etc) and the value of the distance argument isn’t a distance matrix, then do the next thing, which was to throw an error. Later the code generates the distance using the user-supplied method or their distance matrix. But because we had been using the default I don’t think the authors intended to use distances of distances. Perhaps they were like me and assumed that the code would be smart enough to know that if a distance matrix, rather than a simple matrix, was provided then it didn’t need to recalculate distances. Perhaps they didn’t have any intuition and just used the function as a black box. All I know for sure is that calculating Euclidean distances from genomic relatedness values wasn’t described in the paper nor was the clustering method. But, the data from this clustering was used in subsequent analyses, so the method probably should have been more fully described. I don’t fully fault the authors for neglecting these details. I don’t think the difference in results wildly changes the interpretation of the study. But, Nature Microbiology and other journals are placing a huge premium on reproducibility (this is the argument for why we need P-values with 6 significant digits). These types of details are important. Ultimately, I think I was able to reproduce their work, but trust me that I spent a lot of time trying to figure this out. I apologize for this long-winded story that perhaps teaches us a bit about troubleshooting and the importance of reproducibility. However, I don’t think this story is a one-off. This week I’ve gone down two rabbit holes trying to find a visualization to present to you. Each time, I got foiled by not having enough information to understand what the authors did so that I would be able to recreate the figure. I think the real lesson here is to be careful of default values. Just because they are defaults doesn’t mean they aren’t important and that we don’t need to document them. A second important lesson is the value of having open software where we can write code that provides a type of documentation like the code for As for those two rabbit holes I went down, I emailed one of the authors with a question and the other I think I’ve got figured out. Because I’ve already written waaaay too much today, I’ll leave you with the figure I think I figured out that was published this week in Nature from a paper titled, “A pro-carcinogenic bacterial toxin binds claudin-4 to cleave E-cadherin”. The paper indicates that this plot was made in Prism. Prism appears to have a bunch of different models to fit this type of logistic dose response curve. Which did the authors use? Because I don’t use Prism, I don’t have a direct way to answer this question. But I can try to recreate it in R using different approaches. So, I’ll have to figure out how to fit a non-linear model to data in R using
|
Hey folks, This week I want to share with you a figure that resembles many a type of figure that I see in a lot of genomics papers. I’d consider it a data visualization meme - kind of like how you’re “required” to have a stacked bar plot if you’re doing microbiome research or a dynamite plot if you’re publishing in Nature :) This figure was included in the paper, “Impact of intensive control on malaria population genomics under elimination settings in Southeast Asia” that was published...
Hey folks! I hope you enjoyed last week’s series on the radial volcano plot (newsletter, critique video, livestream). I think it did a good job of illustrating the various reasons I think it’s valuable to recreate figures, even if we don’t like how they display the data. Something I didn’t really emphasize in last week’s newsletter was that by recreating a figure, we can make sure that the data are legit. I’m surprised by the number of signals I’ve been finding where authors using tools like...
Hey folks! Before launching into this week’s visualization, I’m looking for a bit of feedback. Since November, I’ve settled into a new routine with this newsletter and the YouTube channel. Each week this newsletter introduces a visualization at a 30,000 ft view or discusses a specific topic in some depth (example). The following Monday I post a video critiquing the visualization (example). Then on Wednesday (or Tuesday like this past week), I livestream a video where I recreate the...