|
Hey folks, It has been great to see the high level of engagement with my weekly critique videos on YouTube. I have really enjoyed making them and have learned a lot about current practices in data visualization. The one problem with these videos is that they’re a bit like an autopsy. We can figure out what went well or what didn’t work in a published figure. But we can’t do much to improve the published figure. What if we could do critiques before submitting our papers, preparing a presentation, or printing a poster? I would love to help you with that. I am offering one-on-one sessions (for a fee) focusing on improving your data visualizations. If you are interested in learning more about what I can provide you, please sign up for a complimentary 30-minute exploratory meeting. If you watched last week’s livestream, you’ll recall that I ran into a hiccup trying to figure out what clustering algorithm the authors used to create the dendrogram on their heatmap. My dendrogram had very long branches and the order of the clusters were quite a bit different from theirs. But, I wasn’t sure why we were getting such different results. What would you do in this situation? The answer is what I want to share with you in today’s newsletter. I went back to the paper to see if I could find anything about the methods the authors used. They mentioned using the We seemed to be using the same clustering method on the same data, but we were getting different results. Perhaps we didn’t have the same data after all. I needed to go ahead and install When I looked at the heatmaps, there were clear clusters along the diagonal that were very different from the off-diagonal values. This told me that the long branch lengths that I was getting seemed pretty reasonable and that the short branch lengths generated by I decided to take a look at the The next step was to run each line of code in succession trying to understand what was going on in the function. I got down to line 946 before things started to look different. I noticed that the different A few lines into the function, I found this statement…
Basically, this is saying if the distance argument isn’t in this list of methods (e.g., correlation, eulidiean, etc) and the value of the distance argument isn’t a distance matrix, then do the next thing, which was to throw an error. Later the code generates the distance using the user-supplied method or their distance matrix. But because we had been using the default I don’t think the authors intended to use distances of distances. Perhaps they were like me and assumed that the code would be smart enough to know that if a distance matrix, rather than a simple matrix, was provided then it didn’t need to recalculate distances. Perhaps they didn’t have any intuition and just used the function as a black box. All I know for sure is that calculating Euclidean distances from genomic relatedness values wasn’t described in the paper nor was the clustering method. But, the data from this clustering was used in subsequent analyses, so the method probably should have been more fully described. I don’t fully fault the authors for neglecting these details. I don’t think the difference in results wildly changes the interpretation of the study. But, Nature Microbiology and other journals are placing a huge premium on reproducibility (this is the argument for why we need P-values with 6 significant digits). These types of details are important. Ultimately, I think I was able to reproduce their work, but trust me that I spent a lot of time trying to figure this out. I apologize for this long-winded story that perhaps teaches us a bit about troubleshooting and the importance of reproducibility. However, I don’t think this story is a one-off. This week I’ve gone down two rabbit holes trying to find a visualization to present to you. Each time, I got foiled by not having enough information to understand what the authors did so that I would be able to recreate the figure. I think the real lesson here is to be careful of default values. Just because they are defaults doesn’t mean they aren’t important and that we don’t need to document them. A second important lesson is the value of having open software where we can write code that provides a type of documentation like the code for As for those two rabbit holes I went down, I emailed one of the authors with a question and the other I think I’ve got figured out. Because I’ve already written waaaay too much today, I’ll leave you with the figure I think I figured out that was published this week in Nature from a paper titled, “A pro-carcinogenic bacterial toxin binds claudin-4 to cleave E-cadherin”. The paper indicates that this plot was made in Prism. Prism appears to have a bunch of different models to fit this type of logistic dose response curve. Which did the authors use? Because I don’t use Prism, I don’t have a direct way to answer this question. But I can try to recreate it in R using different approaches. So, I’ll have to figure out how to fit a non-linear model to data in R using
|
Hey folks, The more I peruse the literature, the more I see that researchers need help designing figures to help tell their stories. I don’t just mean the mechanics of creating a figure in R, Python, Prism, or Excel. Rather, if someone had a box of dry erase markers of various colors and they had to give a talk without any slides, what would they draw to tell their story? I don’t mean to trivialize the difficulties. It’s hard! There are many figures I’ve published that I wish I could have a...
Hey folks, I appreciated the emails I received from people after last week’s newsletter. I hope that even if people didn’t agree with what I had to say, it was thought-provoking. Regardless of how a plot is made - R, Prism, Excel (gasp!), or AI (oh my!) - we need to train our eyes and sense of taste to make the most compelling visualization of our data. If you’re interested in working with me on an individual or group level to achieve this goal, let me know. I am offering consultation...
Hey folks, If you’ve watched any of my livestreams when someone asks why I don’t get ChatGPT or something to do a task for me, you probably saw a pained expression come across my face. Part of me dies every time someone tells me that they used some LLM chatbot to solve a problem. I have many reasons for despising the fascination with AI-based tools. I even wrote a commentary that I submitted to mBio in the fall of 2024. Yes, I wrote it. By hand. Then I typed it. No really, I typed it on a...