|
Hey folks, If you missed it, on Wednesday I did a livestream where I made a stacked barplot and pronounced it good. No, I wasn’t drinking anything! But it’s a reminder to think about the question before finding the best data visualization strategy. I think this highlights the value of the constructive approach I’ve been trying to take to critiquing data visualizations. The first steps are to establish the question and figure out the question. If you aren’t a “regular”, I think you’re really missing out by skipping the Monday critique videos. I’d love to do visualizations that are relevant to your work, so feel free to send me things that catch your eye - either good or bad - in your reading of the literature. This week, I’m going to be critiquing, recreating, and refactoring panels c through j of Figure 1 from the paper “Rising atmospheric CO2 reduces nitrogen availability in boreal forests”, which was recently publishing in Nature. I’ll have more to say in the videos, but for now, I’d like to focus
on the statistical information in the upper right corner of each panel.
How did they generate that information? Many beginners (and more
advanced users too!) would have a single data frame that they filter to
the particular combination of variables they are analyzing. In this
case, the region and the tree species. Then they would run the code to
generate the statistical information eight times. That definitely works,
but it isn’t DRY and
leads to cumbersome code. I’d like to lead you through something you’ve
likely seen me do if you watch my videos and which has often left people
scratching their head. It’s a powerful For discussion, assume that we’re working with the
Let’s think about the beginner approach. I’d filter
The data frame
or it could be written like this
Regardless,
For most people that’s enough to generate the plot. They’d generate
two additional data frames for 6 and 8 cylinders and run
But let’s keep going to see if we can streamline the code some more.
First, we’ll join the two steps. We can pipe the output of
With the
This gives us the following output:
Again, we could repeat this with the other numbers of cylinders. Or
we could use the confusing idiom. Instead of using
This gets us a strange looking data frame with 3 rows and 2 columns.
The
or
But those give a variety of errors that are frustrating. What we need
to do is iterate over each value of
Let me break down that
We’ve added a column. Of course, that’s what
To remove the
Finally, we get this beaut…
We could have repeated the same 3 or 4 lines using
Got it?! Let me know if this did or didn’t make sense. Feel free to ask any questions that might help you understand this better. I suspect that if you can figure out this powerful tidyverse idiom you’ll be among about 5% of R users. I think it’s worth figuring it out to unlock the door to not only tidy output, but tidy code as well! I would love your feedback on this type of newsletter content. Do you like seeing code in the newsletter or do you prefer the higher level discussion I often provide?
|
Hey folks, It has been great to see the high level of engagement with my weekly critique videos on YouTube. I have really enjoyed making them and have learned a lot about current practices in data visualization. The one problem with these videos is that they’re a bit like an autopsy. We can figure out what went well or what didn’t work in a published figure. But we can’t do much to improve the published figure. What if we could do critiques before submitting our papers, preparing a...
Hey folks, This week I want to share with you a figure that resembles many a type of figure that I see in a lot of genomics papers. I’d consider it a data visualization meme - kind of like how you’re “required” to have a stacked bar plot if you’re doing microbiome research or a dynamite plot if you’re publishing in Nature :) This figure was included in the paper, “Impact of intensive control on malaria population genomics under elimination settings in Southeast Asia” that was published...
Hey folks! I hope you enjoyed last week’s series on the radial volcano plot (newsletter, critique video, livestream). I think it did a good job of illustrating the various reasons I think it’s valuable to recreate figures, even if we don’t like how they display the data. Something I didn’t really emphasize in last week’s newsletter was that by recreating a figure, we can make sure that the data are legit. I’m surprised by the number of signals I’ve been finding where authors using tools like...