Learn to make your data analysis "tidy" and your code "tidy" too!

Hey folks,

If you missed it, on Wednesday I did a livestream where I made a stacked barplot and pronounced it good. No, I wasn’t drinking anything! But it’s a reminder to think about the question before finding the best data visualization strategy. I think this highlights the value of the constructive approach I’ve been trying to take to critiquing data visualizations. The first steps are to establish the question and figure out the question. If you aren’t a “regular”, I think you’re really missing out by skipping the Monday critique videos. I’d love to do visualizations that are relevant to your work, so feel free to send me things that catch your eye - either good or bad - in your reading of the literature.

This week, I’m going to be critiquing, recreating, and refactoring panels c through j of Figure 1 from the paper “Rising atmospheric CO2 reduces nitrogen availability in boreal forests”, which was recently publishing in Nature.

I’ll have more to say in the videos, but for now, I’d like to focus on the statistical information in the upper right corner of each panel. How did they generate that information? Many beginners (and more advanced users too!) would have a single data frame that they filter to the particular combination of variables they are analyzing. In this case, the region and the tree species. Then they would run the code to generate the statistical information eight times. That definitely works, but it isn’t DRY and leads to cumbersome code. I’d like to lead you through something you’ve likely seen me do if you watch my videos and which has often left people scratching their head. It’s a powerful {tidyverse} “idiom” that is pretty powerful, albeit a bit confusing to beginners.

For discussion, assume that we’re working with the mtcars dataset in R and we want to know the correlation coefficient between the miles per gallon (mpg) and the engine displacement (disp) for cars with 4, 6, and 8 cylinders (cyl).

Let’s think about the beginner approach. I’d filter mtcars to get the data from cars with 4 cylinders…

four <- mtcars %>%
  select(mpg, disp, cyl) %>%
  filter(cyl == 4)

The data frame four can then be used to run a correlation test with cor.test(). One way to write the function would be like this

cor.test(~four$mpg + four$disp)

or it could be written like this

cor.test(~mpg + disp, data = four)

Regardless, cor.test() will us the correlation coefficient (cor) is -0.8052 and the P-value is 0.0028. As we see in the following output:

    Pearson's product-moment correlation
data:  mpg and disp
t = -4.074, df = 9, p-value = 0.002783
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9474526 -0.3972483
sample estimates:
       cor 
-0.8052361

For most people that’s enough to generate the plot. They’d generate two additional data frames for 6 and 8 cylinders and run cor.test() on those to get the corresponding values.

But let’s keep going to see if we can streamline the code some more. First, we’ll join the two steps. We can pipe the output of filter() into cor.test() like this…

mtcars %>%
  select(mpg, disp, cyl) %>%
  filter(cyl == 4) %>%
  cor.test(~mpg + disp, data = .)

With the %>% operator, the data frame at this point of the pipeline gets inserted either to the first argument of the next function (this is what happens between select() and filter()) or to where we put a . as the value for an argument. Here we’re sending the data to the data argument of cor.test(). Let’s take another step and see if we can’t convert the output of cor.test() to a data frame. This is easily done using the tidy() function from {broom}.

mtcars %>%
  select(mpg, disp, cyl) %>%
  filter(cyl == 4) %>%
  cor.test(~mpg + disp, data = .) %>%
  tidy()

This gives us the following output:

# A tibble: 1 × 8
  estimate statistic p.value parameter conf.low conf.high method      alternative

Again, we could repeat this with the other numbers of cylinders. Or we could use the confusing idiom. Instead of using filter() let’s use nest()

mtcars %>%
  select(mpg, disp, cyl) %>%
  nest(mpg_disp = - cyl)

This gets us a strange looking data frame with 3 rows and 2 columns.

# A tibble: 3 × 2
    cyl mpg_disp

The nest() function put all of the columns except cyl into a column I’m calling mpg_disp. It automagically knows to make this a 3 row data frame because the cyl column wasn’t included in mpg_disp and cyl only has 3 unique values. Now we have a column that we can work with using mutate(). We might be tempted to do something like

mtcars %>%
  select(mpg, disp, cyl) %>%
  nest(mpg_disp = - cyl) %>%
  mutate(test = cor.test(~mpg + disp, data = .)) %>%
  tidy()

mtcars %>%
  select(mpg, disp, cyl) %>%
  nest(mpg_disp = - cyl) %>%
  mutate(test = cor.test(~mpg + disp, data = mpg_disp)) %>%
  tidy()

But those give a variety of errors that are frustrating. What we need to do is iterate over each value of mgp_disp and send the individual nested tibbles to cor.test(). We can do that with map() from {purrr}, which comes with {tidyverse}. The syntax here gets a bit trickier, so hang on…

mtcars %>%
  select(mpg, disp, cyl) %>%
  nest(mpg_disp = - cyl) %>%
  mutate(test = map(mpg_disp, ~cor.test(~mpg + disp, data = .x) %>% tidy()))

Let me break down that map() syntax. First, we give it the mpg_disp column, that we are going to iterate over. The map() function will take each value of mpg_disp and give it to cor.test(). An important thing to note is that the argument that mpg_disp is assigned to is .x. Now if you look to the cor.test() function you’ll see I’ve replaced the . with .x because the data are coming from those individual tibbles. A peculiarity of map() is that if we write the function to take arguments, we need to put a tilde (~) before the first function name. If you count the parentheses you’ll see that tidy() is getting the output from cor.test() before going to the next value in mpg_cyl. Here’s the output…

# A tibble: 3 × 3
    cyl mpg_disp          test

We’ve added a column. Of course, that’s what mutate() does - it modifies or adds columns. We see that each row of test is a 1 row by 8 column data frame. If you look up a bit to where I introduced tidy(), you’ll notice that it is a 1 row by 8 column data frame. Right? Well, we want those values, right? To unnest the data in test we need to use the unnest() function like this…

mtcars %>%
  select(mpg, disp, cyl) %>%
  nest(mpg_disp = - cyl) %>%
  mutate(test = map(mpg_disp, ~cor.test(~mpg + disp, data = .x) %>% tidy())) %>%
  unnest(test)

To remove the mpg_disp column we can use select…

mtcars %>%
  select(mpg, disp, cyl) %>%
  nest(mpg_disp = - cyl) %>%
  mutate(test = map(mpg_disp, ~cor.test(~mpg + disp, data = .x) %>% tidy())) %>%
  unnest(test) %>%
  select(-mpg_disp)

Finally, we get this beaut…

# A tibble: 3 × 9
    cyl estimate statistic p.value parameter conf.low conf.high method  alternative

We could have repeated the same 3 or 4 lines using filter() for each of the three cylinders and gotten the same type of data. Or, we could use 5 or 6 lines once and gotten the data we’re interested in succinctly and in a tibble format. This might seem unnecessary for three cyl values. But it is invaluable when you have many conditions you want to test. For example, I can run a Wilcoxon-Test comparing the relative abundances of hundreds of bacterial taxa between two treatment groups with not much more code than I’ve shown you here. In comparison, the task would be horribly painful by using the filter approach.

Got it?! Let me know if this did or didn’t make sense. Feel free to ask any questions that might help you understand this better. I suspect that if you can figure out this powerful tidyverse idiom you’ll be among about 5% of R users. I think it’s worth figuring it out to unlock the door to not only tidy output, but tidy code as well!

I would love your feedback on this type of newsletter content. Do you like seeing code in the newsletter or do you prefer the higher level discussion I often provide?

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Learn to make your data analysis "tidy" and your code "tidy" too!

Workshops

In case you missed it…

New opportunity / What to do when you can reproduce someone's work

How would you recreate this popular genomics data viz meme?

Are you ready to scrap using Illustrator to compose your multi-panel figures?