Did you know you can do statistics in R? 🤣

Hey folks,

Did you know that you can do statistics in R? HA! Of course it is. As the first sentence of its Wikipedia entry says, “R is a programming language for statistical computing and data visualization”. I rarely discuss using R for statistical analysis and focus far more attention on the data visualization power of R.

This week, I’d like to share a set of panels from a figure in a paper recently published in Nature, “Lymph node environment drives FSP1 targetability in metastasizing melanoma”.

This figure actually has 12 panels. One is a picture of the mouse model that was used (a) and another is an immunoblot (d). Panels i through l are the same style as e through h. I suspect that if you can figure out how to make the scatter plot in panel b, you can create the one in panel c. Similarly, if you can do the bar plots in panels e and f you can do those in g through l. Really, if you can do e, you should be able to do f. I’ll have things to say in a critique video that I’ll post on Monday, but let’s say you want to recreate these panels, how would you go about doing that in R? Before I forget, you can download the data as a MS Excel workbook from the Nature site.

Let’s think about the scatter plot first. If you look at that workbook, you’ll notice that the data are very much not tidy! How would we get the data tidy? Well, first we need to read it in. We can use the read_excel() function from the {readxl} package that installs with the {tidyverse} metapackage. That will allow us to specify the sheet we want from the workbook as well as the range. In this case, we want sheet “Fig. 1b” and cells “A4:F13”. We’ll have to use the read_excel arguments to give the data column names. Because some cells (e.g., E8) are blank, they’ll be imported as NA values. Once we pivot the data we’ll have a column with the “LN#” data (i.e., ln), the values (i.e., tpm), and the column names, which we really don’t need after we pivot the data. To create a generation column from the ln column we could use str_remove() and convert the output to a numeric with as.numeric(). So that we can color the points by the stage, we’ll want to create a Stage column using mutate() and case_when(). I’ll break my capitalization rules for this case since the column name will become the title of the legend. I’d also use capitalization for the values in this column.

Now to plot the data! We can generate the scatter plot using geom_point() by mapping generation to the x aesthetic, tpm to the y, and Stage to color. We can get our axes to start at zero and end at the desired end points in coord_cartesian() by setting the xlim and ylim arguments as well as expand = FALSE and clip = "off". This will all hopefully be straightforward if you’ve spent much time with {ggplot2} with the coord_cartesian() step being a bit more advanced. There are three more sophisticated things about this plot that are a bit different from the norm.

First is the fit and the confidence interval. To fit a line through data, geom_smooth() is a great tool. That will give us the straight line - assuming we use method = "lm", but it will also give us a “cloud” around the line to indicate the standard error. For this plot, we don’t want the cloud, we want lines at the border of the line. This is the hard part. We can make our own cloud as a geom = "ribbon" using stat_smooth(). From there we can turn off the fill and set the line color and linetype arguments to match their figure. Nice, eh? If you want the actual model fit information, you’ll need to use the glm() function along with summary(). Something like this should work…

summary(glm(tpm ~ generation, data = fig_1a))

Next, we can set the color of the points. I do this a lot with scale_color_manual(). The challenging part here is the that the legend leaves out the point for the zero generation. I don’t know if that was intentional or not. To get the same appearance in R, we would again use scale_color_manual() to set the breaks and values arguments. But to drop the zero generation, we would use the limits argument to indicate which stages we want to appear in the legend. This will cause the zero generation point to turn gray. Why? Well, setting the limits argument will cause that point to become NA and the default color for a NA value is gray. See if you can find the argument in scale_color_manual() to set the color of NA to black.

The third more sophisticated element is the R^2^ value in the lower left corner of the plot. We can calculate the correlation coefficient, R, using cor() by giving it the values in generation and tpm. We’ll square it using the ^ argument (e.g., 2^2 = 4). Because we would need to use a superscript for the 2 and italicize the R, I would use geom_richtext() from {ggtext}. We could assemble the label using paste0() or glue() depending on your preference.

Of course to finish replicating the original plot there will be a fair amount of styling to do to the axis titles, the legend, and the legend placement.

This newsletter is already getting long, but a lot of the things we’d do for the scatter plot we could do here as well. If you look at the “Fig. 1e” sheet you’ll see it’s formatted a bit better than “Fig. 1b”. We’ll still need to tidy the data add a stage column and a couple of other bits before we can make the plot. To make the plots there are a few geom’s that we’ll need. First, the bars can be generated using stat_summary(). We can set the geom for the bars as col and use the mean() function as the value of the fun argument. Instead of scale_color_manual() we’ll want to use scale_fill_manual() to set the fill color to a less saturated shade of the points. Second, we can use another stat_summary() function call to add the error bars using the errorbar value for the geom argument. The figure caption says they used the standard deviation of the data to set the limits on the error bars. Third, we can use geom_jitter() to add the points on top of the bars. These three elements will get us pretty close the basic bar plot they have in these panels.

Finally, how would we calculate and add the P-values to the plots? I’ll have more to say about this in my critique video and why I’m not a fan. Regardless, we can calculate the overall P-value (e.g. P< 1x10^-15^) using aov() much like I showed you above with glm(). If you use the str() function, we’ll see that the output of the aov() function will be a list. From that we can figure out how to extract the P-value. The individual comparisons were done using Dunnett’s test, which can be performed using the DunnettTest() function from the {DescTools} package. Again, we’ll use str() to dissect the resulting list and extract the P-values. Once again, because we have the italicized “P” and superscripts, I’d use geom_richtext() to place the P-values. We’ll likely have to run that twice - once for the experiment-wide P-value and once for each separate test. The latter will likely need to use the angle argument to make the text run vertically. Finally, we’d use annotate() with geom = "segment" to add the horizontal line under the experiment-wide P-value.

Now you have the data and the roadmap, see if you can’t figure out how to create these panels on your own. Also, before watching my critique of the panels, go through the DAIJ process on your own. Let me know what you come up with!

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Did you know you can do statistics in R? 🤣

Workshops

In case you missed it…

What's the most important part of your plot? Make sure it's mapped to the axes

Visualizing how you're most likely to die vs what the media wants you to think

I have news! Where do you get your news?