Building data visualization intuition


Hey folks,

In last week’s newsletter, I introduced a new approach that I plan on taking in these emails to help you develop your intuition with visualizing data in R (or any language). I asked you to consider a random figure that I found in the most recent issue of the journal mSphere. It’s Figure 1A from the paper, “Exploring novel microbial metabolites and drugs for inhibiting Clostridioides difficile” by Ahmed Abouelkhair and Mohamed Seleem.

The figure shows the level of inhibition of bacterial growth by 527 compounds; 63 of the compounds were deemed “strong hits” because they inhibited growth by at least 90%. Without worrying about actual code, I encouraged you to think about the data and functions you’d need to generate this figure.

Here were my random thoughts: This is a scatter plot with compounds giving more than 90% inhibition were a burgundy color and those with less were given a green color. There’s also a dashed line indicating the 90% threshold. It took me a minute or two to notice that the x-axis is meaningless. It’s likely the order of the compounds in their database (there seems to be a non-random pattern to the data about 3/4th the way across the axis). I also noticed that there’s no line on the x-axis, but there is a line at zero.

Those are the parts of the figures, described in a way that you could probably use to make a similar looking figure with any tool. Now, how would we do this in R?

Let’s start with the data. I assume that the data will be a data frame with two columns, one for the compound name (compound) and one for the level of inhibition (percent_inhibition). I’d likely use mutate to create a column of logicals (trues and falses) that I’d call strong_hit. Values greater than 90 would be TRUE.

I do everything in ggplot2 nowadays, so I start thinking about what geom I’ll use. Probably geom_point. I could map the index (perhaps the compound column) of the database on the x-axis and the level of inhibition (percent_inhibition) on the y-axis. But the fact that the x-axis doesn’t really mean anything makes me wonder if I could use geom_jitter instead. geom_jitter randomizes the x-axis position of all the points. I wonder what such a figure would look like. I suspect I’d get random clumps of points and that things wouldn’t be randomly distributed across the x-axis. Alternatively, what if I were to randomize the order of the database using something like slice_sample and then use geom_point. That would get rid of the clumping and clear pattern in the data. Those are some ideas to get us going on developing the scatter plot component of the figure.

Next, I’d think about the colors. I’d use scale_color_manual to map colors onto the logical values in strong_hit. Those compounds with the value of TRUE would get the burgundy color and those with FALSE would get that greenish color. Whatever geom I end up with, I’d use show.legend = FALSE to hide the legend since it is unnecessary for this figure.

Let’s move on to the x-axis and the two lines. First, I’d use the axis.title.x, axis.text.x, axis.line.x, and axis.ticks.x arguments in theme to remove the x-axis. I think I can do this by setting those to arguments to element_blank(). To generate the line that looks like an x-axis that hits the y-axis at zero, I’d use geom_hline() with yintercept = 0. I’d use the linewidth argument to get the thickness of this line to match the thickness of the y-axis. For the annotation line at 90%, I’d again use geom_hline() with yintecept = 90 and linetype = "dashed". I’d experiment with that value of linetype to get the right dashing. I would call both instances of geom_hline() before geom_point() so that the points are on top of the axis and I would include show.legend = FALSE for both so a legend doesn’t appear.

Now let’s think about the y-axis. By default we might get the values on the y-axis that the figure already has. But to be safe, we can use scale_y_continuous to specify the breaks argument by giving it the values of c(100, 50, 0, -50, -100). Just in case we should also likely use limits = c(-100, 100) and expand = c(0, 0) to make sure that the y-axis starts and ends at -100 and 100. The y-axis title and text appear to be bold. Since they’re the only text on the figure, we could probably use text = element_text(face = "bold"). Alternatively, we could use the same element_text() syntax as the value to axis.title.y and axis.text.x.

I think that’s everything, right? I’d encourage you to go back through that narrative and assess what you do and don’t understand. Then look at online R resources, including my Riffomonas materials (MinimalR and generalR) and the R Graphics Cookbook for examples of how to use the new concepts. Finally, see if you can generate the figure yourself using some simulated data. The code below should be close enough to what you need:

set.seed(19760620)
sim_data <- tibble(
  compound = 1:527,
  percent_inhibition = sample(c(rnorm(n = 464, mean = 0, sd = 30),
                                runif(63, 91, 100))))

Please let me know how this works out for you! Also, if you have a favorite figure that you'd love to see me break down, reply to this email and I'll see about using it in a future newsletter

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, I’ve now produced three livestream videos. What do you think? Do you watch them live or watch them later? Or are they too long? I’m looking for honest feedback! I have to admit that if I hadn’t livestreamed these videos, they would not have been produced. It’s nice that I can more or less record and post without any editing. This is still a bit of an experiment. I think fewer people are watching the episodes which makes me worry that this might be an overall step backwards for you...

Hey folks! Do you ever get that feeling where you’re scared to try something? But then you do it anyway… and it turns out way better than you expected? Well that was me on Wednesday morning. I ran my first livestream on YouTube recreating a ridgeline plot from Our World in Data showing the US baby boom. I wrote about it here in the newsletter back in May. The full session was about 2.5 hours. YouTube tells me that 272 people popped in at some point during the session. To be honest, I really...

Hey folks, I need your feedback on an idea! Don’t worry, there’s some visualization stuff at the bottom. I had a video nearly ready to post this week using a ridgeline plot to show the baby boom. I think I did a great job of recreating the plot. But through a series of unfortunate events, I lost the video. I actually recorded the video three times because my computer kept crashing as I was recording it. This was on top of increasing busyness on my part with teaching, proposal writing,...