Now that I know how to use annotate(), I can see uses for it everywhere!


Hey folks,

I’d love to hear your experiences trying to recreate the figures I’ve been discussing in recent newsletters. Does a “verbal” description of my thought process for each figure help? Can you pick a figure and do it yourself? What are the biggest obstacles to translating between the verbal description and actual code? Feel free to reply to this email to let me know how you like this approach. Also, if you have a figure you’d like me to walk through, I’d love that too!


This week I want you to look at Figures 3E and 3F from “Preexisting cell state rather than stochastic noise confers high or low infection susceptibility of human lung epithelial cells to adenovirus” by Anthony Petkidis and colleagues, which was recently published in mSphere.

There isn’t anything super special about this set of figures. I’m more interested in the general style of the figures that I want to draw your attention to. You’ll notice that the x-axis in both figures is broken. As I go looking for papers, I find broken axes like these in a lot of papers. Often they are broken y-axes, but as this case shows, people break the x-axis too.

Why do people break an axis? In this case, we see that there was a jump in the data between weeks 4 and 8. Instead of having 3 empty positions in their figures, the cut that out. When people break the y-axis, they often have a big difference between the values for different treatments. Perhaps A is super high and B and C are much lower, but C is greater than B. The author wants to call attention to the relationship between all three treatments.

Why shouldn’t you break an axis? The honest truth is that breaking axes is generally considered a poor data visualization practice. That’s because although the axis is clearly indicated as broken, the human eye and mind will quickly forget that and make comparisons between points based on their distance to each other.

How would you get around the need for a break? One thing these authors could have done would have been to make facets of the early and later time points and drawn boxes around both groups. They’d share a y-axis, but there’d be a stronger indication that there’s a jump in the data. For a break in the y-axis, you might consider a log scaled axis or some other transformation that compresses the difference in the data. Alternatively, you might set the limit on the y-axis to highlight the difference between B and C and let the value for A either be hidden or for a line or bar plot it could extend outside the plot with an annotation indicating the value of A. You might also ask if A is so much larger than B and C, does the difference between B and C really matter?

But what do I know? :) How would I go about creating a break in an axis?

First of all, I’m so sure there’s a package out there to do this for you that I’m not going to bother with the google search. Again, my goal with these discussions isn’t to solve specific problems, but to help you think more generally about how to solve problems with R.

Let’s start by assuming we have a data frame that looks something like this…


set.seed(19760620)
f_data <- tibble(
weeks = c(1, 2, 3, 4, 8, 9),
rel_infection = runif(6, min = 0.75, 2),
ri_high = rel_infection + 0.25,
ri_low = rel_infection - 0.25)

Of course, we would plot this with geom_point() and to get the error bars we’d use something like geom_errorbar(). You might recall from last week’s newsletter that you can have a plotting symbol with one color on the border and a different one the inside. Hopefully by now, you’ve learned how to play with labs() to alter the x and y-axis labels. Let’s think about that break.

My general idea is that I need to pull the data together on the x-axis with space for one piece of missing data. I can pull everything together by recasting the weeks column as a character. Basically, I’m turning a numerical variable into a categorical variable. That compresses everything together. But how do we put a gap in between days 4 and 8? Let’s insert a fake day, day 5. To do this, I’d probably use bind_rows() to add a row to f_data where the weeks value is 5 and everything else is -1. Something like bind_rows(f_data, c(weeks = "5", rel_infection = -1, ri_high = -1, ri_low = -1)). Remember that you’ll want to set your y-axis limit to hide the value for week 5. The coord_cartesian() function should help you to set the limits on the axes. I’d probably set the minimum value on the y-axis to be 0.4 and the maximum to be 2.5. By default, the axes are drawn with some padding on both ends of the axes. To make things easier, I’d turn this off by setting expand = c(0, 0) as an argument to scale_y_continuous()

Now we have pulled days 8 and 9 back towards day 4 with a gap in between. We’d like to get rid of the 5 on the x-axis. We can probably do that with scale_x_discrete. The trick here would be to duplicate the 4 in the vector given to the breaks argument so that it appears in position 5. Then in the labels argument put "" in position 5. This will get rid of the tick mark for week 5.

Next, I’d apply theme_classic(), which is pretty similar to what these authors did. This will give us a white background that is easier to work with. Now let’s draw that break on the x-axis. Last week, I mentioned the annotate() function, we’ll use that again here to draw the diagonal lines. The annotate() function takes different geoms as the first argument. I’d like to draw line segments, which I’d normally draw with geom_segment(). But with annotate(), I’ll use geom = "segment". Then I can give the function values for the x, y, xend, and yend values. Let’s keep in mind that our x-axis crosses the y-axis at 0.4. For the first slash, I’d use annotate(geom = "segment", x = 4.7, xend = 4.9, y = 0.35, yend = 0.45). When you draw that slash, you’ll only find half of it. The part that goes below the axis is missing. To solve that problem, we can use clip = "off" as an argument to coord_cartesian(). Nice, eh? Now see if you can figure out how to make the second slash to be parallel to the right, but over to the right a smidge. Now we have the break, but we haven’t removed the axis. One thought I had was to add another annotation segment between the two slashes. We could again do that with annotate(geom = "segment") using y = 0.4 and yend = 0.4 and x = 4.8 and an appropriate value for xend. Try making linewidth = 3 and color = "red" so it’s easier to see where the segment is.

If you get that to work right, you’ll notice a problem. Do you see the problem? It appears that the data are plotted under the axis rather than on top. What I wanted to do would be to make that line white so that it masks the axis. But because it is under the axis, that strategy won’t work.

Now we need another solution. That other solution would be to remove the x-axis entirely and draw a new one made up of two segments. To remove the x-axis you can use the theme() function to modify axis.line.x by passing it element_blank(). Now to replace that axis. See if you can figure out how to modify the code you wrote for the thick red line to make a normal thickness black line that goes across the x-axis without including the break in the axis.

Got it? To be honest, before last week I didn’t know about annotate()! I think it’s a pretty handy tool for adding customized elements to a plot. If you had a hard time keeping up, here’s some unpolished code of my own minimum example of how to add a break to the x-axis (it will look much better on a laptop than on your phone). If you’re feeling pretty good about your skills, see if you can now put a break in the y-axis. While you’re at it, see if you can try one of the other strategies I mentioned for avoiding breaking your axes!


f_data %>%
bind_rows(c(weeks = 5, rel_infection = -1, ri_high = -1, ri_low = -1)) %>%
ggplot(aes(x = as.character(weeks), y = rel_infection, ymin = ri_low, ymax = ri_high)) +
geom_point() +
geom_errorbar() +
annotate(geom = "segment", x = 4.7, xend = 4.9, y = 0.35, yend = 0.45) +
annotate(geom = "segment", x = 5.1, xend = 5.3, y = 0.35, yend = 0.45) +
annotate(geom = "segment", x = c(0.5, 5.2), xend = c(4.8, 7.5), y = 0.4, yend = 0.4) +
coord_cartesian(ylim = c(0.4, 2.5), clip = "off") +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(breaks = c(1, 2, 3, 4, 4, 8, 9),
labels = c(1, 2, 3, 4, "", 8, 9)) +
theme_classic() +
theme(axis.line.x = element_blank())

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

video preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks! I posted two videos last week! On Monday I posted a video critiquing the diverging bar plot that I described in this newsletter last Friday. My goal in this video was to think through a “constructive” approach to interpreting and critiquing data visualizations. As scientists, I think we are too worried about hurting each other’s feelings. So we don’t critique each other. At the same time, many of us think before we speak and can come off overly harsh. My goal is to create a...

Hey folks! As I’m writing this newsletter the US government is in shutdown mode with no clear signs that things will get going anytime soon. I’ll withhold my own political take except to say that my family has been running without an official budget for about 25 years. I don’t recommend it, but we know basically how much money goes to our mortgage, insurance, groceries, charities, etc. and how much money we generally have left over. Somehow we still are able to spend money on living a pretty...

Hey folks! This week I have a figure for you from the New York Times based on a poll they did with Siena that describes Americans’ sentiments concerning Israel’s actions in their war with Gaza. What does it say to me? This plot is saying that more Americans think that Israel is intentionally killing civilians than they did in December 2023. The change in percentage of people in the other categories seems to decrease accordingly. What do you like? I love slope plots! I think they’re a great...