Plotting the baby BOOM with a ridgeline plot


Hey folks,

I’m gearing up to teach a 1-day (6 hours) data visualization workshop on May 9th. This workshop will cover an introduction to the ggplot2 package and will assume no prior R knowledge. My goal is to help you to understand the ggplot2 framework and begin to apply it to make some interesting and compelling visualizations. From this workshop, I hope that you would be able to go off on your own journey learning more advanced topics. You can learn more and register by clicking the button below. Feel free to email me if you have any questions.

If a full day is still too much time, let me know. I could schedule a 6-hour workshop over two days. I can also make an even shorter workshop!


We’re coming into the month of May, which is when we celebrate Mother’s Day in the US (May 11th). In recent months there’s also been a lot of discussion here about the birth rate and reasons for its decline over the past decades. I thought this would be a great time to think about a mom-related data visualization. If nothing else, it will be a reminder to pick up a card for all the moms in your life or to start dropping subtle hints to those who call you mom :)

I found this interesting data visualization in an article about the baby boom on the “Our World in Data” website. It combines fascinating data with a quirky plot called a “ridgeline plot” with an infographic feel

What do you think of this figure? I initially thought it was pretty and really drew me in. But it took me a bit to wrap my head around what was going on in the plot. Each histogram represents a different birth year and the shape of the histogram is fertility rate for women of each age. This data is kind of like the drug overdose heatmap I described a few weeks ago.

Upon further reflection, I had a few negative critiques of this figure. First, the women who gave birth to the “boomers” were largely born in the 1920s. So, to notice their fertility trends you need to look at the top of the figure and not the middle. Second, I find it really challenging to connect a histogram to a year. I think this is because the height of the histogram bleeds over into the previous years. Finally, my wife was born in 1977 and her birth cohort is at the past the end of their child-bearing years and I wanted to see what their distribution looks like. Why not have the data come to 1980 rather than 1970? I think these critiques could largely be addressed by a heatmap like the drug overdose heatmap. Maybe I’ll circle back and make that heatmap. But for now, I want to make this as a ridgeline plot.

Ridgeline plots had a lot of excitement a few years back. Initially they were called “joy plots” because of the cover art on the Joy Divison’s “Unknown Pleasures” album. Later, Claus Wilke released the {ggridges} package for generating this type of visualization. I agree with the developer of the fertility rate visualization that this made for an interesting application of ridgeline plots.

Saloni Dattani, the developer of the figure I’m interested in, posted her code to GitHub for others to work with. I haven’t looked too closely at her code, but when I ran it I noticed that it doesn’t really look like the visual I included above. I suspect that she handed the visual off to a graphics team that polished the figure to make it look like an infographic.

I think a lot of people use this strategy. But why?! Probably because of the short term ease of placing text, drawing arrows, manipulating colors. The downside is that when the data are updated, it is much harder to bring those data into the polished infographic. Why not try to generate the infographic directly in R? I don’t think it’s that big of a leap to do this.

First off, to recreate the ridgeline plot that Dattani included, I would use the geom_ridgeline_gradient() function from the {ggridges} package. This allows you to alter the fill color to be a gradient across the histogram. To do this, we’d map the age on the x-axis, the birth year on the y-axis, and the fertility rate to the height and the fill color. There’s also a scale aesthetic that we can use to adjust the height of the histograms.

Second, there are two general colors used in the visual. First, the entire plot picks blues from a gradient between a dark blue that dominates the background and the low fertility rate values to a light blue for the high fertility rate values and most of the text in the figure. A more “medium blue” is used for some of the text and the gridlines. Finally, the title is a golden color. The fertility rates are a gradient between the two blue colors. To get the full background to be the same dark blue color, I’d use plot.background = element_rect(fill = "darkblue") and panel.background = element_rect(fill = NA) within theme() to get the desired appearance. We could also use scale_fill_gradient() to set the gradient colors and values.

Third, the axes have some nice styling to them that is not found in the GitHub version. The axis lines, ticks, and grid lines are all the same thickness and color. Their appearance can be controlled within theme. Typically, our x-axis labels and titles are at the bottom of the plot. Here they’re at the top. I’m pretty sure we can do this within scale_x_continuous() using the position = "top" argument-value combination. While we can put the x-axis title on the top in bold using labs() and theme(), the y-axis title will require a different strategy. Since it’s located at the top of the y-axis, I’d likely position that using the annotate(geom = "text") function.

Fourth, the legend is in the bottom right corner of the figure. I think we can move this using legend.position within theme by giving it coordinates to place the legend. The tick marks on the legend leave a bit to be desired. I’d try to clean those up a bit using the legend.ticks argument of theme(). I like how they have a text blurb next to the legend describing what the numbers in the legend mean. We could do that with another annotate() function call. That blurb is part of what I think gives it an “infographic” feel.

Finally, aside from the blurb next to the legend, what makes this figure feel like an infographic? I think it’s the text. There’s a fair amount of text in the right-hand margin. This text - and the associated fonts - are really what separate Dattani’s GitHub version from the published version. I’d probably insert those blurbs using annotate(geom = "text"). There are a few arrows and line annotations that are used to point to data. These could mostly be drawn again using annotate() but with either the "segment" or "curve" geoms along with the arrow arguments of each. The two titles have a serif font (google font’s Domine?) and the other text is a sans serif font (Franklin?).

Although I think this is all completely doable in R, I suspect it will take a bit of work - especially for those arrows! Regardless, I think it will be worth it to see how we can make an infographic in R. Are there any cool infographics you've seen that you'd like to see how to create in R? Let me know

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

video preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, Long time friends of Riffomonas know that I’ve been teaching data science classes for close to 20 years. The hallmark of my teaching has been three-day workshops where I either teach R (here and here) or the mothur software package. I’ve gotten feedback that three days is just too much time for people to carve out of their busy schedules. So, I’m excited to be offering a 1-day (6 hours) data visualization workshop on May 9th. This will cover an introduction to the ggplot2 package....

Hey folks, I’m really excited to be offering a 1-day (6 hours) data visualization workshop on May 9th. It will cover the basics of ggplot2. If you’ve been following along this newsletter for anytime, you know I’ve thought a lot about how we learn. A critical element of learning is to create a mental model that we can hang ideas on to flesh out our understanding of a concept. The “grammar of graphics” is one such mental model for building plots. It is instantiated in ggplot2 - that’s the “gg”...

Hey folks, I somehow got through the month of March without a plot to commemorate the 5th anniversary of the COVID-19 pandemic. It is hard to believe that it has been five years. I know that my life and how I work has radically changed because of the pandemic. I started posting videos to YouTube and writing newsletters during the pandemic to help people who wanted to learn to use R while they were locked out of their labs. At one point I taught a workshop for U of Michigan researchers that...