Visualizing bias in polling data with a dumbbell plot

Hey folks,

Next week is Thanksgiving here in the US and I’ll skip sending you another newsletter. In exchange, you’ll get three videos on YouTube inspired by a newsletter post from October talking about a descending bar plot with a pattern in one of the bars. Before you thank me, you might want to check out today’s newsletter🤣!

I’ve always enjoyed the old 538’s articles and appreciated the data centric point of view of its founder Nate Silver. He has a Substack newsletter, “Silver Bulletin”, that is very good. I’m too cheap to pay for a subscription, so I settle for the bread crumbs he includes on the free subscription.

Last night I received his latest article, Hopium comes at a high price. The article is part of a debrief on the election and the state of polling and predictive models like his. His contention is that polls continues to underestimate Trump’s numbers, but within the margin of error of those polls.

Regardless of what you think of Trump or Silver’s analysis, I was captivated by the visual that he included in the newsletter.

As always, I encourage you to ask some questions about any plot you find to help you develop your taste and and think through how you would recreate elements of a plot. What type of plot is this? Aside from the data story, what is interesting about this figure? What do you like about it? What don’t you like about it? Can you outline the steps you would take to generate the figure? What are some of the steps you aren’t sure about and would like to learn?

This plot was eerily reminiscent of a plot that I made back in 2021 showing the likelihood of people getting the COVID-19 vaccine at different times by country. I called this plot a “dumbbell” or “barbell” plot because for each entity (e.g., state or country) there is a ball connected by a line - it looks like a dumbbell. You might recall another set of videos I made recently based on paired data where I made a scatter plot and a slope plot inspired by sentiments of farmers and non-farmers in Sweden. A dumbbell plot is another way to show paired data for a handful of entities.

If I were asked to recreate Silver’s figure, I’d expect to get a data frame with three columns - state, polling, and actual. I’ll assume that positive values in the two margin columns would be for Trump and the negative vales would be for Harris. Something like this

margins <- tibble( state = c("Arizona", "North Carolina", "Nevada", "Georgia", "Pennsylvania", "Wisconsin", "Michigan", "National"), polling = c(2.4, 1.1, 0.6, 1.0, 0.2, -1.0, -1.2, -1.0), actual = c(5.5, 3.2, 3.1, 2.2, 1.7, 0.9, 1.4, 1.5) )

At a basic level, a dumbbell plot can be made with with a combination of geom_point(), geom_segment(), and geom_text(). But when I start thinking about how I would map the data to each aesthetic, things get a bit more complicated.

Let’s start with the handles. Using my margins data frame, I could generate the handles of the dumbbells using geom_segment() by mapping state to y and yend, polling to x, and actual to xend. The thing that first caught my eye about this plot was the gradient in the handles. I’ve never made a gradiented line like this in R before. But in preparing next week’s videos I saw that the latest ggplot2 release allows you to use gradients and I have to think that we can use a similar approach to making a gradient in a line.

What about the “bells”? For those, I need all of the polling data in a single column. I’d need to generate a second data frame using pivot_longer() where I’d have state, margin, and percentage columns. Then I could map state to y, percentage to x, and margin to fill. I say fill, because I’d use plotting symbol 21 which is a bordered circle. The border color would be gray (“polling”) or black (“actual”) and the fill would either be white (“polling”) or green (“actual”). I’d customize the colors using scale_color_manual() and scale_fill_manual().

The labels are a bit more tricky. I’d use geom_text() in an approach similar to what I did with geom_point except, I’d need to create a new column indicating a label. To do this, I’d probably use glue() or paste() to indicate whether the margin was for Harris or Trump based on the sign of the margin. I notice that the label for the actual margin is bolded, but it is a plain font for the polling margin. I had to check, but sure enough there is a fontface aesthetic that works with geom_text(). I’d have to think there’s also a scale_fontface_manual() function as well. I’d use one of the “nudge” arguments in geom_text() to move the labels down a uniform amount for each of the states. I have to admit that I feel like the individual labels on each point makes the plot a bit cluttered.

I was also struck by the “legend” across the top indicating the white point is the polling average margin and the green the actual margin. I’d probably use a few annotate() statements to add the text as well as the short line segments. Since these fall outside of the plotting area, I’ll like have to turn off the clipping in coord_cartesian().

The axis labels also have some cool things going on. The x-axis text is a pretty slick way of embedding who was favored to the left and right of the black line at zero. I’d use scale_x_continuous() to add those labels. On the y-axis text, I notice that “National” is bolded while the other states are a plain font face. I’d likely use element_markdown() from the {ggtext} package to pull off that effect.

Finally, the plot has vertical grid lines that are grey. There’s also one that is black at zero. We could do one or the other in the theme() function, but not both (that I know of). I’d probably make the grid lines grey in theme() and then add the solid black line using geom_vline(). A very small detail I noticed was that there is a “Trump +0.2” label for Pennsylvania that crosses the zero line and causes an apparent break in the line. This initially made me think about using geom_label() because you can use a white background to hide any overlapping features, but that isn’t seen with other cases where labels overlap grid lines. For this one label, I’d likely add a white segment using annotate().

This figure shows the 7 “battleground” states from the election. Because of how our elections work, it’s the state one wins that matters, not the number of votes they get overall. So, although Harris won California by 21%, she still got the same number of electoral votes as if she had only one it by 1%. Ditto for Trump and Texas. Regardless, it would be interesting to see these types of data for the 43 other states. Beyond being more complete, I’m interested in this to whether the same ~2.5 percentage point difference holds up regardless of the state. Maybe I’ll see if I can track that data down between now and when I produce the remake video, likely in January. I’ll award bonus points if anyone does that for me :)

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Visualizing bias in polling data with a dumbbell plot

Workshops

In case you missed it…

Visualizing how Americans feel about different card games

How would you make a labelled bar plot with positive and negative values?

Making a basic line plot appear more sophisticated