How do you even pronounce "bivariate choropleth"? Here's an alternative


Hey folks,

Before digging into this week’s data visualization, I wanted to give you all a heads-up about some learning activities I’m currently developing. First, in the next month or so I will be hosting a one-day, online workshop on the basics of {ggplot2}. If you feel that the things I talk about in this newsletter or on my YouTube channel are a bit beyond your grasp, this would be perfect for you. Second, I’ve gotten great feedback about a group coaching format that I’ve been developing for helping people develop their R skills through repeated and spaced practice. Third, I have some availability for one-on-one coaching if you need help with implementing reproducible analysis practices for any of your projects. If you have interest in any of these, please reply to this email and let me know.


For this week, I’d like to share with you a newsletter that I subscribe to from Philip Bump of the Washington Post. Each week he explores some data using a variety of visualizations and also provides critiques about visualizations he’s found online. Earlier this week his newsletter talked about the success of different seeds in the men’s NCAA basketball tournament. Further into the newsletter he described a “bivariate choropleth plot” showing the correlation between drinking and smoking levels within each county across the US. Those maps are cool. But I (and I think Bump also) find them a bit difficult to decipher. Choropleth plots suffer from the instinct to assume that counties with large land area also have many people. This is very much not true.

For his take, he used data from the University of Wisconsin’s Population Health Institute (UWPHI). The UWPHI has data on many variables related to health - including smoking and drinking - for each county. Here was his version of the bivariate choropleth plot rendered as a scatter plot:

The five-digit numbers aren’t zip codes, they are FIPS codes. The first two digits indicate the state and the other digits the county in the state.

What stands out to you about this visualization? What correlation do you see between the two variables? What questions do you have about the relationship? About how to make the plot? What would you like to learn to implement in R? What do you already feel comfortable executing?

Again, this is a scatter plot. My assumption is that when we download the data from UWPHI we’ll get a data frame that we can simplify to three columns - the FIPS code, the percent of smokers, and the percent of drinkers or the number of drinks in some time period. Once we have it in this format we can map the drinking metric to the x aesthetic and the smoking metric to the y aesthetic. We can generate the scatter plot using geom_point().

Second, there are clearly a finite number of possible values since the data already appear to be discretized. This causes a lot of over plotting of the data. I notice that the points are different shades of green. No doubt the more intense shade is where there’s more overlap. We could pull this off by using the alpha aesthetic in geom_point(). This aesthetic controls the transparency of the point. If we use alpha = 0.20 then the plotting symbol will be 80% transparent. If there are 5 points on top of each other, then the symbol will be opaque.

Third, there is a fitted line through the data. We can pull this off using geom_smooth(). We can get a polynomial or straight line with this function. To get the straight line, we’ll use method = "lm" as an argument. We’ll also want to turn off the cloud indicating the standard error. He doesn’t provide the correlation coefficient and I wonder if it is significantly different from zero.We could easily calculate and test the coefficient using corr.test().

Fourth, he creates four quadrants that roughly align with the bivariate nature of the choropleth plot. We could achieve this by drawing light gray horizontal and vertical lines that intersect at the median smoking and drinking levels. Then we could use annotate() to add the bolded descriptions that he included.

Finally, around the outside of the cloud of points he includes 20 solid green points with their FIPS codes. I would implement this in three steps. First, I’d use filter() to get the desired FIPS codes. Second, I’d use the filtered data with geom_point() using alpha = 1 to get the solid symbols. I notice that those symbols have a black edge telling me that we’d want to use shape = 21. Third, I’d use geom_text() with the filtered data to label the points. I’d add nudge_x and nudge_y columns to the filtered data to indicate which way to bump the label for each of these points. You might forgive me if instead of using his points of interest I were to denote the data for counties in Michigan (FIPS codes starting with 26)

Along the way there are other interesting things we’d want to implement in this figure. Those include removing the x and y-axis text and ticks and customizing the placement of the x and y-axis titles. A more advanced move would be to think about how we might make a function to automatically generate this type of figure for any pair of data columns in the UWPHI dataset.

Let me know what interests you about this figure! I’ll be sure to work your feedback into the video when I post it to YouTube in a few weeks.

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, It’s March! That means the days are getting longer, the weather is pretty bonkers, the Cubs season has already started, and it’s time for March Madness. For the uninitiated, that’s the roughly month-long period starting last week when men’s and women’s college basketball teams compete for their conference championship and then the National Championship. After falling apart at the end of the regular season the University of Michigan Men’s team won their conference tournament and...

Hey folks, Did you know that March is Women’s History Month? Each year The Economist updates what they call the “Glass Ceiling Index”. This is a measure of “the role and influence of women in the workforce”. It’s an aggregate of ten factors including the gender gap in wages, work force participation, and higher education. Sadly, the article is behind a paywall. They also haven’t made their data publicly available. Regardless, you can get a static copy of the article through archiv.is. Here’s...

Hey folks, This has been a busy week! I’ve been on campus teaching a 3 day, all day, R class. It’s been a while since I’ve done one of these live workshops off campus. If you’re interested in me coming to your campus, you coming to Michigan, or being in a Zoom-based workshop, please let me know! I really love being able to interact with you all in workshops. If your experience has been at all like my own the past month or so, your conversations have all had a tinge of anxiety about the...