How do you even pronounce "bivariate choropleth"? Here's an alternative


Hey folks,

Before digging into this week’s data visualization, I wanted to give you all a heads-up about some learning activities I’m currently developing. First, in the next month or so I will be hosting a one-day, online workshop on the basics of {ggplot2}. If you feel that the things I talk about in this newsletter or on my YouTube channel are a bit beyond your grasp, this would be perfect for you. Second, I’ve gotten great feedback about a group coaching format that I’ve been developing for helping people develop their R skills through repeated and spaced practice. Third, I have some availability for one-on-one coaching if you need help with implementing reproducible analysis practices for any of your projects. If you have interest in any of these, please reply to this email and let me know.


For this week, I’d like to share with you a newsletter that I subscribe to from Philip Bump of the Washington Post. Each week he explores some data using a variety of visualizations and also provides critiques about visualizations he’s found online. Earlier this week his newsletter talked about the success of different seeds in the men’s NCAA basketball tournament. Further into the newsletter he described a “bivariate choropleth plot” showing the correlation between drinking and smoking levels within each county across the US. Those maps are cool. But I (and I think Bump also) find them a bit difficult to decipher. Choropleth plots suffer from the instinct to assume that counties with large land area also have many people. This is very much not true.

For his take, he used data from the University of Wisconsin’s Population Health Institute (UWPHI). The UWPHI has data on many variables related to health - including smoking and drinking - for each county. Here was his version of the bivariate choropleth plot rendered as a scatter plot:

The five-digit numbers aren’t zip codes, they are FIPS codes. The first two digits indicate the state and the other digits the county in the state.

What stands out to you about this visualization? What correlation do you see between the two variables? What questions do you have about the relationship? About how to make the plot? What would you like to learn to implement in R? What do you already feel comfortable executing?

Again, this is a scatter plot. My assumption is that when we download the data from UWPHI we’ll get a data frame that we can simplify to three columns - the FIPS code, the percent of smokers, and the percent of drinkers or the number of drinks in some time period. Once we have it in this format we can map the drinking metric to the x aesthetic and the smoking metric to the y aesthetic. We can generate the scatter plot using geom_point().

Second, there are clearly a finite number of possible values since the data already appear to be discretized. This causes a lot of over plotting of the data. I notice that the points are different shades of green. No doubt the more intense shade is where there’s more overlap. We could pull this off by using the alpha aesthetic in geom_point(). This aesthetic controls the transparency of the point. If we use alpha = 0.20 then the plotting symbol will be 80% transparent. If there are 5 points on top of each other, then the symbol will be opaque.

Third, there is a fitted line through the data. We can pull this off using geom_smooth(). We can get a polynomial or straight line with this function. To get the straight line, we’ll use method = "lm" as an argument. We’ll also want to turn off the cloud indicating the standard error. He doesn’t provide the correlation coefficient and I wonder if it is significantly different from zero.We could easily calculate and test the coefficient using corr.test().

Fourth, he creates four quadrants that roughly align with the bivariate nature of the choropleth plot. We could achieve this by drawing light gray horizontal and vertical lines that intersect at the median smoking and drinking levels. Then we could use annotate() to add the bolded descriptions that he included.

Finally, around the outside of the cloud of points he includes 20 solid green points with their FIPS codes. I would implement this in three steps. First, I’d use filter() to get the desired FIPS codes. Second, I’d use the filtered data with geom_point() using alpha = 1 to get the solid symbols. I notice that those symbols have a black edge telling me that we’d want to use shape = 21. Third, I’d use geom_text() with the filtered data to label the points. I’d add nudge_x and nudge_y columns to the filtered data to indicate which way to bump the label for each of these points. You might forgive me if instead of using his points of interest I were to denote the data for counties in Michigan (FIPS codes starting with 26)

Along the way there are other interesting things we’d want to implement in this figure. Those include removing the x and y-axis text and ticks and customizing the placement of the x and y-axis titles. A more advanced move would be to think about how we might make a function to automatically generate this type of figure for any pair of data columns in the UWPHI dataset.

Let me know what interests you about this figure! I’ll be sure to work your feedback into the video when I post it to YouTube in a few weeks.

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks! The summer is nearly over - where did it go?! Many of us are getting ready to send our kids off to school and start a new academic year. If you’re subscribed to this newsletter, I suspect you are interested in improving your data visualization skills. You can certainly continue to receive this newsletter and watch my weekly livestreams on YouTube for free to help increase those skills. If you want a more concentrated or personalized opportunity to develop your data visualization...

Hey folks! I’d love to have you join me in September for a new approach to teaching workshops that I will be rolling out. For five weeks I’ll be working with two cohorts of you all to improve our data visualization skills. Each week we’ll meet for a two-hour session. These sessions will include instruction on principles and concepts in data visualization and an opportunity to apply this information to visualizations we find in the wild or that you bring to the group. By not talking about...

Hey folks, Are you interested in uping your data visualisation skills? I’m rolling out a new program to help you improve the design of your data visualizations. This program will last 5 weeks starting at the beginning of September. Each session will be two hours long and include a discussion of data visualization principles followed by an opportunity to apply these ideas to your own visualizations. There will be no coding in this program so you can focus more on concepts than implementation....