Visualizing Men's and Women's March Madness with ggplot2 and rvest


Hey folks,

It’s March! That means the days are getting longer, the weather is pretty bonkers, the Cubs season has already started, and it’s time for March Madness.

For the uninitiated, that’s the roughly month-long period starting last week when men’s and women’s college basketball teams compete for their conference championship and then the National Championship. After falling apart at the end of the regular season the University of Michigan Men’s team won their conference tournament and received one of four 5 seeds. Our women’s team also had a strong season falling just short in the conference tournament in the semifinals. They received one of four 6 seeds. To be completely honest and while I’d be happy to be wrong, I doubt either team will get far in the NCAA tournament. Regardless, it’s a fun time of year when nearly anything happen.

Earlier this week the New York Times newsletter recalled that last year was the first time that the Women’s NCAA Championship game had more TV viewers than the men’s game. The women’s game had 18.9 million viewers and the men’s game had 14.8 million viewers. Much of that is being credited to Caitlin Clark. Basketball insiders have noticed that there is greater parity in both the men’s and women’s game. There are more upsets and the traditional powers have lost much of their power. That parity makes for more viewers.

Here’s the visualization that the New York Times included in their newsletter:

It’s a pretty straightforward line plot. Within {ggplot2} we’d make this with geom_line(). I’d expect a data frame with columns for the year, the gender of the players, and the n_viewers (number of viewers). We could then map year to the x aesthetic, n_viewers to the y aesthetic, and gender to the color of the line.

A few things about the plot stand out.

First, as we’ve seen in previous videos recreating NYT visuals, the y-axis text often sits on the horizontal grid lines and the top most value includes the unit of the y-axis.

Second, on the right side of the plot they have a single point for the 2024 data along with a text annotation indicating the number of 2024 viewers for both finals broadcasts. The point and text are the same color as the line. I’d likely add this by taking the full data frame and filtering for the 2024 data. Then I’d add a geom_point() and geom_text() statement with data coming from the filtered data. I’d probably use annotate() to add the “2024 finals” annotation in black above the text with the viewership numbers.

Third, I really like the use of color. The line for the men’s data is gray and the line for the women’s data is orange. The article was about the rise in popularity of the women’s game so it makes sense to highlight those data. Labelling the 2024 data serves as a legend for the figure.

I have a couple of small critiques about the plot. First, I’m not sure that the “2024 finals” annotation was necessary. Instead, the title of the plot could have been “N.C.A.A. basketball championship game viewers” - inserting “game” and removing the years. This would have highlighted what the numbers mean. Also, the years are obvious from the x-axis. Of course, they also could have made the title more declarative. Something about how there has been a downward trend in viewership of the men’s game and a meteoric rise for the women’s game. Finally, the lines make it appear that there was a tournament in 2020. There was not. Ideally, these lines would have a gap in them to indicate that there was a lost year to the pandemic.

Finally, I tried to track down the data that the author’s used. Considering this was a big story last year, I figured Nielsen or someone would have the data gathered together. Nope. Nielsen has numbers for the women’s game going back to 1995. Sports Media Watch has numbers for the men’s game going back to 1975. After a few efforts of manually transcribing numbers, I’m going to take a different approach. This looks like a great opportunity to focus on using the {rvest} package to extract the HTML-based tables from these pages. We can then use {dplyr} to clean up the data and join the men’s and women’s data together.

Give this a try on your own and let me know how it goes. I hope that both University of Michigan teams are still playing by the time I get a chance to share my approach to recreating this visualization. Do you think the women’s game will have more viewers than the men again this year? We’ll know in a few weeks :)

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, I’m gearing up to teach a 1-day (6 hours) data visualization workshop on May 9th. This workshop will cover an introduction to the ggplot2 package and will assume no prior R knowledge. My goal is to help you to understand the ggplot2 framework and begin to apply it to make some interesting and compelling visualizations. From this workshop, I hope that you would be able to go off on your own journey learning more advanced topics. You can learn more and register by clicking the button...

Hey folks, Long time friends of Riffomonas know that I’ve been teaching data science classes for close to 20 years. The hallmark of my teaching has been three-day workshops where I either teach R (here and here) or the mothur software package. I’ve gotten feedback that three days is just too much time for people to carve out of their busy schedules. So, I’m excited to be offering a 1-day (6 hours) data visualization workshop on May 9th. This will cover an introduction to the ggplot2 package....

Hey folks, I’m really excited to be offering a 1-day (6 hours) data visualization workshop on May 9th. It will cover the basics of ggplot2. If you’ve been following along this newsletter for anytime, you know I’ve thought a lot about how we learn. A critical element of learning is to create a mental model that we can hang ideas on to flesh out our understanding of a concept. The “grammar of graphics” is one such mental model for building plots. It is instantiated in ggplot2 - that’s the “gg”...