Visualizing Men's and Women's March Madness with ggplot2 and rvest

Hey folks,

It’s March! That means the days are getting longer, the weather is pretty bonkers, the Cubs season has already started, and it’s time for March Madness.

For the uninitiated, that’s the roughly month-long period starting last week when men’s and women’s college basketball teams compete for their conference championship and then the National Championship. After falling apart at the end of the regular season the University of Michigan Men’s team won their conference tournament and received one of four 5 seeds. Our women’s team also had a strong season falling just short in the conference tournament in the semifinals. They received one of four 6 seeds. To be completely honest and while I’d be happy to be wrong, I doubt either team will get far in the NCAA tournament. Regardless, it’s a fun time of year when nearly anything happen.

Earlier this week the New York Times newsletter recalled that last year was the first time that the Women’s NCAA Championship game had more TV viewers than the men’s game. The women’s game had 18.9 million viewers and the men’s game had 14.8 million viewers. Much of that is being credited to Caitlin Clark. Basketball insiders have noticed that there is greater parity in both the men’s and women’s game. There are more upsets and the traditional powers have lost much of their power. That parity makes for more viewers.

Here’s the visualization that the New York Times included in their newsletter:

It’s a pretty straightforward line plot. Within {ggplot2} we’d make this with geom_line(). I’d expect a data frame with columns for the year, the gender of the players, and the n_viewers (number of viewers). We could then map year to the x aesthetic, n_viewers to the y aesthetic, and gender to the color of the line.

A few things about the plot stand out.

First, as we’ve seen in previous videos recreating NYT visuals, the y-axis text often sits on the horizontal grid lines and the top most value includes the unit of the y-axis.

Second, on the right side of the plot they have a single point for the 2024 data along with a text annotation indicating the number of 2024 viewers for both finals broadcasts. The point and text are the same color as the line. I’d likely add this by taking the full data frame and filtering for the 2024 data. Then I’d add a geom_point() and geom_text() statement with data coming from the filtered data. I’d probably use annotate() to add the “2024 finals” annotation in black above the text with the viewership numbers.

Third, I really like the use of color. The line for the men’s data is gray and the line for the women’s data is orange. The article was about the rise in popularity of the women’s game so it makes sense to highlight those data. Labelling the 2024 data serves as a legend for the figure.

I have a couple of small critiques about the plot. First, I’m not sure that the “2024 finals” annotation was necessary. Instead, the title of the plot could have been “N.C.A.A. basketball championship game viewers” - inserting “game” and removing the years. This would have highlighted what the numbers mean. Also, the years are obvious from the x-axis. Of course, they also could have made the title more declarative. Something about how there has been a downward trend in viewership of the men’s game and a meteoric rise for the women’s game. Finally, the lines make it appear that there was a tournament in 2020. There was not. Ideally, these lines would have a gap in them to indicate that there was a lost year to the pandemic.

Finally, I tried to track down the data that the author’s used. Considering this was a big story last year, I figured Nielsen or someone would have the data gathered together. Nope. Nielsen has numbers for the women’s game going back to 1995. Sports Media Watch has numbers for the men’s game going back to 1975. After a few efforts of manually transcribing numbers, I’m going to take a different approach. This looks like a great opportunity to focus on using the {rvest} package to extract the HTML-based tables from these pages. We can then use {dplyr} to clean up the data and join the men’s and women’s data together.

Give this a try on your own and let me know how it goes. I hope that both University of Michigan teams are still playing by the time I get a chance to share my approach to recreating this visualization. Do you think the women’s game will have more viewers than the men again this year? We’ll know in a few weeks :)

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Visualizing Men's and Women's March Madness with ggplot2 and rvest

Workshops

In case you missed it…

Chartjunk in plain sight & I need your feedback!

(Don't go chasing) waterfall charts

What have I learned by recreating other people's visualizations?