Visualizing Men's and Women's March Madness with ggplot2 and rvest


Hey folks,

It’s March! That means the days are getting longer, the weather is pretty bonkers, the Cubs season has already started, and it’s time for March Madness.

For the uninitiated, that’s the roughly month-long period starting last week when men’s and women’s college basketball teams compete for their conference championship and then the National Championship. After falling apart at the end of the regular season the University of Michigan Men’s team won their conference tournament and received one of four 5 seeds. Our women’s team also had a strong season falling just short in the conference tournament in the semifinals. They received one of four 6 seeds. To be completely honest and while I’d be happy to be wrong, I doubt either team will get far in the NCAA tournament. Regardless, it’s a fun time of year when nearly anything happen.

Earlier this week the New York Times newsletter recalled that last year was the first time that the Women’s NCAA Championship game had more TV viewers than the men’s game. The women’s game had 18.9 million viewers and the men’s game had 14.8 million viewers. Much of that is being credited to Caitlin Clark. Basketball insiders have noticed that there is greater parity in both the men’s and women’s game. There are more upsets and the traditional powers have lost much of their power. That parity makes for more viewers.

Here’s the visualization that the New York Times included in their newsletter:

It’s a pretty straightforward line plot. Within {ggplot2} we’d make this with geom_line(). I’d expect a data frame with columns for the year, the gender of the players, and the n_viewers (number of viewers). We could then map year to the x aesthetic, n_viewers to the y aesthetic, and gender to the color of the line.

A few things about the plot stand out.

First, as we’ve seen in previous videos recreating NYT visuals, the y-axis text often sits on the horizontal grid lines and the top most value includes the unit of the y-axis.

Second, on the right side of the plot they have a single point for the 2024 data along with a text annotation indicating the number of 2024 viewers for both finals broadcasts. The point and text are the same color as the line. I’d likely add this by taking the full data frame and filtering for the 2024 data. Then I’d add a geom_point() and geom_text() statement with data coming from the filtered data. I’d probably use annotate() to add the “2024 finals” annotation in black above the text with the viewership numbers.

Third, I really like the use of color. The line for the men’s data is gray and the line for the women’s data is orange. The article was about the rise in popularity of the women’s game so it makes sense to highlight those data. Labelling the 2024 data serves as a legend for the figure.

I have a couple of small critiques about the plot. First, I’m not sure that the “2024 finals” annotation was necessary. Instead, the title of the plot could have been “N.C.A.A. basketball championship game viewers” - inserting “game” and removing the years. This would have highlighted what the numbers mean. Also, the years are obvious from the x-axis. Of course, they also could have made the title more declarative. Something about how there has been a downward trend in viewership of the men’s game and a meteoric rise for the women’s game. Finally, the lines make it appear that there was a tournament in 2020. There was not. Ideally, these lines would have a gap in them to indicate that there was a lost year to the pandemic.

Finally, I tried to track down the data that the author’s used. Considering this was a big story last year, I figured Nielsen or someone would have the data gathered together. Nope. Nielsen has numbers for the women’s game going back to 1995. Sports Media Watch has numbers for the men’s game going back to 1975. After a few efforts of manually transcribing numbers, I’m going to take a different approach. This looks like a great opportunity to focus on using the {rvest} package to extract the HTML-based tables from these pages. We can then use {dplyr} to clean up the data and join the men’s and women’s data together.

Give this a try on your own and let me know how it goes. I hope that both University of Michigan teams are still playing by the time I get a chance to share my approach to recreating this visualization. Do you think the women’s game will have more viewers than the men again this year? We’ll know in a few weeks :)

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

video previewvideo preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, Before digging into this week’s data visualization, I wanted to give you all a heads-up about some learning activities I’m currently developing. First, in the next month or so I will be hosting a one-day, online workshop on the basics of {ggplot2}. If you feel that the things I talk about in this newsletter or on my YouTube channel are a bit beyond your grasp, this would be perfect for you. Second, I’ve gotten great feedback about a group coaching format that I’ve been developing...

Hey folks, Did you know that March is Women’s History Month? Each year The Economist updates what they call the “Glass Ceiling Index”. This is a measure of “the role and influence of women in the workforce”. It’s an aggregate of ten factors including the gender gap in wages, work force participation, and higher education. Sadly, the article is behind a paywall. They also haven’t made their data publicly available. Regardless, you can get a static copy of the article through archiv.is. Here’s...

Hey folks, This has been a busy week! I’ve been on campus teaching a 3 day, all day, R class. It’s been a while since I’ve done one of these live workshops off campus. If you’re interested in me coming to your campus, you coming to Michigan, or being in a Zoom-based workshop, please let me know! I really love being able to interact with you all in workshops. If your experience has been at all like my own the past month or so, your conversations have all had a tinge of anxiety about the...