Thinking about how to extract and visualize data from a PDF (why!?!?!)


Hey folks!

The summer is nearly over - where did it go?! Many of us are getting ready to send our kids off to school and start a new academic year. If you’re subscribed to this newsletter, I suspect you are interested in improving your data visualization skills. You can certainly continue to receive this newsletter and watch my weekly livestreams on YouTube for free to help increase those skills. If you want a more concentrated or personalized opportunity to develop your data visualization chops, I want to remind you of a few opportunities. First, starting in September I am going to be teaching a 5-part workshop that meets weekly to discuss and apply concepts of data visualization. Second, I have pre-recorded workshops teaching the fundamentals of the tidyverse using microbiome data and data of interest to a more general audience. Finally, I would love to work one-on-one with you or your research team to develop custom learning solutions. If any of these opportunities interest you, please click on the links above or reply to this email and let’s start taking.


This week reader and livestream viewer Mike Parrott from the UK forwarded a plot to me from the Pew Research Center. The plot was part of Pew’s overall effort to look at US media consumption by sex, age, race, politics, and education. Mike was happy to see that The Guardian and BBC News are relatively popular among college educated people living in the US.

This plot reports the survey of 9,482 US adults that Pew surveyed back on March 2025. Part of the survey was to ask the people being surveyed where they get their news from and their level of education. One of my first questions when trying to recreate the data is whether I can get the data from somewhere that will allow me to bring it into R easily. Yes, all of the names and numbers are in the plot, but manually typing that would be a pain. I did find the data, but sadly, the data are embedded in a PDF. Why do people do this? It seems they want to be perceived as being transparent without actually having to be transparent.

Someone on a recent livestream mentioned that there are R packages to extract tables from PDFs. I forget which package they mentioned. A quick google search found a few options. First is {tabulizer}. Sadly, it appears to have been removed from CRAN back in 2021 because the package had problems. Next, I tried {pdftools}. That got me the raw text. I think I could parse the text to create a tibble using tools from {stringr}. Finally, I tried using {tabulapdf}. But after trying for half an hour to install it (and its Java dependencies), I gave up. I think I’d either use {pdftools} or just copy and paste from the PDF. Ultimately, I’d like to have a column for the media outlet and a column for the percent of college graduates who get their news from each outlet. Looking at the PDF, I’m intrigued by the prospect of including the “high or less” and “some college” categories. But that would require designing a different visualization.

Back to the plot…

Clearly this is a bar plot with the axes switched from what we traditionally see. This is helpful because it allows us to more easily read the name of the news outlet than if the names were along the x-axis and the names rotated to prevent them from overlapping. I would use geom_col() mapping the media outlet to the y aesthetic and the percent of college graduates to the x aesthetic.

The next notable element of the plot is the percentages of college graduates. I’d use geom_text() to incorporate the percentages as the label. The x-axis placement would be interesting. I’d likely add a nudge factor depending on whether the percentage was greater or less than the 36% threshold of US adults who are college graduates. In addition, I’d change the font color depending on whether it is above or below the threshold.

Let’s think about the text elements for a moment. There are two bits of text that help orient the reader to the plot. The first is the “62% of people regularly…” blurb that helps us interpret the first bar. I think that’s pretty helpful. There’s a downward pointing triangle there to connect the text to the bar for “The Atlantic”. I’d probably put the text an the triangle in with annotate() using geom = "text" and geom = "point". The second blurb indicates that “36% of US adults are college graduates”. This I’d place using the annotate() functions that would be used to place the other blurb. I think these blurbs are pretty helpful, but I do wonder if they’re too wordy or if the blurbs are too close to each other.

Thinking about that second blurb, we see that the authors put a pink point at 36% on the x-axis for each media outlet. We could place that with geom_point() by setting the x aesthetic to 36. Alternatively, I would perhaps use geom_vline() to draw a vertical line at 36%. I would then move the corresponding blub the the right of the line and move it down to the bottom of the plot near “Newsmax” or “Fox News”.

Now that I’ve started thinking about things I would change, let’s think more about the data being displayed. The story makes a point that the visual is basically flipped for people with a high school diploma or less. For example, “Univision” and “Telemundo” are most popular among these folks and “The Atlantic” is not popular with them. I could imagine changing the plot to be a dot plot instead of a bar plot. For each media outlet, I’d place a different colored point for each of the three education categories across the x-axis. I’d like to put a vertical line to show the total percentage of US adults in each category where the color matches the color of the point. Maybe that would be too busy? If so, we could drop the “some college” population to focus on the extremes. What do you think?

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

video preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks! I’d love to have you join me in September for a new approach to teaching workshops that I will be rolling out. For five weeks I’ll be working with two cohorts of you all to improve our data visualization skills. Each week we’ll meet for a two-hour session. These sessions will include instruction on principles and concepts in data visualization and an opportunity to apply this information to visualizations we find in the wild or that you bring to the group. By not talking about...

Hey folks, Are you interested in uping your data visualisation skills? I’m rolling out a new program to help you improve the design of your data visualizations. This program will last 5 weeks starting at the beginning of September. Each session will be two hours long and include a discussion of data visualization principles followed by an opportunity to apply these ideas to your own visualizations. There will be no coding in this program so you can focus more on concepts than implementation....

Hey folks, I’m really excited to announce a new program to help you improve the design of your data visualizations. I emailed you about this earlier in the week, so I’ll keep this reminder brief. This data visualization makeover program will last 5 weeks starting at the beginning of September. Each two-hour session will include a discussion of data visualization principles and strategies followed by an opportunity to apply these ideas to your own visualizations. There will be no coding in...