Chartjunk in plain sight & I need your feedback!

Hey folks,

I need your feedback on an idea! Don’t worry, there’s some visualization stuff at the bottom.

I had a video nearly ready to post this week using a ridgeline plot to show the baby boom. I think I did a great job of recreating the plot. But through a series of unfortunate events, I lost the video. I actually recorded the video three times because my computer kept crashing as I was recording it. This was on top of increasing busyness on my part with teaching, proposal writing, mentoring, parenting, etc. I need to try something different with the YouTube channel.

I absolutely love making and posting videos. Recreating plots and showing how to do different things with R and the tidyverse is a lot of fun for me. I can tell that people enjoy them. But they take a lot of effort on my part. I usually spend an hour or two finding visuals and figuring out how to recreate them. Then I usually spend two hours recording the videos and another four hours editing and posting them. Basically, I spend a day a week to make one video.

I could do shorter videos, but then I wouldn’t be able to go as deep on each one. I also wouldn’t get the personal satisfaction out of doing them. I could also go to posting videos every other week, but I worry that at that cadence we’d lose interest to keep going.

So, here’s an idea. What would you think of a livestream of the content that is typically in each video? I’d still make the same videos I’d normally make and if you couldn’t be on the livestream, the recording would be posted.

The benefit to those who can watch live is that you could ask questions throughout the show about what I’m doing or make suggestions. The downside to that is I edit out a fair amount of dead space and goofs. Perhaps that’s a good thing because then you’d get a more realistic idea of what I’m doing.

The benefit to me is that I wouldn’t have to edit, cutting my effort by more than half. The downside is that I’d have to commit to a time each week to be recording. There may be weeks where I might move the recording time, but I’d try to announce that in advance.

I think the biggest benefit for all of us is the opportunity to engage with one one another. I’m impressed by the reach of the channel and newsletter. That people watch from around the world - east/west, north/south - is pretty cool. Being able to interact in real time would make it even better.

Here’s my idea. I’ll livestream every Wednesday at 9AM Eastern. I’ll block out two hours for each session. Because I still need to learn the streaming software better, I might not be able to start streaming until June 4th. Would you watch this live? Would you watch a recording. Is there a better time or day?

I really appreciate your feedback! You can reply directly to this email.

~~~

Here is a plot that is representative of a lot of plots I’m seeing these days in the literature. Acutally, I went to my favorite journal, opened 5 papers and found something like this in the second paper I looked at. That paper had two of these figures. Like I said, that plot and the problems I see in it are pretty common. No need to single out individuals directly. So here’s a recreated version of the figure

Take a few minutes and jot down some thoughts about the plot. What don’t you think I like?

My main pet peeve with this type of plot is that they are trying to show too much. They have the actual data points, bars to indicate the mean, and error bars showing the mean plus and minus one standard deviation.

There are three points. Three, it’s a magic number.

In most examples of this plot that I find in the wild, the minimum and maximum bounds on the error bars go to the point with the smallest and largest values, respectively. I’ve tried a few sets of values and even if the data have a pretty skewed distribution the error bars (e.g., groups B, C, and D) are pretty close to the min and max values.

I also don’t think the bar plot provides any information. If the points are close in value like they are here, then you can “eye ball” the mean, at least to the level of precision you would have in calcuating the mean. The bar is what Tufte would probably all “chartjunk”. It’s extraneous and decreases your data to pixels ratio.

Here’s an alternative version of the same data that I also often find in the wild…

Again, there are three points. The midline on a box plot is the median. The whiskers will extend to 1.5 times the intraquartile range or to the next smallest point. Here the boxplot provides nothing over the three points on their own. I like their thinking of using non-parametric summary statistics, but the box plot with the data is another example of chartjunk.

What would I do instead? How about this?

If your second reviewer insists on some indication of the mean (again the median will be always the point in the middle), then how about this?

Lest you think I’m still in a bad mood about losing my ridgeline plot video, here’s a positive about these plots. They show the data. Showing the bar with the error bars or the box plot without the data would likely be very misleading. For the barplot example you’d get the sense that the data are evenly distributed when they often aren’t and for the boxplot example you’d think they had more data than they did. You can read more about why barplots are almost always not the right choice in this great article. I encourage you to present this paper at the next journal club session you lead. Let me know how the discussion goes!

Here’s some R code to recreate the plots in this newsletter…

library(tidyverse) d <- tibble( group = rep(c("A", "B", "C", "D"), each = 3), rep = rep(c(1,2,3), 4), value = c(2.0, 2.2, 2.4, 1.1, 1.5, 1.8, 0.8, 0.6, 1.2, 2.5, 2.4, 2.8) ) d %>% summarize(mean = mean(value), sd = sd(value), lower = mean - sd, upper = mean + sd, min = min(value), max = max(value), .by = group) ggplot(d, aes(x = group, y = value)) + stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "col", width = 0.25, fill = "gray50") + stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.2) + geom_jitter(width = 0.1, height = 0) ggplot(d, aes(x = group, y = value)) + geom_boxplot() + geom_jitter(width = 0.1, height = 0) ggplot(d, aes(x = group, y = value)) + geom_jitter(width = 0.1, height = 0) ggplot(d, aes(x = group, y = value)) + geom_jitter(width = 0.1, height = 0) + stat_summary(fun.data = mean_sdl, fun.args = list(mult = 0), geom = "crossbar", width = 0.5)

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

Here is a livestream that I published this week that relate to previous content from these newsletters. Enjoy!

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Chartjunk in plain sight & I need your feedback!

Workshops

In case you missed it…

Visualizing how Americans feel about different card games

How would you make a labelled bar plot with positive and negative values?

Making a basic line plot appear more sophisticated