Drawing Venn diagrams with ggplot2


Hey folks,

As you know, I’ve been encouraging folks to strengthen their intuition of how to make different types of plots in R. Some of that is through this newsletter. Some of that is by engaging in mob programming with your research group and others. I received a great email this week from Devin Drown, a professor at the University of Alaska in Fairbanks, that just made my day:

I want to thank you for sending out this newsletter and the past few with visualizations. This past week, I used this box plot challenge for the students in my lab group. Most hadn’t seen your newsletter post, so it was a fresh challenge for them. We even tried our hand at the mob programming (Driver/Navigator) approach that you mentioned before. What a really fun way to engage lots of learners in programming. We had a wide variety of experience levels, and I was really impressed at how this method can help.

This unsolicited endorsement really speaks to what I am trying to do with Riffomonas. If you don’t have a local community that you can lean on to do mob programming, I’m offering sessions throughout October for anyone who is interested. The goal is to provide greater opportunities to hone your skills in a social setting. Of course, it is best to do it with people who are on a journey with you at your institution. Doing it via Zoom is the next best thing.


Each week when I search for figures to show you for this newsletter, I come across a number of things that aren’t what we typically think of as plots we’d made with R. One of the more common figures I come across are Venn diagrams. Here is one of many possible examples take from Figure 1 of “Exploring pangenomic diversity and CRISPR-Cas evasion potential in jumbo phages: a comparative genomics study” by Sharayu Magar and colleagues (Who doesn’t love jumbo phages?!).

As with anything you might do with R, there are many ways to make this figure. Before reading on, how would you make a Venn diagram in R?

Three general approaches come to mind immediately. First, there’s probably a package out there to draw geometric object or even make Venn diagram. But where’s the fun in that and what would we learn? Second, we could represent each circle as a line plot. The points along the lines could be generated using the equation of a circle and stored in a data frame. For those of you who just reached for PowerPoint, don’! There’s a third approach. Instead of thinking of each circle in a venn diagram as a line or ribbon plot, think about a scatter plot.

Let’s think about a data frame with three columns and two rows. The first, x contains the x-axis position. Let’s make those values 1 and 2. The second, y, contains the y-axis position. Let’s make both of those values 0. The third, label, contains the labels we find above each circle. What would happen if we use this data frame with the {ggplot2} function geom_point()? Believe it or not, this is the start of a Venn diagram.

Let’s think about the various aesthetics that are available to us with geom_point(). I want to call your attention to six, size, color, shape, fill, stroke, and alpha. We don’t need a legend, so see if you can figure out how to get rid of that.

Once we generate our “scatter plot”, add size=10 as the argument to geom_point(). Because we are using the size argument outside of the aes() function, it will be applied to all of the points. Now increase or decrease the value going to size to get the circles to be the right size. You might also want to adjust the x-axis position to get the right amount of overlap. Unfortunately, we’re likely to end up with a plotting window that is entirely black. We need to zoom out. But how? You can change the x-axis limits using coord_cartesian(). You will likely notice that as we adjust things like the axis limits, the size of the circles will change. Setting expand = FALSE within coord_cartesian() can help some.

Next, we want to adjust the color so we don’t have overlapping black circles. Do you recall how we can map a variable to a color? We can put the variable, like label, in aes(). Now we get two different colors - success! If you don’t like the colors, you can change them to what you want with scale_color_manual(). Unfortunately, the right hand circle masks the overlapping region with the left hand circle. This is where alpha comes in handy. The alpha aesthetic modifies the level of transparency of the plotting symbols. Try using alpha=0.5 as an argument to geom_point() and adjust the value between 0 and 1 to find a value you like. Looking closer at the published figure, I notice that the left hand circle is solid and the right hand circle has some transparency. How would you use two different alpha values? What does your intuition tell you the function might be called that lets you pick specific alpha values?

Another subtle difference between our version of the Venn diagram and the published version is that the published version has a solid black border around the circles. That gives a pretty cool look that I like. If you’ve ever wondered what plotting character values between 21 and 24 are for, here’s a use case. Those plotting symbols allow you to use one color for the interior of the symbol and another for the border. Let’s use shape = 21 in geom_point(). You’ll notice that now we have only colored the border, not the interior. As we’ve seen the color aesthetic colors the border. The fill aesthetic colors the interior. How would you get colored symbols with black borders? Once you get that to work, we might want that border to be a bit thicker. We made the symbol larger, but we didn’t alter the line thickness. That can be done using the stroke argument. Start with stroke = 2 in geom_point() and adjust the value to get a thickness you like.

If your Venn diagram is looking like mine, you’ll notice that the black line is actually gray. Why is that? It appears that alpha also makes the border line transparent. Oof. How would you go about making the border a solid, dark, black line? My solution would be to plot plotting symbols on top of the originals. The new plotting symbols would have no fill color. If you use color = NA or fill = NA, that NA is effectively 100% transparency. But now you might notice that the left hand circle has a solid black line all the way around. It isn’t “under” the right hand circle. This is getting long, so I’ll let you see if you can figure that out.

Let’s add those labels! Let’s start by adding the labels above the two circles using geom_text(). When I make the data frame, I’d probably put the actual label in the label column. If you need a line break, you can use "\n" in the string. Of course, it will put the labels on the x-axis because our y column has 0 for both positions. You can easily modify the y-axis position for both circles by including y = 0.75. But, it now looks like the labels are on the x-axis still. The problem is that our y-axis limits haven’t been set. Return to coord_cartesian() and see if you can set those limits and then adjust where the labels go. Of course, you can also modify the text-related arguments in geom_text() to change the size of the font and whether it is bolded.

Next, let’s insert the numbers in those circles. We can do that by creating a second data frame that we add to the plot with a second geom_text() argument using its data, aes, and inherit.aes arguments. Let’s try something different. We can use annotate() from {ggplot2}. This function will take the geom and all of its aesthetics to position and modify the text. Here’s a starting point that you can use to modify:


annotate(
geom = "text",
x = c(0.25, 1.50, y = 2.5),
y = c(0, 0, 0),
label = c("A", "B", "C")
) +

This is a really cool function to add annotations to your figure. Once you get this to work, see if you can use it to replace the earlier geom_text() function.

The last thing to modify is the background. It still looks like a plot. There’s a special theme_*() function that will come in handy here: theme_void(). Viola. Your Venn diagram. Now you can adjust the sizes and positions to get things to look how you want them. I’d encourage you to use ggsave() to save your figure to a specific format and size so things don’t move between the “Plots” tab and your final file.

You might be asking, why would anyone go through all of this when you could just use Micro$oft PowerPoint? The most important is that you could include a script to generate a Venn diagram that takes the label and number values from your data. The figure will be automatically get updated if your upstream analysis is changed. Scripting a figure like this is also valuable if you need to generate a bunch of Venn Diagrams. Of course, think of all of the awesome things you just learned!

Finally, I’d encourage you to make a three circle Venn diagram using what you’ve learned from making a two circle diagram. I think diagrams with more groups would require using the equation of an ellipse. While I’m handing out homework that I don’t have to grade, see if you can make a two and three circle diagram using the equation of a circle!

As I come to the end of the current YouTube channel series building an R package, let me know whether you’d like me to take this verbal analysis of figures and translate it to real R code that I develop in video form. I’m always interested in the types of figures you’d like to see how to make in R - feel free to email me with ideas!

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

Here are some videos that I published this week that relate to previous content from these newsletters. Enjoy!

video preview

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development
man floating holding on orange stick white people watching on the street

Hey folks, I have long since given up trying to anticipate what types of videos will resonate with people on YouTube. One of my most popular videos shows people how to make stacked bar plots. Throughout it, I tell people that these are a horrible way to visualize data. It’s my third most viewed video. I thought a video on slope plots would be popular. Nope. People panned last week’s episode. But Venn diagrams - holy cats! People are really geeking out about this week’s episodes on Venn...

Hey folks, I’m really grateful for the people who have emailed me recently to thank me for making the recreation and makeover videos. I’ve been excited to see the types of figures some of you are trying to make. It’s really been a great part of this work for me. Thank you! Eric Hill is a loyal Riffomonas Channel viewer who recently sent me an animation he made using the p5.js platform. The animation shows his son’s performance relative to other runners in the prestigious Nike Cross Nationals...

Hey folks, One of the benefits of sending out these newsletters and making my YouTube videos is that I get a ton of practice. I can’t emphasize how much practice has paid off in learning to use dplyr, ggplot2, and other packages. Reproducing published figures has really helped me to dive into parts of ggplot2 that I wouldn’t normally use because I make plots that use the features of ggplot2 that I know. By expanding my knowledge of ggplot2, I’m finding that the plots I make from scratch are...