Maintaining package hygiene in your R scripts

Hey there,

I’ve enjoyed marching through some of the more common code smells that I see when looking at people’s data analysis work. Let me know if you’re eager to get my thoughts on any smells you frequently see or have heard about. This week I’m taking on “package hygiene”.

What do I mean by package hygiene? This has a few aspects. Of course, as I discussed a few weeks ago, it means putting all of your library function calls at the top of your R script. Co-locating these functions will make it easier for someone to see what needs to be installed to get your code to run.

Once, you have put all of the library calls at the top of the script, it also becomes easy to see how many packages you are using in your script. This week, the bit of hygiene I want to focus on is related to the number of library calls in your R script.

I once worked with someone who easily had 25 library calls. There were a few reasons for this.

First, he had been googling for a package to do X, tried one of the results out in his script and then tried another. Then another. In the final script he may have only used one of those packages and forgotten to remove the other two library calls. Then he googled for a package to do Y, etc. When this was compounded across his entire script he had a pretty hefty number of library calls.

Second, as suggested by all of his googling, he was an R beginner. Several of the packages were being used to add features onto ggplot2 or to do trivial things that probably didn’t require a package. There’s nothing necessarily wrong with these packages, but they weren’t strictly necessary. This brings up a long fought discussion over how many dependencies your code should have. Once upon a time, I wrote all of my code code with no dependencies. That really made life hard on me as a coder, but easier on someone using my code. In time, I appreciated that I was often the only person using my R code! So it was worth adding a few packages to make my life easier as a coder. Now I basically start every script with library(tidyverse) and move on with life.

Third, his script was waaaaay too long. It was hundreds of lines long. Maybe even more than 1,000 lines long. Long scripts are their own smell, which I’ll take on another time. Because his script did so many things, he needed multiple specialized packages for different types of analysis. His large number of library calls was an indicator that he needed to break up his analysis into different parts.

Part of the reason this case is so memorable to me was because he was loading packages that conflicted with other packages. Whenever you do library(tidyverse) you’ll get a sense of this…

> library(tidyverse) ── Attaching core tidyverse packages ──────────────────────────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.0 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (

See that section at the bottom with “Conflicts”? That tells me which functions conflict with other functions I’m already using. There are only so many words to describe filtering. It turns out that filter in the stats package, filters time series data. Meanwhile, filter in the dplyr package removes rows from data frames. Pretty different applications. The person I was working with was loading a package that had a select function. Whenever, he would try running select with a data frame as input, it would throw an error. I don’t recall what the package was or what its select did, but we wanted the dplyr version. It was really annoying.

When you use more and more packages, the likelihood of having conflicts like these increases. What do you do when you can’t avoid having conflicts?

Let’s consider the project I was helping with where conflicting select functions were getting loaded. One option would be to load the package with the select function you want after the other package. If I only have two or three packages and there’s only one function that I’m worried about, I might take this approach. But it isn’t ideal.

To show a better approach, let’s say you’re working with time series data and want to use the filter from both stats and dplyr? The output I have above from running library(tidyverse) gives you a hint. If you want to use a function from a specific package use the package name with two colons followed by the function. For example,

mtcars %>% dplyr::select(mpg, cyl)

Some people will encourage you to always use this approach to specifying any function that you are loading from a package. I’m trying to be better about doing this myself, but old habits die hard. Plus, if I only have one package (tidyverse), then who cares?

The other benefit of writing dplyr::select() is that you won’t actually need to run library(dplyr) to use the select function if dplyr is already installed. Using the double-colon syntax keeps me from having to load everything from a big package. Be sure to remember that the tidyverse package is a “metapackage”, which is made up of other packages including dplyr. Running tidyverse::select() actually gives an error. I would still encourage you to use library calls at the top of your script so you know which packages need to be installed. The double-colon syntax will still be helpful if there are conflicts and helping you to see where different functions are coming from if you ever get 25 library calls in your script!

This week, look at some of your recent R code and count the number of library calls. What does that number tell you? Are you sure you are using functions from each of those packages? Do you know where each function is coming from? Is your script trying to do too much? Are you unnecessarily trying to be a dependency minimalist?

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

I’m in the process of building an R package to implement the Naive Bayesian Classifier that used to be found at the Ribosomal Database Project. There were two videos this week. Click on the thumbnails below to go to each video.

**Brute force building a kmer database in R (CC272)**

**The tutorial you need to maximize your use of R's vectors (CC273)**

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Maintaining package hygiene in your R scripts

Workshops

In case you missed it…

Looking back on 2025 and forward to 2026

Do you want to up your data visualization designs in 2026?

Adding additional layers of text and titles to an x-axis with ggplot2 in R