Congratulations! You have the one script to rule them all. Now fix it.

Hey folks,

Hopefully, you’ve been enjoying my march through some of the more common code smells that I see as I look over people’s shoulder when they ask for help with their R code. My goal in this series is to give both the person developing code (aka John) and their supervisor (aka Peggy) a useful set of things to look for when reviewing code.

Last week, I discussed the smell of having too many packages in your R script. Having a bunch of packages may be a symptom of having an overly long R script. That’s this week’s code smell!

How long is too long? Well……. Can we start with saying that a script with more than 10,000 lines of code is too long and work back from there?!

I think “how long is too long” is probably the wrong question. Let’s try out these questions instead? * What does this R script do? * How many R scripts do you have for your analysis? * How long does it take you to find something in your script?

These are three highly related questions. Let me share my side of a conversation I once had:

“Tell me what your most recent R script does” … “Oh, I see, it builds all of the figures for your paper.” … “Oh, really - it also fits some statistical models.” … “Really? It also cleans the data?” … “How long does it take to run?” … “You don’t know because you’ve never actually run the whole thing?” … “Can you show me where it cleans the data?”

This person is likely falling into the trap of thinking that they are developing the “one script to rule them all”. This is almost always a bad idea. You are unlikely to be the exception to this rule. More than likely, you (and I) are the reason this is a bad idea!

Scripts that go on for hundreds of lines of code usually do many things. They also tend to be poorly organized so that you can’t find the code you need to find. If you try to run the script with R’s source function, it may throw an error or warning message. Then you’re stuck trying to wade through all that code to find where things went wrong. If it doesn’t throw an error message, it may take a while to execute the script. If you only wanted to change the color of the lines in Figure 3, then you’ll need to rerun the entire script multiple times to get just the right color. What a pain.

Your script should do one thing with clear inputs and a single output. Having multiple outputs are called “side effects”. Like medications, side effects in programming are almost always a bad idea.

If you were to look at a directory for one of my projects, you would likely see separate R scripts for each figure I am generating where the output is a single TIFF file. Where is the code that I ran to generate Figure 3? Oh right there, it’s in code/build_figure_3.R. More than likely, Figure 3 is based on several pieces of input data that were each generated by other scripts. It’s also likely that Figures 2 and 5 also reuse some of that data. I’ll have separate scripts to generate those input files with hopefully descriptive names that tell me what the script does.

By the end of the analysis, I may have 10 or more R scripts in my code/ directory. How do you keep track of all the inputs to those scripts? One option would be to have the “one script to rule them all”. Or perhaps, the “one script to run them all”. This driver script can be very helpful to organize your pipeline.

An alternative is to use a tool like Snakemake or R’s targets package to keep track of dependencies. With these types of tools, if I want to change a color in a figure, the tooling will tell me that I only need to regenerate the figure. If I change the URL where the raw data comes from, then it will know that the entire pipeline needs to be rerun. I’ve become a HUGE fan of Snakemake for managing the data and code dependencies in my projects.

This approach to breaking up a single script into smaller chunks of code is part of the project-based approach I discussed a few weeks ago when chiding you for using setwd.

In conclusion, if you take the approach of having your script generate a single output you will tend to create shorter and more scripts. You’ll find that your code is better organized and that you’ll minimize duplication of your code.

This week, take a look at your most recent project. How many R scripts do you have? How long are they? How many outputs are they generating? Take one of those scripts and see if you can break it up in to more manageable units.

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

minimalR Workshop

generalR Workshop

mothur Workshop

In case you missed it…

I’m in the process of building an R package to implement the Naive Bayesian Classifier that used to be found at the Ribosomal Database Project. This week we got started on the package.

**The Team, The Team, The Team: Reductionism vs holism in microbiome research (CC274)**

**Evaluating the performance of various methods for generating vectors in R (CC275)**

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Congratulations! You have the one script to rule them all. Now fix it.

Workshops

In case you missed it…

Looking back on 2025 and forward to 2026

Do you want to up your data visualization designs in 2026?

Adding additional layers of text and titles to an x-axis with ggplot2 in R