Congratulations! You have the one script to rule them all. Now fix it.


Hey folks,

Hopefully, you’ve been enjoying my march through some of the more common code smells that I see as I look over people’s shoulder when they ask for help with their R code. My goal in this series is to give both the person developing code (aka John) and their supervisor (aka Peggy) a useful set of things to look for when reviewing code.

Last week, I discussed the smell of having too many packages in your R script. Having a bunch of packages may be a symptom of having an overly long R script. That’s this week’s code smell!

How long is too long? Well……. Can we start with saying that a script with more than 10,000 lines of code is too long and work back from there?!

I think “how long is too long” is probably the wrong question. Let’s try out these questions instead? * What does this R script do? * How many R scripts do you have for your analysis? * How long does it take you to find something in your script?

These are three highly related questions. Let me share my side of a conversation I once had:

“Tell me what your most recent R script does” … “Oh, I see, it builds all of the figures for your paper.” … “Oh, really - it also fits some statistical models.” … “Really? It also cleans the data?” … “How long does it take to run?” … “You don’t know because you’ve never actually run the whole thing?” … “Can you show me where it cleans the data?”

This person is likely falling into the trap of thinking that they are developing the “one script to rule them all”. This is almost always a bad idea. You are unlikely to be the exception to this rule. More than likely, you (and I) are the reason this is a bad idea!

Scripts that go on for hundreds of lines of code usually do many things. They also tend to be poorly organized so that you can’t find the code you need to find. If you try to run the script with R’s source function, it may throw an error or warning message. Then you’re stuck trying to wade through all that code to find where things went wrong. If it doesn’t throw an error message, it may take a while to execute the script. If you only wanted to change the color of the lines in Figure 3, then you’ll need to rerun the entire script multiple times to get just the right color. What a pain.

Your script should do one thing with clear inputs and a single output. Having multiple outputs are called “side effects”. Like medications, side effects in programming are almost always a bad idea.

If you were to look at a directory for one of my projects, you would likely see separate R scripts for each figure I am generating where the output is a single TIFF file. Where is the code that I ran to generate Figure 3? Oh right there, it’s in code/build_figure_3.R. More than likely, Figure 3 is based on several pieces of input data that were each generated by other scripts. It’s also likely that Figures 2 and 5 also reuse some of that data. I’ll have separate scripts to generate those input files with hopefully descriptive names that tell me what the script does.

By the end of the analysis, I may have 10 or more R scripts in my code/ directory. How do you keep track of all the inputs to those scripts? One option would be to have the “one script to rule them all”. Or perhaps, the “one script to run them all”. This driver script can be very helpful to organize your pipeline.

An alternative is to use a tool like Snakemake or R’s targets package to keep track of dependencies. With these types of tools, if I want to change a color in a figure, the tooling will tell me that I only need to regenerate the figure. If I change the URL where the raw data comes from, then it will know that the entire pipeline needs to be rerun. I’ve become a HUGE fan of Snakemake for managing the data and code dependencies in my projects.

This approach to breaking up a single script into smaller chunks of code is part of the project-based approach I discussed a few weeks ago when chiding you for using setwd.

In conclusion, if you take the approach of having your script generate a single output you will tend to create shorter and more scripts. You’ll find that your code is better organized and that you’ll minimize duplication of your code.

This week, take a look at your most recent project. How many R scripts do you have? How long are they? How many outputs are they generating? Take one of those scripts and see if you can break it up in to more manageable units.

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

I’m in the process of building an R package to implement the Naive Bayesian Classifier that used to be found at the Ribosomal Database Project. This week we got started on the package.

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, What a year! This will be the last newsletter of 2025 and so it’s a natural break point to think back on the year and to look forward to the next. Some highlights for me have been recreating a number of panels from the collection of WEB DuBois visualizations on YouTube, recreating plots from the popular media, and modifying and recreating figures from the scientific literature. I guess you could say 2025 was a year of “recreating”! I have found this approach to making...

Hey folks, As 2025 is winding down, I want to encourage you to think about your goals for 2026! For many people designing an effective visualization and then implementing it with the tool of their choice is too much to take on at once. I think this is why many researchers recycle approaches that they see in the literature or that their mentors insist they use. Of course, this perpetuates problematic design practices. What if you could break out of these practices? What if you could tell your...

Hey folks, Did you miss me last week? Friday was the day after the US Thanksgiving holiday and I just couldn’t get everything done that I needed to. The result was an extra livestream on the figure I shared in the previous newsletter. If you haven’t had a chance to watch the three videos (one critique, a livestream, and another livestream) from that figure, I really encourage you to. In the first livestream I made an effort to simplify the panels as a set of facets. Towards the end a viewer...