What is the smelliest function in R?


Hey there,

I’m still glowing from the positive feedback to restarting the YouTube channel. Thank you for all the warm fuzzies and encouragement :) In case you missed it, I’m going to be developing an R package over the coming weeks to teach myself how to build an R package, but to also teach people how a commonly used algorithm in microbial data analysis works and to give more exposure to programming with R.

In last week’s newsletter I mentioned need to develop a sense of “taste” when we look at code. Even if we don’t know how to code, I believe there are things you can see in code that tells you there might be problems. The technical jargon for these problems are “code smells”.

This week, the code smell I want to share with you is the use of the setwd function in R code. The setwd function probably should have never been created. Well-meaning people use setwd to set the working directory for their script, but there are better approaches to doing this.

There are at least three problems I’ve seen with using setwd.

The first problem with setwd is that it isn’t portable. If you have an R script with setwd you likely have something like this:

setwd("/Users/pschloss/Desktop/phylotyper")

What happens if you need to run your script on a high performance computer (HPC) cluster? What happens if you give your script to a colleague and encourage them to run it on their computer? So many error messages…

Why would this cause problems? An HPC is very unlikely to have a Desktop directory. Your colleague, most likely, doesn’t share your user id (e.g., pschloss). Again, setwd is not portable.

The second problem with setwd is that people will often use it to “move” their analysis to the data. They may have a single script or a pipeline with multiple setwd commands as the go in and out of directories. This gets confusing. It gets exceptionally confusing if your pipeline crashes mid way through. This will leave you not knowing where R was in the analysis.

A related problem with using setwd is that it indicates other problems might be lurking with how you are doing your analysis. In my experience, many setwd users have their data and code in very different places on their computer. This can cause confusion about “where” they are in the computer and which files they are using.

What is the alternative? I would strongly encourage housing all of your code, data, and outputs within a single directory that I call the “project root directory”. Within the project root directory you would then have separate directories for code, data, and outputs. All R code would be run from the project root directory.

Using a project directory like this you will need to include the path to the data and code relative to the project root directory. If you ever see a path start with a /, you have an absolute path. The path I gave above with setwd is an example. But, if I’m in phylotyper, I could instead do something like this using paths relative to the project root directory:


read_tsv("raw_data/my_car_data.tsv") %>%
group_by(passengers) %>%
summarize(mpg = total_miles/total_gallons) %>%
write_tsv("proc_data/my_summary_car_data.tsv")

No need to move in and out of directories. You can read from raw_data and write to proc_data. If I give you my phylotyper directory then you would have everything self contained within that directory.

But how do we get to phylotyper in the first place? If you’re using the command line, you can navigate there using cd and then launch R from the command prompt. Most likely, however, you’re using RStudio. In RStudio, you can create a project file that will end with .Rproj. This will live in your project root directory. By double clicking on this file, RStudio will launch and put you directly into the project root directory.

Brilliant, eh? If you ever wonder where you are you can either use getwd within R or you can look at the upper left corner of your Console window in RStudio. You should see your path.

Working relative to your project root directory is a very different way of approaching data analysis. But, it is tremendously powerful. Furthermore, each of the problem cases I described above vanish by viewing your entire project as coming from a single working directory.

This week, look at your code. Can you find a script that has multiple setwd calls? Supervisors, look at your people’s code. Do you see any setwd calls? If your lab is posting their code to GitHub, you can do a search for setwd across your lab’s repositories.

Workshops

I'm pleased to be able to offer you one of three recent workshops! With each you'll get access to 18 hours of video content, my code, and other materials. Click the buttons below to learn more

In case you missed it…

I’m in the process of building an R package to implement the Naive Bayesian Classifier that used to be found at the Ribosomal Database Project. This week we got started on the package.

Finally, if you would like to support the Riffomonas project financially, please consider becoming a patron through Patreon! There are multiple tiers and fun gifts for each. By no means do I expect people to become patrons, but if you need to be asked, there you go :)

I’ll talk to you more next week!

Pat

Riffomonas Professional Development

Read more from Riffomonas Professional Development

Hey folks, What a year! This will be the last newsletter of 2025 and so it’s a natural break point to think back on the year and to look forward to the next. Some highlights for me have been recreating a number of panels from the collection of WEB DuBois visualizations on YouTube, recreating plots from the popular media, and modifying and recreating figures from the scientific literature. I guess you could say 2025 was a year of “recreating”! I have found this approach to making...

Hey folks, As 2025 is winding down, I want to encourage you to think about your goals for 2026! For many people designing an effective visualization and then implementing it with the tool of their choice is too much to take on at once. I think this is why many researchers recycle approaches that they see in the literature or that their mentors insist they use. Of course, this perpetuates problematic design practices. What if you could break out of these practices? What if you could tell your...

Hey folks, Did you miss me last week? Friday was the day after the US Thanksgiving holiday and I just couldn’t get everything done that I needed to. The result was an extra livestream on the figure I shared in the previous newsletter. If you haven’t had a chance to watch the three videos (one critique, a livestream, and another livestream) from that figure, I really encourage you to. In the first livestream I made an effort to simplify the panels as a set of facets. Towards the end a viewer...