folderfun package makes it easy for you to manage files on disk for your R project.
folderfun is short for folder functions, but you’ll soon discover that it’s fun as well.
In a basic R project, you’ll probably want to read in data and write out plots or results. By default, the reading, plotting, and writing functions will read or write files in your current working directory. That might be fine for small, simple projects, but it breaks down for many real-world use cases.
For example, what if you save your R project as a git repository (which is a good idea)? You don’t want to store large, compressed input files in the same folder, nor do you want to commit your plot outputs. Instead, you’ll want to store the data and results in other folders. Large projects can also require multiple folders for both input and output – for example, you may load some shared data resource that lives in a group folder as well as some of your own project-specific resources. What if you want to work on a project with multiple people? These distributed folders can be organized in different ways and reside on different file systems in different computing environments. It can become a nightmare to keep track of the locations of all the folders on disk where different data and results are stored. And if you start hard-coding paths inside your R script, you make your code less portable, because it will only be able to be run in that computing environment. What if data changes locations? Your code breaks.
folderfun solves all these issues by making it dead simple to use wrapper folder functions to point to different data sources. Instead of pointing to input or output files with absolute file names, we define a function that remembers a root folder, and then use relative filenames with that function to identify individual files. Coupled with environment variables that define parent folder locations, you can easily maintain project-level subfolders with code that works across individuals and computing environments with almost no effort. This makes your code more portable and sharable and enables multiple users to work together on complex projects in different compute environments while sharing a single code base. Are you convinced yet?
Let’s say we have a project that needs to read data from one folder, let’s call it
data, and write results to another folder, let’s call it
results. Here’s how you might start this analysis naively:
# Load our data: input1 = read.table("/long/and/annoying/hard/coded/path/data.txt") input2 = read.table("/long/and/annoying/hard/coded/path/data2.txt") output1 = processData(input) output2 = processData2(input2) # Run other analysis... # Now write results: write.table("/different/long/annoying/hard/coded/path/result.txt", output1) write.table("/different/long/annoying/hard/coded/path/result2.txt", output2)
OK, that works… but this has problems: First, you repeat the paths, making it harder to change if the data move; Second, if you want to refer to these same locations in a different script, you’d have to repeat the paths yet further; and Third, this script won’t work in a different compute environment since filepaths may differ.
We can solve the first problem by defining a path variable, and then using it in multiple places:
inputDir = "/long/and/annoying/hard/coded/path" outputDir = "/different/long/annoying/hard/coded/path" input1 = read.table(file.path(inputDir, "data.txt")) input2 = read.table(file.path(inputDir, "data2.txt")) output1 = processData(input) output2 = processData2(input2) # Run other analysis... write.table(file.path(outputDir, "result.txt")) write.table(file.path(outputDir, "result2.txt"))
That’s much nicer; it limits the hard-coded folders to a single variable per folder, making them easier to maintain. Plus, now someone else could re-use this script by just adjusting the variable pointers at the top. But we still haven’t solved the problems of using these variables in another or using this script in another environment. And besides, that
file.path(...) syntax is really annoying! With
folderfun we can do better.
folderfun, we’ll use a function called
setff to create functions, each of which will provide a path to a folder of interest. This is analogous to what we’re trying to do with
outputDir above, we just use a function call instead of a variable. We assign each folder function a name (
Out in this example), and provide the location to the folder:
library(folderfun) setff("In", "/long/and/annoying/hard/coded/path/") ## Created folder function ffIn(): /long/and/annoying/hard/coded/path/ setff("Out", "/different/long/annoying/hard/coded/path/") ## Created folder function ffOut(): /different/long/annoying/hard/coded/path/
These functions have created new functions named by prepending the text ff (for folder function) to our given name. These functions allow us to build paths to files inside those folders by simply passing a relative path (filename), like this:
ffIn("data.txt") ffOut("result.txt") ##  "/long/and/annoying/hard/coded/path/data.txt" ##  "/different/long/annoying/hard/coded/path/result.txt"
So our original analysis would look something like this:
input1 = read.table(ffIn("data.txt")) input2 = read.table(ffIn("data2.txt")) output1 = processData(input) output2 = processData2(input2) # Run other analysis... write.table(ffOut("result.txt")) write.table(ffOut("result2.txt"))
So, to reiterate:
setff("In", ...) creates a folder function called
ffIn that will prepend the
inputDir path to its argument, giving you easy access to files in the directory referenced in the
setff call. You can have as many folder functions you want with whatever names you like. Creating a function with a name already in use will overwrite the older function with that name.
So far, so good – the
folderfun syntax is much nicer than what we had before. But we still haven’t solved the problem of referring to these same folders from multiple scripts, or sharing scripts across computing environments. What if there was a way to share folder functions across scripts and servers? This is where
folderfun becomes very useful. By using environment variables (or
R options), we eliminate the step of hard-coding anything in the R script.
For example, say we put this code into our
.profile to define the locations for a particular server:
export INDIR="/long/and/annoying/hard/coded/path/" export OUTDIR="/different/long/annoying/hard/coded/path/"
Or, from within R we could set environment variables like this:
Or perhaps our locations are R specific, and so we store them in our
Setting these variables creates a global variable that can be read by any R script. Furthermore, we could define variables with the same names on different systems. We have effectively outsourced the specification of the root directories to our
.bashrc. Now, all we need to do is use the global variables to build our folder functions. We could do this like so:
setff("In", Sys.getenv("INDIR")) ## Created folder function ffIn(): /long/and/annoying/hard/coded/path/ setff("Out", Sys.getenv("OUTDIR")) ## Created folder function ffOut(): /different/long/annoying/hard/coded/path/ ffIn() ffIn("data.txt") ##  "/long/and/annoying/hard/coded/path/" ##  "/long/and/annoying/hard/coded/path/data.txt"
setff("In", getOption("INDIR")) ## Created folder function ffIn(): /long/and/annoying/path/to/hard/coded/file/ setff("Out", getOption("OUTDIR")) ## Created folder function ffOut(): /different/long/annoying/hard/coded/path/ ffIn() ffIn("data.txt") ##  "/long/and/annoying/path/to/hard/coded/file/" ##  "/long/and/annoying/path/to/hard/coded/file/data.txt"
That code is now portable across scripts and servers because it uses the global folders. But it gets even easier: we’ve wrapped the
getOption calls into
setff so you just need to specify the global variable name to the
When you pass the
setff will look first for an R option with that name, and then for an environment variable with that name. So, this has the same effect as above, but no longer requires specifying the path directly in any particular R script. That one line of code, then, is all you need in your script to get the universal
But wait, there’s more! Now, here’s the ultimate syntactic sugar to make it dead simple to create portable folder functions. If your folder function name matches the name of the
pathVar, then you don’t even need to provide the
pathVar. For example, say we wanted to name our folder function
ffIndir instead of just
ffIn. In that case, you’d get the same result with:
The name provided exactly determines the function name (
ffIndir), and it also specifies a priority of places to search for a
pathVar variable: It favors R
options over environment variables, and first looks for a name exactly as given, trying an all-caps and then an all-lowercase version of the name until a nonempty value (neither
"") is found. If no match is found, the
setff call will result in error.
So far we’ve addressed how to create universal folder functions. We’ve solved the main problems with the traditional approach. Using folder folders combined with R options or environment variables allows us to: 1) Avoid repeating paths either within a script or across scripts, because they are stored globally; 2) Let the exact same script work in two different computing environments. We can do all of this with a simple, easy-to-understand call to
setff, and then wrapping all our references to disk resources with the appropriate
But let’s go one step further: what if we want more than just a set of global folders. What if we also want to specify project-specific folders? We might want an
output subfolder that reside in our parent
OUTDIR folders, but give us a separate space for each project. This is possible with another
postpend allows you to append additional text (e.g. subfolders) to the folder function. For example, here’s some code that will give you a subfolder called
projectName at the location specified by your
$DATA environment variable:
projectName="myproject" setff("Data", pathVar="DATA", postpend=projectName) ## Created folder function ffData(): /long/and/annoying/path/to/hard/coded/file/myproject
Remember, you could also take advantage of
folderfun’s smart matching in this case by leaving off the
projectName="myproject" setff("Data", postpend=projectName) ## Created folder function ffData(): /long/and/annoying/path/to/hard/coded/file/myproject
There you have it! A single line gives you a portable and project-specific input and output folder functions, making it easier for you to manage your data and results.
You can get a list of all your loaded folder functions with the
listff() ## funcNames pdirOptVals ## FF_Data "ffData" "/long/and/annoying/path/to/hard/coded/file/myproject" ## FF_In "ffIn" "/long/and/annoying/path/to/hard/coded/file/" ## FF_Out "ffOut" "/different/long/annoying/hard/coded/path/"
Now let’s see how this fits into a real-world system. In our lab, we have set aside a few locations on our primary server where we store both raw and processed data, and we store the folder locations in shell environment variables called
$PROCESSED. We also have a few other variables that point to shared resources, like
$GENOMES. Our server uses an environment modules system, so we have set up a lab environment module that populates these variables. If we ever need to move anything to a new file system, it’s as simple as updating the environment module, and all lab members’ pointers will automatically point to the new folder.
folderfun to access these folders in R. By convention, we assign a subfolder for each project in each of the
PROCESSED folders. Then, we simply need to have this code in each script:
projectName="myproject" setff("Raw", postpend=projectName) setff("Processed", postpend=projectName)
Because every project is the same, we’ve wrapped this capability into another function called
projectInit, so we must merely put
projectInit(projectName) at the beginning of each script, and it will have access to the folder functions it needs. The beautiful thing about this approach is that these scripts are now automatically functional on any computing environment and are robust to data moves as long as the environment variables are kept up-to-date.
setff attempts to find a path value for either an R option or a shell environment variable. To do so, it uses a function called
folderfun::optOrEnvVar in this package. This prioritized name resolution function may be useful in other contexts, so it’s independently available:
name = "DUMMYTESTVAR" value = "test_value" optOrEnvVar(name) # NULL Sys.setenv(name, value) optOrEnvVar(name) # Now resolves Sys.unsetenv(name) optOrEnvVar(name) # NULL optArg = list(value) names(optArg) = name options(optArg) optOrEnvVar(name) # Now resolves Sys.setenv(name, "new?") optOrEnvVar(name) # on name collision, option trumps environment variable.