Microsoft has been making daily copies of the entire CRAN website of R packages since 2014. This archive, named MRAN, allows installing older versions of packages, which is valuable for reproducibility purposes. The 15,000+ R packages on CRAN are incessantly updated. For example, the package tidyverse
depends on 109 packages; these packages accumulate 63 updates, just since 2022 (R Code). Package updates sometimes are backwards incompatible, breaking other packages and/or scripts that depend on them. To combat this reproducibility problem, both groundhog
and checkpoint
allow version-controlled package loading, they enable loading the same version of each package, every time a given R script is ran. But, there is a plot-twist: this backbone of R's reproducibility is itself threatening reproducibility. MRAN is dying. Microsoft stopped making copies of CRAN in January 2023, and will shut down access to all of MRAN in July (announcement .htm).
I created and currently maintain groundhog
, so when I learned that MRAN was shutting down, I needed to either find a replacement, or shut down groundhog too. I decided for the former and designed & created an MRAN replacement: GRAN.
GRAN: Groundhog's R Archive Neighbor
Groundhog makes version-control of packages in R as easy as it gets.
Instead of: library(pkg)
One does this: groundhog.library(pkg, date)
That is it.
Groundhog will ensure every time the script is ran, the same package version is ran, the one current on date.
That's all you need to change to make your code vastly more reproducible.
Groundhog works in two-steps: (1) figure out which package version to get based on the date, and (2) if that version needs to be installed: find it, get it, and install it. Groundhog does (1) on its own, but it used to rely on MRAN for (2). Since groundhog v3.0.0, however, groundhog no longer relies on MRAN, instead, it relies on GRAN.
Image courtesy of Dall.e2
Like MRAN, GRAN also has every package published on CRAN since 2014. But GRAN is based on a different approach to archiving. GRAN has a small subset of the gazillion files MRAN was archiving; the useful subset.
All of GRAN, from 2014 to last night, uses about 800 Gb of storage.
This adds up to several dozen dollars a year in storage costs (dozens with a 'd') [1].
MRAN fills up that amount of storage space in about a week, instead of in 9 years [2].
Push the button to get into the weeds on how GRAN manages to be so petite
MRAN was also slow for some reason. GRAN is not.
Installing the tidyverse
, and its 69 dependencies on my laptop, with groundhog.day = '2020-01-01', took 9 minutes with groundhog 2.2, which relies on MRAN, and less than 2 minutes on groundhog 3.0 which relies on GRAN [3]:
Indeed, groundhog v3.0 is faster even that R's native install.packages() [4]
Further reading:
- R's reproducibility problem and how groundhog helps: Colada[95] and Colada[100]
- How groundhog works: http://groundhogr.com
- How GRAN works http://groundhogr.com/gran
Get groundhog 3.0 from CRAN: install.packages('groundhog')
footnotes
- GRAN is hosted by wasabi; it charges about $6 a month for a terabyte of data, so currently, GRAN's storage costs are about $72 a year, six dozen dollars [↩]
- this is a guesstimate by ballparking how many current packages there are, that MRAN copies all binaries for them for the current and next release of R, and that it copies the last available binary for all packages for all past R versions, and it copies all source files… …every single day! [↩]
- earlier in this post I said
tidyverse
had 109 dependencies, that's the current version. Tidyverse 'only' had 69 dependencies back in 2020 [↩] - Groundhog is faster than install.packages for both binaries and source-packages. For binaries it is faster because it downloads binary packages in parallel rather than sequentially, so downloading is faster. For source packages, those slow-to-install ones, groundhog.library() defaults to parallel installation while install.packages() does not, so groundhog.library() is faster than the default, but setting the option 'Ncpus' in install.packages can make it as fast as groundhog. The difference for binaries is only perceivable when installing many packages, say >60 [↩]