Data Colada
Menu
  • Home
  • Table of Contents
  • Feedback Policy
  • About
Menu

[108] MRAN is Dead, long live GRAN


Posted on April 28, 2023April 28, 2023 by Uri Simonsohn

Microsoft has been making daily copies of the entire CRAN website of R packages since 2014. This archive, named MRAN, allows installing older versions of packages, which is valuable for reproducibility purposes. The 15,000+ R packages on CRAN are incessantly updated. For example, the package tidyverse depends on 109 packages; these packages accumulate 63 updates, just since 2022 (R Code). Package updates sometimes are backwards incompatible, breaking other packages and/or scripts that depend on them. To combat this reproducibility problem, both groundhog and checkpoint allow version-controlled package loading, they enable loading the same version of each package, every time a given R script is ran. But, there is a plot-twist: this backbone of R's reproducibility is itself threatening reproducibility. MRAN is dying. Microsoft stopped making copies of CRAN in January 2023, and will shut down access to all of MRAN in July (announcement .htm).

I created and currently maintain groundhog, so when I learned that MRAN was shutting down, I needed to either find a replacement, or shut down groundhog too. I decided for the former and designed & created an MRAN replacement: GRAN. 

GRAN: Groundhog's R Archive Neighbor
Groundhog makes version-control of packages in R as easy as it gets. 

Instead of:          library(pkg)
One does this:   groundhog.library(pkg, date)
That is it.

Groundhog will ensure every time the script is ran, the same package version is ran, the one current on date.
That's all you need to change to make your code vastly more reproducible.

Groundhog works in two-steps: (1) figure out which package version to get based on the date, and (2) if that version needs to be installed: find it, get it, and install it. Groundhog does (1) on its own, but it used to rely on MRAN for (2). Since groundhog v3.0.0, however, groundhog no longer relies on MRAN, instead, it relies on GRAN.

 

 

 

 

 

Image courtesy of Dall.e2

 

Like MRAN, GRAN also has every package published on CRAN since 2014. But GRAN is based on a different approach to archiving. GRAN has a small subset of the gazillion files MRAN was archiving; the useful subset.

All of GRAN, from 2014 to last night, uses about 800 Gb of storage.
This adds up to several dozen dollars a year in storage costs (dozens with a 'd') [1].
MRAN fills up that amount of  storage space in about a week, instead of in 9 years [2].

Push the button to get into the weeds on how GRAN manages to be so petite   

The weeds on the smaller footprint of GRAN vs MRAN
MRAN used to save the entirety of CRAN every day, saving sometimes 100s of identical copies of files that have not changed over time.  Moreover, most of these files do not even need a single copy to be archived, because CRAN already keeps one. Specifically, CRAN already archives source files, those slow-to-install (but usable in any operating system) versions of packages. Only binaries, those packages that quickly install on Mac and Windows computers, are deleted from CRAN when new package versions are released. So, only binaries need to be archived by GRAN. Moreover, even for binaries we don't need daily copies, we largely need just one copy of each of them.  So, GRAN has a copy of all binaries for all CRAN package versions dating back to 2014. But nothing else. GRAN is updated daily. All package (binaries) posted today to CRAN, will be permanently archived by GRAN this evening.

There are exceptions and some binaries are saved multiple times for justified technical reasons, to get further into the weeds check out : https://groundhogr.com/gran

MRAN was also slow for some reason. GRAN is not.
Installing the tidyverse, and its 69 dependencies on my laptop, with groundhog.day =  '2020-01-01', took 9 minutes with groundhog 2.2, which relies on MRAN, and less than 2 minutes on groundhog 3.0 which relies on GRAN [3]:

Indeed, groundhog v3.0 is faster even that R's native install.packages() [4]

Further reading:

  1. R's reproducibility problem and how groundhog helps: Colada[95]  and Colada[100]
  2. How groundhog works: http://groundhogr.com
  3. How GRAN works http://groundhogr.com/gran

Get groundhog 3.0 from CRAN: install.packages('groundhog')

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

footnotes

  1. GRAN is hosted by wasabi; it charges about $6 a month for a terabyte of data, so currently, GRAN's storage costs are about $72 a year, six dozen dollars [↩]
  2. this is a guesstimate by ballparking how many current packages there are, that MRAN copies all binaries for them for the current and next release of R, and that it copies the last available binary for all packages for all past R versions, and it copies all source files… …every single day! [↩]
  3. earlier in this post I said tidyverse had 109 dependencies, that's the current version. Tidyverse 'only' had 69 dependencies back in 2020 [↩]
  4. Groundhog is faster than install.packages for both binaries and source-packages. For binaries it is faster because it downloads binary packages in parallel rather than sequentially, so downloading is faster. For source packages, those slow-to-install ones, groundhog.library() defaults to parallel installation while install.packages() does not, so groundhog.library() is faster than the default, but setting the option 'Ncpus'  in install.packages can make it as fast as groundhog. The difference for binaries is only perceivable when installing many packages, say >60 [↩]

Related

Get Colada email alerts.

Join 10.6K other subscribers

Social media

Recent Posts

  • [125] "Complexity" 2: Don't be mean to the median
  • [124] "Complexity": 75% of participants missed comprehension questions in AER paper critiquing Prospect Theory
  • [123] Dear Political Scientists: The binning estimator violates ceteris paribus
  • [122] Arresting Flexibility: A QJE field experiment on police behavior with about 40 outcome variables
  • [121] Dear Political Scientists: Don't Bin, GAM Instead

Get blogpost email alerts

Join 10.6K other subscribers

tweeter & facebook

We announce posts on Twitter
We announce posts on Bluesky
And link to them on our Facebook page

Posts on similar topics

R, Reproducibility
  • [108] MRAN is Dead, long live GRAN
  • [102] R on Steroids: Running WAY faster simulations in R
  • [100] Groundhog 2.0: Further addressing the threat R poses to reproducible research

search

© 2021, Uri Simonsohn, Leif Nelson, and Joseph Simmons. For permission to reprint individual blog posts on DataColada please contact us via email..