Many Models in R

Presenter(s): Liam Wright


decorative image to accompany text

This resource will share an approach using the tidyverse package to run many models within the programming language R.

Over the past thirty years, the processing power of the average desktop computer has dramatically increased. This astonishing growth has opened up an array of new possibilities in quantitative research. One such possibility is the ability to quickly run many regression models: this includes repeating a model across different datasets or at different ages (as in cross-context or life course research), examining associations between an exposure and multiple different outcome variables or vice versa (as in the outcome-wide and exposure-wide epidemiological designs), or examining the robustness of results across a large set of plausible model specifications (as in specification curve analysis, multiverse analysis, and pattern mixture modelling).

Yet learning to write the code to run many models is not a standard component of statistical training. Writing out each model separately can quickly become unfeasible. Such code is, moreover, difficult to read (and thus spot errors) and difficult to amend when requirements change (Wickham and Grolemund, 2017). Using for loops may be little better - as the number of parameters to be looped over grows, nesting for loops within each other becomes onerous to write, tricky to read, and can require the addition of multiple if conditions to remove unnecessary executions. Here I will introduce an alternative approach that is much more efficient, clear and adaptable using the tidyverse package within the programming language R.

A Thriller in Three Parts

The approach comprises three main steps. First, we enumerate the specifications of the model within a data.frame. I’ll call this data.frame the specification grid. (For readers who are new to R, a data.frame is a rectangular data structure with columns [variables] and rows [observations], similar to a spreadsheet in Excel or a dataset in Stata.) Each row of the specification grid will contain the information to define a model that we want to run, with the information stored in the columns. For instance, one column may contain the name of the outcome variable and another column may contain information on the observations we want to subset to (e.g., specifying that we run the regression in females only).

Second, we create a function written to be general enough that it can take any row from the specification grid and perform the steps required to run the particular model it defines. This may include obtaining the required data (e.g., subsetting the data to females only), creating new variables, running the regression, and extracting pertinent results (e.g., coefficients of interest).

Third, we use a tidyverse function, map(), to run each of the models by passing the rows of the specification grid to the general function in turn. The output of this is another data.frame containing the results of the regression models.

In what follows, we’ll see videos on each of these steps applied to an example using simulated data from four pretend cohort studies. We use this data to examine the robustness of the association between childhood cognitive ability and adult body mass index (BMI), with BMI measured at different ages, cognitive ability measured in multiple ways, and the set of control variables varying between regressions, too (96 regressions in total).

> Download html worksheet - the worksheet provides (edited) code from the videos. I recommend reading this worksheet alongside the videos.


Step 1: Making the Specification Grid


   Download transcript    |   [ 86 Views ]

Step 2: Writing a General Model Function


   Download transcript    |   [ 69 Views ]

Step 3: Running the Models


   Download transcript    |   [ 65 Views ]

Coda

The three videos have demonstrated a relatively simple process for running many models in R. This process is general purpose enough to be applied to many different types of analysis, including producing descriptive statistics. There is also a package within Rspecr, which can achieve similar aims (Masur and Scharkow, 2020). specr offers excellent out-of-the-box functionality for running many regression models and includes helpful functions for plotting and parsing results. However, in complex situations involving multiple pre- or post-processing steps (e.g., creating new variables or pooling imputed results) following the process outlined in the videos may be necessary.

When performing your own analyses, there are two other factors that are important to consider. First, if the number of models to be run is large, strategies to reduce computation time may be necessary. This can include running only a random sample of models, using the future_map() function from the package furrr for Step 3 (future_map() is similar to map() but runs in parallel, not serial), or running the models in the cloud (Aslett, 2020).

Second, thought may need to be given to using your computer’s memory efficiently. Returning a full lm() model object for each model may quickly use up all the available RAM - this is why we used tidy() to return only the model coefficients. Another way that memory can be used up is by doing all the data pre-processing in advance - for instance, creating all the variables you’ll need outside the general function. In terms of memory, it may be more efficient to create variables on-the-fly, rather than make the dataset excessively large (see Wright, 2023, for an example).




About the author

Dr. Liam Wright is a lecturer in Statistics and Survey Methodology at the Social Research Institute, University College London. Previously he was a Research Fellow on the COVID-19 Social Study UCL, where he researched compliance with public health measures (such as mask wearing). Dr. Wright was an NIHR Research Methods Fellow at the University of Sheffield, a Digital Analyst at the BBC, and an Economist at Monitor (now NHS Improvement).

Primary author profile page



BACK TO TOP