Many Models in R: NCRM Worksheet

Author

Liam Wright

Published

2024

Introduction

This worksheet provides (edited) code from the videos on running many models in R. I recommend reading this worksheet alongside the videos.

The example used in the video uses simulated data from four cohort studies to examine cohort level changes in the association between childhood cognitive ability and adulthood body mass index (BMI). BMI is measured in each cohort at different ages, cognitive ability is measured in multiple ways, and the set of control variables to be used varies between regressions, too. We want to run a regression for each combination of these (96 regressions in total).

Step 1: Making the Specification Grid

First, we load the tidyverse package. Uncomment the first line if you need to install it.

# install.packages("tidyverse")
library(tidyverse)

The tidyverse is a package of packages. library(tidyverse) loads a set of “core” packages. We use several of these below. We also use some non-core packages which we’ll need to load explicitly.

Next, we load the dataset, which is in .Rds format, directly from the internet, assigning it to the object df.

df <- url("https://osf.io/download/ep86k/") %>% readRDS()

str(df)
tibble [80,000 × 11] (S3: tbl_df/tbl/data.frame)
 $ cohort       : Factor w/ 4 levels "Baby Boomers",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ id           : int [1:80000] 1 1 1 1 1 1 2 2 2 2 ...
 $ fup          : int [1:80000] 15 25 35 45 55 65 15 25 35 45 ...
 $ age          : num [1:80000] 15.4 25.3 35.5 45.5 55.6 ...
 $ bmi          : num [1:80000] 24.3 25.8 27.3 28.7 30.2 ...
 $ cog_nonverbal: num [1:80000] -0.282 -0.282 -0.282 -0.282 -0.282 ...
 $ cog_verbal   : num [1:80000] -0.341 -0.341 -0.341 -0.341 -0.341 ...
 $ cog_vocab    : num [1:80000] 0.983 0.983 0.983 0.983 0.983 ...
 $ female       : num [1:80000] 0 0 0 0 0 0 0 0 0 0 ...
 $ family_class : Factor w/ 2 levels "Manual","Non-Manual": 1 1 1 1 1 1 2 2 2 2 ...
 $ child_bmi    : num [1:80000] 18.1 18.1 18.1 18.1 18.1 ...
df %>% sample_n(20) # Print a random sample of 20 rows
# A tibble: 20 × 11
   cohort         id   fup   age   bmi cog_nonverbal cog_verbal cog_vocab female
   <fct>       <int> <int> <dbl> <dbl>         <dbl>      <dbl>     <dbl>  <dbl>
 1 Baby Boome…  2056    35  35.6  22.5       -0.137      -0.301   -0.313       0
 2 Gen X        4541    55  55.3  31.9       -0.317       0.310    0.530       1
 3 Millennials  2231    15  15.2  18.1        1.20        2.20     1.84        1
 4 Gen Z        1736    15  15.5  23.7       -0.706      -0.133   -1.30        0
 5 Gen X        2830    25  25.0  21.1       -1.35       -0.965   -1.60        0
 6 Millennials  4575    15  15.0  21.0        0.0601     -0.258   -0.671       0
 7 Baby Boome…  1433    25  25.4  24.0       -0.475      -0.552   -0.244       1
 8 Gen X        4350    25  25.7  18.8        0.225       0.411   -0.725       1
 9 Gen Z        1735    25  25.3  24.7       -0.0853     -0.897    0.817       0
10 Baby Boome…  2191    45  45.5  27.6       -0.225      -0.949    0.159       1
11 Baby Boome…   801    35  35.7  21.7       -0.850      -0.179   -1.12        1
12 Gen X        2614    25  25.6  17.2        0.231       0.433    1.46        0
13 Gen X        4361    25  25.2  26.5       -0.683      -0.725    0.0583      0
14 Baby Boome…  2033    15  15.8  20.6       -0.622      -0.885    0.0510      1
15 Baby Boome…  2205    25  25.6  16.7       -1.75       -0.660   -0.960       0
16 Gen X        4267    45  45.3  33.0       -0.590       0.106    0.266       0
17 Baby Boome…  4675    55  55.5  30.1        0.198      -0.246    0.0690      0
18 Gen Z        3336    15  15.8  25.4        0.912       0.186   -0.605       0
19 Baby Boome…  2143    45  45.8  22.8       -0.0987     -0.366    0.535       0
20 Gen Z        3684    15  15.5  16.1       -0.340       0.121    0.0760      1
# ℹ 2 more variables: family_class <fct>, child_bmi <dbl>

The dataset is a tibble - a type of data.frame used extensively in the tidyverse. The dataset is in long format with an observation for each cohort, participant and follow-up (cohort x id x fup) combination. The variables bmi and age are time-varying, denoting information on body mass index (BMI) and age of BMI measurement, respectively. The other variables are time-invariant and comprise three measures of childhood cognitive ability (cog_*) and variables for sex (female), family social class (family_class) and childhood BMI (child_bmi). Sex, social class, age, and childhood BMI will be used as control variables in the regressions.

Our aim is to perform a regression for each combination of:

  1. Cohort
  2. Follow-up at which BMI was measured
  3. Measure of cognitive ability
  4. Set of control variable used (including or not including child_bmi as a control variable).

We have included multiple cognitive ability variables as these are measured slightly differently in each cohort, so we want to check how robust results are given this. Note, using a consistent naming scheme for similar variables (e.g., prepending with cog_) simplifies grouping and selecting these variables, as we’ll see shortly. Regressions will be repeated controlling and not controlling for childhood BMI to examine the possibility of reverse causality.

We’ll be putting these specifications into a data.frame, which I’ll term the specification grid. Before creating the specification grid, we’ll create two objects, one which contains the names of the cognitive ability variables and another which specifies the model covariates to be added to the particular regression.

cog_vars <- str_subset( 
  # Returns the names of the variables in df which begin "cog_"
  names(df),
  "^cog_"
)
cog_vars
[1] "cog_nonverbal" "cog_verbal"    "cog_vocab"    
mod_covars <- list( 
  # Creates a list naming the models and...
  # ...giving the variables to be added as controls
  basic = "age + female + family_class",
  child_bmi = "age + female + family_class + child_bmi"
)
mod_covars
$basic
[1] "age + female + family_class"

$child_bmi
[1] "age + female + family_class + child_bmi"
mod_covars[["basic"]] # Code to extract covariates for "basic" model
[1] "age + female + family_class"

Next, we want to get the unique combinations of cohort and follow-up. As each cohort was tracked at different ages (e.g., the Baby Boomers were followed to age 65, and Gen Z to age 25), we need to make sure the combinations of cohort and fup actually appear in the data. We can do this using distinct().

df %>%
  distinct(cohort, fup)
# A tibble: 16 × 2
   cohort         fup
   <fct>        <int>
 1 Baby Boomers    15
 2 Baby Boomers    25
 3 Baby Boomers    35
 4 Baby Boomers    45
 5 Baby Boomers    55
 6 Baby Boomers    65
 7 Gen X           15
 8 Gen X           25
 9 Gen X           35
10 Gen X           45
11 Gen X           55
12 Millennials     15
13 Millennials     25
14 Millennials     35
15 Gen Z           15
16 Gen Z           25

Below, we use distinct() and follow this with expand_grid(). expand_grid() creates a data.frame with all the unique combinations of cohort, fup and cog_vars and the models in mod_covars. We also add a further column, spec_id, which records the row. spec_id will be passed to the general function to be used to identify a particular model specification; it is simpler passing this number to a general function, rather than all the other columns in the specification grid as (a) it is just one value and (b) it reduces the number of changes we need to make if we add more complexity to our specification grid later (i.e., by adding further columns).

mod_specs <- df %>%
  distinct(cohort, fup) %>%
  expand_grid( # Enumerates all combinations of cohort, fup, cog_vars and models in mod_covars.
    cog_var = cog_vars,
    covars = names(mod_covars)
  ) %>%
  mutate(
    spec_id = row_number(), # Adds a new variable, spec_id, which records the row
    .before = 1 # Specifies that spec_id should be the first column of the data.frame
  )

mod_specs
# A tibble: 96 × 5
   spec_id cohort         fup cog_var       covars   
     <int> <fct>        <int> <chr>         <chr>    
 1       1 Baby Boomers    15 cog_nonverbal basic    
 2       2 Baby Boomers    15 cog_nonverbal child_bmi
 3       3 Baby Boomers    15 cog_verbal    basic    
 4       4 Baby Boomers    15 cog_verbal    child_bmi
 5       5 Baby Boomers    15 cog_vocab     basic    
 6       6 Baby Boomers    15 cog_vocab     child_bmi
 7       7 Baby Boomers    25 cog_nonverbal basic    
 8       8 Baby Boomers    25 cog_nonverbal child_bmi
 9       9 Baby Boomers    25 cog_verbal    basic    
10      10 Baby Boomers    25 cog_verbal    child_bmi
# ℹ 86 more rows

And that’s the Step 1 done! In a few lines of code, we have created a grid of 96 model specifications.

Step 2: Writing a General Model Function

Next we need to write a function that takes a spec_id and runs the model defined in that particular row of the specification grid. In the video, I recommend initially taking a specific spec_id and writing out the code required to run that model, writing the code in as general a way as possible, and ensuring it works, before saving that code as a function. Here I will write the code as a function directly.

# Load packages used by the function. 
# These are installed with the tidyverse but need to be loaded explicitly
library(broom)
library(glue)

get_lm <- function(spec_id){
  spec <- mod_specs %>%
    filter(  
      # Keeps the specification where the column spec_id...
      # ...is equal to spec_id inputted to get_lm()
      spec_id == !!spec_id
      )
  
  df_mod <- df %>% # Keeps data for the specifc cohort...
    # ...and follow-up defined in the spec
    filter(cohort == !!spec$cohort,
           fup == !!spec$fup)
  
  mod_formula <- glue(
    # Creates a model formula including the correct...
    # ...cog_var and model covariates
    "bmi ~ {spec$cog_var} + {mod_covars[[spec$covars]]}"
    ) %>% 
    # Converts from a string to a formula object...
    # ...so it works with lm()
    as.formula()
  
  mod <- lm(mod_formula, data = df_mod) # Runs the model
  
  ret <- tidy(mod, conf.int = TRUE) %>% # Extracts the model coefficients
    filter(term == !!spec$cog_var) %>%
    select(estimate, std.error, p.value, conf.low, conf.high)
  
  return(ret)
}

The results of the 1st and 15th specifications are:

get_lm(1) # Run for the first specification
# A tibble: 1 × 5
  estimate std.error   p.value conf.low conf.high
     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1   -0.169    0.0422 0.0000629   -0.252   -0.0864
get_lm(15) # Run for the fifteenth specification
# A tibble: 1 × 5
  estimate std.error  p.value conf.low conf.high
     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1   -0.378    0.0422 5.09e-19   -0.461    -0.295

And that’s Step 2 completed!

Step 3: Running the Models

Next, we need to run the models. It does not make sense to write get_lm() out 96 times. Instead, we can use the map() function from the purrr() package (part of the tidyverse). map() is an iterator - it takes an object and applies a function to each element of it, returning the results as a list. For instance, below we take the vector 2 to 5 and add 16 to each element in turn (i.e., obtaining 2+16=18,3+16=19, …, 5+16=21). In this code, we use what is known as lambda notation: .x is a placeholder for the current value of the object being iterated over (in the first instance, 2; in the second instance, 3; and so on).

map(2:5, ~ .x + 16)
[[1]]
[1] 18

[[2]]
[1] 19

[[3]]
[1] 20

[[4]]
[1] 21

We can pass the spec_id vector from our specification grid to map(). Importantly, we can do this within a mutate() function call so the results appear as a new column in the mod_specs data.frame. Note, as map() returns a list, the new column is a list-column - in this case, a list of tibbles (remember, a tibble is a type of data.frame used in the tidyverse).

mod_specs %>%
  mutate(res = map(spec_id, ~ get_lm(.x))) # Add a new column "res"
# A tibble: 96 × 6
   spec_id cohort         fup cog_var       covars    res             
     <int> <fct>        <int> <chr>         <chr>     <list>          
 1       1 Baby Boomers    15 cog_nonverbal basic     <tibble [1 × 5]>
 2       2 Baby Boomers    15 cog_nonverbal child_bmi <tibble [1 × 5]>
 3       3 Baby Boomers    15 cog_verbal    basic     <tibble [1 × 5]>
 4       4 Baby Boomers    15 cog_verbal    child_bmi <tibble [1 × 5]>
 5       5 Baby Boomers    15 cog_vocab     basic     <tibble [1 × 5]>
 6       6 Baby Boomers    15 cog_vocab     child_bmi <tibble [1 × 5]>
 7       7 Baby Boomers    25 cog_nonverbal basic     <tibble [1 × 5]>
 8       8 Baby Boomers    25 cog_nonverbal child_bmi <tibble [1 × 5]>
 9       9 Baby Boomers    25 cog_verbal    basic     <tibble [1 × 5]>
10      10 Baby Boomers    25 cog_verbal    child_bmi <tibble [1 × 5]>
# ℹ 86 more rows

This is not so helpful as we cannot see or immediately access the results. Instead, we can use the unnest() function to explode the list-column out into the wider data.frame - we’ll have one row per model and five extra variables as the tibbles in the list-column were each of dimension 1 (rows) x 5 (columns).

mod_res <- mod_specs %>%
  mutate(res = map(spec_id, ~ get_lm(.x))) %>%
  unnest(res)
mod_res
# A tibble: 96 × 10
   spec_id cohort        fup cog_var covars estimate std.error  p.value conf.low
     <int> <fct>       <int> <chr>   <chr>     <dbl>     <dbl>    <dbl>    <dbl>
 1       1 Baby Boome…    15 cog_no… basic    -0.169    0.0422 6.29e- 5   -0.252
 2       2 Baby Boome…    15 cog_no… child…   -0.200    0.0346 8.46e- 9   -0.268
 3       3 Baby Boome…    15 cog_ve… basic    -0.214    0.0422 3.94e- 7   -0.297
 4       4 Baby Boome…    15 cog_ve… child…   -0.242    0.0346 3.02e-12   -0.310
 5       5 Baby Boome…    15 cog_vo… basic    -0.117    0.0424 5.95e- 3   -0.200
 6       6 Baby Boome…    15 cog_vo… child…   -0.185    0.0348 1.09e- 7   -0.253
 7       7 Baby Boome…    25 cog_no… basic    -0.231    0.0423 4.97e- 8   -0.314
 8       8 Baby Boome…    25 cog_no… child…   -0.261    0.0347 6.39e-14   -0.329
 9       9 Baby Boome…    25 cog_ve… basic    -0.296    0.0422 2.53e-12   -0.379
10      10 Baby Boome…    25 cog_ve… child…   -0.323    0.0346 1.60e-20   -0.391
# ℹ 86 more rows
# ℹ 1 more variable: conf.high <dbl>

And that’s Step 3 done! We now have the results of 96 regression models. Each row of the mod_res data.frame is the coefficient from a linear regression model of BMI upon cognitive ability and a set of covariates, with the cohort, age of BMI collection, measurement of cognitive ability, and covariates used differing between models.

Exercise

To demonstrate the flexibility of this approach, as an exercise, try amending the above code to also run regressions among male and female participants only. Please attempt the exercise before looking at the solution below.

Hint: you can add an if(){} statement along with the filter() function to subset the data to males or females depending on whether the specification requires it. You should also add a new column to the mod_specs data.frame to specify whether the given specification is to be run among females, males or all participants. You may want to use a similar trick to the one we used to specify mod_covars - i.e., using an external named list contains the set of sexes the model should be run among.

Solution

mod_sex <- list(all = 0:1, male = 0, female = 1)

sex_specs <- mod_specs %>%
  expand_grid(sexes = names(mod_sex)) %>%
  mutate(spec_id = row_number(), .before = 1)

get_sex_lm <- function(spec_id){
    spec <- sex_specs %>%
    filter(spec_id == !!spec_id)
  
  df_mod <- df %>%
    filter(cohort == !!spec$cohort,
           fup == !!spec$fup,
           female %in% mod_sex[[spec$sexes]])
  
  mod_formula <- glue("bmi ~ {spec$cog_var} + {mod_covars[[spec$covars]]}") %>%
    as.formula()
  # Add condition that removes female covariate if not required
  if (spec$sex != "all") mod_formula <-  update(mod_formula, ~ -female)
  
  mod <- lm(mod_formula, data = df_mod) # Runs the model
  
  ret <- tidy(mod, conf.int = TRUE) %>% # Extracts the model coefficients
    filter(term == !!spec$cog_var) %>%
    select(estimate, std.error, p.value, conf.low, conf.high)
  
  return(ret)
}

Learning the tidyverse

To learn more about the tidyverse, read Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund’s book R for Data Science. The tidyverse website is also excellent and includes many “vignettes” to show you how to use specific functions. RStudio also produce quick-reference cheatsheets for many popular packages and the R for Data Science Online Learning Community (R4DS) is a very helpful hub for advice.