# install.packages("tidyverse")
library(tidyverse)
Many Models in R: NCRM Worksheet
Introduction
This worksheet provides (edited) code from the videos on running many models in R. I recommend reading this worksheet alongside the videos.
The example used in the video uses simulated data from four cohort studies to examine cohort level changes in the association between childhood cognitive ability and adulthood body mass index (BMI). BMI is measured in each cohort at different ages, cognitive ability is measured in multiple ways, and the set of control variables to be used varies between regressions, too. We want to run a regression for each combination of these (96 regressions in total).
Step 1: Making the Specification Grid
First, we load the tidyverse
package. Uncomment the first line if you need to install it.
The tidyverse
is a package of packages. library(tidyverse)
loads a set of “core” packages. We use several of these below. We also use some non-core packages which we’ll need to load explicitly.
Next, we load the dataset, which is in .Rds
format, directly from the internet, assigning it to the object df
.
<- url("https://osf.io/download/ep86k/") %>% readRDS()
df
str(df)
tibble [80,000 × 11] (S3: tbl_df/tbl/data.frame)
$ cohort : Factor w/ 4 levels "Baby Boomers",..: 1 1 1 1 1 1 1 1 1 1 ...
$ id : int [1:80000] 1 1 1 1 1 1 2 2 2 2 ...
$ fup : int [1:80000] 15 25 35 45 55 65 15 25 35 45 ...
$ age : num [1:80000] 15.4 25.3 35.5 45.5 55.6 ...
$ bmi : num [1:80000] 24.3 25.8 27.3 28.7 30.2 ...
$ cog_nonverbal: num [1:80000] -0.282 -0.282 -0.282 -0.282 -0.282 ...
$ cog_verbal : num [1:80000] -0.341 -0.341 -0.341 -0.341 -0.341 ...
$ cog_vocab : num [1:80000] 0.983 0.983 0.983 0.983 0.983 ...
$ female : num [1:80000] 0 0 0 0 0 0 0 0 0 0 ...
$ family_class : Factor w/ 2 levels "Manual","Non-Manual": 1 1 1 1 1 1 2 2 2 2 ...
$ child_bmi : num [1:80000] 18.1 18.1 18.1 18.1 18.1 ...
%>% sample_n(20) # Print a random sample of 20 rows df
# A tibble: 20 × 11
cohort id fup age bmi cog_nonverbal cog_verbal cog_vocab female
<fct> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Baby Boome… 2056 35 35.6 22.5 -0.137 -0.301 -0.313 0
2 Gen X 4541 55 55.3 31.9 -0.317 0.310 0.530 1
3 Millennials 2231 15 15.2 18.1 1.20 2.20 1.84 1
4 Gen Z 1736 15 15.5 23.7 -0.706 -0.133 -1.30 0
5 Gen X 2830 25 25.0 21.1 -1.35 -0.965 -1.60 0
6 Millennials 4575 15 15.0 21.0 0.0601 -0.258 -0.671 0
7 Baby Boome… 1433 25 25.4 24.0 -0.475 -0.552 -0.244 1
8 Gen X 4350 25 25.7 18.8 0.225 0.411 -0.725 1
9 Gen Z 1735 25 25.3 24.7 -0.0853 -0.897 0.817 0
10 Baby Boome… 2191 45 45.5 27.6 -0.225 -0.949 0.159 1
11 Baby Boome… 801 35 35.7 21.7 -0.850 -0.179 -1.12 1
12 Gen X 2614 25 25.6 17.2 0.231 0.433 1.46 0
13 Gen X 4361 25 25.2 26.5 -0.683 -0.725 0.0583 0
14 Baby Boome… 2033 15 15.8 20.6 -0.622 -0.885 0.0510 1
15 Baby Boome… 2205 25 25.6 16.7 -1.75 -0.660 -0.960 0
16 Gen X 4267 45 45.3 33.0 -0.590 0.106 0.266 0
17 Baby Boome… 4675 55 55.5 30.1 0.198 -0.246 0.0690 0
18 Gen Z 3336 15 15.8 25.4 0.912 0.186 -0.605 0
19 Baby Boome… 2143 45 45.8 22.8 -0.0987 -0.366 0.535 0
20 Gen Z 3684 15 15.5 16.1 -0.340 0.121 0.0760 1
# ℹ 2 more variables: family_class <fct>, child_bmi <dbl>
The dataset is a tibble
- a type of data.frame
used extensively in the tidyverse
. The dataset is in long format with an observation for each cohort, participant and follow-up (cohort x id x fup
) combination. The variables bmi
and age
are time-varying, denoting information on body mass index (BMI) and age of BMI measurement, respectively. The other variables are time-invariant and comprise three measures of childhood cognitive ability (cog_*
) and variables for sex (female
), family social class (family_class
) and childhood BMI (child_bmi
). Sex, social class, age, and childhood BMI will be used as control variables in the regressions.
Our aim is to perform a regression for each combination of:
- Cohort
- Follow-up at which BMI was measured
- Measure of cognitive ability
- Set of control variable used (including or not including
child_bmi
as a control variable).
We have included multiple cognitive ability variables as these are measured slightly differently in each cohort, so we want to check how robust results are given this. Note, using a consistent naming scheme for similar variables (e.g., prepending with cog_
) simplifies grouping and selecting these variables, as we’ll see shortly. Regressions will be repeated controlling and not controlling for childhood BMI to examine the possibility of reverse causality.
We’ll be putting these specifications into a data.frame
, which I’ll term the specification grid. Before creating the specification grid, we’ll create two objects, one which contains the names of the cognitive ability variables and another which specifies the model covariates to be added to the particular regression.
<- str_subset(
cog_vars # Returns the names of the variables in df which begin "cog_"
names(df),
"^cog_"
) cog_vars
[1] "cog_nonverbal" "cog_verbal" "cog_vocab"
<- list(
mod_covars # Creates a list naming the models and...
# ...giving the variables to be added as controls
basic = "age + female + family_class",
child_bmi = "age + female + family_class + child_bmi"
) mod_covars
$basic
[1] "age + female + family_class"
$child_bmi
[1] "age + female + family_class + child_bmi"
"basic"]] # Code to extract covariates for "basic" model mod_covars[[
[1] "age + female + family_class"
Next, we want to get the unique combinations of cohort and follow-up. As each cohort was tracked at different ages (e.g., the Baby Boomers were followed to age 65, and Gen Z to age 25), we need to make sure the combinations of cohort
and fup
actually appear in the data. We can do this using distinct()
.
%>%
df distinct(cohort, fup)
# A tibble: 16 × 2
cohort fup
<fct> <int>
1 Baby Boomers 15
2 Baby Boomers 25
3 Baby Boomers 35
4 Baby Boomers 45
5 Baby Boomers 55
6 Baby Boomers 65
7 Gen X 15
8 Gen X 25
9 Gen X 35
10 Gen X 45
11 Gen X 55
12 Millennials 15
13 Millennials 25
14 Millennials 35
15 Gen Z 15
16 Gen Z 25
Below, we use distinct()
and follow this with expand_grid()
. expand_grid()
creates a data.frame
with all the unique combinations of cohort
, fup
and cog_vars
and the models in mod_covars
. We also add a further column, spec_id
, which records the row. spec_id
will be passed to the general function to be used to identify a particular model specification; it is simpler passing this number to a general function, rather than all the other columns in the specification grid as (a) it is just one value and (b) it reduces the number of changes we need to make if we add more complexity to our specification grid later (i.e., by adding further columns).
<- df %>%
mod_specs distinct(cohort, fup) %>%
expand_grid( # Enumerates all combinations of cohort, fup, cog_vars and models in mod_covars.
cog_var = cog_vars,
covars = names(mod_covars)
%>%
) mutate(
spec_id = row_number(), # Adds a new variable, spec_id, which records the row
.before = 1 # Specifies that spec_id should be the first column of the data.frame
)
mod_specs
# A tibble: 96 × 5
spec_id cohort fup cog_var covars
<int> <fct> <int> <chr> <chr>
1 1 Baby Boomers 15 cog_nonverbal basic
2 2 Baby Boomers 15 cog_nonverbal child_bmi
3 3 Baby Boomers 15 cog_verbal basic
4 4 Baby Boomers 15 cog_verbal child_bmi
5 5 Baby Boomers 15 cog_vocab basic
6 6 Baby Boomers 15 cog_vocab child_bmi
7 7 Baby Boomers 25 cog_nonverbal basic
8 8 Baby Boomers 25 cog_nonverbal child_bmi
9 9 Baby Boomers 25 cog_verbal basic
10 10 Baby Boomers 25 cog_verbal child_bmi
# ℹ 86 more rows
And that’s the Step 1 done! In a few lines of code, we have created a grid of 96 model specifications.
Step 2: Writing a General Model Function
Next we need to write a function that takes a spec_id
and runs the model defined in that particular row of the specification grid. In the video, I recommend initially taking a specific spec_id
and writing out the code required to run that model, writing the code in as general a way as possible, and ensuring it works, before saving that code as a function. Here I will write the code as a function directly.
# Load packages used by the function.
# These are installed with the tidyverse but need to be loaded explicitly
library(broom)
library(glue)
<- function(spec_id){
get_lm <- mod_specs %>%
spec filter(
# Keeps the specification where the column spec_id...
# ...is equal to spec_id inputted to get_lm()
== !!spec_id
spec_id
)
<- df %>% # Keeps data for the specifc cohort...
df_mod # ...and follow-up defined in the spec
filter(cohort == !!spec$cohort,
== !!spec$fup)
fup
<- glue(
mod_formula # Creates a model formula including the correct...
# ...cog_var and model covariates
"bmi ~ {spec$cog_var} + {mod_covars[[spec$covars]]}"
%>%
) # Converts from a string to a formula object...
# ...so it works with lm()
as.formula()
<- lm(mod_formula, data = df_mod) # Runs the model
mod
<- tidy(mod, conf.int = TRUE) %>% # Extracts the model coefficients
ret filter(term == !!spec$cog_var) %>%
select(estimate, std.error, p.value, conf.low, conf.high)
return(ret)
}
The results of the 1st and 15th specifications are:
get_lm(1) # Run for the first specification
# A tibble: 1 × 5
estimate std.error p.value conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.169 0.0422 0.0000629 -0.252 -0.0864
get_lm(15) # Run for the fifteenth specification
# A tibble: 1 × 5
estimate std.error p.value conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.378 0.0422 5.09e-19 -0.461 -0.295
And that’s Step 2 completed!
Step 3: Running the Models
Next, we need to run the models. It does not make sense to write get_lm()
out 96 times. Instead, we can use the map()
function from the purrr()
package (part of the tidyverse
). map()
is an iterator - it takes an object and applies a function to each element of it, returning the results as a list
. For instance, below we take the vector 2 to 5 and add 16 to each element in turn (i.e., obtaining 2+16=18
,3+16=19
, …, 5+16=21
). In this code, we use what is known as lambda notation: .x
is a placeholder for the current value of the object being iterated over (in the first instance, 2; in the second instance, 3; and so on).
map(2:5, ~ .x + 16)
[[1]]
[1] 18
[[2]]
[1] 19
[[3]]
[1] 20
[[4]]
[1] 21
We can pass the spec_id
vector from our specification grid to map()
. Importantly, we can do this within a mutate()
function call so the results appear as a new column in the mod_specs
data.frame
. Note, as map()
returns a list, the new column is a list-column - in this case, a list of tibbles
(remember, a tibble
is a type of data.frame
used in the tidyverse
).
%>%
mod_specs mutate(res = map(spec_id, ~ get_lm(.x))) # Add a new column "res"
# A tibble: 96 × 6
spec_id cohort fup cog_var covars res
<int> <fct> <int> <chr> <chr> <list>
1 1 Baby Boomers 15 cog_nonverbal basic <tibble [1 × 5]>
2 2 Baby Boomers 15 cog_nonverbal child_bmi <tibble [1 × 5]>
3 3 Baby Boomers 15 cog_verbal basic <tibble [1 × 5]>
4 4 Baby Boomers 15 cog_verbal child_bmi <tibble [1 × 5]>
5 5 Baby Boomers 15 cog_vocab basic <tibble [1 × 5]>
6 6 Baby Boomers 15 cog_vocab child_bmi <tibble [1 × 5]>
7 7 Baby Boomers 25 cog_nonverbal basic <tibble [1 × 5]>
8 8 Baby Boomers 25 cog_nonverbal child_bmi <tibble [1 × 5]>
9 9 Baby Boomers 25 cog_verbal basic <tibble [1 × 5]>
10 10 Baby Boomers 25 cog_verbal child_bmi <tibble [1 × 5]>
# ℹ 86 more rows
This is not so helpful as we cannot see or immediately access the results. Instead, we can use the unnest()
function to explode the list-column out into the wider data.frame
- we’ll have one row per model and five extra variables as the tibbles
in the list-column were each of dimension 1 (rows) x 5 (columns).
<- mod_specs %>%
mod_res mutate(res = map(spec_id, ~ get_lm(.x))) %>%
unnest(res)
mod_res
# A tibble: 96 × 10
spec_id cohort fup cog_var covars estimate std.error p.value conf.low
<int> <fct> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Baby Boome… 15 cog_no… basic -0.169 0.0422 6.29e- 5 -0.252
2 2 Baby Boome… 15 cog_no… child… -0.200 0.0346 8.46e- 9 -0.268
3 3 Baby Boome… 15 cog_ve… basic -0.214 0.0422 3.94e- 7 -0.297
4 4 Baby Boome… 15 cog_ve… child… -0.242 0.0346 3.02e-12 -0.310
5 5 Baby Boome… 15 cog_vo… basic -0.117 0.0424 5.95e- 3 -0.200
6 6 Baby Boome… 15 cog_vo… child… -0.185 0.0348 1.09e- 7 -0.253
7 7 Baby Boome… 25 cog_no… basic -0.231 0.0423 4.97e- 8 -0.314
8 8 Baby Boome… 25 cog_no… child… -0.261 0.0347 6.39e-14 -0.329
9 9 Baby Boome… 25 cog_ve… basic -0.296 0.0422 2.53e-12 -0.379
10 10 Baby Boome… 25 cog_ve… child… -0.323 0.0346 1.60e-20 -0.391
# ℹ 86 more rows
# ℹ 1 more variable: conf.high <dbl>
And that’s Step 3 done! We now have the results of 96 regression models. Each row of the mod_res
data.frame
is the coefficient from a linear regression model of BMI upon cognitive ability and a set of covariates, with the cohort, age of BMI collection, measurement of cognitive ability, and covariates used differing between models.
Exercise
To demonstrate the flexibility of this approach, as an exercise, try amending the above code to also run regressions among male and female participants only. Please attempt the exercise before looking at the solution below.
Hint: you can add an if(){}
statement along with the filter()
function to subset the data to males or females depending on whether the specification requires it. You should also add a new column to the mod_specs
data.frame
to specify whether the given specification is to be run among females, males or all participants. You may want to use a similar trick to the one we used to specify mod_covars
- i.e., using an external named list contains the set of sexes the model should be run among.
Solution
<- list(all = 0:1, male = 0, female = 1)
mod_sex
<- mod_specs %>%
sex_specs expand_grid(sexes = names(mod_sex)) %>%
mutate(spec_id = row_number(), .before = 1)
<- function(spec_id){
get_sex_lm <- sex_specs %>%
spec filter(spec_id == !!spec_id)
<- df %>%
df_mod filter(cohort == !!spec$cohort,
== !!spec$fup,
fup %in% mod_sex[[spec$sexes]])
female
<- glue("bmi ~ {spec$cog_var} + {mod_covars[[spec$covars]]}") %>%
mod_formula as.formula()
# Add condition that removes female covariate if not required
if (spec$sex != "all") mod_formula <- update(mod_formula, ~ -female)
<- lm(mod_formula, data = df_mod) # Runs the model
mod
<- tidy(mod, conf.int = TRUE) %>% # Extracts the model coefficients
ret filter(term == !!spec$cog_var) %>%
select(estimate, std.error, p.value, conf.low, conf.high)
return(ret)
}
Learning the tidyverse
To learn more about the tidyverse
, read Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund’s book R for Data Science. The tidyverse website is also excellent and includes many “vignettes” to show you how to use specific functions. RStudio also produce quick-reference cheatsheets for many popular packages and the R for Data Science Online Learning Community (R4DS) is a very helpful hub for advice.