Dr Chris Playford
This Jupyter notebook has been constructed as an online resource on Latent Variables for the National Centre for Research Methods.
The workbook demonstrates the estimation of 4 different types of latent variable model:
The dataset includes information on attainment in General Certificate of Secondary Education (GCSE) qualifications from young people age 15-16 in England and Wales in 1992. It was collected as part of the Youth Cohort Study (YCS), a series of surveys exploring the education and backgrounds of young people.
Courtenay, G. (1996). Youth Cohort Study of England and Wales, 1992-1994; Cohort Six, Sweep One to Three. [data collection]. UK Data Service. SN: 3532, DOI: http://doi.org/10.5255/UKDA-SN-3532-1
* Open the dataset
use ncrm_latent_vars_online_resource_ycs1992.dta, clear
Pupils in England and Wales study GCSE qualifications in a range of different subjects. GCSE subjects are assessed separately and a subject-specific GCSE is awarded. Each GCSE subject is awarded a grade, historically the highest being grade A and the lowest grade G. The grade A* was introduced in 1994 but this is after the data presented in this notebook. For more details on the application latent class models to GCSE outcomes, see Playford and Gayle (2016).
Playford, C. J., and Gayle, V. (2016). The concealed middle? An exploration of ordinary young people and school GCSE subject area attainment. Journal of Youth Studies, 19(2), 149-168. https://doi.org/10.1080/13676261.2015.1052049
For simplicity, the individual GCSE subjects have been grouped into 5 broad subject groupings. The grade reported is the highest grade attained in each subject grouping.
The variables we will be looking at are the highest GCSE grade attained in the following subject groupings:
There are two versions of these variables described in the table below.
GCSE Grade | GCSE Points Score | GCSE A-C Indicator |
---|---|---|
U | 0 | 0 |
G | 1 | 0 |
F | 2 | 0 |
E | 3 | 0 |
D | 4 | 0 |
C | 5 | 1 |
B | 6 | 1 |
A | 7 | 1 |
The distribution of the variables is described below.
tab english_sc english, missing
Highest | Score in | GCSE | Highest Grade in GCSE English | English subjects subjects | 0 1 | Total -----------+----------------------+---------- 0 | 27 0 | 27 1 | 50 0 | 50 2 | 279 0 | 279 3 | 1,238 0 | 1,238 4 | 2,666 0 | 2,666 5 | 0 4,768 | 4,768 6 | 0 3,543 | 3,543 7 | 0 2,149 | 2,149 -----------+----------------------+---------- Total | 4,260 10,460 | 14,720
tab maths_sc maths, missing
Highest | Score in | Highest Grade in GCSE GCSE Maths | Maths subjects subjects | 0 1 | Total -----------+----------------------+---------- 0 | 223 0 | 223 1 | 350 0 | 350 2 | 1,126 0 | 1,126 3 | 2,327 0 | 2,327 4 | 2,393 0 | 2,393 5 | 0 4,640 | 4,640 6 | 0 2,013 | 2,013 7 | 0 1,648 | 1,648 -----------+----------------------+---------- Total | 6,419 8,301 | 14,720
tab science_sc science, missing
Highest | Score in | GCSE | Highest Grade in GCSE Science | Science subjects subjects | 0 1 | Total -----------+----------------------+---------- 0 | 115 0 | 115 1 | 383 0 | 383 2 | 1,048 0 | 1,048 3 | 2,195 0 | 2,195 4 | 3,137 0 | 3,137 5 | 0 3,776 | 3,776 6 | 0 2,349 | 2,349 7 | 0 1,717 | 1,717 -----------+----------------------+---------- Total | 6,878 7,842 | 14,720
tab humanity_sc humanity, missing
Highest | Score in | GCSE | Highest Grade in GCSE Humanity | Humanity subjects subjects | 0 1 | Total -----------+----------------------+---------- 0 | 133 0 | 133 1 | 309 0 | 309 2 | 812 0 | 812 3 | 1,594 0 | 1,594 4 | 2,661 0 | 2,661 5 | 0 3,520 | 3,520 6 | 0 3,033 | 3,033 7 | 0 2,658 | 2,658 -----------+----------------------+---------- Total | 5,509 9,211 | 14,720
tab othersub_sc othersub, missing
Highest | Score in | Highest Grade in GCSE GCSE Other | Other subjects subjects | 0 1 | Total -----------+----------------------+---------- 0 | 56 0 | 56 1 | 165 0 | 165 2 | 555 0 | 555 3 | 1,209 0 | 1,209 4 | 2,479 0 | 2,479 5 | 0 3,499 | 3,499 6 | 0 3,224 | 3,224 7 | 0 3,533 | 3,533 -----------+----------------------+---------- Total | 4,464 10,256 | 14,720
Latent class models are suitable when:
In the model below, we have 5 manifest variables which are binary (coded 1 or 0):
Before fitting the latent class model, we can look at the correlation between pairs of binary indicators with a tetrachoric correlation.
For example, young people who gain an A-C pass in GCSE Maths subjects also tend to gain an A-C pass in Science subjects (0.7922).
tetrachoric english maths science humanity othersub
(obs=14,720) | english maths science humanity othersub -------------+--------------------------------------------- english | 1.0000 maths | 0.7056 1.0000 science | 0.6413 0.7922 1.0000 humanity | 0.7334 0.7290 0.6995 1.0000 othersub | 0.6240 0.6152 0.5854 0.6082 1.0000
It is helpful to explore all the possible combiations of GCSE subject group outcomes. In the table below, we can see that there are 32 combinations because we have 5 manifest variables with 2 possible outcomes for each variable (i.e. 2 ^ 5 = 32). Each of the 32 patterns is a unique combinations of GCSE outcomes. For example, combination 11111 includes those young people who gain an A-C pass in all 5 GCSE subject groups.
egen gcse_att_pat = concat(english maths science humanity othersub), format(%9.0f)
label variable gcse_att_pat "GCSE subject outcome pattern"
tab gcse_att_pat, missing
GCSE | subject | outcome | pattern | Freq. Percent Cum. ------------+----------------------------------- 00000 | 1,787 12.14 12.14 00001 | 733 4.98 17.12 00010 | 229 1.56 18.68 00011 | 217 1.47 20.15 00100 | 147 1.00 21.15 00101 | 131 0.89 22.04 00110 | 66 0.45 22.49 00111 | 103 0.70 23.19 01000 | 133 0.90 24.09 01001 | 107 0.73 24.82 01010 | 55 0.37 25.19 01011 | 107 0.73 25.92 01100 | 74 0.50 26.42 01101 | 114 0.77 27.19 01110 | 67 0.46 27.65 01111 | 190 1.29 28.94 10000 | 548 3.72 32.66 10001 | 599 4.07 36.73 10010 | 323 2.19 38.93 10011 | 676 4.59 43.52 10100 | 90 0.61 44.13 10101 | 179 1.22 45.35 10110 | 109 0.74 46.09 10111 | 482 3.27 49.36 11000 | 110 0.75 50.11 11001 | 274 1.86 51.97 11010 | 173 1.18 53.15 11011 | 807 5.48 58.63 11100 | 100 0.68 59.31 11101 | 383 2.60 61.91 11110 | 453 3.08 64.99 11111 | 5,154 35.01 100.00 ------------+----------------------------------- Total | 14,720 100.00
We can estimate latent class models to explore whether the 32 combinations might be represented by a latent variable with a smaller number of categories or classes.
The number of latent classes that the data can support is a function of the number of combinations available and the frequency of the particular combinations. It is normal for researchers to estimate several models each with varying number of latent classes. The models can then be compared using goodness of fit statistics to examine which is the preferred model. For the purposes of illustration in this notebook, a 4 class model has been fitted (below) to show how this works in practice and how we might interpret the results.
NB: When using categorical indicators with Stata's GSEM commands, ensure that the binary outcome is coded 1 or 0 (not 1 or 2)
gsem (english maths science humanity othersub <- ), logit lclass(C 4)
estimates store class4
Fitting class model: Iteration 0: (class) log likelihood = -20406.253 Iteration 1: (class) log likelihood = -20406.253 Fitting outcome model: Iteration 0: (outcome) log likelihood = -25736.566 Iteration 1: (outcome) log likelihood = -24088.912 Iteration 2: (outcome) log likelihood = -23889.89 Iteration 3: (outcome) log likelihood = -23840.072 Iteration 4: (outcome) log likelihood = -23828.68 Iteration 5: (outcome) log likelihood = -23826.321 Iteration 6: (outcome) log likelihood = -23825.83 Iteration 7: (outcome) log likelihood = -23825.716 Iteration 8: (outcome) log likelihood = -23825.689 Iteration 9: (outcome) log likelihood = -23825.683 Iteration 10: (outcome) log likelihood = -23825.682 Refining starting values: Iteration 0: (EM) log likelihood = -46955.636 Iteration 1: (EM) log likelihood = -47061.504 Iteration 2: (EM) log likelihood = -47118.702 Iteration 3: (EM) log likelihood = -47150.853 Iteration 4: (EM) log likelihood = -47170.055 Iteration 5: (EM) log likelihood = -47182.406 Iteration 6: (EM) log likelihood = -47190.959 Iteration 7: (EM) log likelihood = -47197.268 Iteration 8: (EM) log likelihood = -47202.149 Iteration 9: (EM) log likelihood = -47206.052 Iteration 10: (EM) log likelihood = -47209.243 Iteration 11: (EM) log likelihood = -47211.89 Iteration 12: (EM) log likelihood = -47214.105 Iteration 13: (EM) log likelihood = -47215.972 Iteration 14: (EM) log likelihood = -47217.551 Iteration 15: (EM) log likelihood = -47218.891 Iteration 16: (EM) log likelihood = -47220.032 Iteration 17: (EM) log likelihood = -47221.004 Iteration 18: (EM) log likelihood = -47221.833 Iteration 19: (EM) log likelihood = -47222.541 Iteration 20: (EM) log likelihood = -47223.147 note: EM algorithm reached maximum iterations. Fitting full model: Iteration 0: Log likelihood = -38407.036 (not concave) Iteration 1: Log likelihood = -38356.743 (not concave) Iteration 2: Log likelihood = -38319.948 (not concave) Iteration 3: Log likelihood = -38308.467 (not concave) Iteration 4: Log likelihood = -38305.797 (not concave) Iteration 5: Log likelihood = -38304.484 (not concave) Iteration 6: Log likelihood = -38301.176 (not concave) Iteration 7: Log likelihood = -38299.958 (not concave) Iteration 8: Log likelihood = -38294.927 (not concave) Iteration 9: Log likelihood = -38292.965 (not concave) Iteration 10: Log likelihood = -38288.061 (not concave) Iteration 11: Log likelihood = -38279.074 (not concave) Iteration 12: Log likelihood = -38273.615 (not concave) Iteration 13: Log likelihood = -38273.109 (not concave) Iteration 14: Log likelihood = -38268.851 (not concave) Iteration 15: Log likelihood = -38259.636 (not concave) Iteration 16: Log likelihood = -38256.432 (not concave) Iteration 17: Log likelihood = -38253.041 (not concave) Iteration 18: Log likelihood = -38249.164 (not concave) Iteration 19: Log likelihood = -38245.711 (not concave) Iteration 20: Log likelihood = -38243.22 (not concave) Iteration 21: Log likelihood = -38241.272 (not concave) Iteration 22: Log likelihood = -38239.835 (not concave) Iteration 23: Log likelihood = -38239.431 (not concave) Iteration 24: Log likelihood = -38237.974 (not concave) Iteration 25: Log likelihood = -38237.614 (not concave) Iteration 26: Log likelihood = -38235.685 (not concave) Iteration 27: Log likelihood = -38233.537 (not concave) Iteration 28: Log likelihood = -38231.497 (not concave) Iteration 29: Log likelihood = -38230.743 (not concave) Iteration 30: Log likelihood = -38227.409 (not concave) Iteration 31: Log likelihood = -38224.932 (not concave) Iteration 32: Log likelihood = -38223.166 (not concave) Iteration 33: Log likelihood = -38221.619 (not concave) Iteration 34: Log likelihood = -38221.466 (not concave) Iteration 35: Log likelihood = -38221.416 Iteration 36: Log likelihood = -38219.494 Iteration 37: Log likelihood = -38216.388 Iteration 38: Log likelihood = -38215.387 Iteration 39: Log likelihood = -38215.158 Iteration 40: Log likelihood = -38215.047 Iteration 41: Log likelihood = -38214.959 Iteration 42: Log likelihood = -38214.957 Iteration 43: Log likelihood = -38214.957 Generalized structural equation model Number of obs = 14,720 Log likelihood = -38214.957 ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- 1.C | (base outcome) -------------+---------------------------------------------------------------- 2.C | _cons | -.0573257 .1516287 -0.38 0.705 -.3545126 .2398611 -------------+---------------------------------------------------------------- 3.C | _cons | -.8434204 .3248152 -2.60 0.009 -1.480046 -.2067944 -------------+---------------------------------------------------------------- 4.C | _cons | .6991554 .0590171 11.85 0.000 .583484 .8148268 ------------------------------------------------------------------------------ Class: 1 Response: english Family: Bernoulli Link: Logit Response: maths Family: Bernoulli Link: Logit Response: science Family: Bernoulli Link: Logit Response: humanity Family: Bernoulli Link: Logit Response: othersub Family: Bernoulli Link: Logit ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- english | _cons | -1.635319 .1144926 -14.28 0.000 -1.85972 -1.410918 -------------+---------------------------------------------------------------- maths | _cons | -2.954779 .1736526 -17.02 0.000 -3.295132 -2.614426 -------------+---------------------------------------------------------------- science | _cons | -2.982875 .2312944 -12.90 0.000 -3.436204 -2.529546 -------------+---------------------------------------------------------------- humanity | _cons | -2.409235 .1329394 -18.12 0.000 -2.669791 -2.148678 -------------+---------------------------------------------------------------- othersub | _cons | -1.054667 .0614057 -17.18 0.000 -1.17502 -.9343142 ------------------------------------------------------------------------------ Class: 2 Response: english Family: Bernoulli Link: Logit Response: maths Family: Bernoulli Link: Logit Response: science Family: Bernoulli Link: Logit Response: humanity Family: Bernoulli Link: Logit Response: othersub Family: Bernoulli Link: Logit ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- english | _cons | 1.475149 .2056347 7.17 0.000 1.072112 1.878186 -------------+---------------------------------------------------------------- maths | _cons | -.8149979 .1435114 -5.68 0.000 -1.096275 -.5337207 -------------+---------------------------------------------------------------- science | _cons | -1.812117 .4947032 -3.66 0.000 -2.781717 -.8425163 -------------+---------------------------------------------------------------- humanity | _cons | .315028 .0875916 3.60 0.000 .1433515 .4867044 -------------+---------------------------------------------------------------- othersub | _cons | .8563969 .0830042 10.32 0.000 .6937117 1.019082 ------------------------------------------------------------------------------ Class: 3 Response: english Family: Bernoulli Link: Logit Response: maths Family: Bernoulli Link: Logit Response: science Family: Bernoulli Link: Logit Response: humanity Family: Bernoulli Link: Logit Response: othersub Family: Bernoulli Link: Logit ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- english | _cons | -.0653913 .3328607 -0.20 0.844 -.7177862 .5870036 -------------+---------------------------------------------------------------- maths | _cons | .2295466 .2064148 1.11 0.266 -.1750191 .6341122 -------------+---------------------------------------------------------------- science | _cons | 1.163136 .646758 1.80 0.072 -.1044864 2.430758 -------------+---------------------------------------------------------------- humanity | _cons | -.1199283 .1953333 -0.61 0.539 -.5027746 .262918 -------------+---------------------------------------------------------------- othersub | _cons | .4703479 .1720786 2.73 0.006 .13308 .8076158 ------------------------------------------------------------------------------ Class: 4 Response: english Family: Bernoulli Link: Logit Response: maths Family: Bernoulli Link: Logit Response: science Family: Bernoulli Link: Logit Response: humanity Family: Bernoulli Link: Logit Response: othersub Family: Bernoulli Link: Logit ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- english | _cons | 4.044277 .2958173 13.67 0.000 3.464486 4.624068 -------------+---------------------------------------------------------------- maths | _cons | 2.782206 .1169147 23.80 0.000 2.553057 3.011354 -------------+---------------------------------------------------------------- science | _cons | 2.297907 .0958555 23.97 0.000 2.110034 2.485781 -------------+---------------------------------------------------------------- humanity | _cons | 2.97444 .130719 22.75 0.000 2.718235 3.230644 -------------+---------------------------------------------------------------- othersub | _cons | 2.580313 .0697189 37.01 0.000 2.443666 2.716959 ------------------------------------------------------------------------------
The model estimated using the gsem command (above) uses a logistic regression model framework. The first part of the code specifies the 5 manifest variables (binary GCSE subject outcomes). The lclass option then specifies how many classes are to be estimated (4 in this example). The model reports a series of estimates:
As the log odds scale can be a little difficult to interpret, Stata provides further commands to re-express the model parameters as probabilities.
To look at the latent class probabilities (prior probabilities), we run the following command.
estat lcprob
Latent class marginal probabilities Number of obs = 14,720 -------------------------------------------------------------- | Delta-method | Margin std. err. [95% conf. interval] -------------+------------------------------------------------ C | 1 | .2279683 .0095253 .2098391 .2471737 2 | .2152673 .0307114 .1611375 .2814792 3 | .0980802 .0314894 .0513473 .1793073 4 | .4586841 .0109111 .4373878 .4801325 --------------------------------------------------------------
The example about we can see that 4 latent classes have been estimated. The margins column reports the (prior) probability that a randomly selected individual belong to the latent class.
We can see the probabilities of class membership is:
Now let's look at the conditional probabilities for the manifest (observed) variables. These are the probabilities of gaining an A-C pass in each GCSE subject given membership of a particular latent class.
estat lcmean
Latent class marginal means Number of obs = 14,720 -------------------------------------------------------------- | Delta-method | Margin std. err. [95% conf. interval] -------------+------------------------------------------------ 1 | english | .163103 .0156283 .1347356 .1960893 maths | .0495111 .0081721 .0357386 .0682157 science | .0482055 .0106122 .031183 .0738127 humanity | .0824712 .0100595 .0647796 .1044548 othersub | .2583299 .0117651 .2359488 .2820503 -------------+------------------------------------------------ 2 | english | .8138387 .0311547 .7449984 .8674026 maths | .3068265 .0305226 .2504385 .3696495 science | .1403825 .0596984 .0583202 .3010051 humanity | .5781121 .0213635 .5357766 .6193298 othersub | .7019073 .0173672 .6667921 .7347938 -------------+------------------------------------------------ 3 | english | .483658 .0831263 .3278807 .6426773 maths | .557136 .0509299 .4563566 .6534213 science | .7619021 .1173266 .4739021 .9191429 humanity | .4700538 .0486582 .3768888 .5653535 othersub | .6154661 .0407254 .533221 .6916012 -------------+------------------------------------------------ 4 | english | .9827794 .0050064 .9696602 .9902826 maths | .9417066 .0064181 .9277786 .9530844 science | .9087036 .0079523 .8918746 .923139 humanity | .951406 .0060435 .9380941 .9619713 othersub | .9295837 .0045636 .920097 .93802 --------------------------------------------------------------
Class 1 is characterised by the lowest levels of overall attainment (i.e. probability of gaining A-C in GCSE English is 16%, 5% in GCSE maths, 5% in GCSE Science and so on).
Class 2 have higher levels of attainment in GCSE English (probability of A-C is 81%) or GCSE Other Subjects (70%) but much lower probabilities of gaining an A-C pass in GCSE Maths (31%) or GCSE Science (14%).
Class 3 demonstrates an inverse pattern to the conditional probabilities reports by members of class 2. Members of class 3 have higher probabilities of gaining an A-C in GCSE Maths (55%) or GCSE Science (76%) but lower probabilities of gaining an A-C in GCSE English (48%) or GCSE Other Subjects (61%).
Class 4 has the highest levels of attainment with the probability of gaining A-C in each of the 5 GCSE subject groups being greater than 90%.
To see the overall goodness of fit for this model, we run the following command.
estat lcgof
---------------------------------------------------------------------------- Fit statistic | Value Description ---------------------+------------------------------------------------------ Likelihood ratio | chi2_ms(8) | 4.550 model vs. saturated p > chi2 | 0.804 ---------------------+------------------------------------------------------ Information criteria | AIC | 76475.913 Akaike's information criterion BIC | 76650.643 Bayesian information criterion ----------------------------------------------------------------------------
Further options are available for comparing and reporting models
These are beyong the scope of this workbook of examples, but are covered in more depth elsewhere.
Latent trait models are suitable when:
In the model below, we have 5 manifest variables which are binary (coded 1 or 0):
These models are also referred to as item response models and can be estimated using the irt commands in Stata (see below).
In this model there is 1 continuous latent variable - just with varying degrees of difficulty by GCSE subject grouping.
irt 1pl english maths science humanity othersub
Fitting fixed-effects model: Iteration 0: Log likelihood = -47945.098 Iteration 1: Log likelihood = -47874.642 Iteration 2: Log likelihood = -47874.601 Iteration 3: Log likelihood = -47874.601 Fitting full model: Iteration 0: Log likelihood = -42056.32 Iteration 1: Log likelihood = -38565.704 Iteration 2: Log likelihood = -38523.623 Iteration 3: Log likelihood = -38523.615 Iteration 4: Log likelihood = -38523.621 Iteration 5: Log likelihood = -38523.621 One-parameter logistic model Number of obs = 14,720 Log likelihood = -38523.621 ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- Discrim | 2.564579 .0312275 82.13 0.000 2.503375 2.625784 -------------+---------------------------------------------------------------- english | Diff | -.6775558 .0134272 -50.46 0.000 -.7038725 -.651239 -------------+---------------------------------------------------------------- maths | Diff | -.2063378 .0121772 -16.94 0.000 -.2302046 -.1824711 -------------+---------------------------------------------------------------- science | Diff | -.1118938 .0121377 -9.22 0.000 -.1356833 -.0881043 -------------+---------------------------------------------------------------- humanity | Diff | -.397438 .0124792 -31.85 0.000 -.4218968 -.3729792 -------------+---------------------------------------------------------------- othersub | Diff | -.629697 .0132248 -47.61 0.000 -.655617 -.6037769 ------------------------------------------------------------------------------
In this example, the indicators are the binary categories for whether a young person gains an A-C GCSE pass or not for a given subject grouping. The latent variable is a continuous normal distribution. The cummulative density function for the normal distribution helps us to understand the overall distribution of "attainment". The individual subject groupings then have varying degrees of difficulty (i.e. the proportion of the sample at a given point on the overall latent distribution that can pass the subject).
We can use the following command to show the order of difficulty of these subjects.
estat report, sort(b) byparm
One-parameter logistic model Number of obs = 14,720 Log likelihood = -38523.621 ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- Discrim | 2.564579 .0312275 82.13 0.000 2.503375 2.625784 -------------+---------------------------------------------------------------- Diff | english | -.6775558 .0134272 -50.46 0.000 -.7038725 -.651239 othersub | -.629697 .0132248 -47.61 0.000 -.655617 -.6037769 humanity | -.397438 .0124792 -31.85 0.000 -.4218968 -.3729792 maths | -.2063378 .0121772 -16.94 0.000 -.2302046 -.1824711 science | -.1118938 .0121377 -9.22 0.000 -.1356833 -.0881043 ------------------------------------------------------------------------------
According to this model the subjects in order of difficulty from easiest to hardest are:
The parallel item characteristic curves are shown below. This appears very similar to a plot for an ordinal logistic regression model. The x-axis reports theta, which is the continuous latent variable. Probabilities of success differs across subjects. English is the furthest to left suggesting that as we move along the latent "attainment" distribution (theta), it is the first subject that students tend to pass.
%set user_graph_keywords irtgraph
irtgraph icc
graph export latent_trait1.png, as(png) replace
file C:/Users/cjp229/.stata_kernel_cache/graph0.svg saved as SVG format file C:/Users/cjp229/.stata_kernel_cache/graph0.pdf saved as PDF format file latent_trait1.png saved as PNG format
Now we fit a model where discrination and difference are allowed to vary across subject groupings.
irt 2pl english maths science humanity othersub
Fitting fixed-effects model: Iteration 0: Log likelihood = -47945.098 Iteration 1: Log likelihood = -47874.642 Iteration 2: Log likelihood = -47874.601 Iteration 3: Log likelihood = -47874.601 Fitting full model: Iteration 0: Log likelihood = -42047.203 Iteration 1: Log likelihood = -38418.105 Iteration 2: Log likelihood = -38325.25 Iteration 3: Log likelihood = -38321.816 Iteration 4: Log likelihood = -38321.892 Iteration 5: Log likelihood = -38321.899 Iteration 6: Log likelihood = -38321.9 Two-parameter logistic model Number of obs = 14,720 Log likelihood = -38321.9 ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- english | Discrim | 2.533943 .0647634 39.13 0.000 2.407009 2.660877 Diff | -.680093 .014198 -47.90 0.000 -.7079206 -.6522653 -------------+---------------------------------------------------------------- maths | Discrim | 3.51309 .1065428 32.97 0.000 3.30427 3.72191 Diff | -.1940406 .0109287 -17.76 0.000 -.2154604 -.1726208 -------------+---------------------------------------------------------------- science | Discrim | 2.791897 .071835 38.87 0.000 2.651103 2.932691 Diff | -.1119334 .0116335 -9.62 0.000 -.1347347 -.0891322 -------------+---------------------------------------------------------------- humanity | Discrim | 2.779038 .0716841 38.77 0.000 2.63854 2.919537 Diff | -.389163 .012209 -31.88 0.000 -.4130921 -.3652338 -------------+---------------------------------------------------------------- othersub | Discrim | 1.77048 .0425716 41.59 0.000 1.687041 1.853918 Diff | -.7238225 .0169885 -42.61 0.000 -.7571193 -.6905256 ------------------------------------------------------------------------------
estat report, sort(b) byparm
Two-parameter logistic model Number of obs = 14,720 Log likelihood = -38321.9 ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- Discrim | othersub | 1.77048 .0425716 41.59 0.000 1.687041 1.853918 english | 2.533943 .0647634 39.13 0.000 2.407009 2.660877 humanity | 2.779038 .0716841 38.77 0.000 2.63854 2.919537 maths | 3.51309 .1065428 32.97 0.000 3.30427 3.72191 science | 2.791897 .071835 38.87 0.000 2.651103 2.932691 -------------+---------------------------------------------------------------- Diff | othersub | -.7238225 .0169885 -42.61 0.000 -.7571193 -.6905256 english | -.680093 .014198 -47.90 0.000 -.7079206 -.6522653 humanity | -.389163 .012209 -31.88 0.000 -.4130921 -.3652338 maths | -.1940406 .0109287 -17.76 0.000 -.2154604 -.1726208 science | -.1119334 .0116335 -9.62 0.000 -.1347347 -.0891322 ------------------------------------------------------------------------------
In the chart below, the steeper the slope, the higher the discrimination of the item (i.e. GCSE subject grouping). What this means is how much a given indicator/item is able to discriminate between individuals. For example, science has the highest levels of difficulty (fewer people pass GCSE science) and the highest levels of discrimination between individuals (serves as the clearest differentiator of overall ability or a better indicator for those with lower levels of attainment from those with higher levels).
irtgraph icc
graph export latent_trait2.png, as(png) replace
file C:/Users/cjp229/.stata_kernel_cache/graph1.svg saved as SVG format file C:/Users/cjp229/.stata_kernel_cache/graph1.pdf saved as PDF format file latent_trait2.png saved as PNG format
Latent profile models are suitable when:
In the model below, we have 5 manifest variables which are scores (0 - 7):
We can estimate the pairwise correlation between each of the GCSE subject score variables (see below). This can be worthwhile when inspecting your data prior to fitting a model.
corr english_sc maths_sc science_sc humanity_sc othersub_sc
(obs=14,720) | englis~c maths_sc scienc~c humani~c others~c -------------+--------------------------------------------- english_sc | 1.0000 maths_sc | 0.6108 1.0000 science_sc | 0.5888 0.7228 1.0000 humanity_sc | 0.6724 0.6442 0.6369 1.0000 othersub_sc | 0.5767 0.5473 0.5309 0.5449 1.0000
For comparison with the 4 class latent class model, a 4 'class' latent profile model has been estimated below.
The model estimated using the gsem command (above) uses a Generalised Linear Latent and Mixed Modelling framework (GLLMM). The first part of the code specifies the 5 manifest variables (continuous GCSE subject outcomes). The lclass option then specifies how many 'classes' are to be estimated (4 in this example). The model reports a series of estimates:
As the log odds scale can be a little difficult to interpret, Stata provides further commands to re-express the model parameters as probabilities. The conditional scores for the latent 'classes' are easier to interpret.
To look at the latent 'class' probabilities (prior probabilities), we run the following command.
gsem (english_sc maths_sc science_sc humanity_sc othersub_sc <- ), lclass(C 4)
est store class4score
Fitting class model: Iteration 0: (class) log likelihood = -20406.253 Iteration 1: (class) log likelihood = -20406.253 Fitting outcome model: Iteration 0: (outcome) log likelihood = -99358.426 Iteration 1: (outcome) log likelihood = -99358.426 Refining starting values: Iteration 0: (EM) log likelihood = -121898.48 Iteration 1: (EM) log likelihood = -121839.92 Iteration 2: (EM) log likelihood = -121559.11 Iteration 3: (EM) log likelihood = -121304.62 Iteration 4: (EM) log likelihood = -121097.9 Iteration 5: (EM) log likelihood = -120930.46 Iteration 6: (EM) log likelihood = -120793.23 Iteration 7: (EM) log likelihood = -120679.59 Iteration 8: (EM) log likelihood = -120584.74 Iteration 9: (EM) log likelihood = -120505.05 Iteration 10: (EM) log likelihood = -120437.65 Iteration 11: (EM) log likelihood = -120380.23 Iteration 12: (EM) log likelihood = -120330.95 Iteration 13: (EM) log likelihood = -120288.31 Iteration 14: (EM) log likelihood = -120251.09 Iteration 15: (EM) log likelihood = -120218.35 Iteration 16: (EM) log likelihood = -120189.3 Iteration 17: (EM) log likelihood = -120163.32 Iteration 18: (EM) log likelihood = -120139.92 Iteration 19: (EM) log likelihood = -120118.69 Iteration 20: (EM) log likelihood = -120099.31 note: EM algorithm reached maximum iterations. Fitting full model: Iteration 0: Log likelihood = -115433.69 Iteration 1: Log likelihood = -115383.6 Iteration 2: Log likelihood = -115382.62 Iteration 3: Log likelihood = -115382.62 Generalized structural equation model Number of obs = 14,720 Log likelihood = -115382.62 ( 1) [/]var(e.english_sc)#1bn.C - [/]var(e.english_sc)#4.C = 0 ( 2) [/]var(e.english_sc)#2.C - [/]var(e.english_sc)#4.C = 0 ( 3) [/]var(e.english_sc)#3.C - [/]var(e.english_sc)#4.C = 0 ( 4) [/]var(e.maths_sc)#1bn.C - [/]var(e.maths_sc)#4.C = 0 ( 5) [/]var(e.maths_sc)#2.C - [/]var(e.maths_sc)#4.C = 0 ( 6) [/]var(e.maths_sc)#3.C - [/]var(e.maths_sc)#4.C = 0 ( 7) [/]var(e.science_sc)#1bn.C - [/]var(e.science_sc)#4.C = 0 ( 8) [/]var(e.science_sc)#2.C - [/]var(e.science_sc)#4.C = 0 ( 9) [/]var(e.science_sc)#3.C - [/]var(e.science_sc)#4.C = 0 (10) [/]var(e.humanity_sc)#1bn.C - [/]var(e.humanity_sc)#4.C = 0 (11) [/]var(e.humanity_sc)#2.C - [/]var(e.humanity_sc)#4.C = 0 (12) [/]var(e.humanity_sc)#3.C - [/]var(e.humanity_sc)#4.C = 0 (13) [/]var(e.othersub_sc)#1bn.C - [/]var(e.othersub_sc)#4.C = 0 (14) [/]var(e.othersub_sc)#2.C - [/]var(e.othersub_sc)#4.C = 0 (15) [/]var(e.othersub_sc)#3.C - [/]var(e.othersub_sc)#4.C = 0 ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- 1.C | (base outcome) -------------+---------------------------------------------------------------- 2.C | _cons | 1.141388 .0478518 23.85 0.000 1.0476 1.235176 -------------+---------------------------------------------------------------- 3.C | _cons | 1.432306 .0521877 27.45 0.000 1.33002 1.534592 -------------+---------------------------------------------------------------- 4.C | _cons | 1.098932 .0563038 19.52 0.000 .9885781 1.209285 ------------------------------------------------------------------------------ Class: 1 Response: english_sc Family: Gaussian Link: Identity Response: maths_sc Family: Gaussian Link: Identity Response: science_sc Family: Gaussian Link: Identity Response: humanity_sc Family: Gaussian Link: Identity Response: othersub_sc Family: Gaussian Link: Identity ------------------------------------------------------------------------------- | Coefficient Std. err. z P>|z| [95% conf. interval] --------------+---------------------------------------------------------------- english_sc | _cons | 3.140857 .0323312 97.15 0.000 3.077489 3.204225 --------------+---------------------------------------------------------------- maths_sc | _cons | 1.961922 .0356103 55.09 0.000 1.892127 2.031717 --------------+---------------------------------------------------------------- science_sc | _cons | 2.151475 .037378 57.56 0.000 2.078216 2.224735 --------------+---------------------------------------------------------------- humanity_sc | _cons | 2.1179 .0396661 53.39 0.000 2.040156 2.195644 --------------+---------------------------------------------------------------- othersub_sc | _cons | 3.122194 .0426271 73.24 0.000 3.038647 3.205742 --------------+---------------------------------------------------------------- var(e.engli~c)| .671014 .0095365 .6525807 .689968 var(e.maths~c)| .8851617 .013933 .8582705 .9128955 var(e.scien~c)| .9186259 .013553 .8924429 .9455771 var(e.human~c)| .9064522 .014311 .8788328 .9349397 var(e.other~c)| 1.231776 .0161764 1.200476 1.263893 ------------------------------------------------------------------------------- Class: 2 Response: english_sc Family: Gaussian Link: Identity Response: maths_sc Family: Gaussian Link: Identity Response: science_sc Family: Gaussian Link: Identity Response: humanity_sc Family: Gaussian Link: Identity Response: othersub_sc Family: Gaussian Link: Identity ------------------------------------------------------------------------------- | Coefficient Std. err. z P>|z| [95% conf. interval] --------------+---------------------------------------------------------------- english_sc | _cons | 4.308375 .021638 199.11 0.000 4.265965 4.350785 --------------+---------------------------------------------------------------- maths_sc | _cons | 3.325618 .0286554 116.06 0.000 3.269454 3.381781 --------------+---------------------------------------------------------------- science_sc | _cons | 3.451006 .0253821 135.96 0.000 3.401258 3.500754 --------------+---------------------------------------------------------------- humanity_sc | _cons | 3.806207 .0296027 128.58 0.000 3.748187 3.864227 --------------+---------------------------------------------------------------- othersub_sc | _cons | 4.339169 .0258624 167.78 0.000 4.288479 4.389858 --------------+---------------------------------------------------------------- var(e.engli~c)| .671014 .0095365 .6525807 .689968 var(e.maths~c)| .8851617 .013933 .8582705 .9128955 var(e.scien~c)| .9186259 .013553 .8924429 .9455771 var(e.human~c)| .9064522 .014311 .8788328 .9349397 var(e.other~c)| 1.231776 .0161764 1.200476 1.263893 ------------------------------------------------------------------------------- Class: 3 Response: english_sc Family: Gaussian Link: Identity Response: maths_sc Family: Gaussian Link: Identity Response: science_sc Family: Gaussian Link: Identity Response: humanity_sc Family: Gaussian Link: Identity Response: othersub_sc Family: Gaussian Link: Identity ------------------------------------------------------------------------------- | Coefficient Std. err. z P>|z| [95% conf. interval] --------------+---------------------------------------------------------------- english_sc | _cons | 5.28238 .018694 282.57 0.000 5.24574 5.31902 --------------+---------------------------------------------------------------- maths_sc | _cons | 4.773784 .0226449 210.81 0.000 4.729401 4.818167 --------------+---------------------------------------------------------------- science_sc | _cons | 4.718916 .0231977 203.42 0.000 4.673449 4.764383 --------------+---------------------------------------------------------------- humanity_sc | _cons | 5.211693 .0235324 221.47 0.000 5.16557 5.257815 --------------+---------------------------------------------------------------- othersub_sc | _cons | 5.409425 .0230908 234.27 0.000 5.364168 5.454682 --------------+---------------------------------------------------------------- var(e.engli~c)| .671014 .0095365 .6525807 .689968 var(e.maths~c)| .8851617 .013933 .8582705 .9128955 var(e.scien~c)| .9186259 .013553 .8924429 .9455771 var(e.human~c)| .9064522 .014311 .8788328 .9349397 var(e.other~c)| 1.231776 .0161764 1.200476 1.263893 ------------------------------------------------------------------------------- Class: 4 Response: english_sc Family: Gaussian Link: Identity Response: maths_sc Family: Gaussian Link: Identity Response: science_sc Family: Gaussian Link: Identity Response: humanity_sc Family: Gaussian Link: Identity Response: othersub_sc Family: Gaussian Link: Identity ------------------------------------------------------------------------------- | Coefficient Std. err. z P>|z| [95% conf. interval] --------------+---------------------------------------------------------------- english_sc | _cons | 6.33835 .0162578 389.86 0.000 6.306485 6.370215 --------------+---------------------------------------------------------------- maths_sc | _cons | 6.119511 .0205543 297.72 0.000 6.079226 6.159797 --------------+---------------------------------------------------------------- science_sc | _cons | 6.165085 .0207947 296.47 0.000 6.124328 6.205842 --------------+---------------------------------------------------------------- humanity_sc | _cons | 6.439744 .0183939 350.10 0.000 6.403693 6.475795 --------------+---------------------------------------------------------------- othersub_sc | _cons | 6.458281 .0203541 317.30 0.000 6.418388 6.498174 --------------+---------------------------------------------------------------- var(e.engli~c)| .671014 .0095365 .6525807 .689968 var(e.maths~c)| .8851617 .013933 .8582705 .9128955 var(e.scien~c)| .9186259 .013553 .8924429 .9455771 var(e.human~c)| .9064522 .014311 .8788328 .9349397 var(e.other~c)| 1.231776 .0161764 1.200476 1.263893 -------------------------------------------------------------------------------
estat lcprob
Latent class marginal probabilities Number of obs = 14,720 -------------------------------------------------------------- | Delta-method | Margin std. err. [95% conf. interval] -------------+------------------------------------------------ C | 1 | .088336 .0039272 .0809356 .0963421 2 | .2765898 .0055229 .2658969 .2875442 3 | .3699817 .0057928 .3587016 .3814055 4 | .2650926 .0058544 .253778 .2767245 --------------------------------------------------------------
Now let's look at the conditional probabilities for the manifest (observed) variables. These are the mean score in each GCSE subject group given membership of a particular latent 'class'.
estat lcmean
Latent class marginal means Number of obs = 14,720 ------------------------------------------------------------------------------ | Delta-method | Margin std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- 1 | english_sc | 3.140857 .0323312 97.15 0.000 3.077489 3.204225 maths_sc | 1.961922 .0356103 55.09 0.000 1.892127 2.031717 science_sc | 2.151475 .037378 57.56 0.000 2.078216 2.224735 humanity_sc | 2.1179 .0396661 53.39 0.000 2.040156 2.195644 othersub_sc | 3.122194 .0426271 73.24 0.000 3.038647 3.205742 -------------+---------------------------------------------------------------- 2 | english_sc | 4.308375 .021638 199.11 0.000 4.265965 4.350785 maths_sc | 3.325618 .0286554 116.06 0.000 3.269454 3.381781 science_sc | 3.451006 .0253821 135.96 0.000 3.401258 3.500754 humanity_sc | 3.806207 .0296027 128.58 0.000 3.748187 3.864227 othersub_sc | 4.339169 .0258624 167.78 0.000 4.288479 4.389858 -------------+---------------------------------------------------------------- 3 | english_sc | 5.28238 .018694 282.57 0.000 5.24574 5.31902 maths_sc | 4.773784 .0226449 210.81 0.000 4.729401 4.818167 science_sc | 4.718916 .0231977 203.42 0.000 4.673449 4.764383 humanity_sc | 5.211693 .0235324 221.47 0.000 5.16557 5.257815 othersub_sc | 5.409425 .0230908 234.27 0.000 5.364168 5.454682 -------------+---------------------------------------------------------------- 4 | english_sc | 6.33835 .0162578 389.86 0.000 6.306485 6.370215 maths_sc | 6.119511 .0205543 297.72 0.000 6.079226 6.159797 science_sc | 6.165085 .0207947 296.47 0.000 6.124328 6.205842 humanity_sc | 6.439744 .0183939 350.10 0.000 6.403693 6.475795 othersub_sc | 6.458281 .0203541 317.30 0.000 6.418388 6.498174 ------------------------------------------------------------------------------
Although the gsem command refers to them as 'classes' we might also call them profiles.
To plot the different profiles, we can use the margins command. This has been estimated below, using the quietly option. The margins are then saved and used in the graph below.
estimates restore class4score
quietly: margins, predict(outcome(english_sc) class(1)) ///
predict(outcome(maths_sc) class(1)) ///
predict(outcome(science_sc) class(1)) ///
predict(outcome(humanity_sc) class(1)) ///
predict(outcome(othersub_sc) class(1)) post
est store lpa4_1
(results class4score are active now)
estimates restore class4score
quietly: margins, predict(outcome(english_sc) class(2)) ///
predict(outcome(maths_sc) class(2)) ///
predict(outcome(science_sc) class(2)) ///
predict(outcome(humanity_sc) class(2)) ///
predict(outcome(othersub_sc) class(2)) post
est store lpa4_2
(results class4score are active now)
estimates restore class4score
quietly: margins, predict(outcome(english_sc) class(3)) ///
predict(outcome(maths_sc) class(3)) ///
predict(outcome(science_sc) class(3)) ///
predict(outcome(humanity_sc) class(3)) ///
predict(outcome(othersub_sc) class(3)) post
est store lpa4_3
(results class4score are active now)
estimates restore class4score
quietly: margins, predict(outcome(english_sc) class(4)) ///
predict(outcome(maths_sc) class(4)) ///
predict(outcome(science_sc) class(4)) ///
predict(outcome(humanity_sc) class(4)) ///
predict(outcome(othersub_sc) class(4)) post
est store lpa4_4
(results class4score are active now)
set scheme s1color
coefplot lpa4_1 ///
lpa4_2 ///
lpa4_3 ///
lpa4_4, connect(l) vertical ///
xlabel(1 "English" ///
2 "Maths" ///
3 "Science" ///
4 "Humanity" ///
5 "Other Subject", angle(45)) ///
ytitle("GCSE Points Score") ///
legend(order(2 "Latent Profile 1" ///
4 "Latent Profile 2" ///
6 "Latent Profile 3" ///
8 "Latent Profile 4" ))
graph export latent_profile1.png, as(png) replace
file latent_profile1.png saved as PNG format
The use of mean scores in the latent profile model presents a different perspective to the latent class model.
Mean scores will occlude patterns of response that would be visible using a latent class model. The analyst therefore has to decide what their research question is and which measure help them to answer this best.
Factor analysis is suitable when:
In the model below, we have 5 manifest variables which are scores (0 - 7):
Now we can fit a factor analysis with a continuous latent variable and a continuous set of manifest variables.
factor english_sc maths_sc science_sc humanity_sc othersub_sc
(obs=14,720) Factor analysis/correlation Number of obs = 14,720 Method: principal factors Retained factors = 2 Rotation: (unrotated) Number of params = 9 -------------------------------------------------------------------------- Factor | Eigenvalue Difference Proportion Cumulative -------------+------------------------------------------------------------ Factor1 | 2.98712 2.93301 1.0952 1.0952 Factor2 | 0.05411 0.11042 0.0198 1.1150 Factor3 | -0.05631 0.06717 -0.0206 1.0943 Factor4 | -0.12348 0.01037 -0.0453 1.0491 Factor5 | -0.13385 . -0.0491 1.0000 -------------------------------------------------------------------------- LR test: independent vs. saturated: chi2(10) = 3.9e+04 Prob>chi2 = 0.0000 Factor loadings (pattern matrix) and unique variances ------------------------------------------------- Variable | Factor1 Factor2 | Uniqueness -------------+--------------------+-------------- english_sc | 0.7750 0.1205 | 0.3849 maths_sc | 0.8114 -0.1151 | 0.3283 science_sc | 0.7959 -0.1277 | 0.3502 humanity_sc | 0.7953 0.0586 | 0.3640 othersub_sc | 0.6797 0.0811 | 0.5314 -------------------------------------------------
Two factors were identified. Factor 1 appears to correspond to "overall attainment." Factor 2 scores differently. This factor is positively loaded for English, Humanitiess and Other Subjects but negative loaded for Maths and Science. This broadly fits with the findings from the 4 class latent class model.
estat factors
Factor analysis with different numbers of factors (maximum likelihood) ---------------------------------------------------------- #factors | loglik df_m df_r AIC BIC ---------+------------------------------------------------ 1 | -560.179 5 5 1130.358 1168.343 2 | -1.448049 9 1 20.8961 89.26876 ---------------------------------------------------------- no Heywood cases encountered
estat structure
Structure matrix: correlations between variables and common factors ---------------------------------- Variable | Factor1 Factor2 -------------+-------------------- english_sc | 0.7750 0.1205 maths_sc | 0.8114 -0.1151 science_sc | 0.7959 -0.1277 humanity_sc | 0.7953 0.0586 othersub_sc | 0.6797 0.0811 ----------------------------------
The workbook above provides an illustration of the 4 main types of latent variable model using the same underlying data. Each model provides a slightly different 'view' of the association between manifest variables. Researchers should consider carefully how they conceptualise the latent variable(s) and how the manifest variables are measured.