NCRM Latent Variables Online Resource¶

Dr Chris Playford

This Jupyter notebook has been constructed as an online resource on Latent Variables for the National Centre for Research Methods.

The workbook demonstrates the estimation of 4 different types of latent variable model:

  • Latent Class Analysis
  • Latent Trait Analysis
  • Latent Profile Analysis
  • Factor Analysis

Data¶

The dataset includes information on attainment in General Certificate of Secondary Education (GCSE) qualifications from young people age 15-16 in England and Wales in 1992. It was collected as part of the Youth Cohort Study (YCS), a series of surveys exploring the education and backgrounds of young people.

Courtenay, G. (1996). Youth Cohort Study of England and Wales, 1992-1994; Cohort Six, Sweep One to Three. [data collection]. UK Data Service. SN: 3532, DOI: http://doi.org/10.5255/UKDA-SN-3532-1

In [1]:
* Open the dataset

use ncrm_latent_vars_online_resource_ycs1992.dta, clear

GCSE Qualifications¶

Pupils in England and Wales study GCSE qualifications in a range of different subjects. GCSE subjects are assessed separately and a subject-specific GCSE is awarded. Each GCSE subject is awarded a grade, historically the highest being grade A and the lowest grade G. The grade A* was introduced in 1994 but this is after the data presented in this notebook. For more details on the application latent class models to GCSE outcomes, see Playford and Gayle (2016).

Playford, C. J., and Gayle, V. (2016). The concealed middle? An exploration of ordinary young people and school GCSE subject area attainment. Journal of Youth Studies, 19(2), 149-168. https://doi.org/10.1080/13676261.2015.1052049

For simplicity, the individual GCSE subjects have been grouped into 5 broad subject groupings. The grade reported is the highest grade attained in each subject grouping.

The variables we will be looking at are the highest GCSE grade attained in the following subject groupings:

  • English
  • Mathematics
  • Science
  • Humanities
  • Other Subjects

There are two versions of these variables described in the table below.

  1. A GCSE score from A = 7 to U = 0 points
  2. A GCSE A-C binary indicator
    • 0 indicates a student gained a D-U grade in the subject grouping
    • 1 indicates an A-C grade
GCSE Grade GCSE Points Score GCSE A-C Indicator
U 0 0
G 1 0
F 2 0
E 3 0
D 4 0
C 5 1
B 6 1
A 7 1

The distribution of the variables is described below.

In [2]:
tab english_sc english, missing
   Highest |
  Score in |
      GCSE | Highest Grade in GCSE
   English |   English subjects
  subjects |         0          1 |     Total
-----------+----------------------+----------
         0 |        27          0 |        27 
         1 |        50          0 |        50 
         2 |       279          0 |       279 
         3 |     1,238          0 |     1,238 
         4 |     2,666          0 |     2,666 
         5 |         0      4,768 |     4,768 
         6 |         0      3,543 |     3,543 
         7 |         0      2,149 |     2,149 
-----------+----------------------+----------
     Total |     4,260     10,460 |    14,720 
In [3]:
tab maths_sc maths, missing
   Highest |
  Score in | Highest Grade in GCSE
GCSE Maths |    Maths subjects
  subjects |         0          1 |     Total
-----------+----------------------+----------
         0 |       223          0 |       223 
         1 |       350          0 |       350 
         2 |     1,126          0 |     1,126 
         3 |     2,327          0 |     2,327 
         4 |     2,393          0 |     2,393 
         5 |         0      4,640 |     4,640 
         6 |         0      2,013 |     2,013 
         7 |         0      1,648 |     1,648 
-----------+----------------------+----------
     Total |     6,419      8,301 |    14,720 
In [4]:
tab science_sc science, missing
   Highest |
  Score in |
      GCSE | Highest Grade in GCSE
   Science |   Science subjects
  subjects |         0          1 |     Total
-----------+----------------------+----------
         0 |       115          0 |       115 
         1 |       383          0 |       383 
         2 |     1,048          0 |     1,048 
         3 |     2,195          0 |     2,195 
         4 |     3,137          0 |     3,137 
         5 |         0      3,776 |     3,776 
         6 |         0      2,349 |     2,349 
         7 |         0      1,717 |     1,717 
-----------+----------------------+----------
     Total |     6,878      7,842 |    14,720 
In [5]:
tab humanity_sc humanity, missing
   Highest |
  Score in |
      GCSE | Highest Grade in GCSE
  Humanity |   Humanity subjects
  subjects |         0          1 |     Total
-----------+----------------------+----------
         0 |       133          0 |       133 
         1 |       309          0 |       309 
         2 |       812          0 |       812 
         3 |     1,594          0 |     1,594 
         4 |     2,661          0 |     2,661 
         5 |         0      3,520 |     3,520 
         6 |         0      3,033 |     3,033 
         7 |         0      2,658 |     2,658 
-----------+----------------------+----------
     Total |     5,509      9,211 |    14,720 
In [6]:
tab othersub_sc othersub, missing
   Highest |
  Score in | Highest Grade in GCSE
GCSE Other |    Other subjects
  subjects |         0          1 |     Total
-----------+----------------------+----------
         0 |        56          0 |        56 
         1 |       165          0 |       165 
         2 |       555          0 |       555 
         3 |     1,209          0 |     1,209 
         4 |     2,479          0 |     2,479 
         5 |         0      3,499 |     3,499 
         6 |         0      3,224 |     3,224 
         7 |         0      3,533 |     3,533 
-----------+----------------------+----------
     Total |     4,464     10,256 |    14,720 

Latent Class Model¶

Latent class models are suitable when:

  • The latent variable(s) is categorical
  • The manifest variables are categorical

In the model below, we have 5 manifest variables which are binary (coded 1 or 0):

  • English
  • Mathematics
  • Science
  • Humanities
  • Other Subjects

Before fitting the latent class model, we can look at the correlation between pairs of binary indicators with a tetrachoric correlation.

For example, young people who gain an A-C pass in GCSE Maths subjects also tend to gain an A-C pass in Science subjects (0.7922).

In [7]:
tetrachoric english maths science humanity othersub
(obs=14,720)

             |  english    maths  science humanity othersub
-------------+---------------------------------------------
     english |   1.0000 
       maths |   0.7056   1.0000 
     science |   0.6413   0.7922   1.0000 
    humanity |   0.7334   0.7290   0.6995   1.0000 
    othersub |   0.6240   0.6152   0.5854   0.6082   1.0000 

It is helpful to explore all the possible combiations of GCSE subject group outcomes. In the table below, we can see that there are 32 combinations because we have 5 manifest variables with 2 possible outcomes for each variable (i.e. 2 ^ 5 = 32). Each of the 32 patterns is a unique combinations of GCSE outcomes. For example, combination 11111 includes those young people who gain an A-C pass in all 5 GCSE subject groups.

In [8]:
egen gcse_att_pat = concat(english maths science humanity othersub), format(%9.0f)

label variable gcse_att_pat 	"GCSE subject outcome pattern"

tab gcse_att_pat, missing



       GCSE |
    subject |
    outcome |
    pattern |      Freq.     Percent        Cum.
------------+-----------------------------------
      00000 |      1,787       12.14       12.14
      00001 |        733        4.98       17.12
      00010 |        229        1.56       18.68
      00011 |        217        1.47       20.15
      00100 |        147        1.00       21.15
      00101 |        131        0.89       22.04
      00110 |         66        0.45       22.49
      00111 |        103        0.70       23.19
      01000 |        133        0.90       24.09
      01001 |        107        0.73       24.82
      01010 |         55        0.37       25.19
      01011 |        107        0.73       25.92
      01100 |         74        0.50       26.42
      01101 |        114        0.77       27.19
      01110 |         67        0.46       27.65
      01111 |        190        1.29       28.94
      10000 |        548        3.72       32.66
      10001 |        599        4.07       36.73
      10010 |        323        2.19       38.93
      10011 |        676        4.59       43.52
      10100 |         90        0.61       44.13
      10101 |        179        1.22       45.35
      10110 |        109        0.74       46.09
      10111 |        482        3.27       49.36
      11000 |        110        0.75       50.11
      11001 |        274        1.86       51.97
      11010 |        173        1.18       53.15
      11011 |        807        5.48       58.63
      11100 |        100        0.68       59.31
      11101 |        383        2.60       61.91
      11110 |        453        3.08       64.99
      11111 |      5,154       35.01      100.00
------------+-----------------------------------
      Total |     14,720      100.00

We can estimate latent class models to explore whether the 32 combinations might be represented by a latent variable with a smaller number of categories or classes.

The number of latent classes that the data can support is a function of the number of combinations available and the frequency of the particular combinations. It is normal for researchers to estimate several models each with varying number of latent classes. The models can then be compared using goodness of fit statistics to examine which is the preferred model. For the purposes of illustration in this notebook, a 4 class model has been fitted (below) to show how this works in practice and how we might interpret the results.

 

NB: When using categorical indicators with Stata's GSEM commands, ensure that the binary outcome is coded 1 or 0 (not 1 or 2)

In [9]:
gsem (english maths science humanity othersub <- ), logit lclass(C 4)

estimates store class4

Fitting class model:

Iteration 0:  (class) log likelihood = -20406.253  
Iteration 1:  (class) log likelihood = -20406.253  

Fitting outcome model:

Iteration 0:  (outcome) log likelihood = -25736.566  
Iteration 1:  (outcome) log likelihood = -24088.912  
Iteration 2:  (outcome) log likelihood =  -23889.89  
Iteration 3:  (outcome) log likelihood = -23840.072  
Iteration 4:  (outcome) log likelihood =  -23828.68  
Iteration 5:  (outcome) log likelihood = -23826.321  
Iteration 6:  (outcome) log likelihood =  -23825.83  
Iteration 7:  (outcome) log likelihood = -23825.716  
Iteration 8:  (outcome) log likelihood = -23825.689  
Iteration 9:  (outcome) log likelihood = -23825.683  
Iteration 10: (outcome) log likelihood = -23825.682  

Refining starting values:

Iteration 0:  (EM) log likelihood = -46955.636
Iteration 1:  (EM) log likelihood = -47061.504
Iteration 2:  (EM) log likelihood = -47118.702
Iteration 3:  (EM) log likelihood = -47150.853
Iteration 4:  (EM) log likelihood = -47170.055
Iteration 5:  (EM) log likelihood = -47182.406
Iteration 6:  (EM) log likelihood = -47190.959
Iteration 7:  (EM) log likelihood = -47197.268
Iteration 8:  (EM) log likelihood = -47202.149
Iteration 9:  (EM) log likelihood = -47206.052
Iteration 10: (EM) log likelihood = -47209.243
Iteration 11: (EM) log likelihood =  -47211.89
Iteration 12: (EM) log likelihood = -47214.105
Iteration 13: (EM) log likelihood = -47215.972
Iteration 14: (EM) log likelihood = -47217.551
Iteration 15: (EM) log likelihood = -47218.891
Iteration 16: (EM) log likelihood = -47220.032
Iteration 17: (EM) log likelihood = -47221.004
Iteration 18: (EM) log likelihood = -47221.833
Iteration 19: (EM) log likelihood = -47222.541
Iteration 20: (EM) log likelihood = -47223.147
note: EM algorithm reached maximum iterations.

Fitting full model:

Iteration 0:  Log likelihood = -38407.036  (not concave)
Iteration 1:  Log likelihood = -38356.743  (not concave)
Iteration 2:  Log likelihood = -38319.948  (not concave)
Iteration 3:  Log likelihood = -38308.467  (not concave)
Iteration 4:  Log likelihood = -38305.797  (not concave)
Iteration 5:  Log likelihood = -38304.484  (not concave)
Iteration 6:  Log likelihood = -38301.176  (not concave)
Iteration 7:  Log likelihood = -38299.958  (not concave)
Iteration 8:  Log likelihood = -38294.927  (not concave)
Iteration 9:  Log likelihood = -38292.965  (not concave)
Iteration 10: Log likelihood = -38288.061  (not concave)
Iteration 11: Log likelihood = -38279.074  (not concave)
Iteration 12: Log likelihood = -38273.615  (not concave)
Iteration 13: Log likelihood = -38273.109  (not concave)
Iteration 14: Log likelihood = -38268.851  (not concave)
Iteration 15: Log likelihood = -38259.636  (not concave)
Iteration 16: Log likelihood = -38256.432  (not concave)
Iteration 17: Log likelihood = -38253.041  (not concave)
Iteration 18: Log likelihood = -38249.164  (not concave)
Iteration 19: Log likelihood = -38245.711  (not concave)
Iteration 20: Log likelihood =  -38243.22  (not concave)
Iteration 21: Log likelihood = -38241.272  (not concave)
Iteration 22: Log likelihood = -38239.835  (not concave)
Iteration 23: Log likelihood = -38239.431  (not concave)
Iteration 24: Log likelihood = -38237.974  (not concave)
Iteration 25: Log likelihood = -38237.614  (not concave)
Iteration 26: Log likelihood = -38235.685  (not concave)
Iteration 27: Log likelihood = -38233.537  (not concave)
Iteration 28: Log likelihood = -38231.497  (not concave)
Iteration 29: Log likelihood = -38230.743  (not concave)
Iteration 30: Log likelihood = -38227.409  (not concave)
Iteration 31: Log likelihood = -38224.932  (not concave)
Iteration 32: Log likelihood = -38223.166  (not concave)
Iteration 33: Log likelihood = -38221.619  (not concave)
Iteration 34: Log likelihood = -38221.466  (not concave)
Iteration 35: Log likelihood = -38221.416  
Iteration 36: Log likelihood = -38219.494  
Iteration 37: Log likelihood = -38216.388  
Iteration 38: Log likelihood = -38215.387  
Iteration 39: Log likelihood = -38215.158  
Iteration 40: Log likelihood = -38215.047  
Iteration 41: Log likelihood = -38214.959  
Iteration 42: Log likelihood = -38214.957  
Iteration 43: Log likelihood = -38214.957  

Generalized structural equation model                   Number of obs = 14,720
Log likelihood = -38214.957

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
1.C          |  (base outcome)
-------------+----------------------------------------------------------------
2.C          |
       _cons |  -.0573257   .1516287    -0.38   0.705    -.3545126    .2398611
-------------+----------------------------------------------------------------
3.C          |
       _cons |  -.8434204   .3248152    -2.60   0.009    -1.480046   -.2067944
-------------+----------------------------------------------------------------
4.C          |
       _cons |   .6991554   .0590171    11.85   0.000      .583484    .8148268
------------------------------------------------------------------------------

Class:    1        

Response: english  
Family:   Bernoulli
Link:     Logit    

Response: maths    
Family:   Bernoulli
Link:     Logit    

Response: science  
Family:   Bernoulli
Link:     Logit    

Response: humanity 
Family:   Bernoulli
Link:     Logit    

Response: othersub 
Family:   Bernoulli
Link:     Logit    

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
english      |
       _cons |  -1.635319   .1144926   -14.28   0.000     -1.85972   -1.410918
-------------+----------------------------------------------------------------
maths        |
       _cons |  -2.954779   .1736526   -17.02   0.000    -3.295132   -2.614426
-------------+----------------------------------------------------------------
science      |
       _cons |  -2.982875   .2312944   -12.90   0.000    -3.436204   -2.529546
-------------+----------------------------------------------------------------
humanity     |
       _cons |  -2.409235   .1329394   -18.12   0.000    -2.669791   -2.148678
-------------+----------------------------------------------------------------
othersub     |
       _cons |  -1.054667   .0614057   -17.18   0.000     -1.17502   -.9343142
------------------------------------------------------------------------------

Class:    2        

Response: english  
Family:   Bernoulli
Link:     Logit    

Response: maths    
Family:   Bernoulli
Link:     Logit    

Response: science  
Family:   Bernoulli
Link:     Logit    

Response: humanity 
Family:   Bernoulli
Link:     Logit    

Response: othersub 
Family:   Bernoulli
Link:     Logit    

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
english      |
       _cons |   1.475149   .2056347     7.17   0.000     1.072112    1.878186
-------------+----------------------------------------------------------------
maths        |
       _cons |  -.8149979   .1435114    -5.68   0.000    -1.096275   -.5337207
-------------+----------------------------------------------------------------
science      |
       _cons |  -1.812117   .4947032    -3.66   0.000    -2.781717   -.8425163
-------------+----------------------------------------------------------------
humanity     |
       _cons |    .315028   .0875916     3.60   0.000     .1433515    .4867044
-------------+----------------------------------------------------------------
othersub     |
       _cons |   .8563969   .0830042    10.32   0.000     .6937117    1.019082
------------------------------------------------------------------------------

Class:    3        

Response: english  
Family:   Bernoulli
Link:     Logit    

Response: maths    
Family:   Bernoulli
Link:     Logit    

Response: science  
Family:   Bernoulli
Link:     Logit    

Response: humanity 
Family:   Bernoulli
Link:     Logit    

Response: othersub 
Family:   Bernoulli
Link:     Logit    

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
english      |
       _cons |  -.0653913   .3328607    -0.20   0.844    -.7177862    .5870036
-------------+----------------------------------------------------------------
maths        |
       _cons |   .2295466   .2064148     1.11   0.266    -.1750191    .6341122
-------------+----------------------------------------------------------------
science      |
       _cons |   1.163136    .646758     1.80   0.072    -.1044864    2.430758
-------------+----------------------------------------------------------------
humanity     |
       _cons |  -.1199283   .1953333    -0.61   0.539    -.5027746     .262918
-------------+----------------------------------------------------------------
othersub     |
       _cons |   .4703479   .1720786     2.73   0.006       .13308    .8076158
------------------------------------------------------------------------------

Class:    4        

Response: english  
Family:   Bernoulli
Link:     Logit    

Response: maths    
Family:   Bernoulli
Link:     Logit    

Response: science  
Family:   Bernoulli
Link:     Logit    

Response: humanity 
Family:   Bernoulli
Link:     Logit    

Response: othersub 
Family:   Bernoulli
Link:     Logit    

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
english      |
       _cons |   4.044277   .2958173    13.67   0.000     3.464486    4.624068
-------------+----------------------------------------------------------------
maths        |
       _cons |   2.782206   .1169147    23.80   0.000     2.553057    3.011354
-------------+----------------------------------------------------------------
science      |
       _cons |   2.297907   .0958555    23.97   0.000     2.110034    2.485781
-------------+----------------------------------------------------------------
humanity     |
       _cons |    2.97444    .130719    22.75   0.000     2.718235    3.230644
-------------+----------------------------------------------------------------
othersub     |
       _cons |   2.580313   .0697189    37.01   0.000     2.443666    2.716959
------------------------------------------------------------------------------

The model estimated using the gsem command (above) uses a logistic regression model framework. The first part of the code specifies the 5 manifest variables (binary GCSE subject outcomes). The lclass option then specifies how many classes are to be estimated (4 in this example). The model reports a series of estimates:

  • The first part shows the log odds of latent class membership (compared to a base class, similar to a multinomial logistic regression model)
  • The second part shows, for a members of a particular latent class, the conditional log odds of a positive response for each of the manifest variables (i.e. GCSE subject outcomes)

As the log odds scale can be a little difficult to interpret, Stata provides further commands to re-express the model parameters as probabilities.

To look at the latent class probabilities (prior probabilities), we run the following command.

In [10]:
estat lcprob
Latent class marginal probabilities                     Number of obs = 14,720

--------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.     [95% conf. interval]
-------------+------------------------------------------------
           C |
          1  |   .2279683   .0095253      .2098391    .2471737
          2  |   .2152673   .0307114      .1611375    .2814792
          3  |   .0980802   .0314894      .0513473    .1793073
          4  |   .4586841   .0109111      .4373878    .4801325
--------------------------------------------------------------

The example about we can see that 4 latent classes have been estimated. The margins column reports the (prior) probability that a randomly selected individual belong to the latent class.

We can see the probabilities of class membership is:

  • 22% for class 1
  • 22% for class 2
  • 10% for class 3
  • 46% for class 4

Now let's look at the conditional probabilities for the manifest (observed) variables. These are the probabilities of gaining an A-C pass in each GCSE subject given membership of a particular latent class.

In [11]:
estat lcmean
Latent class marginal means                             Number of obs = 14,720

--------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.     [95% conf. interval]
-------------+------------------------------------------------
1            |
     english |    .163103   .0156283      .1347356    .1960893
       maths |   .0495111   .0081721      .0357386    .0682157
     science |   .0482055   .0106122       .031183    .0738127
    humanity |   .0824712   .0100595      .0647796    .1044548
    othersub |   .2583299   .0117651      .2359488    .2820503
-------------+------------------------------------------------
2            |
     english |   .8138387   .0311547      .7449984    .8674026
       maths |   .3068265   .0305226      .2504385    .3696495
     science |   .1403825   .0596984      .0583202    .3010051
    humanity |   .5781121   .0213635      .5357766    .6193298
    othersub |   .7019073   .0173672      .6667921    .7347938
-------------+------------------------------------------------
3            |
     english |    .483658   .0831263      .3278807    .6426773
       maths |    .557136   .0509299      .4563566    .6534213
     science |   .7619021   .1173266      .4739021    .9191429
    humanity |   .4700538   .0486582      .3768888    .5653535
    othersub |   .6154661   .0407254       .533221    .6916012
-------------+------------------------------------------------
4            |
     english |   .9827794   .0050064      .9696602    .9902826
       maths |   .9417066   .0064181      .9277786    .9530844
     science |   .9087036   .0079523      .8918746     .923139
    humanity |    .951406   .0060435      .9380941    .9619713
    othersub |   .9295837   .0045636       .920097      .93802
--------------------------------------------------------------

Class 1 is characterised by the lowest levels of overall attainment (i.e. probability of gaining A-C in GCSE English is 16%, 5% in GCSE maths, 5% in GCSE Science and so on).

Class 2 have higher levels of attainment in GCSE English (probability of A-C is 81%) or GCSE Other Subjects (70%) but much lower probabilities of gaining an A-C pass in GCSE Maths (31%) or GCSE Science (14%).

Class 3 demonstrates an inverse pattern to the conditional probabilities reports by members of class 2. Members of class 3 have higher probabilities of gaining an A-C in GCSE Maths (55%) or GCSE Science (76%) but lower probabilities of gaining an A-C in GCSE English (48%) or GCSE Other Subjects (61%).

Class 4 has the highest levels of attainment with the probability of gaining A-C in each of the 5 GCSE subject groups being greater than 90%.

To see the overall goodness of fit for this model, we run the following command.

In [12]:
estat lcgof
----------------------------------------------------------------------------
Fit statistic        |      Value   Description
---------------------+------------------------------------------------------
Likelihood ratio     |
          chi2_ms(8) |      4.550   model vs. saturated
            p > chi2 |      0.804
---------------------+------------------------------------------------------
Information criteria |
                 AIC |  76475.913   Akaike's information criterion
                 BIC |  76650.643   Bayesian information criterion
----------------------------------------------------------------------------

Further options are available for comparing and reporting models

  • Estimates from models can be stored and then several models can be compared using the estimates stats command
  • Probabilities from the models can be plotted using the margins command
  • The posterior probabilies of an individual being assigned to a particular latent class can be predicted, recorded as a new variable and then used to explore latent class assingment under different assumptions (there is an extended literature on three-step modelling approaches covering this)
  • A single-step model can be estimated whereby the latent class model is esimated simultaneously as a structural model predicting class membership using a set of explanatory variables

These are beyong the scope of this workbook of examples, but are covered in more depth elsewhere.

Latent Trait Model¶

Latent trait models are suitable when:

  • The latent variable(s) is continuous
  • The manifest variables are categorical

In the model below, we have 5 manifest variables which are binary (coded 1 or 0):

  • English
  • Mathematics
  • Science
  • Humanities
  • Other Subjects

These models are also referred to as item response models and can be estimated using the irt commands in Stata (see below).

In this model there is 1 continuous latent variable - just with varying degrees of difficulty by GCSE subject grouping.

In [13]:
irt 1pl english maths science humanity othersub
Fitting fixed-effects model:

Iteration 0:  Log likelihood = -47945.098  
Iteration 1:  Log likelihood = -47874.642  
Iteration 2:  Log likelihood = -47874.601  
Iteration 3:  Log likelihood = -47874.601  

Fitting full model:

Iteration 0:  Log likelihood =  -42056.32  
Iteration 1:  Log likelihood = -38565.704  
Iteration 2:  Log likelihood = -38523.623  
Iteration 3:  Log likelihood = -38523.615  
Iteration 4:  Log likelihood = -38523.621  
Iteration 5:  Log likelihood = -38523.621  

One-parameter logistic model                            Number of obs = 14,720
Log likelihood = -38523.621
------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
     Discrim |   2.564579   .0312275    82.13   0.000     2.503375    2.625784
-------------+----------------------------------------------------------------
english      |
        Diff |  -.6775558   .0134272   -50.46   0.000    -.7038725    -.651239
-------------+----------------------------------------------------------------
maths        |
        Diff |  -.2063378   .0121772   -16.94   0.000    -.2302046   -.1824711
-------------+----------------------------------------------------------------
science      |
        Diff |  -.1118938   .0121377    -9.22   0.000    -.1356833   -.0881043
-------------+----------------------------------------------------------------
humanity     |
        Diff |   -.397438   .0124792   -31.85   0.000    -.4218968   -.3729792
-------------+----------------------------------------------------------------
othersub     |
        Diff |   -.629697   .0132248   -47.61   0.000     -.655617   -.6037769
------------------------------------------------------------------------------

In this example, the indicators are the binary categories for whether a young person gains an A-C GCSE pass or not for a given subject grouping. The latent variable is a continuous normal distribution. The cummulative density function for the normal distribution helps us to understand the overall distribution of "attainment". The individual subject groupings then have varying degrees of difficulty (i.e. the proportion of the sample at a given point on the overall latent distribution that can pass the subject).

We can use the following command to show the order of difficulty of these subjects.

In [14]:
estat report, sort(b) byparm
One-parameter logistic model                            Number of obs = 14,720
Log likelihood = -38523.621
------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
     Discrim |   2.564579   .0312275    82.13   0.000     2.503375    2.625784
-------------+----------------------------------------------------------------
Diff         |
     english |  -.6775558   .0134272   -50.46   0.000    -.7038725    -.651239
    othersub |   -.629697   .0132248   -47.61   0.000     -.655617   -.6037769
    humanity |   -.397438   .0124792   -31.85   0.000    -.4218968   -.3729792
       maths |  -.2063378   .0121772   -16.94   0.000    -.2302046   -.1824711
     science |  -.1118938   .0121377    -9.22   0.000    -.1356833   -.0881043
------------------------------------------------------------------------------

According to this model the subjects in order of difficulty from easiest to hardest are:

  1. English
  2. Other subjects
  3. Humanities
  4. Maths
  5. Science

The parallel item characteristic curves are shown below. This appears very similar to a plot for an ordinal logistic regression model. The x-axis reports theta, which is the continuous latent variable. Probabilities of success differs across subjects. English is the furthest to left suggesting that as we move along the latent "attainment" distribution (theta), it is the first subject that students tend to pass.

In [15]:
%set user_graph_keywords irtgraph
In [16]:
irtgraph icc
graph export latent_trait1.png, as(png) replace

file C:/Users/cjp229/.stata_kernel_cache/graph0.svg saved as SVG format
file C:/Users/cjp229/.stata_kernel_cache/graph0.pdf saved as PDF format

file latent_trait1.png saved as PNG format

irt_graph

Now we fit a model where discrination and difference are allowed to vary across subject groupings.

In [17]:
irt 2pl english maths science humanity othersub
Fitting fixed-effects model:

Iteration 0:  Log likelihood = -47945.098  
Iteration 1:  Log likelihood = -47874.642  
Iteration 2:  Log likelihood = -47874.601  
Iteration 3:  Log likelihood = -47874.601  

Fitting full model:

Iteration 0:  Log likelihood = -42047.203  
Iteration 1:  Log likelihood = -38418.105  
Iteration 2:  Log likelihood =  -38325.25  
Iteration 3:  Log likelihood = -38321.816  
Iteration 4:  Log likelihood = -38321.892  
Iteration 5:  Log likelihood = -38321.899  
Iteration 6:  Log likelihood =   -38321.9  

Two-parameter logistic model                            Number of obs = 14,720
Log likelihood = -38321.9
------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
english      |
     Discrim |   2.533943   .0647634    39.13   0.000     2.407009    2.660877
        Diff |   -.680093    .014198   -47.90   0.000    -.7079206   -.6522653
-------------+----------------------------------------------------------------
maths        |
     Discrim |    3.51309   .1065428    32.97   0.000      3.30427     3.72191
        Diff |  -.1940406   .0109287   -17.76   0.000    -.2154604   -.1726208
-------------+----------------------------------------------------------------
science      |
     Discrim |   2.791897    .071835    38.87   0.000     2.651103    2.932691
        Diff |  -.1119334   .0116335    -9.62   0.000    -.1347347   -.0891322
-------------+----------------------------------------------------------------
humanity     |
     Discrim |   2.779038   .0716841    38.77   0.000      2.63854    2.919537
        Diff |   -.389163    .012209   -31.88   0.000    -.4130921   -.3652338
-------------+----------------------------------------------------------------
othersub     |
     Discrim |    1.77048   .0425716    41.59   0.000     1.687041    1.853918
        Diff |  -.7238225   .0169885   -42.61   0.000    -.7571193   -.6905256
------------------------------------------------------------------------------
In [18]:
estat report, sort(b) byparm
Two-parameter logistic model                            Number of obs = 14,720
Log likelihood = -38321.9
------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Discrim      |
    othersub |    1.77048   .0425716    41.59   0.000     1.687041    1.853918
     english |   2.533943   .0647634    39.13   0.000     2.407009    2.660877
    humanity |   2.779038   .0716841    38.77   0.000      2.63854    2.919537
       maths |    3.51309   .1065428    32.97   0.000      3.30427     3.72191
     science |   2.791897    .071835    38.87   0.000     2.651103    2.932691
-------------+----------------------------------------------------------------
Diff         |
    othersub |  -.7238225   .0169885   -42.61   0.000    -.7571193   -.6905256
     english |   -.680093    .014198   -47.90   0.000    -.7079206   -.6522653
    humanity |   -.389163    .012209   -31.88   0.000    -.4130921   -.3652338
       maths |  -.1940406   .0109287   -17.76   0.000    -.2154604   -.1726208
     science |  -.1119334   .0116335    -9.62   0.000    -.1347347   -.0891322
------------------------------------------------------------------------------

In the chart below, the steeper the slope, the higher the discrimination of the item (i.e. GCSE subject grouping). What this means is how much a given indicator/item is able to discriminate between individuals. For example, science has the highest levels of difficulty (fewer people pass GCSE science) and the highest levels of discrimination between individuals (serves as the clearest differentiator of overall ability or a better indicator for those with lower levels of attainment from those with higher levels).

In [19]:
irtgraph icc
graph export latent_trait2.png, as(png) replace

file C:/Users/cjp229/.stata_kernel_cache/graph1.svg saved as SVG format
file C:/Users/cjp229/.stata_kernel_cache/graph1.pdf saved as PDF format

file latent_trait2.png saved as PNG format

irt_graph2

Latent Profile Model¶

Latent profile models are suitable when:

  • The latent variable(s) is categorical
  • The manifest variables are continuous

In the model below, we have 5 manifest variables which are scores (0 - 7):

  • English
  • Mathematics
  • Science
  • Humanities
  • Other Subjects

We can estimate the pairwise correlation between each of the GCSE subject score variables (see below). This can be worthwhile when inspecting your data prior to fitting a model.

In [20]:
corr english_sc maths_sc science_sc humanity_sc othersub_sc
(obs=14,720)

             | englis~c maths_sc scienc~c humani~c others~c
-------------+---------------------------------------------
  english_sc |   1.0000
    maths_sc |   0.6108   1.0000
  science_sc |   0.5888   0.7228   1.0000
 humanity_sc |   0.6724   0.6442   0.6369   1.0000
 othersub_sc |   0.5767   0.5473   0.5309   0.5449   1.0000

For comparison with the 4 class latent class model, a 4 'class' latent profile model has been estimated below.

The model estimated using the gsem command (above) uses a Generalised Linear Latent and Mixed Modelling framework (GLLMM). The first part of the code specifies the 5 manifest variables (continuous GCSE subject outcomes). The lclass option then specifies how many 'classes' are to be estimated (4 in this example). The model reports a series of estimates:

  • The first part shows the log odds of latent 'class' membership (compared to a base class, similar to a multinomial logistic regression model)
  • The second part shows, for a members of a particular latent 'class', the mean score for each of the manifest variables (i.e. GCSE subject outcomes)

As the log odds scale can be a little difficult to interpret, Stata provides further commands to re-express the model parameters as probabilities. The conditional scores for the latent 'classes' are easier to interpret.

To look at the latent 'class' probabilities (prior probabilities), we run the following command.

In [21]:
gsem (english_sc maths_sc science_sc humanity_sc othersub_sc <- ), lclass(C 4)
est store class4score

Fitting class model:

Iteration 0:  (class) log likelihood = -20406.253  
Iteration 1:  (class) log likelihood = -20406.253  

Fitting outcome model:

Iteration 0:  (outcome) log likelihood = -99358.426  
Iteration 1:  (outcome) log likelihood = -99358.426  

Refining starting values:

Iteration 0:  (EM) log likelihood = -121898.48
Iteration 1:  (EM) log likelihood = -121839.92
Iteration 2:  (EM) log likelihood = -121559.11
Iteration 3:  (EM) log likelihood = -121304.62
Iteration 4:  (EM) log likelihood =  -121097.9
Iteration 5:  (EM) log likelihood = -120930.46
Iteration 6:  (EM) log likelihood = -120793.23
Iteration 7:  (EM) log likelihood = -120679.59
Iteration 8:  (EM) log likelihood = -120584.74
Iteration 9:  (EM) log likelihood = -120505.05
Iteration 10: (EM) log likelihood = -120437.65
Iteration 11: (EM) log likelihood = -120380.23
Iteration 12: (EM) log likelihood = -120330.95
Iteration 13: (EM) log likelihood = -120288.31
Iteration 14: (EM) log likelihood = -120251.09
Iteration 15: (EM) log likelihood = -120218.35
Iteration 16: (EM) log likelihood =  -120189.3
Iteration 17: (EM) log likelihood = -120163.32
Iteration 18: (EM) log likelihood = -120139.92
Iteration 19: (EM) log likelihood = -120118.69
Iteration 20: (EM) log likelihood = -120099.31
note: EM algorithm reached maximum iterations.

Fitting full model:

Iteration 0:  Log likelihood = -115433.69  
Iteration 1:  Log likelihood =  -115383.6  
Iteration 2:  Log likelihood = -115382.62  
Iteration 3:  Log likelihood = -115382.62  

Generalized structural equation model                   Number of obs = 14,720
Log likelihood = -115382.62

 ( 1)  [/]var(e.english_sc)#1bn.C - [/]var(e.english_sc)#4.C = 0
 ( 2)  [/]var(e.english_sc)#2.C - [/]var(e.english_sc)#4.C = 0
 ( 3)  [/]var(e.english_sc)#3.C - [/]var(e.english_sc)#4.C = 0
 ( 4)  [/]var(e.maths_sc)#1bn.C - [/]var(e.maths_sc)#4.C = 0
 ( 5)  [/]var(e.maths_sc)#2.C - [/]var(e.maths_sc)#4.C = 0
 ( 6)  [/]var(e.maths_sc)#3.C - [/]var(e.maths_sc)#4.C = 0
 ( 7)  [/]var(e.science_sc)#1bn.C - [/]var(e.science_sc)#4.C = 0
 ( 8)  [/]var(e.science_sc)#2.C - [/]var(e.science_sc)#4.C = 0
 ( 9)  [/]var(e.science_sc)#3.C - [/]var(e.science_sc)#4.C = 0
 (10)  [/]var(e.humanity_sc)#1bn.C - [/]var(e.humanity_sc)#4.C = 0
 (11)  [/]var(e.humanity_sc)#2.C - [/]var(e.humanity_sc)#4.C = 0
 (12)  [/]var(e.humanity_sc)#3.C - [/]var(e.humanity_sc)#4.C = 0
 (13)  [/]var(e.othersub_sc)#1bn.C - [/]var(e.othersub_sc)#4.C = 0
 (14)  [/]var(e.othersub_sc)#2.C - [/]var(e.othersub_sc)#4.C = 0
 (15)  [/]var(e.othersub_sc)#3.C - [/]var(e.othersub_sc)#4.C = 0

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
1.C          |  (base outcome)
-------------+----------------------------------------------------------------
2.C          |
       _cons |   1.141388   .0478518    23.85   0.000       1.0476    1.235176
-------------+----------------------------------------------------------------
3.C          |
       _cons |   1.432306   .0521877    27.45   0.000      1.33002    1.534592
-------------+----------------------------------------------------------------
4.C          |
       _cons |   1.098932   .0563038    19.52   0.000     .9885781    1.209285
------------------------------------------------------------------------------

Class:    1          

Response: english_sc 
Family:   Gaussian   
Link:     Identity   

Response: maths_sc   
Family:   Gaussian   
Link:     Identity   

Response: science_sc 
Family:   Gaussian   
Link:     Identity   

Response: humanity_sc
Family:   Gaussian   
Link:     Identity   

Response: othersub_sc
Family:   Gaussian   
Link:     Identity   

-------------------------------------------------------------------------------
              | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
--------------+----------------------------------------------------------------
english_sc    |
        _cons |   3.140857   .0323312    97.15   0.000     3.077489    3.204225
--------------+----------------------------------------------------------------
maths_sc      |
        _cons |   1.961922   .0356103    55.09   0.000     1.892127    2.031717
--------------+----------------------------------------------------------------
science_sc    |
        _cons |   2.151475    .037378    57.56   0.000     2.078216    2.224735
--------------+----------------------------------------------------------------
humanity_sc   |
        _cons |     2.1179   .0396661    53.39   0.000     2.040156    2.195644
--------------+----------------------------------------------------------------
othersub_sc   |
        _cons |   3.122194   .0426271    73.24   0.000     3.038647    3.205742
--------------+----------------------------------------------------------------
var(e.engli~c)|    .671014   .0095365                      .6525807     .689968
var(e.maths~c)|   .8851617    .013933                      .8582705    .9128955
var(e.scien~c)|   .9186259    .013553                      .8924429    .9455771
var(e.human~c)|   .9064522    .014311                      .8788328    .9349397
var(e.other~c)|   1.231776   .0161764                      1.200476    1.263893
-------------------------------------------------------------------------------

Class:    2          

Response: english_sc 
Family:   Gaussian   
Link:     Identity   

Response: maths_sc   
Family:   Gaussian   
Link:     Identity   

Response: science_sc 
Family:   Gaussian   
Link:     Identity   

Response: humanity_sc
Family:   Gaussian   
Link:     Identity   

Response: othersub_sc
Family:   Gaussian   
Link:     Identity   

-------------------------------------------------------------------------------
              | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
--------------+----------------------------------------------------------------
english_sc    |
        _cons |   4.308375    .021638   199.11   0.000     4.265965    4.350785
--------------+----------------------------------------------------------------
maths_sc      |
        _cons |   3.325618   .0286554   116.06   0.000     3.269454    3.381781
--------------+----------------------------------------------------------------
science_sc    |
        _cons |   3.451006   .0253821   135.96   0.000     3.401258    3.500754
--------------+----------------------------------------------------------------
humanity_sc   |
        _cons |   3.806207   .0296027   128.58   0.000     3.748187    3.864227
--------------+----------------------------------------------------------------
othersub_sc   |
        _cons |   4.339169   .0258624   167.78   0.000     4.288479    4.389858
--------------+----------------------------------------------------------------
var(e.engli~c)|    .671014   .0095365                      .6525807     .689968
var(e.maths~c)|   .8851617    .013933                      .8582705    .9128955
var(e.scien~c)|   .9186259    .013553                      .8924429    .9455771
var(e.human~c)|   .9064522    .014311                      .8788328    .9349397
var(e.other~c)|   1.231776   .0161764                      1.200476    1.263893
-------------------------------------------------------------------------------

Class:    3          

Response: english_sc 
Family:   Gaussian   
Link:     Identity   

Response: maths_sc   
Family:   Gaussian   
Link:     Identity   

Response: science_sc 
Family:   Gaussian   
Link:     Identity   

Response: humanity_sc
Family:   Gaussian   
Link:     Identity   

Response: othersub_sc
Family:   Gaussian   
Link:     Identity   

-------------------------------------------------------------------------------
              | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
--------------+----------------------------------------------------------------
english_sc    |
        _cons |    5.28238    .018694   282.57   0.000      5.24574     5.31902
--------------+----------------------------------------------------------------
maths_sc      |
        _cons |   4.773784   .0226449   210.81   0.000     4.729401    4.818167
--------------+----------------------------------------------------------------
science_sc    |
        _cons |   4.718916   .0231977   203.42   0.000     4.673449    4.764383
--------------+----------------------------------------------------------------
humanity_sc   |
        _cons |   5.211693   .0235324   221.47   0.000      5.16557    5.257815
--------------+----------------------------------------------------------------
othersub_sc   |
        _cons |   5.409425   .0230908   234.27   0.000     5.364168    5.454682
--------------+----------------------------------------------------------------
var(e.engli~c)|    .671014   .0095365                      .6525807     .689968
var(e.maths~c)|   .8851617    .013933                      .8582705    .9128955
var(e.scien~c)|   .9186259    .013553                      .8924429    .9455771
var(e.human~c)|   .9064522    .014311                      .8788328    .9349397
var(e.other~c)|   1.231776   .0161764                      1.200476    1.263893
-------------------------------------------------------------------------------

Class:    4          

Response: english_sc 
Family:   Gaussian   
Link:     Identity   

Response: maths_sc   
Family:   Gaussian   
Link:     Identity   

Response: science_sc 
Family:   Gaussian   
Link:     Identity   

Response: humanity_sc
Family:   Gaussian   
Link:     Identity   

Response: othersub_sc
Family:   Gaussian   
Link:     Identity   

-------------------------------------------------------------------------------
              | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
--------------+----------------------------------------------------------------
english_sc    |
        _cons |    6.33835   .0162578   389.86   0.000     6.306485    6.370215
--------------+----------------------------------------------------------------
maths_sc      |
        _cons |   6.119511   .0205543   297.72   0.000     6.079226    6.159797
--------------+----------------------------------------------------------------
science_sc    |
        _cons |   6.165085   .0207947   296.47   0.000     6.124328    6.205842
--------------+----------------------------------------------------------------
humanity_sc   |
        _cons |   6.439744   .0183939   350.10   0.000     6.403693    6.475795
--------------+----------------------------------------------------------------
othersub_sc   |
        _cons |   6.458281   .0203541   317.30   0.000     6.418388    6.498174
--------------+----------------------------------------------------------------
var(e.engli~c)|    .671014   .0095365                      .6525807     .689968
var(e.maths~c)|   .8851617    .013933                      .8582705    .9128955
var(e.scien~c)|   .9186259    .013553                      .8924429    .9455771
var(e.human~c)|   .9064522    .014311                      .8788328    .9349397
var(e.other~c)|   1.231776   .0161764                      1.200476    1.263893
-------------------------------------------------------------------------------

In [22]:
estat lcprob
Latent class marginal probabilities                     Number of obs = 14,720

--------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.     [95% conf. interval]
-------------+------------------------------------------------
           C |
          1  |    .088336   .0039272      .0809356    .0963421
          2  |   .2765898   .0055229      .2658969    .2875442
          3  |   .3699817   .0057928      .3587016    .3814055
          4  |   .2650926   .0058544       .253778    .2767245
--------------------------------------------------------------

Now let's look at the conditional probabilities for the manifest (observed) variables. These are the mean score in each GCSE subject group given membership of a particular latent 'class'.

In [23]:
estat lcmean
Latent class marginal means                             Number of obs = 14,720

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
1            |
  english_sc |   3.140857   .0323312    97.15   0.000     3.077489    3.204225
    maths_sc |   1.961922   .0356103    55.09   0.000     1.892127    2.031717
  science_sc |   2.151475    .037378    57.56   0.000     2.078216    2.224735
 humanity_sc |     2.1179   .0396661    53.39   0.000     2.040156    2.195644
 othersub_sc |   3.122194   .0426271    73.24   0.000     3.038647    3.205742
-------------+----------------------------------------------------------------
2            |
  english_sc |   4.308375    .021638   199.11   0.000     4.265965    4.350785
    maths_sc |   3.325618   .0286554   116.06   0.000     3.269454    3.381781
  science_sc |   3.451006   .0253821   135.96   0.000     3.401258    3.500754
 humanity_sc |   3.806207   .0296027   128.58   0.000     3.748187    3.864227
 othersub_sc |   4.339169   .0258624   167.78   0.000     4.288479    4.389858
-------------+----------------------------------------------------------------
3            |
  english_sc |    5.28238    .018694   282.57   0.000      5.24574     5.31902
    maths_sc |   4.773784   .0226449   210.81   0.000     4.729401    4.818167
  science_sc |   4.718916   .0231977   203.42   0.000     4.673449    4.764383
 humanity_sc |   5.211693   .0235324   221.47   0.000      5.16557    5.257815
 othersub_sc |   5.409425   .0230908   234.27   0.000     5.364168    5.454682
-------------+----------------------------------------------------------------
4            |
  english_sc |    6.33835   .0162578   389.86   0.000     6.306485    6.370215
    maths_sc |   6.119511   .0205543   297.72   0.000     6.079226    6.159797
  science_sc |   6.165085   .0207947   296.47   0.000     6.124328    6.205842
 humanity_sc |   6.439744   .0183939   350.10   0.000     6.403693    6.475795
 othersub_sc |   6.458281   .0203541   317.30   0.000     6.418388    6.498174
------------------------------------------------------------------------------

Although the gsem command refers to them as 'classes' we might also call them profiles.

To plot the different profiles, we can use the margins command. This has been estimated below, using the quietly option. The margins are then saved and used in the graph below.

In [24]:
estimates restore class4score

quietly: margins, 	predict(outcome(english_sc)  class(1)) ///
                    predict(outcome(maths_sc)    class(1)) ///
                    predict(outcome(science_sc)  class(1)) ///
                    predict(outcome(humanity_sc) class(1)) ///
                    predict(outcome(othersub_sc) class(1)) post
est store lpa4_1
(results class4score are active now)


In [25]:
estimates restore class4score

quietly: margins, 	predict(outcome(english_sc)  class(2)) ///
                    predict(outcome(maths_sc)    class(2)) ///
                    predict(outcome(science_sc)  class(2)) ///
                    predict(outcome(humanity_sc) class(2)) ///
                    predict(outcome(othersub_sc) class(2)) post
est store lpa4_2
(results class4score are active now)


In [26]:
estimates restore class4score

quietly: margins, 	predict(outcome(english_sc)  class(3)) ///
                    predict(outcome(maths_sc)    class(3)) ///
                    predict(outcome(science_sc)  class(3)) ///
                    predict(outcome(humanity_sc) class(3)) ///
                    predict(outcome(othersub_sc) class(3)) post
est store lpa4_3
(results class4score are active now)


In [27]:
estimates restore class4score

quietly: margins, 	predict(outcome(english_sc)  class(4)) ///
                    predict(outcome(maths_sc)    class(4)) ///
                    predict(outcome(science_sc)  class(4)) ///
                    predict(outcome(humanity_sc) class(4)) ///
                    predict(outcome(othersub_sc) class(4)) post
est store lpa4_4
(results class4score are active now)


In [28]:
set scheme s1color

coefplot lpa4_1 ///
		 lpa4_2 ///
		 lpa4_3 ///
		 lpa4_4, connect(l) vertical ///
				 xlabel(1 "English" ///
					    2 "Maths" ///
					    3 "Science" ///
					    4 "Humanity" ///
					    5 "Other Subject", angle(45)) ///
				 ytitle("GCSE Points Score") ///
				 legend(order(2 "Latent Profile 1" ///
				              4 "Latent Profile 2" ///
							  6 "Latent Profile 3" ///
							  8 "Latent Profile 4" ))
                              
graph export latent_profile1.png, as(png) replace


file latent_profile1.png saved as PNG format

latent_profile

The use of mean scores in the latent profile model presents a different perspective to the latent class model.

Mean scores will occlude patterns of response that would be visible using a latent class model. The analyst therefore has to decide what their research question is and which measure help them to answer this best.

Factor Analysis¶

Factor analysis is suitable when:

  • The latent variable(s) is continuous
  • The manifest variables are continuous

In the model below, we have 5 manifest variables which are scores (0 - 7):

  • English
  • Mathematics
  • Science
  • Humanities
  • Other Subjects

Now we can fit a factor analysis with a continuous latent variable and a continuous set of manifest variables.

In [29]:
factor english_sc maths_sc science_sc humanity_sc othersub_sc
(obs=14,720)

Factor analysis/correlation                      Number of obs    =     14,720
    Method: principal factors                    Retained factors =          2
    Rotation: (unrotated)                        Number of params =          9

    --------------------------------------------------------------------------
         Factor  |   Eigenvalue   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      2.98712      2.93301            1.0952       1.0952
        Factor2  |      0.05411      0.11042            0.0198       1.1150
        Factor3  |     -0.05631      0.06717           -0.0206       1.0943
        Factor4  |     -0.12348      0.01037           -0.0453       1.0491
        Factor5  |     -0.13385            .           -0.0491       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(10) = 3.9e+04 Prob>chi2 = 0.0000

Factor loadings (pattern matrix) and unique variances

    -------------------------------------------------
        Variable |  Factor1   Factor2 |   Uniqueness 
    -------------+--------------------+--------------
      english_sc |   0.7750    0.1205 |      0.3849  
        maths_sc |   0.8114   -0.1151 |      0.3283  
      science_sc |   0.7959   -0.1277 |      0.3502  
     humanity_sc |   0.7953    0.0586 |      0.3640  
     othersub_sc |   0.6797    0.0811 |      0.5314  
    -------------------------------------------------

Two factors were identified. Factor 1 appears to correspond to "overall attainment." Factor 2 scores differently. This factor is positively loaded for English, Humanitiess and Other Subjects but negative loaded for Maths and Science. This broadly fits with the findings from the 4 class latent class model.

In [30]:
estat factors
Factor analysis with different numbers of factors (maximum likelihood)

    ----------------------------------------------------------
    #factors |     loglik   df_m   df_r        AIC        BIC 
    ---------+------------------------------------------------
           1 |   -560.179      5      5   1130.358   1168.343 
           2 |  -1.448049      9      1    20.8961   89.26876 
    ----------------------------------------------------------
    no Heywood cases encountered
In [31]:
estat structure
Structure matrix: correlations between variables and common factors

    ----------------------------------
        Variable |  Factor1   Factor2 
    -------------+--------------------
      english_sc |   0.7750    0.1205 
        maths_sc |   0.8114   -0.1151 
      science_sc |   0.7959   -0.1277 
     humanity_sc |   0.7953    0.0586 
     othersub_sc |   0.6797    0.0811 
    ----------------------------------

Conclusions¶

The workbook above provides an illustration of the 4 main types of latent variable model using the same underlying data. Each model provides a slightly different 'view' of the association between manifest variables. Researchers should consider carefully how they conceptualise the latent variable(s) and how the manifest variables are measured.

In [ ]: