Complex social survey data designs

Date: 5th June 2017

Category: NCRM news

Author(s): Roxanne Connelly, University of Warwick; Vernon Gayle, NCRM, University of Edinburgh

Social scientists now have unprecedented access to data. New social science data resources are increasingly becoming available including new forms of ‘found’ data¹, such as administrative data² and various forms of ‘big data’. The richest and most research-valuable data resources available to social science researchers are large scale multipurpose social surveys, such as Understanding Society and the British Cohort Studies.

A feature of modern social surveys is that their designs are complex³. Large scale surveys generally do not collect data from simple random samples. This is partly due to the constraints of fieldwork, for example costs and logistic problems are reduced if households or individuals are sampled from within smaller areas (i.e. primary sampling units or clusters). To ensure that certain smaller geographical areas (e.g. the devolved territories in the UK) have sufficient sample sizes to support independent analyses, the survey may over-sample these areas. Similarly, groups that tend to have low coverage in national samples (e.g. individuals living in poverty or ethnic minority groups) may also be over-sampled to provide enough cases for focussed independent analyses to be undertaken. These samples are often referred to as booster samples.

When analysing complex social surveys researchers will need take the design elements of the data into account, or their results will not represent the patterns found in the wider population. These adjustments can be undertaken in statistical data analysis programmes such as SPSS (i.e. the complex samples package), Stata (i.e. svy commands), or R (i.e. survey package). The analysis of complex survey samples is not always straightforward, however. There are some data analysis techniques that do not readily support adjustments for complex survey designs and some statistical measures cannot easily be calculated.

In most large scale social surveys the respondents have unequal chances of being selected or have unequal chances of providing data. Therefore most contemporary large-scale social survey datasets come supplied with weights. The purpose of these weights is to allow researchers to adjust the data in some way, usually to better represent a target population⁴.

The respected and highly experienced data analysts Angrist and Pischke⁵ assert that ‘few things are as confusing to applied researchers as the role of sample weights.’ Gelman⁶ makes the bold assertion that ‘survey weighting is a mess’, because it is not always clear how to use weights in estimating anything other than simple descriptive statistics. We concur that there is little in the way of a clear prescription on when and how best to use weights in empirical analyses, and advice differs within the technical literature. The statistical literature on survey design, sampling and weighting is dense and the terminology and concepts that are used are often confusing for applied social science researchers. An aim of our ongoing work is to make these prescriptions more accessible⁷.

Using simpler standard data analysis techniques that fail to account for the complexity of surveys is a naive approach. In some analyses a survey design and selection strategy may be ignorable and a naive approach to data analysis will be satisfactory, however making this assumption a priori is at best speculative and at worst may result in misleading inferences. Our advice is that researchers should take this issue seriously and begin by studying the design and selection strategies used to collect the data. Researchers should be open about their analytical decisions, and the choices that are made to operationalize the analyses. Whenever possible researchers should compare results that attempt to take into account the complex survey designs and selection strategies with more naive analyses, and reflect upon whether or not the survey design is ignorable. The research process should be fully documented. This is often infeasible within the confines of a standard journal article, but through sharing code and the use of repositories, researchers can ensure that their analysis of complex survey data is transparent and reproducible⁸.

References

1 Connelly R, Playford CJ, Gayle V, et al. (2016) The role of administrative data in the big data revolution in social science research. Social Science Research 59: 1-12.

2 https://thedetectiveshandbook.wordpress. com/2015/10/13/administrative-data-is-a-bit-like-tinder-other-people-seem-to-be-using-it-are-you-missing-out/

3 https://thedetectiveshandbook.wordpress. com/2016/12/05/all-maps-are-inaccurate-but-some-have-very-useful-applications-thoughts-on-complex-social-surveys/

4 For an accessible introduction to survey weights see http://www.restore.ac.uk/PEAS/ index.php

5 Angrist J and Pischke J. (2009) Mostly Harmless Econometrics: An Empiricists Companion. Princeton: Princeton University Press.

6 Gelman, A. (2007) Struggles with Survey Weighting and Regression Modeling. Statistical Science, 153-164.

7 For a fuller account of our recommendations see the training materials associated with our project ‘Have Socio- Economic Inequalities in Childhood Cognitive Test Scores Changed? A Secondary Analysis of Three British Birth Cohorts’ [Grant Number: ES/N011783/1] which are available here: http://www2. warwick.ac.uk/fac/soc/sociology/staff/ connelly/cognitiveinequalities/training/.

8 Gayle V and Lambert P. (2017) The Workflow: A Practical Guide to Producing Accurate, Efficient, Transparent and Reproducible Social Survey Data Analysis. University of Southampton: National Centre for Research Methods.