Analysing Large Scale Social Surveys: the Statistical Data Analysis Workflow
Presenter(s): Vernon Gayle
Large scale social surveys provide rich sources of empirical data, however analysing the resources often proves to be complicated. This NCRM resource outlines the stages that are common in the statistical data analysis workflow.
The research process usually begins with the development and formulation of clear and feasible theoretically informed research questions (e.g. a research hypothesis). Statistical methods can be used to analyse large scale datasets to investigate complicated multivariate patterns and relationships in social science datasets.
White, P., 2017. Developing research questions. Bloomsbury Publishing.
The size and the wide scope of large scale surveys are beyond the data collection capabilities of individual researchers. There is a broad portfolio of ‘omnibus’ surveys that are specifically designed to facilitate data analysis on a comprehensive range of topics. These surveys support research from different social science disciplines.
The UK Data Service is a nationally funded infrastructural resource, that manages data and provides researchers with access to high-quality large scale social science datasets. Other countries have similar national archives that provide access to data resources.
Social survey data is accessible in computer-readable formats. The UK Data Service usually provide data files that can be read by the popular data analysis software packages SPSS and Stata. They also provide files in a more generic tab-delimited (TAB) format. In practice TAB can be a difficult format in which to work with large scale and complex social science datasets.
Understanding Society, the UK Household Longitudinal Study (Study Number 6614) comprises 15 files per wave of survey data collection. The download of Waves 1 – 12 (years 2009 – 2021) contains more than 180 files.
University of Essex, Institute for Social and Economic Research. (2022). Understanding Society: Waves 1-12, 2009-2021 and Harmonised BHPS: Waves 1-18, 1991-2009. [data collection]. 17th Edition. UK Data Service. SN: 6614, http://doi.org/10.5255/UKDA-SN-6614-18.
It is impractical for social researchers to expect to undertake any serious statistically orientated data analysis without using a computer and a data analysis software package or statistical programming language. Software can be operated in different ways, but graphical user interfaces (e.g. drop down menus) do not provide a suitable record of the very large number of operations that are required undertake comprehensive analyses of large scale social science survey datasets. Serious data analysts will write out the code required to organise and analyse data.
Gayle, V.J. and Lambert, P.S. 2017. The Workflow: A Practical Guide to Producing Accurate, Efficient, Transparent and Reproducible Social Survey Data Analysis. NCRM Working Paper. NCRM. https://eprints.ncrm.ac.uk/id/eprint/4000/
The four most popular statistical data analysis tools are SPSS, Stata, SAS and R. The majority of mainstream data analytical tasks and statistical techniques that are used in social science research can all be undertaken using either of these four main tools. There are minor differences between each of the main statistical data analysis tools that are inconsequential. However, some researchers, academic departments, research institutes, research organisations and social science disciplines, prefer specific software packages or programming languages.
- Ward, B.W., 2013. What’s better—R, SAS®, SPSS®, or Stata®? Thoughts for instructors of statistics and research methods courses. Journal of Applied Social Science, 7(1), pp.115-120.
Data wrangling is the process of organising and preparing a raw dataset to enable data analytical tasks. Social science data are frequently delivered in multiple files that have to be merged. Operations such as matching cases (e.g. spouses), selecting subsets of data (e.g. married couples), recoding existing variables, renaming variables, creating new variables and dealing with missing data, are all common activities in the data wrangling phase. The data wrangling phase is an essential part of the statistical data analysis workflow, however the length of time required to undertake the data wrangling phase is frequently underestimated.
- Long, J.S. and Long, J.S., 2009. The workflow of data analysis using Stata (p. 379). College Station, TX: Stata Press.
- MacInnes, J., 2016. An introduction to secondary data analysis with IBM SPSS statistics. Sage.
- Wickham, H., Çetinkaya-Rundel, M. and Grolemund, G., 2023. R for data science. O'Reilly Media, Inc.
Exploratory Data Analysis:
In the exploratory data analysis (EDA) phase relationships in the datasets are investigated. The main characteristics of the data are explored and summary statistics are produced. Elementary univariate and bivariate relationships are explored and graphical and visualisation methods are often employed.
- Tukey, J.W., 1977. Exploratory data analysis. Addison-Wesley.
- Elliott, M., 2009. Exploring data: an introduction to data analysis for social scientists. Polity.
Multivariate Data Analysis:
A common aim in empirical social science research is to evaluate the relative influence that multiple explanatory variables might have on an outcome. Large scale datasets support multivariate data analysis techniques such as statistical models. A central attraction of using statistical models in social science research is they help researchers to disentangle interrelated effects within social science datasets.
- Fogarty, B.J., 2018. Quantitative social science data with R: an introduction. Quantitative Social Science Data with R. Sage.
- Mehmetoglu, M. and Jakobsen, T.G., 2022. Applied statistics using Stata: a guide for the social sciences. Sage.
- Treiman, D.J., 2014. Quantitative data analysis: Doing social research to test ideas. John Wiley & Sons.
Presenting research at research meetings, seminars and conference is a method of gaining rapid critical feedback from peers working in the research field. In order to address brickbats, researchers are often required to undertake more work. This might involve additional data wrangling and then further exploratory data analysis in order to improve the multivariate data analysis.
The ‘writing-up’ phase involves collating results and drafting social science text, and then editing and proofreading before producing the manuscript for submission. Outputs commonly include, tables describing the dataset (including summary statistics), graphs illustrating relationships between variables, tables of statistical modelling results (e.g. regression tables) and graphical representation of modelling results.
University regulations stipulate requirements related to the length, structure and format of undergraduate dissertations and postgraduate theses.
Academic journals publish high-quality research and have specific requirements about the content of papers that are submitted. Most journals provide detailed guidance on the production of material for submission and often provide step-by-step guides on the process. Electronic submission is now widespread, although journal use different software and online platforms.
The benefits of recording the entire research process are frequently overlooked. Arching data and research materials (e.g. data analysis code) and documenting the complete statistical data analysis workflow is an invaluable practice. There is an increasing movement to make research transparent and reproducible. Research materials should be rendered Findable and Accessible (by others), and they should also be Interoperable (i.e. they can be used other people and other machines) and Reusable in order to maximise the potential for transparency and reproducibility. These are the FAIR principles.
Social scientists use different approached to archive research for example GitHub, OSF and university repositories. GitHub is a service that was originally for software development, but has been successfully used as an accessible platform to host data, research code, documents and other objects related to research process. OSF is a collaboration tool that is especially designed to help research teams work on projects and to make the entire project publicly accessible for broad dissemination. Many universities have Current Research Information Systems (CRIS), which hold data on projects and publications, and which link research inputs and outputs to provide a broad picture of research activities within the university. These systems can be used to archive research.
About the author
Professor Vernon Gayle is Chair of Sociology and Social Statistics at the University of Edinburgh. His work involves the statistical analysis of large-scale and complex social science datasets. These datasets include both social surveys and administrative data resources.
- Published on: 29 September 2023
- Event hosted by: Edinburgh University
- Keywords: Statistical Data Analysis Workflow |
- To cite this resource:
⌃BACK TO TOP