Do we teach enough secondary data analysis?

Date: 22nd March 2017

Category: NCRM news

Author(s): John MacInnes, NCRM, University of Edinburgh

Research needs good data, whatever form it comes in. For quantitative data, that almost always means using data that has been collected by someone else, usually a government agency or professional social survey organisation. Only they have the resources to keep good sampling frames, develop robust survey instruments and carry out high quality fieldwork. While there will always be some scope for smaller scale bespoke surveys undertaken by an individual or group of academics, especially in developing new fields of research, testing unproven theoretical ideas or responding quickly to some event, the economics of data collection mean that such surveys will be the exception rather than the rule.

Given this, it is surprising how little attention we have paid to the mechanics of secondary data analysis when teaching research methods, compared to the statistical theory and techniques used in data analysis itself. Yet, when using secondary data, the bulk of the work comprises getting it into a form that allows such analysis to be undertaken in the first place. By the ‘mechanics’ I mean such tasks as locating and accessing suitable datasets, going through the data documentation to identify relevant variables, checking on the target population, understanding any weights used, examining the question routing in survey instruments, reorganising data files, dealing with missing values or recoding variables and so on. Researchers also need to learn the difference between ‘data exploration’ (examining the data without many specific hypotheses to see what some of the main patterns or associations seem to be: good!) and ‘data snooping’ or ‘data dredging’ (post hoc Texan sharpshooting that seizes on any ’statistically significant’ association as ‘proof’ of a hypothesis devised to be consistent with it: bad!).

Part of the blame for this lies in university teachers’ failure to keep their model of the statistical ‘problem solving cycle’ up to date. Most readers will be familiar with the idea of ‘formulating a problem, collecting data, analysing it, drawing a conclusion and then refining or reforming the original problem’. If we imagine ‘collecting data’ to comprise designing and fielding a survey instrument we are living in the past. It now means finding appropriate data that has been collected by others, and judging how far it suits our purposes. Teaching ought to reflect this. I’m unconvinced of the pedagogical benefits of having students design a questionnaire. While it potentially introduces them to issues of validity and reliability, ensuring mutually exclusive and comprehensive categories, to questing wording and the whole business of trying to ensure some correspondence of meaning between data producer and respondent, how often is this realised in practice? Would we not be on firmer ground looking at good examples of survey instruments and asking ‘why these questions?’, and often ‘why so many?’, ‘why this order?’, and so on. Amongst other benefits, this helps open students’ eyes to the real difficulties of good measurement, and the need for some appropriate caution about the quality of even the best data. I doubt that many of our graduates ever end up designing a questionnaire, but I’m confident that most will be faced with using data produced by someone else.

I suspect that a corresponding weakness in our university teaching, especially at undergraduate level, is that we do not do enough to show students just how much useful data there is out there, and how accessible it is. Thirty years ago secondary data analysis was a tiresome business of ordering data on physical media (remember computer tapes, punch cards!) and arranging to get it onto a university mainframe. Today tens of thousands of high quality surveys are a couple of mouse clicks away. Online tools like Nesstar mean anyone can explore data. The social sciences are about evidence (unless you are Michael Gove or Donald Trump). Now that it is so much easier to access and explore, why do we not insist that students use it directly in their work, rather than always relying on its analysis and interpretation by others?

Students often find research methods boring. Yet secondary data analysis can be so exciting. I can think of few datasets that do not contain results that contradict students’ often dearly held misconceptions about the world, or offer opportunities for students to argue about what the data shows. And best of all (with apologies to the stronger variants of social constructionism) it is real. If anything good comes out of the events of 2016 it will be the rehabilitation of facts (without scare quotes) as something precious, as subversive and radical. Secondary data analysis gives students the skills to get their hands on some and do something with them. We should teach a lot more of it.

John MacInnes is the author of a recent textbook that aims to give students the skills they need to do in or offline secondary data analysis.

An Introduction to Secondary Data Analysis with IBM SPSS Statistics. Sage 2017.