Why is Big Data important and what are the challenges for social scientists

Date
Category
NCRM news
Author(s)
Jane Elliott, Economic and Social Research Council

Largely thanks to Morten Tyldum’s film ‘The Imitation Game’, many people now know the story of Alan Turing and his contribution to the Second World War. The challenge for Turing was to develop a computer that could swiftly work through many permutations to decipher encrypted German messages about their war tactics. The challenge for social scientists today is to make best use of computing power, and newly developed algorithms, to capitalise on the vast quantity and variety of data that are created at speed in our knowledge driven, and digitally connected, society.

We are living at a time of great opportunity for social science. The digital revolution has led to the generation of a huge amount of evidence about people’s daily activities, including their social networks, and communications. This can be interrogated to help us understand more about individuals and the communities and institutions to which they belong. As has been argued by sociologists such as Mike Savage and Roger Burrows1, a key advantage of much of this data is that it records actual transactions and activities rather than individuals’ reported activities.  

Although there is undeniably now a ubiquity of data, some datasets can be seen as more intrinsically ‘valuable’ than others.  There is growing appreciation of the huge potential of administrative datasets, often held by government departments as a result of the routine work of the department in interaction with the public. An example would be the Work and Pensions Longitudinal Survey which links information about individuals’ benefit records (held by DWP) to information from HMRC about employment and pension contributions. Gaining access to these datasets, in anonymised form and in safe settings, can be a challenge even for specially trained or ‘approved’ researchers.  What makes this type of Big Data so valuable is that although never perfect, it does not suffer from the same biases inherent in survey data. Coverage is of the whole population not a survey sample; quality of data depends on administrative processes rather than individuals’ memories. Given the research potential of these rich data it is unsurprising that there is considerable frustration among academic researchers that often the data resources they most need are still tantalisingly just beyond their grasp. The ESRC-funded Administrative Data Resource Network was set up in 2013 to help make this administrative data more accessible to researchers.

In this new landscape of Big Data there are perhaps three main challenges for social scientists. First there is a methodological challenge. How can we develop the very best tools to help us interrogate, analyse and understand the vast quantities and varieties of data that now exist? For example, how can we ensure that the methods we use for analysing textual material fully exploit the potential of newly developed machine-learning techniques? Despite a small vanguard of individuals who are working productively with colleagues from computer science and mathematics to develop new techniques, this is still very much a niche area of working. Many academic researchers are continuing to use the methods and approaches with which they are familiar and comfortable even though this can limit the scope of their analysis.

Second there is the challenge of framing insightful research questions. Indeed this can sometimes be seen as in tension with the first challenge. There is a danger that social scientists at the cutting edge of developing methodological techniques can get distracted by the fascinating ‘puzzles’ of how to interrogate a new corpus of data, rather than focussing energy on the substantive evidence that can be gleaned from the empirical material. At the ESRC we are particularly interested in how we can facilitate co-creation of research questions – bringing together practitioners, policy makers and academic researchers so that they can construct questions that are interesting, useful and tractable. There is also a need to foster interdisciplinary collaborations. There can be a productive iteration here between the interesting and the possible – as new technologies such as machine learning make it possible to ask different types of question from data, this will in turn fuel our imaginations to think of a new sets of substantive research questions.

Third, we need to address the ethical questions that are raised by new forms of data and new approaches to analysis. For example, there are debates about whether, and what type of, consent is needed for access to these new sources of Big Data. The revised ESRC framework on data ethics includes sections on internet mediated research and other elements relevant to the use of Big Data. In addition it was announced earlier this year that the Government is establishing a new council for data science ethics.  

Arguably if we can address these three main challenges we will then be more successful in gaining the trust of data owners and of the individuals and public who generate ‘Big data’. The aptly named ‘Alan Turing Institute’ (ATI), based at the British Library,  was launched in November 2015 to advance data science and foster interdisciplinary collaborations to tackle research questions that can ultimately have a positive impact. Social scientists are already contributing to the work of the ATI, but it is vital that more researchers across the social sciences develop an understanding of the potential of Big Data and the need for engagement with data science in order to address new substantive research problems.

Reference
1. Burrows, R. and Savage M. (2014) ‘After the crisis? Big Data and the methodological challenges of empirical sociology’. Big Data and Society; April-June 1-6. DOI: 10.1177/2053951714540280