What is Synthetic data: accelerating public policy research


Paul Calcraft, Behavioural Insights Team

Katie Harron, University of College London Great Ormond Street Institute of Child Health

Dora Kokosi, University College London

What if society-level patterns in behaviour and outcomes could be easily analysed by researchers to inform policy and services; without risking the privacy of any individual citizen? An idea from a Harvard professor in 1993 may provide exactly that: synthetic data. Synthetic data is a new copy of a data set that is generated at random; but following the structure and (some) patterns of the original data. Each piece of information in the data set is plausible (e.g. an athlete's height is usually between 1.5 and 2.2 meters; never 1 kilometer); but it is chosen randomly from the range of possible values; not by pointing to any original individual in the data set. We will show how synthetic data is helping to expand the use of data in policy research; and outline our ambitions to further improve the efficiency and safety of public policy research.