
There’s a well-known saying in data science: “garbage in, garbage out”. It’s a blunt but accurate reminder that poor-quality data leads to poor-quality results, no matter how sophisticated the analysis. At ADR UK (Administrative Data Research UK), one of the ways we address this challenge is through our Data Explained publications.
ADR UK is a partnership dedicated to improving access to public sector data for research purposes. By providing secure, ethical access to de-identified, linked administrative datasets, ADR UK is helping to drive better, evidence-based policy decisions. But as any researcher working with administrative data knows, this kind of data wasn’t created with research in mind.
Instead, administrative data is generated during the everyday operations of government departments and public services. It’s practical and functional, but often messy, with inconsistencies in how information was recorded (or not). For researchers, that messiness can lead to misinterpretation, bias, or flawed conclusions.
That’s where ADR UK’s Data Explained publications can come in.
Largely written by ADR UK Research Fellows, these distil real-world challenges and insights from using specific datasets for research. These reflections help other researchers understand the limitations of the data before diving in, improving the quality of analysis from the very start.
What researchers are learning through Data Explained publications
Take the work of Professor Tim McSweeney, who explored serious and organised crime (SOC) using data from the crown and magistrates’ courts. His project encountered several major data issues:
- No dedicated flag or marker in the data to identify SOC cases
- Incomplete information about all offences being prosecuted
- No data on victims or complainants
- Missing details on factors influencing sentencing (for example, aggravating or mitigating circumstances).
These gaps meant he had to exclude some cases from his analysis altogether, like those that lacked a record of the most serious offence. He also made recommendations to the data owners, such as exploring the creation of a dedicated SOC marker in future datasets. Without producing a Data Explained, other researchers might walk into these same pitfalls unaware.
Another powerful example comes from Dr Xiaohui Zhang, who investigated how time spent in care impacts educational outcomes for young adults in England. She used the Growing Up in England dataset, which combines Census 2011 data with the National Pupil Database.
But despite its richness, the dataset had critical limitations:
- No school identifier, making it impossible to control for school-level influences like staff quality or facilities
- A mismatch in the format of key variables (like adr_id) across different periods/“waves” of data, preventing data linkage beyond certain years.
Because of these issues, Dr Zhang had to limit her analysis to a shorter time frame than originally intended – just five years instead of a full long-term view. Her experience highlights how missing metadata and inconsistent formats can dramatically constrain a study’s scope.
Why producing a Data Explained matters
These aren’t just technical snags: they’re fundamental research constraints. Each Data Explained publication provides a way to surface them early, helping other researchers design better studies, avoid common traps and manage expectations.
This also helps the research achieve greater impact. When policymakers rely on research to inform decisions that affect people’s lives, data quality matters more than ever.
A final word: The consequences of bad data
To illustrate how crucial data quality is, I’ll end with a cautionary tale shared by Professor Peter Christen during a data quality workshop (originally reported by The New York Times, 30 September 1992 – thanks to Professor Xiao-Li Meng at the Harvard Data Science Review):
Court Computer Says All Hartford Is Dead
Court officials have figured out why Hartford residents were excluded from Federal grand jury pools over the past three years: The computer that selected names thought everyone in the city was dead … The city’s name had been listed in the wrong place on the computer records, forcing the “d” at the end of “Hartford” into the column used to describe the status of prospective jurors. “D” stands for dead.
It's an extreme and calamitous example, but it underscores the stakes. When we don’t understand the structure or quirks of our datasets, errors creep in. Decisions are made on faulty foundations. People, and sometimes even entire cities, get left out. Professor Christen and his colleague Professor Rainer Schnell have identified more than 30 such assumptions and misconceptions about administrative data.
These results could have been mitigated by producing a Data Explained, helping to make sure the right insights are obtained from the right data, and ultimately benefit the public.
View all ADR UK Data Explained publications
Browse ADR UK flagship datasets
About ADR UK
ADR UK (Administrative Data Research UK) is a partnership transforming the way researchers access the UK’s wealth of public sector data, to enable better informed policy decisions that improve lives. By linking together data held by different parts of government and facilitating safe and secure access for accredited researchers to these newly joined-up and de-identified datasets, ADR UK is creating a sustainable body of knowledge about how our society and economy function – tailored to give decision makers the answers they need to solve important policy questions. To find out more, visit adruk.org or follow us on X and LinkedIn.