QAMyData: a health check for your numeric data

Date: 19th June 2019

Category: NCRM news

Author(s): Louise Corti

Social science research benefits from accountability and transparency, which can usefully be underpinned by high quality and trustworthy data. Rigorous data curation practices are still sometimes viewed as dark art, and easy-to-use tools to correct and clean numeric data are not widely used, despite awareness of the desire to make data FAIR (Findable, Accessible, Interoperable and Reusable). The tasks of checking, cleaning and documenting data by repository staff can be all too manual and time-consuming.

As part of an NCRM Collaborative Project award in 2018, the UK Data Service developed a free easy-to-use open source tool known as QAMyData that provides a health check for numeric data. The tool uses automated methods to detect and report on some of the most common problems in survey or numeric data, such as missingness, duplication, outliers and direct identifiers. Requirements were scoped through a series of engagements with the Service’s own data curation team, other data publishers, managers and quantitative researchers to create a comprehensive list of ‘tests’ that are typically used when quality assessing numeric data files. Some of the tests included are shown in the table.

The tool offers a number of configurable tests that have been categorised into four types: file, metadata, data integrity; and direct identifiers. The tool can be run on popular file formats, including SPSS, Stata, SAS and CSV. A standard config file with default settings for each test is available, such as a threshold for pass or fail on various tests (e.g. detect value labels that are truncated, email addresses identified as a string, or undefined missing values). The configuration feature allows thresholds to be easily adapted to meet the user’s own desired thresholds, and also to help define and create a unique Data Quality Profile. New tests can also easily be added.

The software creates a ‘data health check’ that details errors and issues as both a summary and detailed report, providing a location of the failed test. Data depositors and publishers can act on the results and resubmit the file until a clean bill of health is produced.

The choice of technology for the tool went through at least 4 months of research, experimenting with different open source programming languages and libraries of statistical functions, including R, Python and Clojure, focussed initially on SPSS and STATA files. The agile programming language, Rust, was selected as the best choice, building on the established Readstat library, which is gaining recognition in the statistical community. The QAMyData software is easily downloaded to a laptop or server and can be quickly used and integrated into data cleaning and processing pipelines. It is available to download from the UK Data Service Github pages under an open licence, and will be further developed as web service.

The grant also delivered a training module on what makes a clean and well-documented numeric dataset. A user guide, training exercise and purposely-erroneous dataset were produced and road-tested during training sessions, including one with the AQMEN Training provider at Edinburgh.

Outputs

• Project outputs, including overview presentation, table of tests, Installation Guide and training materials: https://www.ukdataservice.ac.uk/about-us/our-rd/qamydata.aspx

• Full day hands-on NCRM workshop, Assessing Data Quality and Disclosure Risk in Numeric Data, LSE, 20 February 2019