Bite-sized
Day 1: Thursday, 12 September
-Future Data Services 2
Session convener: Mark Elliot, Jon Johnson
Session 2.A: Enhancing Data Accessibility and Security through Innovative Data Synthesis (EDASIDA).
One bottle neck in the discoverability pipeline for data is the availability of teaching datasets. This is particularly acute for data that is stored in a virtual research environment where the access restrictions make the production of teaching datasets problematic. This adds to the 'hurdle height' for potential new users. In principle, synthetic data produced by the data services themselves are an option, but there are two conflicting issues, risk and utility. A preliminary study conducted at Manchester in collaboration with Administrative Data Research UK demonstrated the feasibility of generating synthetic datasets with high utility and low risk (even achieving zero marginal risk). The essential idea is to start from output that has been cleared for publication by a service and to use the parameters of that output (model coefficients, sufficient statistics etc.) as the objective function for a genetic algorithm for data synthesis. In this session, we will describe the results of these initial studies before outlining a further extension of the work which uses the synthetic data created using this approach to assess the disclosure risk of the output itself. This potentially addresses another issue with TREs, the informality and inconsistency of output checking procedures. Again, our initial results here are very promising. Building on earlier work (Elliot et al 2023), we have discovered that comparable risk measures, akin to those applied to microdata, can be potentially employed. The envisioned goal is to modernize and automate the output checking process
Session 2.B: Extraction and Utilisation of Metadata from Non-machine-actionable Documents to Improve Data Curation and Discovery
We will describe the rationale for our approach, results of prior work and how that relates to the development of new methods to improve metadata uplift in survey and biomedical instruments