Generating Synthetic Data for Statistical Disclosure Control

Date:

02/12/2014 - 03/12/2014

Organised by:

University of Southampton/ADRC-E

Presenter:

Jörg Drechsler

Level:

Intermediate (some prior knowledge)

Contact:

adrce@soton.ac.uk

Map:

View in Google Maps  (SO17 1BJ)

Venue:

Southampton Statistical Sciences Research Institute, Building 39, University of Southampton, Highfield Campus, Southampton

Description:

NOTE: THIS COURSE IS NOW FULLY BOOKED AND THE ON-LINE BOOKING SYSTEM HAS BEEN CLOSED.

Course No. ADRCE-Training011 Drechsler

 

Short Summary of Course

This short course will provide a detailed overview of the topic, covering all important aspects relevant for the synthetic data approach. Starting with a short introduction to data confidentiality in general and synthetic data in particular, the workshop will discuss the different approaches to generating synthetic datasets in detail. Possible modeling strategies and analytical validity evaluations will be assessed and potential measures to quantify the remaining risk of disclosure will be presented. Finally, recent extensions of the synthetic data approach will be reviewed and chances and obstacles of the idea will be discussed. To provide the participants with hands on experience, all steps will be illustrated using simulated and real data examples in R.

Course Contents

The course covers:
• the fully synthetic data approach
• the partially synthetic data approach
• modelling strategies for generating synthetic data
• data utility evaluations
• disclosure risk assessment

Learning Outcomes
By the end of the course participants will:
• have a practical understanding of the concept of synthetic data 
• be able to judge in which situations the approach could be useful
• know how to generate synthetic data from their own data
• have a number of tools available to evaluate the analytical validity of the synthetic datasets
• know how to assess the disclosure risk of the generated data


Computer Software and Computer workshops

This event includes computer workshops.

The practical implementation of the approach will be illustrated using the statistical software R.


The Presenter(s) 

Jörg Drechsler is the deputy head of the Department for Statistical Methods at the Institute for Employment Research in Nürnberg. He studied business administration in Nürnberg and obtained his Ph.D. from the University in Bamberg in 2009. During the winter term 2011/2012 he held an interim professor position at the Institute for Statistics at the Ludwig-Maximilians-University in Munich. His main research interests are data confidentiality and nonresponse in surveys. He received several awards for his research on synthetic data and recently published a book on this topic.

Target Audience

The course intends to summarize the state of the art in synthetic data. The main focus will be on practical implementation and not so much on the motivation of the underlying statistical theory. Participants may be academic researchers or practitioners from statistical agencies working in the area of data confidentiality and data access. Some background in Bayesian statistics is helpful but not obligatory.


Duration

This is a two-day course. On Day one, the Registration will start from 9.30 and formal teaching will commence at 10.00 and finish at around 17.00. On Day two, it will start at 9.00 and finish at around 16.00.

Event Outline (Programme)

1. A Brief History of Data Confidentiality
a. Information Reduction vs. Data Perturbation
b. The Computer Science Approach vs. the SDC Approach to Confidentiality
2. Some Basics Regarding Multiply Imputed Synthetic Datasets
a. Fully Synthetic Datasets
b. Partially Synthetic Datasets
c. Applications in Practice
3. Analyzing Synthetic Datasets
a. Fully Synthetic Data Combining Rules
b. Partially Synthetic Data Combining Rules
c. Extensions to Missing Data
4. Generating Synthetic Datasets
a. Two Approaches for Multiple Imputation (joint modeling vs. sequential regression)
b. Imputation Models and Modeling Strategies ((generalized) linear models and machine learning approaches)
c. Evaluating the Analytical Validity
d. Evaluating the Risk of Disclosure
5. Recent Extensions of the Synthetic Data Approach
a. A Synthesis Approach for Census Data
b. A Two Stage Approach to Balance Analytical Validity and Disclosure Risk
6. Chances and Obstacles of the Approach

Pre-requisites

Some background regarding general linear modelling is expected. Familiarity with the concept of Bayesian statistics is helpful but not required. The statistical software R will be used to illustrate the implementation of the approach.
Familiarity with basics in R would be useful but is not required.

Preparatory Reading

Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011), Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database, International Statistical Review, 79, 363 - 384.
Reiter, J. P. (2012), Statistical approaches to protecting confidentiality for microdata and their effects on the quality of statistical inferences, Public Opinion Quarterly, 76, 163 - 181.
Drechsler, J. (2011) Synthetic datasets for statistical disclosure control. Theory and implementation. Lecture notes in statistics, 201, New York: Springer

Course Materials

Participants will receive written course notes.

 

Cost:

Thanks to ESRC funding we are able to offer this course at reduced rates as follows:
1) £30 per day for UK registered students
2)£60 per day for staff at UK academic institutions, RCUK funded researchers, UK public sector staff and staff in UK registered charity organisations
3)£220 per day for all other participants
4)Free Place for ADRC-E & ADRN/ADS staff

The course fee includes course materials, lunches and morning and afternoon refreshments. Travel and accommodation are to be arranged and paid for by the participant.

Website and registration:

Region:

South East

Keywords:

Analysis of administrative data, Statistical Disclosure Control, Confidentiality and Anonymity

Related publications and presentations:

Analysis of administrative data
Statistical Disclosure Control
Confidentiality and Anonymity

Back to archive...