Data Linkage: From Theory to Practice

Date:

23/05/2016 - 25/05/2016

Organised by:

University of Southampton/ADRC-E

Presenter:

Professor Natalie Shlomo

Level:

Entry (no or almost no prior knowledge)

Contact:

adrce@southampton.ac.uk

Map:

View in Google Maps  (SO17 1BJ)

Venue:

Southampton Statistical Sciences Research Institute, Building 39, University of Southampton,Highfield,Southampton

Description:

Course places are limited and registration by 16 May 2016 is strongly recommended.

Course No. ADRCE-Training023 Shlomo

Summary of Course:

The course will introduce basic concepts and methods of record linkage and will cover methodological and statistical aspects of this new emerging area. The course will provide theory and practical applications of deterministic and probabilistic approaches to record linkage including pre-matching processes, matching weights, types of errors in classification,  evaluation of the quality of linkage procedures, implementation of the E-M algorithm and an introduction to the analysis of linked datasets. By the end of the course, participants should have an understanding of record linkage techniques and be able to implement and evaluate record linkage procedures. The course does not assume any prior knowledge of record linkage and there will be  a session devoted to the revision of basic concepts in probability theory necessary to understand probabilistic record linkage. The course will have a strong practical emphasis and will include tutorials and a computer workshop to enable course participants to put the taught methods into practice. The software that will be used is SAS although no familiarity with SAS prior to the course is required. (This course is a more intensive course than the course ‘Introduction to Data Linkage').

 

Course Objectives:

By the end of the course, students should have an understanding of data linkage techniques and be able to implement and evaluate data linkage procedures. There will be practical sessions with a computing lab and tutorials.

 

Course Content:

  • Introduction and types of record linkage methods
  • Sources for record linkage
  • Examples of record linkage applications
  • Pre-matching processes (data cleaning, standardizing and parsing of fields)
  • Revision in probability and odds, Bayes Theorem and Hypothesis Testing
  • Deterministic matching
  • Probabilistic matching
  • Field agreement weights and frequency based weights
  • String Comparators
  • Blocking variables
  • Evaluation of record linkage
  • Introduction to EM algorithm
  • Introduction to the analysis of linked datasets
  • Tutorials
  • Computing lab in SAS  - applying record linkage to two datasets

 

Target Audience:

The course is aimed at researchers who need to gain an understanding of record linkage techniques. The course emphasizes putting theory into practice for those who need to carry out record linkage in their own work. Participants may be academic researchers in the social and health sciences or may work in government, survey agencies, official statistics, for charities or the private sector.

 

Pre-requisites:

The course does not assume any prior knowledge of record linkage and a special session will be devoted to the revision of probability theory necessary to understanding probabilistic record linkage. No familiarity with the software SAS will be assumed.

 

Course Materials:

Participants will receive course notes, tutorials and computing lab material.

 

Presenter:

Natalie Shlomo is a Professor in Social Statistics, School of Social Sciences at the University of Manchester. She has extensive knowledge of survey methods including data processing: record linkage, edit and imputation processes and statistical disclosure control.

 

Programme:

Monday, May 23rd     

 

09:30 – 10:00              Registration and coffee

 

10:00 – 10:15              Welcome and Introductions

 

10:15 – 11:00              Session 1: Introduction to data linkage

                                    Types of data linkage

                                    Sources for data linkage

 

11:00 – 11:20              Break

 

11:20 – 12:30              Session 2: Ethics and disclosure control

                                    Overview of methods

                                    Introductory example

 

12:30 – 13:30              Lunch

 

13:30 – 15:00              Session 3: Tutorial

                                    Exact matching

                                    Pre-matching processes  

                                                       

15:00 – 15:20             Break

                                


 

15: 20 – 16:30             Session 4:  Standardization/parsing

                                    Phonetic codes and string comparators

                                

 

Tuesday, May 24th     

 

09:30 – 10:00             Coffee

 

10:00 – 11:00             Session 5:  Revision: Probability and odds

 

11:00 – 11:20              Break

 

11:20 – 12:30              Session 6: Revision: hypothesis testing and types of errors

                                    Basic concepts of probabilistic record linkage

 

12:30 – 13:30              Lunch

 

13:30 – 15:00              Session 7:  Field agreement/disagreement weights

                                    Frequency and outcome specific weights    

                                    Blocking variables

                               

15:00 – 15:20             Break

                                

15:20 – 16:30              Session 8:  Tutorial

                                    Constraints on matching

                                    Post-linkage

                                    Fellegi and Sunter Theorem

                                

 

Wednesday, May 25th     

 

09:30 – 10:00              Coffee

 

10:00 – 11:00              Session 9:  Evaluation of data linkage procedures

                                    Introduction to the E-M Algorithm

                                    Introduction to the analysis of linked data

 

11:00 – 11:20              Break

 

11:20 – 12:30              Session 10:  Introduction to the analysis of linked data

                                    Some recent extensions and applications

 

12:30 – 13:30              Lunch

 

13:30 – 15:00             Session 11: Computing Lab

                               

15:00 – 15:20             Break

                                

15: 20 – 16:30             Session 12:  Computing Lab

                                    Wrap-up, evaluation and discussion

On the last day, there will be an opportunity for participants to ask questions on how to link their own datasets (you can bring your own data to the course if you wish).

 

Preparatory Reading:

Belin, T.R. and Rubin, D. B. (1995) A Method for Calibrating False-Match Rates in Record Linkage. Journal of the American Statistical Association, 90, 694-707.

 

Fellegi, I. P. and Sunter, A. B. (1969) A Theory for Record Linkage, Journal of the American Statistical Association, 64, 1183-1210.

 

Gill, L. (2001) Methods for Automatic Record Matching and Linkage and their use in National Statistics,  The National Statistics Methodology Series, ONS (available at http://www.ons.gov.uk/ons/guide-method/method-quality/specific/gss-methodology-series/index.html)

 

Harron, K., Goldstein, H.and Dibben, C.  (2016) Methodological Developments in Data Linkage. Wiley Series in Probability and Statistics, Chichester: Wiley.

Herzog, T. N., Scheuren, F. J. and Winkler, W. E. (2007) Data Quality and Record Linkage Techniques. New York: Springer. ISBN 978-0-387-69502-0

 

Lahiri, P. and Larsen, M.D. (2005) Regression Analysis with Linked Data. Journal of the American Statistical Association, Vol. 100, No. 469, 222-230 (Also at:

http://www.stat.iastate.edu/preprint/articles/2004-09.pdf)

 

Mason, C.A. and Shihfen, T. (2008) Data Linkage Using Probabilistic Decision Rules: A Primer, Birth Defects Research (Part A): Clinical and Molecular Teratology 82, 812-821

 

Sadinle, M. and Fienberg, S.E. (2013) A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems. http://arxiv.org/abs/1205.3217

 

Scheuren, F. and Winkler, W. E. (1993) Regression Analysis of Data Files that are Computer Matched, Survey Methodology, 19, 39-58

http://www.fcsm.gov/working-papers/scheuren_part1.pdf

 

Scheuren, F. and Winkler, W. E. (1997) Regression Analysis of Data Files that are Computer Matched II,  Survey Methodology, 23, 157-165

http://www.fcsm.gov/working-papers/scheuren_part2.pdf

 

Winglee, M., Valliant, R. and Scheuren, F. (2005) A Case Study in Record Linkage. Survey Methodology, Vol. 31, Number 1, 3-12.

 

Winkler, W. E. (1995) Matching and Record Linkage, in B.G. Cox et al. (ed) Business Survey Methods, New York: J. Wiley, 355-384

http://www.fcsm.gov/working-papers/wwinkler.pdf


Terms and conditions: 12 Cancellation and Refund of Events and Services

http://store.southampton.ac.uk/help/?HelpID=1

Cost:

The fee per day is:

1. £30 - For UK registered postgraduate students
2. £60 - For staff at UK academic institutions, Research Council UK funded researchers, UK public sector staff and staff at UK registered charity organisations
3. Free Place for ADRC/ADRN/ADS staff
4. £220 - For all other participants

All fees include event materials, lunch, morning and afternoon tea. They do not include travel and accommodation costs.

Website and registration:

Region:

South East

Keywords:

Analysis of administrative data, Data Quality and Data Management (other), Data linkage, Quantitative Approaches (other)

Related publications and presentations:

Analysis of administrative data
Data Quality and Data Management (other)
Data linkage
Quantitative Approaches (other)

Back to archive...