Introduction to Data Linkage
Date:
22/10/2024 - 23/10/2024
Organised by:
NCRM, University of Southampton
Presenter:
Professor Katie Harron and Dr James Doidge
Level:
Entry (no or almost no prior knowledge)
Contact:
Jacqui Thorp
Training and Capacity Building Coordinator, National Centre for Research Methods, University of Southampton
Email: jmh6@soton.ac.uk
Description:
This short course is designed to give participants a practical introduction to data linkage and is aimed at both analysts intending to link data themselves and researchers who want to understand more about the linkage process and its implications for analysis of linked data—particularly the implications of linkage error. Day 1 will focus on the methods and practicalities of data linkage (including deterministic and probabilistic approaches) using worked examples. Day 2 will focus more on analysis of linked data, including concepts of linkage error, how to assess linkage quality and how to account for the resulting bias and uncertainty in analysis of linked data. Examples will be drawn predominantly from health data, but the concepts will apply to many other areas. This course includes a mixture of lectures and practical sessions that will enable participants to put theory into practice.
The course covers:
Overview of data linkage (data linkage systems, benefits of data linkage, types of projects)
Overview of linkage methods (deterministic and probabilistic, privacy-preserving)
The linkage process (data preparation, blocking, classification)
Classifying linkage designs
Evaluating linkage quality and bias (types of error, analysis of linked data)
Reporting analysis of linked data
Practical sessions
By the end of the course participants will:
Understand the background and theory of data linkage methods
Design deterministic and probabilistic linkage strategies
Evaluate the success of data linkage
Appropriately report analysis based on linked data
The course is aimed at researchers who need to gain an understanding of data linkage techniques and of how to analyse linked data. The course provides an introduction to data linkage theory and methods for those who might be using linked data in their own work. Participants may be academic researchers in the social and health sciences or may work in government, survey agencies, official statistics, for charities or the private sector. The course does not assume any prior knowledge of data linkage. Some experience of using Excel or other software will be useful for the practical sessions.
Preparatory Reading
Recommended (not required):
Doidge JC, Christen P and Harron K (2020). Quality assessment in data linkage. In: Joined up data in government: the future of data linking methods. https://www.gov.uk/government/publications/joined-up-data-in-government-the-future-of-data-linking-methods/quality-assessment-in-data-linkage
Harron K, Doidge JC & Goldstein H (2020) Assessing data linkage quality in cohort studies, Annals of Human Biology, 47:2, 218-226, DOI: 10.1080/03014460.2020.1742379
Harron KL, Doidge JC, Knight HE, et al. A guide to evaluating linkage quality for the analysis of linked data. Int J Epidemiol. 2017;46(5):1699–1710. doi:10.1093/ije/dyx177
Doidge JC, Harron K (2019). Reflections of modern methods: Linkage error bias. International Journal of Epidemiology. 48(6):2050-60. https://doi.org/10.1093/ije/dyz203
Sayers A, Ben-Shlomo Y, Blom AW, Steele F. Probabilistic record linkage. Int J Epidemiol. 2016;45(3):954–964. doi:10.1093/ije/dyv322
Doidge JC, Harron K. Demystifying probabilistic linkage: Common myths and misconceptions. Int J Popul Data Sci. 2018;3(1):410. doi:10.23889/ijpds.v3i1.410
The course will start with registration and coffee at 9:30 with formal teaching starting at 9:45 and finishing at 16:45.
Day 1
Overview
Deterministic linkage algorithms
Linkage error
Probabilistic linkage theory and practical demonstration
Practical considerations (including variable selection, handling missing data and managing processing requirements)
Overview of advanced topics including privacy preservation, string comparators and linkage of multiple files
Day 2
Recap: Common myths and misconceptions about probabilistic linkage
Linkage error bias
Linkage quality assessment
Handling linkage error in analysis
Reporting studies of linked data
Software demonstration: Splink – open-source toolkit for probabilistic record linkage and deduplication at scale
Cost:
The fee per teaching day is:£35 per day for students registered at any University. £75 per day for staff at academic institutions, Research Councils researchers, public sector staff and staff at registered charity organisations and recognised research institutions. £250 per day for all other participants. All fees include event materials and morning and afternoon refreshments. Fees do not include travel and accommodation costs. In the event of cancellation by the delegate a full refund of the course fee is available up to two weeks prior to the course. NO refunds are available after this date. If it is no longer possible to run a course due to circumstances beyond its control, NCRM reserves the right to cancel the course at its sole discretion at any time prior to the event. In this event every effort will be made to reschedule the course. If this is not possible or the new date is inconvenient a full refund of the course fee will be given. NCRM shall not be liable for any costs, losses or expenses that may be incurred as a result of its cancellation of a course, including but not limited to any travel or accommodation costs. The University of Southampton’s Online Store T&Cs also continue to apply.
Website and registration:
Region:
Greater London
Keywords:
Quantitative Data Handling and Data Analysis, Data Matching, Use of Administrative Sources, Data Quality and Data Management, Quality in Quantitative Research, Longitudinal Research, Data linkage
Related publications and presentations:
Quantitative Data Handling and Data Analysis