To share or not to share: code sharing in social science
In this blog we highlight the benefits and challenges of creating and sharing reusable code. It is based on a session held at the 2021 Research Methods e-Festival, run by the National Centre for Research Methods (NCRM) and methods@manchester, and draws on ideas arising from the ONS-UKDS #LoveYourCode2020 workshop.
Setting the scene
The drive towards transparency and accountability in research has seen community and government-led policies developed for opening access to research resources. An emerging ‘reproducibility crisis’ in some disciplines demands that more data and code should be released for substantiating published results. Indeed, code sharing is being viewed as best practice in empirical scientific research.
Research reproducibility for a published research article is where the authors have provided all the data, code, and processing instructions necessary to rerun exactly the same analysis and obtain identical results.
Despite some disciplines leading the way with making underlying code available, sharing code in quantitative social science is still not widely done. In our NCRM discussion we highlighted benefits and examined barriers that can be overcome around writing and sharing code. We also shared tips for how newcomers can get started.
Contributing code is valuable for the future progress of science. It demonstrates a willingness to follow open science principles, promotes a positive and collaborative approach and ensures that researchers are able to return to their own work later and remember what they have done.
Gold standard code can be submitted to an external service for validation and receipt of a certificate of reproducibility, for example, the cascad or CODECHECK services. And code can be published in a data repository such as ReShare or Zenodo, aiding citation and visibility (for example, with a DOI). Intellectual property for code can stay with the researcher, but working collaboratively with data owners might mean joint ownership.
As users, researchers can make use of ‘research-ready’ code upon which to build their own code. Access to syntax for derived variables created by data owners can also help to derive new variables for analysis. Researchers can also avoid recreating basic recoding routines or derivation of complex histories.
Challenges and solutions for sharing code
Researchers might be worried about writing good code. Others do it routinely, submitting analytic code with their publications and even publishing it in GitHub.
There are challenges of reproducing work that's undertaken in a safe haven or trusted research environment (TRE). Data access is restricted and data cannot be taken out. Code can be shared outside the secure environment, but it must be first reviewed for disclosure risk, such as results noted in the comments. Code tracking and versioning tools can also be used inside a TRE such as R Markdown, Jupyter Notebook and GitLab to manage and document code.
Tips for getting started
- Be consistent in the way you lay out your code, using clear commenting and carefully describing new derived variables.
- Modern programming languages allow you to have very flexible layout, with subroutines to help modularize things.
- Use a standard where possible, and follow guidance on how to make your data and code readable and understandable by others.
- When working with collaborators on a project, especially when cleaning and preparing your analyses datasets, agree a high-level common approach.
- Publish code on GitHub under an open licence.
- Browse the Turing Way online handbook for guidance on reproducibility in data science
- Start somewhere and each small step is better than not doing it!
There are lots of great resources out there around learning how to write and publish good code. At the ONS Secure Research Service, in partnership with ADR UK, we are starting to work with accredited researchers or the researcher community who want to improve their code writing skills and collaborative opportunities.
About the authors
Louise Corti is Head of Insights and Impact for the Integrated Data Programme and Service at the Office for National Statistics. Her team focus on tracking, measuring and showcasing impact from research undertaken in the ONS secure environment, the Secure Research Service (SRS). She is leading a pilot on promoting and facilitating reproducible code with the ADR UK.
Professor Felix Ritchie is Professor of Applied Economics at the University of West England. Felix is a microeconomist and applied statistical analyses to record-level data. Prior to that he was a programmer working for a company writing office management software.
Martin O’Reilly is Director of Research Engineering at the Alan Turing Institute, based at the British Library. He runs a team of software engineers and data scientists, who supports researchers to apply their research and make it more reproducible and reusable. He is interested is providing guidance on working safely with sensitive data.
Read a transcript of the session from the 2021 Research Methods e-Festival. This document includes links to resources highlighted in the session.