Cogstack Anonymisation Process
This is a summary of the steps we use to verify that Cogstack has successfully anonymised free text extracted from patient records as per the process we have agreed with the Information Commissioner's Office. As usual we are working toward the motivated intruder standard.
The main problem items that must be removed are names, DOB, address, postcode, MRN and NHS number. There may be others that crop up as the process matures. We allow age, gender and ethnicity to be included within structured datasets so would not worry about those being included within free text.
The model was initially trained and fine tuned as described in this report and is retrained at the point where a project identifies a new field to redact.
The model is currently trained to identify and remove the following data fields:
Data fieldGuidelineTelephone NumberString that looks like a telephone numberEmailEmail addressHospital NumberNameAny thing that looks like a name. E.g. ' Pt John Smith
was seen…InitialsInitials Address LineAddress of a place excluding the postcodePostcodePostcodeDate of BirthA date of birth. Normally preceded by the string 'dob'Nhs NumberNHS Number normally prefixed by "NHS Number: "HCPC NumberHCPC Number. Normally prefixed by "HCPC No: " Accession NumberAccession Number. Normally prefixed by "Accession No: "DateAny date other than Date of BirthGMC NumberGMC Number. Normally prefixed by "GMC No: "Hospital NameName of hospital that is not part of an address, e.g. University College LondonExtension NumberExtension numberGP NoGeneral Practitioner Identifier. Normally prefixed by "GP Identifier: "
Our internal technical solution needs to be able to achieve 95% effective anonymisation in order to be considered compliant, and for data to be safely released to researchers.
In order to mitigate the residual 5% risk, we propose the following governance measures to wrap around the technical process:
Figure 1: Overview of research extract process
Manual sampling of 1% of requested records per project, with a minimum of 100 records and a maximum of 500 records to be sampled
If any identifiers are found, these should be documented, removed from the dataset and reported to RDAC/IG before being recorded in the Cogstack DPIA
A motivated intruder test could be performed at every 6 months – meaning we provide a sample to a UCLH employee with the appropriate technical skill and ask them to identify the patients from the sample without accessing Epic. Any lessons are then fed back into the process.
Principal research investigators will continue to sign the code of conduct agreeing not to re-identify any patients, keep the data within the secured and agreed environment, etc
After a pilot period, we’d like to ask the DTC for their opinion on the measures we could take to further enhance the privacy of the de-id’d free text data
See more detail in the original ICO guidance Guidance from the ICO on anonymisation of free text.
We are passionate about sharing our policies and best practices with other institutions. Please get in touch at uclh.safehr@nhs.net for more information.