The public release of synthetic data has been discussed for several years, but governance remains a central challenge. Where a synthetic data generator is trained directly on real, individual-level data, it can be difficult to quantify the residual disclosure risk in the outputs. A range of metrics have been proposed to assess whether synthetic data may retain information that could support re-identification, linkage, or inference attacks. However, these assessments are difficult to future-proof. The risk environment is likely to increase as new auxiliary datasets, analytical methods, and adversarial techniques become available.

The Data Matryoshka project at UCLH has taken a different approach. Rather than relying on post-hoc measurement of re-identification risk, we govern what information is made available to the synthetic data generation process in the first place. For synthetic datasets intended for public release, the generator is exposed only to properties of the real data that would themselves be acceptable for public disclosure, such as aggregate statistics, approved distributions, and selected correlations.

In developing this approach, we coined the term: “publicly-sourced synthetic data”.. Publicly-sourced synthetic data is generated from information that is already public, or is acceptable for public release through disclosure-control processes, rather than from identifiable or individual-level patient data. This provides a transparent route for producing useful synthetic datasets while maintaining a clear separation from personal data under the UK GDPR. Developing this approach has involved cross-disciplinary collaboration between technical, clinical, and information governance experts.

The approach has been recognised by the UK Synthetic Data Community Group as an innovative model for supporting the responsible public release of synthetic data.

The Data Matryoshka Approach to

Publicly-Sourced Synthetic Data

The Data Matryoshka Approach to

Publicly-Sourced Synthetic Data

Interested in data?