Synthetic Data

On this page: synthetic data, fake data, artificial data, dummy data, fictitious data, reproducibility, reproduce, open science workflow
Date of last review: 2023-05-15

Synthetic data generation is the process of creating artificial data, which can be used in place of the real, possibly personal, data. Instead of simply adjusting an existing dataset to make it less identifiable, a completely new dataset is generated, with fictitious individuals. When creating synthetic data, sensitive values (which can be some values, or the entire dataset) in the data are replaced by values that are generated from a statistical model. The intuition behind this is to mimic the idea of drawing samples from a population. Not with actual people, but with fictitious individuals that “look like” the people from the population. Synthetic data can be created in multiple ways, such as based on rules or using a trained machine learning model, and for different purposes, such as for privacy protection purposes, but also for data enrichment or software testing.

When to use

Synthetic data can be generated for a variety of reasons, for example:

  • As an intermediate step in sharing (personal) data, before others gain access to (part of) the real dataset. This can for example be useful when data recipients still have to determine which variables they need from the real dataset, or how many observations. Or when the data request procedure can take up a large amount of time and recipients already want to explore the (synthetic) data.
  • To develop code without requiring access to real (personal) data. In this case, the synthetic data usually does not need to mirror the data statistically, but just in terms of the structure (e.g., only with the same column names and data types). Synthetic data that only resembles the real dataset in terms of its structure is also sometimes called “dummy data”.
  • To adhere to an open science workflow, to evaluate and reproduce analyses. If you share both a synthetic dataset and your code, others can easily evaluate your code with an actual dataset, see the results of the code and test its reproducibility.
  • In teaching, to prevent having to share (personal) data with students.

Implications for research

  • Although synthetic data are artificial, privacy risks may still remain. You can think of it like this: if the generating model is too good, you could reproduce the original data exactly, doing no better in protecting anyone’s privacy. On the other end, you can create a synthetic dataset that sets every value to “0”, which protects privacy very well, but is useless in terms of its quality. Usually, the result is somewhere in between: the synthetic data is less informative than the real data, at the expense of leaking some information about the original sample (e.g., descriptive statistics, relationships between variables, plausible values in the data). Hence, the synthetic data lies on a privacy-utility spectrum. Whether the synthetic dataset still contains personal data will need to be considered on a case-by-case basis and will differ depending on the method used to create the synthetic dataset. For more detailed information on this concept of the privacy-utility spectrum, see the UK Office for National Statistics.
  • Synthetic data is sometimes used in conjunction with Differential Privacy, which can help to numerically set the level of privacy/utility. Unfortunately, it is not straightforward to determine the disclosure risk directly from the synthetic data, and it is better to fix the privacy leakage in the process of generating the data.
  • The quality of the synthetic dataset is highly dependent on the input from which it is created. In general, a larger number of rows in the dataset (individuals) results in better quality synthetic data as there is more variability in the dataset. Moreover, datasets that contain many outliers typically result in lower-quality synthetic data, because they can have a large influence on the statistical properties of the dataset (e.g., the mean) from which the synthetic dataset is created. This in turn can lead to a distorted or unrealistic distribution in the synthetic dataset.

Tools and resources