Differential privacy

On this page: differential privacy, dp, epsilon, quantifying privacy, privacy-utility spectrum Date of last review: 2023-05-30

Differential privacy (DP) is a mathematical technique to prevent the disclosure of personal information by adding statistical noise to queries. In its original form, the data are stored on a secure server. Researchers can then, without access to the data, pose a query (i.e., an analysis request), and obtain the result with a certain amount of statistical noise added to the output. The amount of noise is tweaked such that adding or removing an individual to or from the dataset alters the result of the query by a limited, predetermined amount.

The below video explains the basic premise of differential privacy using a very basic example:

Differential privacy is generally not added to a dataset itself, but to the result of a query (e.g., average income in a dataset), although some forms exist in which the dataset itself is altered; this is called local differential privacy.

How it works

Imagine a dataset with peoples’ incomes, and we want to know the average income in the data (i.e. our query). When applying differential privacy, we want to ensure that we cannot determine whether an individual is in the dataset or not. In theory, we can do this by adding or removing any single individual to or from our dataset and recalculating the average income in the modified dataset. Importantly, additions to the dataset should all be individuals that could be in the dataset, and not only those that already are. Depending on the privacy requirements (i.e., what amount of privacy leakage is acceptable) and the sensitivity of the query (the effect of changing one individual on the average income), more or less statistical noise is added to the result.

The most common definition of differential privacy is “epsilon-differential privacy”, epsilon being the “privacy budget”. In this definition, we consider two scenarios: the original dataset and a modified dataset, where we substitute the person with the highest income with the person with the lowest income. A certain amount of noise is added, so that the probabilities of obtaining a certain result in the original and the modified datasets do not differ more than Epsilon. Epsilon (i.e., the privacy budget) is a predetermined value: the lower epsilon, the higher the level of privacy, and therefore the more noise is added to the query.

Differential privacy can also protect groups of people by adding or subtracting groups instead of individuals. This can for example be useful when you want to protect a household as a whole, and not only the individuals in it.

Setting a lower epsilon results in more noise in the results that you release. This simply implies that it becomes more difficult to distinguish whether individuals were or were not in your dataset. However, it does not necessarily imply that a certain level of epsilon will result in fully anonymous data. Whether a dataset can be viewed as anonymous or not under the GDPR will always be context- and dataset-dependent.

Implications for research

The advantage of differential privacy is that privacy risk can be mathematically quantified (through epsilon). With many other techniques, it can be hard to determine exactly what the privacy guarantees are, especially with the presence of external information.
As with the other statistical approaches, there is a trade-off between more privacy or more utility: the lower the privacy budget, the more privacy is retained, but also the less accurate (more noisy) individual values will be.
The privacy budget should be determined in advance and can only be spent once: every query will lower the available privacy budget by epsilon, and when the privacy budget reaches zero, no more queries can be done. Thus, you cannot indefinitely reuse the data and still preserve privacy.
A disadvantage of differential privacy is that the concept is not completely trivial for the non-expert and in some cases, this has resulted in violated privacy guarantees. There is no absolute right way to set the privacy budget and no framework to decide which value of epsilon should be set for what kind of data. Thus, it can be hard for researchers to justify using a particular value of epsilon.

When to use

The implementation of differential privacy is a very technical (and tricky) endeavour. Setting up a differentially private server that only outputs differentially private results for any query is currently practically impossible. Thus, if you are not an expert on statistics and/or differential privacy, we recommend reaching out to someone who is, and to use Differential Privacy in addition to other privacy-enhancing techniques, such as pseudonymisation.

For now, differential Privacy is mostly used in combination with synthetic data, or with simple repetitive queries. Widespread use of differential privacy might become safer and/or easier over time as implementations are tested more thoroughly on real-world datasets.