privacy-engineering-tools

Python packages to create synthetic data

This is a list of Python packages that can help you generate synthetic data. The list was compiled mainly through searching in Google and GitHub. Naturally, some packages will be missing, so feel free to suggest other packages.

We divided the packages into three categories: Tabular dataset, Purely generative datasets and packages for specific use cases. This is a loose classification, since packages might fall into multiple categories.

Tabular datasets

This packages in this category are useful if you have a tabular dataset and want to create a new synthetic dataset with similar properties (listed in alphabetical order).

Name	Methods	Extra Features	Synthetic spectrum	More info	License	Maintenance	GitHub stars
DataSynthesizer	Bayesian Network	Web User Interface, intermediate file, differential privacy	Synthetic structural, synthetically-augmented plausible, synthetically-augmented multivariate plausible	Article, on PyPi	MIT	Active	100-500
Gretel Synthetics	LSTM	Differential Privacy	Synthetically-augmented multivariate plausible	Documentation, on PyPi	Apache-2.0	Active	100-500
metasyn	Scipy	intermediate file, extensible	Synthetically-augmented univariate plausible	Documentation, PyPi	MIT	Active	0-10
Synthetic Data Vault (SDV)	Copula, GAN, TVAE	Single table, relational database, time-series, de-identification	Synthetically-augmented replica	Article (pdf), documentation	Business Source License	Active	500-1000
synthcity	GAN, TVAE, LLM	Time series, static survival analysis, images	Synthetically-augmented multivariate plausible	Article Documentation, PyPi	Apache-2.0	Active	100-500
synthia	fPCA, Gaussian copula, vine copula	Supports xarray	Synthetically-augmented multivariate/univariate plausible	Article, article, documentation	MIT	Active	10-100
ydata-synthetic	GAN	Single table and time-series	Synthetically-augmented multivariate plausible	On PyPi	MIT	Active	500-1000

Generative

Generative methods generally do not need any input dataset. Instead, the user specifies the properties of the dataset and the package generates the synthetic data from this specification (listed in alphabetical order).

Name	Data type	Extra features	More info	License	Maintenance	GitHub stars
BlenderProc	Images for segmentation, distance estimation	Physics simulation, camera sampling	Article, documentation	GPL-v3	Active	1000+
faker	Many kinds of identifiers, e.g., name, address, phone number, etc.	Modular structure	Documentation, on PyPi	MIT	Active	1000+
google-semantic-location-history	Google semantic location histories	Uses GenSON, Faker, faker-schema	-	MIT	Active	0-10
mimesis	Many kinds of identifiers, e.g., name, address, phone number, etc. (databases, json, xml)	High performance, multilingual	Documentation, on PyPi	MIT	Active	1000+
physim-dataset-generator	Images of cluttered scenes to train object detection models	3D rendering	Article	BSD-2 Clause	Inactive	10-100
plait.py	Tabular data	From user-defined yaml template	-	MIT	Inactive	100-500
pydbgen	SQL databases and DataFrames with private information	Easy to use, uses Faker	Documentation	MIT	Inactive	100-500
Synthdet	Images to train object detection models	-	Article	Apache 2.0	Active	100-500
scikit-learn	Random matrices (regression problem)	-	-	BSD-3 Clause	Active	1000+
timeseries-generator	Time-series	GUI, built-in economics factors	-	Apache 2.0	Active	10-100
zpy	Images for machine learning applications	Uses blender to generate images	-	GPL-v3	Inactive	100-500

Specific use-cases

These packages have more specific use-cases and are not purely generative (listed alphabetically).

Name	Data type	Extra features	More info	License	Maintenance	GitHub stars
augraphy	Distortion of paper documents	-	-	MIT	Active	10-100
doppelGANger	Timeseries generation	Uses GANs, high customization	Article, presentation	BSD-3 Clause-Clear	Active	100-500
mtt-distilation	Images for Machine Learning	Distillation from bigger to smaller dataset	Article	MIT	Active	100-500
smogn	Improved tabular dataset	Synthetic Minority Over-Sampling, Pure Python	Article, on PyPi	GPL-v3	Active	100-500
syndata-generation	Synthetic images (scenes, bounding box annotations) to train object detection models	-	Article	MIT	Inactive	100-500
tofu	Synthetic UK Biobank data	-	Published version	MIT	Inactive	10-100

This site is open source. Improve this page.