privacy-engineering-tools

Python packages to create synthetic data

This is a list of Python packages that can help you generate synthetic data. The list was compiled mainly through searching in Google and GitHub. Naturally, some packages will be missing, so feel free to suggest other packages.

We divided the packages into three categories: Tabular dataset, Purely generative datasets and packages for specific use cases. This is a loose classification, since packages might fall into multiple categories.

Tabular datasets

This packages in this category are useful if you have a tabular dataset and want to create a new synthetic dataset with similar properties (listed in alphabetical order).

Name Methods Extra Features Synthetic spectrum More info License Maintenance GitHub stars
DataSynthesizer Bayesian Network Web User Interface, intermediate file, differential privacy Synthetic structural, synthetically-augmented plausible, synthetically-augmented multivariate plausible Article, on PyPi MIT Active 100-500
Gretel Synthetics LSTM Differential Privacy Synthetically-augmented multivariate plausible Documentation, on PyPi Apache-2.0 Active 100-500
metasyn Scipy intermediate file, extensible Synthetically-augmented univariate plausible Documentation, PyPi MIT Active 0-10
Synthetic Data Vault (SDV) Copula, GAN, TVAE Single table, relational database, time-series, de-identification Synthetically-augmented replica Article (pdf), documentation Business Source License Active 500-1000
synthcity GAN, TVAE, LLM Time series, static survival analysis, images Synthetically-augmented multivariate plausible Article Documentation, PyPi Apache-2.0 Active 100-500
synthia fPCA, Gaussian copula, vine copula Supports xarray Synthetically-augmented multivariate/univariate plausible Article, article, documentation MIT Active 10-100
ydata-synthetic GAN Single table and time-series Synthetically-augmented multivariate plausible On PyPi MIT Active 500-1000

Generative

Generative methods generally do not need any input dataset. Instead, the user specifies the properties of the dataset and the package generates the synthetic data from this specification (listed in alphabetical order).

Name Data type Extra features More info License Maintenance GitHub stars
BlenderProc Images for segmentation, distance estimation Physics simulation, camera sampling Article, documentation GPL-v3 Active 1000+
faker Many kinds of identifiers, e.g., name, address, phone number, etc. Modular structure Documentation, on PyPi MIT Active 1000+
google-semantic-location-history Google semantic location histories Uses GenSON, Faker, faker-schema - MIT Active 0-10
mimesis Many kinds of identifiers, e.g., name, address, phone number, etc. (databases, json, xml) High performance, multilingual Documentation, on PyPi MIT Active 1000+
physim-dataset-generator Images of cluttered scenes to train object detection models 3D rendering Article BSD-2 Clause Inactive 10-100
plait.py Tabular data From user-defined yaml template - MIT Inactive 100-500
pydbgen SQL databases and DataFrames with private information Easy to use, uses Faker Documentation MIT Inactive 100-500
Synthdet Images to train object detection models - Article Apache 2.0 Active 100-500
scikit-learn Random matrices (regression problem) - - BSD-3 Clause Active 1000+
timeseries-generator Time-series GUI, built-in economics factors - Apache 2.0 Active 10-100
zpy Images for machine learning applications Uses blender to generate images - GPL-v3 Inactive 100-500

Specific use-cases

These packages have more specific use-cases and are not purely generative (listed alphabetically).

Name Data type Extra features More info License Maintenance GitHub stars
augraphy Distortion of paper documents - - MIT Active 10-100
doppelGANger Timeseries generation Uses GANs, high customization Article, presentation BSD-3 Clause-Clear Active 100-500
mtt-distilation Images for Machine Learning Distillation from bigger to smaller dataset Article MIT Active 100-500
smogn Improved tabular dataset Synthetic Minority Over-Sampling, Pure Python Article, on PyPi GPL-v3 Active 100-500
syndata-generation Synthetic images (scenes, bounding box annotations) to train object detection models - Article MIT Inactive 100-500
tofu Synthetic UK Biobank data - Published version MIT Inactive 10-100