This is a list of Python packages that can help you generate synthetic data. The list was compiled mainly through searching in Google and GitHub. Naturally, some packages will be missing, so feel free to suggest other packages.
We divided the packages into three categories: Tabular dataset, Purely generative datasets and packages for specific use cases. This is a loose classification, since packages might fall into multiple categories.
This packages in this category are useful if you have a tabular dataset and want to create a new synthetic dataset with similar properties (listed in alphabetical order).
Name | Methods | Extra Features | Synthetic spectrum | More info | License | Maintenance | GitHub stars |
---|---|---|---|---|---|---|---|
DataSynthesizer | Bayesian Network | Web User Interface, intermediate file, differential privacy | Synthetic structural, synthetically-augmented plausible, synthetically-augmented multivariate plausible | Article, on PyPi | MIT | Active | 100-500 |
Gretel Synthetics | LSTM | Differential Privacy | Synthetically-augmented multivariate plausible | Documentation, on PyPi | Apache-2.0 | Active | 100-500 |
metasyn | Scipy | intermediate file, extensible | Synthetically-augmented univariate plausible | Documentation, PyPi | MIT | Active | 0-10 |
Synthetic Data Vault (SDV) | Copula, GAN, TVAE | Single table, relational database, time-series, de-identification | Synthetically-augmented replica | Article (pdf), documentation | Business Source License | Active | 500-1000 |
synthcity | GAN, TVAE, LLM | Time series, static survival analysis, images | Synthetically-augmented multivariate plausible | Article Documentation, PyPi | Apache-2.0 | Active | 100-500 |
synthia | fPCA, Gaussian copula, vine copula | Supports xarray | Synthetically-augmented multivariate/univariate plausible | Article, article, documentation | MIT | Active | 10-100 |
ydata-synthetic | GAN | Single table and time-series | Synthetically-augmented multivariate plausible | On PyPi | MIT | Active | 500-1000 |
Generative methods generally do not need any input dataset. Instead, the user specifies the properties of the dataset and the package generates the synthetic data from this specification (listed in alphabetical order).
Name | Data type | Extra features | More info | License | Maintenance | GitHub stars |
---|---|---|---|---|---|---|
BlenderProc | Images for segmentation, distance estimation | Physics simulation, camera sampling | Article, documentation | GPL-v3 | Active | 1000+ |
faker | Many kinds of identifiers, e.g., name, address, phone number, etc. | Modular structure | Documentation, on PyPi | MIT | Active | 1000+ |
google-semantic-location-history | Google semantic location histories | Uses GenSON, Faker, faker-schema | - | MIT | Active | 0-10 |
mimesis | Many kinds of identifiers, e.g., name, address, phone number, etc. (databases, json, xml) | High performance, multilingual | Documentation, on PyPi | MIT | Active | 1000+ |
physim-dataset-generator | Images of cluttered scenes to train object detection models | 3D rendering | Article | BSD-2 Clause | Inactive | 10-100 |
plait.py | Tabular data | From user-defined yaml template | - | MIT | Inactive | 100-500 |
pydbgen | SQL databases and DataFrames with private information | Easy to use, uses Faker | Documentation | MIT | Inactive | 100-500 |
Synthdet | Images to train object detection models | - | Article | Apache 2.0 | Active | 100-500 |
scikit-learn | Random matrices (regression problem) | - | - | BSD-3 Clause | Active | 1000+ |
timeseries-generator | Time-series | GUI, built-in economics factors | - | Apache 2.0 | Active | 10-100 |
zpy | Images for machine learning applications | Uses blender to generate images | - | GPL-v3 | Inactive | 100-500 |
These packages have more specific use-cases and are not purely generative (listed alphabetically).
Name | Data type | Extra features | More info | License | Maintenance | GitHub stars |
---|---|---|---|---|---|---|
augraphy | Distortion of paper documents | - | - | MIT | Active | 10-100 |
doppelGANger | Timeseries generation | Uses GANs, high customization | Article, presentation | BSD-3 Clause-Clear | Active | 100-500 |
mtt-distilation | Images for Machine Learning | Distillation from bigger to smaller dataset | Article | MIT | Active | 100-500 |
smogn | Improved tabular dataset | Synthetic Minority Over-Sampling, Pure Python | Article, on PyPi | GPL-v3 | Active | 100-500 |
syndata-generation | Synthetic images (scenes, bounding box annotations) to train object detection models | - | Article | MIT | Inactive | 100-500 |
tofu | Synthetic UK Biobank data | - | Published version | MIT | Inactive | 10-100 |