privacy-engineering-tools

R Packages To Generate Synthetic Data & Simulated Data

Below is a selection of R packages which allow to create (mostly tabular) synthetic datasets. We divided the packages into three categories: purely generative datasets, not purely generative (i.e., based on input data), and packages for specific use-cases. This is a loose classification, since packages might fall into multiple categories.

For more inspiration on how to create fake data in R, see this Rstudio blogpost.

Generative

The following packages can generate synthetic data without the need for any input data to generate the synthetic dataset from (listed alphabetically):

Name	Description	More info	Maintenance	GitHub stars
charlatan	Generation of fake data, e.g., names, dates, addresses, etc.	Documentation, GitHub	Active	100-500
conjurer	A Parametric Method for Generating Synthetic Tabular Data	Documentation, GitHub	Active	0-10
fabricatr	Simulate hierarchical data structures and correlated data (tabular), either from random number generators or by resampling from existing data sources	Documentation, GitHub	Active	10-100
faux	Create datasets with factorial structure through simulation by specifying variable parameters	Documentation, Published version, GitHub	Active	10-100
simPop	Simulation of populations for surveys based on auxiliary data (model-based, calibration, combinatorial optimization)	Article, documentation, GitHub	Active	10-100
wakefield	Generates random dataframes, lists and vectors from a selection of variable types (e.g., age, sex, date, religion, zip code, etc.)	Documentation, GitHub	Inactive	100-500

Not purely generative

The following packages use/need some kind of input dataset to generate the synthetic dataset from (listed alphabetically):

Name	Description	More info	Maintenance	GitHub stars
mice	Extend imputation procedures for missing data to synthetic data, such that (part of) the observed data can be easily overimputed with fake, privacy-preserving, synthetic records	Documentation, article, GitHub	Active	100-500
synthesis	Generate Synthetic time series from commonly used statistical models (e.g., linear, nonlinear, chaotic systems)	Documentation article, GitHub	Active	0-10
synthpop	Produces synthetic versions of tabular microdata containing confidential information with minimal distortion of statistical information (using sequential modelling)	Documentation, article, GitHub	Active	10-100

Specific use-cases

The following packages were developed for specific use-cases and are not purely generative (listed alphabetically):

Name	Description	More info	Maintenance	GitHub stars
NestedCategBayesImpute	Modeling, Imputing and Generating Synthetic Versions of Nested Categorical Data in the Presence of Impossible Combinations	Documentation	-	-
synthACS	Access American Community Survey (ACS) data, build synthetic ACS microdata at any specified geographical level, add additional attributes to the synthetic dataset, and conduct spatial microsimulation modelling (SMSM)	Documentation, article, GitHub	Active	0-10

This site is open source. Improve this page.