privacy-engineering-tools

R Packages To Generate Synthetic Data & Simulated Data

Below is a selection of R packages which allow to create (mostly tabular) synthetic datasets. We divided the packages into three categories: purely generative datasets, not purely generative (i.e., based on input data), and packages for specific use-cases. This is a loose classification, since packages might fall into multiple categories.

For more inspiration on how to create fake data in R, see this Rstudio blogpost.

Generative

The following packages can generate synthetic data without the need for any input data to generate the synthetic dataset from (listed alphabetically):

Name Description More info Maintenance GitHub stars
charlatan Generation of fake data, e.g., names, dates, addresses, etc. Documentation, GitHub Active 100-500
conjurer A Parametric Method for Generating Synthetic Tabular Data Documentation, GitHub Active 0-10
fabricatr Simulate hierarchical data structures and correlated data (tabular), either from random number generators or by resampling from existing data sources Documentation, GitHub Active 10-100
faux Create datasets with factorial structure through simulation by specifying variable parameters Documentation, Published version, GitHub Active 10-100
simPop Simulation of populations for surveys based on auxiliary data (model-based, calibration, combinatorial optimization) Article, documentation, GitHub Active 10-100
wakefield Generates random dataframes, lists and vectors from a selection of variable types (e.g., age, sex, date, religion, zip code, etc.) Documentation, GitHub Inactive 100-500

Not purely generative

The following packages use/need some kind of input dataset to generate the synthetic dataset from (listed alphabetically):

Name Description More info Maintenance GitHub stars
mice Extend imputation procedures for missing data to synthetic data, such that (part of) the observed data can be easily overimputed with fake, privacy-preserving, synthetic records Documentation, article, GitHub Active 100-500
synthesis Generate Synthetic time series from commonly used statistical models (e.g., linear, nonlinear, chaotic systems) Documentation article, GitHub Active 0-10
synthpop Produces synthetic versions of tabular microdata containing confidential information with minimal distortion of statistical information (using sequential modelling) Documentation, article, GitHub Active 10-100

Specific use-cases

The following packages were developed for specific use-cases and are not purely generative (listed alphabetically):

Name Description More info Maintenance GitHub stars
NestedCategBayesImpute Modeling, Imputing and Generating Synthetic Versions of Nested Categorical Data in the Presence of Impossible Combinations Documentation - -
synthACS Access American Community Survey (ACS) data, build synthetic ACS microdata at any specified geographical level, add additional attributes to the synthetic dataset, and conduct spatial microsimulation modelling (SMSM) Documentation, article, GitHub Active 0-10