Below is a selection of R packages which allow to create (mostly tabular) synthetic datasets. We divided the packages into three categories: purely generative datasets, not purely generative (i.e., based on input data), and packages for specific use-cases. This is a loose classification, since packages might fall into multiple categories.
For more inspiration on how to create fake data in R, see this Rstudio blogpost.
The following packages can generate synthetic data without the need for any input data to generate the synthetic dataset from (listed alphabetically):
Name | Description | More info | Maintenance | GitHub stars |
---|---|---|---|---|
charlatan | Generation of fake data, e.g., names, dates, addresses, etc. | Documentation, GitHub | Active | 100-500 |
conjurer | A Parametric Method for Generating Synthetic Tabular Data | Documentation, GitHub | Active | 0-10 |
fabricatr | Simulate hierarchical data structures and correlated data (tabular), either from random number generators or by resampling from existing data sources | Documentation, GitHub | Active | 10-100 |
faux | Create datasets with factorial structure through simulation by specifying variable parameters | Documentation, Published version, GitHub | Active | 10-100 |
simPop | Simulation of populations for surveys based on auxiliary data (model-based, calibration, combinatorial optimization) | Article, documentation, GitHub | Active | 10-100 |
wakefield | Generates random dataframes, lists and vectors from a selection of variable types (e.g., age, sex, date, religion, zip code, etc.) | Documentation, GitHub | Inactive | 100-500 |
The following packages use/need some kind of input dataset to generate the synthetic dataset from (listed alphabetically):
Name | Description | More info | Maintenance | GitHub stars |
---|---|---|---|---|
mice | Extend imputation procedures for missing data to synthetic data, such that (part of) the observed data can be easily overimputed with fake, privacy-preserving, synthetic records | Documentation, article, GitHub | Active | 100-500 |
synthesis | Generate Synthetic time series from commonly used statistical models (e.g., linear, nonlinear, chaotic systems) | Documentation article, GitHub | Active | 0-10 |
synthpop | Produces synthetic versions of tabular microdata containing confidential information with minimal distortion of statistical information (using sequential modelling) | Documentation, article, GitHub | Active | 10-100 |
The following packages were developed for specific use-cases and are not purely generative (listed alphabetically):
Name | Description | More info | Maintenance | GitHub stars |
---|---|---|---|---|
NestedCategBayesImpute | Modeling, Imputing and Generating Synthetic Versions of Nested Categorical Data in the Presence of Impossible Combinations | Documentation | - | - |
synthACS | Access American Community Survey (ACS) data, build synthetic ACS microdata at any specified geographical level, add additional attributes to the synthetic dataset, and conduct spatial microsimulation modelling (SMSM) | Documentation, article, GitHub | Active | 0-10 |