Privacy-preserving synthetic data generation workflows
Collaboration and project management is in the QUIPP-collab repo.
Please do not open new issues in this repository: Instead, open a new issue in QUIPP-collab and add the 'pipeline' label.
The QUiPP (Quantifying Utility and Preserving Privacy) project aims to produce a framework to facilitate the creation of synthetic population data where the privacy of individuals is quantified. In addition, QUiPP can assess utility in a variety of contexts. Does a model trained on the synthetic data generalize as well to the population as the same model trained on the confidential sample, for example?
The proliferation of individual-level data sets has opened up new research opportunities. This individual information is tightly restricted in many contexts: in health and census records, for example. This creates difficulties in working openly and reproducibly, since full analyses cannot then be shared. Methods exist for creating synthetic populations that are representative of the existing relationships and attributes in the original data. However, understanding the utility of the synthetic data and simultaneously protecting individuals' privacy, such that these data can be released more openly, is challenging.
This repository contains a pipeline for synthetic population generation, using a variety of methods as implemented by several libraries. In addition, the pipeline emits measures of privacy and utility of the resulting data.
Note that a Docker image is provided with the dependencies pre-installed, as turinginst/quipp-env. More detail on setting this up can be found here.
- Clone the repository
git clone git@github.com:alan-turing-institute/QUIPP-pipeline.git
- Various parts of this code and its dependencies are written in Python, R, C++ and Bash.
- It has been tested with
- python 3.8
- R 3.6
- gcc 9.3
- bash 3.2
- It depends on the following libraries/tools:
To install all of the dependencies, ensure you're using the relevant versions of python (>=3.8) and R (>=3.6), then run the following commands in a terminal from the root of this repository:
python -m pip install -r env-configuration/requirements.txt
R
> source("env-configuration/install.R")
> q()
> Save workspace image? [y/n/c]: y
Another external dependency is the SGF implementation of plausible deniability:
- Download SGF here
- See the library's README file for how to compile the code. You will need a recent version of cmake (tested with version 3.17), either installed through your system's package manager, or from here.
- After compilation, the three executables of the SGF package
(
sgfinit
,sgfgen
andsgfextract
) should have been built. Add their location to your PATH, or alternatively, assign the environmental variableSGFROOT
to point to this location. That is, in bash,- either
export PATH=$PATH:/path/to/sgf/bin
, - or
export SGFROOT=/path/to/sgf/bin
- either
The top-level directory structure mirrors the data pipeline.
-
doc
: The QUiPP report - a high-level overview of the project, our work and the methods we have considered so far. -
examples
: Tutorial examples of using some of the methods (currently just CTGAN). These are independent of the pipeline. -
binder
: Configuration files to set up the pipeline using Binder -
env-configuration
: Set-up of the computational environment needed by the pipeline and its dependencies -
generators
: Quickly generating input data for the pipeline from a few tunable and well-understood models -
datasets
: Sample data that can be consumed by the pipeline. -
datasets-raw
: A few (public, open) datasets that we have used are reproduced here where licence and size permit. They are not necessarily of the correct format to be consumed by the pipeline. -
synth-methods
: One directory per library/tool, each of them implementing a complete synthesis method -
utility-metrics
: Scripts relating to computing the utility metrics -
privacy-metrics
: Scripts relating to computing the privacy metrics -
run-inputs
: Parameter json files (see below), one for each run
When the pipeline is run, additional directories are created:
-
generator-outputs
: Sample generated input data (usinggenerators
) -
synth-output
: Contains the result of each run (as specified inrun-inputs
), which will typically consist of the synthetic data itself and a selection of utility and privacy scores
The following indicates the full pipeline, as run on an input file
called example.json
. This input file has keywords dataset
(the
base of the filename to use for the original input data) and
synth-method
which refers to one of the synthesis methods. As
output, the pipeline produces:
- synthetic data, in one or more files
synthetic_data_1.csv
,synthetic_data_2.csv
, ... - the disclosure risk privacy score, in
disclosure_risk.json
- classification scores of utility,
sklearn_classifiers.json
The files dataset.csv
and dataset.json
could be in a subdirectory of
datasets, but this is not a requirement.
-
Make a parameter json file, in
run-inputs/
, for each desired synthesis (see below for the structure of these files). -
Run
make
in the top level QUIPP-pipeline directory to run all syntheses (one per file). The output forrun-inputs/example.json
can be found insynth-output/example/
. It will consist of:- one or more syntehtic data sets, based on the orignal data (as
specified in
example.json
), calledsynthetic_data_1.csv
,synthetic_data_2.csv
, ... - the file
disclosure_risk.json
, containing the disclosure risk scores - the file
sklearn_classifiers.json
, containing the classification scores
- one or more syntehtic data sets, based on the orignal data (as
specified in
-
make clean
removes all synthetic output and generated data.
-
Make a subdirectory in
synth-methods
having the name of the new method. -
This directory should contain an executable file
run
that when called asrun $input_json $dataset_base $outfile_prefix
runs the method with the input parameter json file on the dataset
$dataset_base.{csv,json}
(see data format, below), and puts its output files in the directory$outfile_prefix
. -
In the parameter JSON file (a JSON file in
run-inputs
), the method can be used as the value of the"synth-method"
name.
The input data should be present as two files with the same prefix: a
csv file (with suffix
.csv
) which must contain column headings (along with the column data
itself), and a json file (the "data json file")
describing the types of the columns used for synthesis.
For example, see the Polish Social Diagnosis dataset. This contains the files
datasets/polish_data_2011/polish_data_2011.csv
datasets/polish_data_2011/polish_data_2011.json
and so has the prefix datasets/polish_data_2011/polish_data_2011
relative to the root of this repository.
The prefix of the data files (as an absolute path, or relative to the
root of the repository) is given in the parameter json file (see the
next section) as the top-level property dataset
: there is no
restriction on where these can be located, although a few examples can
be found in datasets/
.
The pipeline takes a single json file, describing the data synthesis to perform, including any parameters the synthesis method might need, as well as any additional parameters for the privacy and utlity methods. The json schema for this parameter file is here.
To be usable by the pipeline, the parameter input file must be
located in the run-inputs
directory
The following example is in run-inputs/synthpop-example-2.json
.
{
"enabled" : true,
"dataset" : "generator-outputs/odi-nhs-ae/hospital_ae_data_deidentify",
"synth-method" : "synthpop",
"parameters":
{
"enabled" : true,
"num_samples_to_fit": -1,
"num_samples_to_synthesize": -1,
"num_datasets_to_synthesize": 5,
"random_state": 12345,
"vars_sequence": [5, 3, 8, 1],
"synthesis_methods": ["cart", "", "cart", "", "cart", "", "", "cart"],
"proper": true,
"tree_minbucket": 1,
"smoothing": {}
},
"parameters_disclosure_risk":
{
"enabled": true,
"num_samples_intruder": 100,
"vars_intruder": ["Treatment", "Gender", "Age bracket"]
},
"parameters_sklearn_utility":
{
"enabled": true,
"input_columns": ["Time in A&E (mins)"],
"label_column": "Age bracket",
"test_train_ratio": 0.2,
"num_leaked_rows": 0
}
}
The JSON schema for the parameter json file is here.
The parameter JSON file must include the following names:
enabled
(boolean): Run this example?dataset
(string): The prefix of the dataset (.csv and .json are appended to get the paths of the data files)synth-method
(string): The synthesis method used by the run. It must correspond to a subdirectory ofsynth-methods
.parameters
(object): The parameters passed to the synthesis method. The contents of this object will depend on thesynth-method
used: the contents of this object are documented separately for each. The following names are common across each method:enabled
(boolean): Perform the synthesis step?num_samples_to_fit
(integer): How many samples from the input dataset should be used as input to the synthesis procedure? To use all of the input records, pass a value of-1
.num_samples_to_synthesize
(integer): How many synthetic samples should be produced as output? To produce the same number of output records as input records, pass a value of-1
.num_datasets_to_synthesize
(integer): How many entire synthetic datasets should be produced?random_seed
(integer): the seed for the random number generator (most methods require a PRNG: the seed can be explicitly passed to aid with the testability and reproducibility of the synthetic output)- Additional options for CTGAN, SGF and synthpop
parameters_disclosure_risk
(object): parameters needed to compute the disclosure risk privacy scoreenabled
(boolean): compute this score?num_samples_intruder
(integer): how many records corresponding to the original dataset exist in a dataset visible to an attacker.vars_intruder
(array):- items (string): names of the columns that are available in the attacker-visible dataset.
parameters_sklearn_utility
(object): parameters needed to compute the classification utility scores with scikit learn:enabled
(boolean): compute this score?input_columns
(array):- items (string): names of the columns to use as the explanatory variables for the classification
label_column
(string): the column to use for the category labelstest_train_ratio
(number): fraction of records to use in the test set for the classificationnum_leaked_rows
(integer): the number of additional records from the original dataset with which to augment the synthetic data set before training the classifiers. This is primarily an option to enable testing of the utility metric (i.e. the more rows we leak, the better the utility should become). It should be set to 0 during normal synthesis tasks.