PrivGen

RNN code used for "Privacy-Preserving Synthetic Educational Data Generation"

Read the article published at EC-TEL 2022
Check the slides

To cite the paper, please use:

@inproceedings{Vie2022,
  title={Privacy-Preserving Synthetic Educational Data Generation},
  author={Vie*, Jill-Jênn and Rigaux*, Tomas and Minn, Sein},
  booktitle={Proceedings of EC-TEL 2022},
  pages={in press},
  year={2022},
  url={https://hal.archives-ouvertes.fr/hal-03715416}
}

Data

The data should be stored in data/, with the following structure:

data/<data> is the folder containing the data for the dataset <data>
data/<data>/data.csv should be the raw data, which should contain the columns user, item, skill, and correct
data/<data>/coef0.npy can be generated automatically and should contain the irt coefficients matching the current dataset.
Generated datasets should be stored in the form data/<data>/gen-<generated-data>.csv where <generated-data> is a name identifying the generation method

In the following, <data> and <generated-data> will refer to the preceding files and folders.

Training

The following command trains a RNN model for the dataset <data>, and outputs the learned parameters in the file data/<data>/params-<model_name>.pt, where <model_name> contains all the model hyper-parameters, and outputs the loss curve in data/<data>/loss-<model_name>

python train.py <data>

Model hyperparameters can be adjusted, following python train.py -h

bsl is a list of integers that dictates in how many segments the sequences will be broken into during training, to allow their length to vary and work around gradient vanishing problems. Once the training has stablilized (loss hasn't improved for 100 epochs) the next element of bsl will be chosen.

Generation

To generate a dataset with the same number of users as <data>, use:

python gen.py <data>

If model hyperparameters where adjusted during training, they must be adjusted the same way here to use the correct model parameters.

Evaluation

Compare IRT coefficients

For the first evaluation, the ktm submodule must be initiated with

git submodule init
git submodule update

Then, the different error values are computed with

python eval_irt.py <data> <generated-data>

Evaluate Reidentification AUC

TODO: See notebooks/Attack.ipynb for the current code

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
ktm @ ec9778e		ktm @ ec9778e
notebooks		notebooks
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
eval_irt.py		eval_irt.py
gen.py		gen.py
models.py		models.py
parsing.py		parsing.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrivGen

Data

Training

Generation

Evaluation

Compare IRT coefficients

Evaluate Reidentification AUC

About

Releases

Packages

Contributors 2

Languages

License

Akulen/PrivGen

Folders and files

Latest commit

History

Repository files navigation

PrivGen

Data

Training

Generation

Evaluation

Compare IRT coefficients

Evaluate Reidentification AUC

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages