Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add export functionality to TOML in addition to JSON #327

Merged
merged 25 commits into from
Oct 10, 2024
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/core-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install ".[check]"
python -m pip install ".[check]" tomlkit
qubixes marked this conversation as resolved.
Show resolved Hide resolved
- name: Lint with Ruff
run: ruff check
run: ruff check metasyn
- name: Check types with MyPy
run: mypy metasyn

Expand Down Expand Up @@ -61,7 +61,7 @@ jobs:
run: |
pip install git+https://github.com/sodascience/metasyn-disclosure-control
pip install .
metasyn create-meta metasyn/demo/demo_titanic.csv --config examples/example_config.toml
metasyn create-meta metasyn/demo/demo_titanic.csv --config examples/config_files/example_config.toml

build-docs:
name: Build documentation
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

__Generate synthetic tabular data__ in a transparent, understandable, and privacy-friendly way. Metasyn makes it possible for owners of sensitive data to create test data, do open science, improve code reproducibility, encourage data reuse, and enhance accessibility of their datasets, without worrying about leaking private information.

With metasyn you can __fit__ a model to an existing dataframe, __export__ it to a transparent and auditable `.json` file, and __synthesize__ a dataframe that looks a lot like the real one. In contrast to most other synthetic data software, we make the explicit choice to strictly limit the statistical information in our model in order to adhere to the highest privacy standards.
With metasyn you can __fit__ a model to an existing dataframe, __save__ it to a transparent and auditable `.json` file, and __synthesize__ a dataframe that looks a lot like the real one. In contrast to most other synthetic data software, we make the explicit choice to strictly limit the statistical information in our model in order to adhere to the highest privacy standards.

## Highlights
- 👋 __Accessible__. Metasyn is designed to be easy to use and understand, and we do our best to be welcoming to newcomers and novice users. [Let us know](https://github.com/sodascience/metasyn/issues/new) if we can improve!
Expand Down Expand Up @@ -71,7 +71,7 @@ mf = MetaFrame.fit_dataframe(df)
# Generate a new DataFrame with 5 rows from the MetaFrame.
df_synth = mf.synthesize(5)

# This DataFrame can be exported to csv, parquet, excel and more.
# This DataFrame can be saved to csv, parquet, excel and more.
qubixes marked this conversation as resolved.
Show resolved Hide resolved
df_synth.write_csv("output.csv")
```

Expand Down
8 changes: 4 additions & 4 deletions docs/paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ These choices enable the software to generate synthetic data with __privacy and
At its core, `metasyn` has three main functions:

1. __Estimation__: Fit a generative model to a properly formatted tabular dataset, optionally with privacy guarantees.
2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and exporting.
2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and saving.
3. __Generation__: Synthesize new datasets based on a fitted model.

## Estimation
Expand Down Expand Up @@ -117,11 +117,11 @@ After fitting a model, `metasyn` can transparently store it in a human- and mach
}
```

This `.json` can be manually audited, edited, and after exporting this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` can be performed as follows:
This `.json` can be manually audited, edited, and after saving this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` can be performed as follows:

```python
mf.export("fruits.json")
mf_new = MetaFrame.from_json("fruits.json")
mf.save("fruits.json")
mf_new = MetaFrame.load("fruits.json")
```

## Data generation
Expand Down
5 changes: 4 additions & 1 deletion docs/source/developer/GMF.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
Generative Metadata Format (GMF)
================================

At the core of ``metasyn`` lies its ability to :doc:`generate </usage/generating_metaframes>`, :doc:`export</usage/exporting_metaframes>` and :doc:`import</usage/exporting_metaframes>` statistical metadata for a given dataset, which can then be used to :doc:`generate synthetic datasets </usage/generating_synthetic_data>`. To achieve this, ``metasyn`` uses the Generative Metadata Format (GMF), an open source format (available on `GitHub <https://github.com/sodascience/generative_metadata_format>`_) designed to store statistical metadata for tabular datasets. The GMF standard is designed to be modular and extensible, with more distributions and privacy-enhancing mechanisms. Due to its open nature, GMF can be used by other software too.
At the core of ``metasyn`` lies its ability to :doc:`generate </usage/generating_metaframes>`,
:doc:`save</usage/saving_metaframes>` and :ref:`load<loading-a-metaframe>` statistical metadata
for a given dataset, which can then be used to :doc:`generate synthetic datasets </usage/generating_synthetic_data>`. To achieve this, ``metasyn`` uses the Generative Metadata Format (GMF), an open source format (available on `GitHub <https://github.com/sodascience/generative_metadata_format>`_) designed to store statistical metadata for tabular datasets. The GMF standard is designed to be modular and extensible, with more distributions and privacy-enhancing mechanisms.
Due to its open nature, GMF can be used by other software too.



Expand Down
2 changes: 1 addition & 1 deletion docs/source/developer/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The :class:`~metasyn.MetaFrame` class is a core component of the ``metasyn`` pac
Essentially, a :obj:`~metasyn.MetaFrame` is a collection of :obj:`~metasyn.MetaVar` objects, each representing a column in a dataset. It contains methods that allow for the following:

- **Fitting to a DataFrame**: The :meth:`~metasyn.MetaFrame.fit_dataframe` method allows for fitting a Polars DataFrame to create a :obj:`~metasyn.MetaFrame` object. This method takes several parameters including the DataFrame, column specifications, distribution providers, privacy level, and a progress bar flag.
- **Exporting and importing**: The :meth:`~metasyn.MetaFrame.export` method serializes and exports the :obj:`~metasyn.MetaFrame` to a JSON file, following the GMF format. The :meth:`~metasyn.MetaFrame.from_json` method reads a :obj:`~metasyn.MetaFrame` from a JSON file.
- **Saving and loading**: The :meth:`~metasyn.MetaFrame.save` method serializes and saves the :obj:`~metasyn.MetaFrame` to a JSON or TOML file, following the GMF format. The :meth:`~metasyn.MetaFrame.load` method reads a :obj:`~metasyn.MetaFrame` from a JSON file.
qubixes marked this conversation as resolved.
Show resolved Hide resolved
- **Synthesizing to a DataFrame**: The :meth:`~metasyn.MetaFrame.synthesize` method creates a synthetic Polars DataFrame based on the :obj:`~metasyn.MetaFrame`.


Expand Down
2 changes: 1 addition & 1 deletion docs/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ A MetaFrame is a fitted model that describes the aggregate structure and charact

Key elements encapsulated in a MetaFrame include variable names, their data types, the proportion of missing values, and the parameters of the distributions that these variables follow in the dataset. This information is sufficient to understand the overall structure and attributes of the data, without divulging the exact data points.

When a MetaFrame is created from an input dataset, it can be exported for auditing or manual editing.
When a MetaFrame is created from an input dataset, it can be saved for auditing or manual editing.

In the ``metasyn`` workflow, once you have a MetaFrame, ``metasyn`` can generate synthetic data that aligns with the MetaFrame. This synthetic data shares the structural and distributional characteristics (as defined in the MetaFrame) with the original data but does not contain any actual data points from the original dataset, thus preserving privacy.

Expand Down
Loading