Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update package version 6 documentation #314

Merged
merged 9 commits into from
Dec 8, 2023
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ Here you can find usage examples of the package and models to synthesize tabular
- Fast tabular data synthesis on adult census income dataset [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ydataai/ydata-synthetic/blob/master/examples/regular/models/Fast_Adult_Census_Income_Data.ipynb)
- Tabular synthetic data generation with CTGAN on adult census income dataset [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ydataai/ydata-synthetic/blob/master/examples/regular/models/CTGAN_Adult_Census_Income_Data.ipynb)
- Time Series synthetic data generation with TimeGAN on stock dataset [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ydataai/ydata-synthetic/blob/master/examples/timeseries/TimeGAN_Synthetic_stock_data.ipynb)
- Time Series synthetic data generation with DoppelGANger on FCC MBA dataset [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ydataai/ydata-synthetic/blob/master/examples/timeseries/DoppelGANger_FCC_MBA_Dataset.ipynb)
- More examples are continuously added and can be found in `/examples` directory.

### Datasets for you to experiment
Expand All @@ -102,6 +103,7 @@ Here are some example datasets for you to try with the synthesizers:

#### Sequential datasets
- [Stock data](https://github.com/ydataai/ydata-synthetic/tree/master/data)
- [FCC MBA data](https://github.com/ydataai/ydata-synthetic/tree/master/data)

## Project Resources

Expand Down
18 changes: 0 additions & 18 deletions docs/examples/ctgan_example.md

This file was deleted.

31 changes: 30 additions & 1 deletion docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,17 @@
transform: none;
}

.md-content {
--md-typeset-a-color: #002b9e;
}

@media {
.md-button--ydata {
--md-primary-fg-color: #E32212;
--md-primary-bg-color: #E32212;
}
}

:root {
/* Primary color shades */
--md-primary-fg-color: #040404;
Expand All @@ -19,4 +30,22 @@
--md-accent-fg-color--transparent: hsla(189, 100%, 37%, 0.1);
--md-accent-bg-color: hsla(0, 0%, 100%, 1);
--md-accent-bg-color--light: hsla(0, 0%, 100%, 0.7);
}
}

:root > * {
/* Code block color shades */
--md-code-bg-color: hsla(0, 0%, 96%, 1);
--md-code-fg-color: hsla(200, 18%, 26%, 1);

/* Footer */
--md-footer-bg-color: #040404;
--md-footer-bg-color--dark: hsla(0, 0%, 0%, 0.32);
--md-footer-fg-color: hsla(0, 0%, 100%, 1);
--md-footer-fg-color--light: hsla(0, 0%, 100%, 0.7);
--md-footer-fg-color--lighter: hsla(0, 0%, 100%, 0.3);
}

.youtube {
color: #EE0F0F;
}

File renamed without changes.
19 changes: 19 additions & 0 deletions docs/synthetic_data/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Synthetic data generation

[Synthetic data](https://ydata.ai/products/synthetic_data) is data that has been created artificially through computer simulation or that algorithms can generate to
take the place of real-world data. The data can be used as an alternative or supplement to real-world data when real-world
data is not readily available. It can also be used as a Machine Learning performance booster.

The ydata-synthetic package is an open-source Python package developed by YData’s team that allows users to experiment
with several generative models for synthetic data generation. The main goal of the package is to serve as a way for data
scientists to get familiar with synthetic data and its applications in real-world domains, as well as the potential of **Generative AI**.

The *ydata-synthetic* package provides different methods for generating synthetic tabular and time-series data,
such as Variational Auto Encoders (VAE), [Gaussian Mixture Models (GMM)](single_table/gmm_example.md), and [Conditional Generative Adversarial Networks (CTGAN)](single_table/ctgan_example.md).
The package also includes a user-friendly UI interface that guides users through the steps and inputs to generate synthetic data
samples.

The package also aims to facilitate the exploration and understanding of synthetic data generation methods and their limitations.

### 📄<a href="single_table/ctgan_example.md"><u>Get started with synthetic data for tabular data with CTGAN</u></a>
### 📈 <a href="time_series/timegan_example.md"><u>Get started with synthetic data for time-series with TimeGAN</u></a>
31 changes: 31 additions & 0 deletions docs/synthetic_data/multi_table/fabric_multitable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Multiple tables synthetic data generation **

!!! info "** YData's Enterprise feature"

This feature is only available for users of [YData Fabric](https://ydata.ai).

[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) and
try synthetic data generation from multiple tables or [contact us](https://ydata.ai/contact-us) for more informations.

Multitable synthetic data enables the creation of large, diverse
datasets crucial for training robust machine learning models, algorithm testing, and addressing privacy concerns. It can be
crucial to enable proper data democratization within an organization.

Nevertheless, the process of generating a full database or even several tables that share relations, can be particularly
challenging due to the necessity of preserving referential integrity across diverse tables and scale. This involves maintaining
realistic relationships between entities to mirror real-world scenarios accurately while being able to process large volumes
of data.

[YData Fabric](https://ydata.ai/products/fabric) offers a cutting-edge Synthetic data generation process that seamlessly integrates with your existing Relational databases.
By replicating the data's value and structure to a new target storage, Fabric delivers a wide range of benefits and use-cases.
These include reducing risk and improving compliance by substituting operational databases with synthetic databases for tests and development. It also enables QA teams to create comprehensive and more flexible testing scenarios.

Explore [Fabric](https://ydata.ai/register) multi-table synthesis capabilities:

### From what sources am I able to train a multi-tables synthetic data generator?
- From a relational database
- From the upload of multiple files

### Related materials
- 📖 <a href="https://ydata.ai/resources/whitepaper-relational-databases-synthetic-data"><u>Read more about Fabric multi-table synthesis process with this whitepaper</u></a>
- :fontawesome-brands-youtube:{ .youtube } <a href="https://www.youtube.com/watch?v=9EupCg5YQLE&t=130s"><u>See Fabric multi-table synthesis in action</u></a>
57 changes: 57 additions & 0 deletions docs/synthetic_data/single_table/ctgan_example.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Synthesize tabular data

**Using *CTGAN* to generate tabular synthetic data:**

Real-world domains are often described by **tabular data** i.e., data that can be structured and organized in a table-like format, where **features/variables** are represented in **columns**, whereas **observations** correspond to the **rows**.

Additionally, real-world data usually comprises both **numeric** and **categorical** features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements.

CTGAN was specifically designed to deal with the challenges posed by tabular datasets, handling mixed (numeric and categorical) data:

- 📑 **Paper:** [Modeling Tabular Data using Conditional GAN](https://arxiv.org/pdf/1907.00503.pdf)

Here’s an example of how to synthetize tabular data with CTGAN using the [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download) dataset:

```python
--8<-- "examples/regular/models/adult_ctgan.py"
```

## Best practices & results optimization

!!! tip "Generate the best synthetic data quality"

If you are having a hard time in ensuring that CTGAN returns the synthetic data quality that you need for your use-case
give it a try to [YData Fabric Synthetic Data](https://ydata.ai/register).
**Fabric Synthetic Data generation** is considered the best in terms of quality.
[Read more about it in this benchmark](https://www.linkedin.com/pulse/generative-ai-synthetic-data-vendor-comparison-best-vincent-granville).

**CTGAN**, as any other Machine Learning model, requires optimization at the level of the data preparation as well as
hyperparameter tuning. Here follows a list of best-practices and tips to improve your synthetic data quality:

- **Understand Your Data:**
Thoroughly understand the characteristics and distribution of your original dataset before using CTGAN.
Identify important features, correlations, and patterns in the data.
Leverage [ydata-profiling](https://pypi.org/project/ydata-profiling/) feature to automate the process of understanding your data.

- **Data Preprocess:**
Clean and preprocess your data to handle missing values, outliers, and other anomalies before training CTGAN.
Standardize or normalize numerical features to ensure consistent scales.

- **Feature Engineering:**
Create additional meaningful features that could improve the quality of the synthetic data.

- **Optimize Model Parameters:**
Experiment with CTGAN hyperparameters such as *epochs*, *batch_size*, and *gen_dim* to find the values that work best
for your specific dataset.
Fine-tune the *learning rate* for better convergence.

- **Conditional Generation:**
Leverage the conditional generation capabilities of CTGAN by specifying conditions for certain features if applicable.
Adjust the conditioning mechanism to enhance the relevance of generated samples.

- **Handle Imbalanced Data:**
If your original dataset is imbalanced, ensure that CTGAN captures the distribution of minority classes effectively.
Adjust sampling strategies if needed.

- **Use Larger Datasets:**
Train CTGAN on larger datasets when possible to capture a more comprehensive representation of the underlying data distribution.
13 changes: 13 additions & 0 deletions docs/synthetic_data/single_table/gmm_example.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Synthesize tabular data

**Using *GMMs* to generate tabular synthetic data:**

Real-world domains are often described by **tabular data** i.e., data that can be structured and organized in a table-like
format, where **features/variables** are represented in **columns**, whereas **observations** correspond to the **rows**.

Gaussian Mixture models (GMMs) are a type of probabilistic models. Probabilistic models can also be leveraged to generate
synthetic data. Particularly, the way GMMs are able to generate synthetic data, is by learning the original data distribution
while fitting it to a mixture of Gaussian distributions.

- 📑 **Blogpost:** [Generate synthetic data with Gaussian Mixture models](https://ydata.ai/resources/synthetic-data-generation-with-gaussian-mixture-models)
- **Google Colab:** [Generate Adult census data with GMM](https://colab.research.google.com/github/ydataai/ydata-synthetic/blob/master/examples/regular/models/Fast_Adult_Census_Income_Data.ipynb)
46 changes: 46 additions & 0 deletions docs/synthetic_data/streamlit_app.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# The UI guided experience for Synthetic Data generation

´ydata-synthetic´ offers a UI interface to guide you through the steps and inputs to generate structure tabular data.
The streamlit app is available from *v1.0.0* onwards, and supports the following flows:

- Train a synthesizer model for a single table dataset
- Generate & profile the generated synthetic samples

<p style="text-align:center;">
<iframe width="560" height="315" src="https://www.youtube.com/embed/ep0PhwsFx0A?si=a4UtCbetGdHb7py0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</p>

## Installation

pip install ydata-synthetic[streamlit]

## Quickstart

Use the code snippet below in a python file:

!!! warning "Use python scripts"

I know you probably love Jupyter Notebooks or Google Colab, but make sure that you start your
synthetic data generation streamlit app from a python script as notebooks are not supported!

``` py
from ydata_synthetic import streamlit_app
streamlit_app.run()
```

Or use the file streamlit_app.py that can be found in the [examples folder]().

``` py
python -m streamlit_app
```

The below models are supported:

- [ydata-sdk Synthetic Data generator](https://docs.sdk.ydata.ai/0.6/examples/synthesize_tabular_data/)
- CGAN
- WGAN
- WGANGP
- DRAGAN
- CRAMER
- CTGAN

Loading