Skip to content

Commit

Permalink
docs(README): polish EDA section
Browse files Browse the repository at this point in the history
  • Loading branch information
jinglinpeng committed Dec 14, 2020
1 parent 3c09669 commit fd5ef8c
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 24 deletions.
50 changes: 26 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ DataPrep lets you prepare your data using a single library with a few lines of c
Currently, you can use DataPrep to:
* Collect data from common data sources (through `dataprep.connector`)
* Do your exploratory data analysis (through `dataprep.eda`)
* Clean and standardize data (through `dataprep.clean`)
* ...more modules are coming

## Releases
Expand Down Expand Up @@ -62,8 +63,7 @@ The following examples can give you an impression of what DataPrep can do:

* [Documentation: Connector](https://sfu-db.github.io/dataprep/user_guide/connector/connector.html)
* [Documentation: EDA](https://sfu-db.github.io/dataprep/user_guide/eda/introduction.html)
* [EDA Case Study: Titanic](https://sfu-db.github.io/dataprep/user_guide/eda/titanic.html)
* [EDA Case Study: House Price](https://sfu-db.github.io/dataprep/user_guide/eda/house_price.html)
* [Documentation: Clean](https://sfu-db.github.io/dataprep/user_guide/clean/introduction.html)

### Connector

Expand Down Expand Up @@ -109,40 +109,42 @@ If you want to connect with a different web API, Connector is designed to be eas

In the following link, you can see detailed examples of how to use Connector for retrieving data from DBLP, Spotify, Yelp, and other sites, without taking an in-depth look into the web APIs documentation!: [Examples.](https://github.com/sfu-db/dataprep/tree/develop/examples)

### EDA

There are common tasks during the exploratory data analysis stage,
like a quick look at the columnar distribution, or understanding the correlations
between columns.

The EDA (<em>Exploratory Data Analysis</em>) module categorizes these EDA tasks into functions helping you finish EDA
tasks with a single function call.

* Want to understand the distributions for each DataFrame column? Use `plot`.

<a href="https://sfu-db.github.io/dataprep/user_guide/eda/plot.html#Get-an-overview-of-the-dataset-with-plot(df)"><img src="https://github.com/sfu-db/dataprep/raw/develop/assets/plot(df).gif"/></a>
### EDA
DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.

* Want to understand the correlation between columns? Use `plot_correlation`.
#### Create Profile Reports, Fast
You can create a beautiful profile report from a Pandas/Dask DataFrame with the `create_report` function. DataPrep.EDA has the following advantages compared to other tools:

<a href="https://sfu-db.github.io/dataprep/user_guide/eda/plot_correlation.html#Get-an-overview-of-the-correlations-with-plot_correlation(df)"><img src="https://github.com/sfu-db/dataprep/raw/develop/assets/plot_correlation(df).gif"/></a>
* **10-100X Faster**: DataPrep.EDA is 10-100X faster than Pandas-based profiling tools due to its highly optimized Dask-based computing module.
* **Interactive Visualization**: DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.
* **Big Data Support**: DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.

* Or, if you want to understand the impact of the missing values for each column, use `plot_missing`.
The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.

<a href="https://sfu-db.github.io/dataprep/user_guide/eda/plot_missing.html#Get-an-overview-of-the-missing-values-with-plot_missing(df)"><img src="https://github.com/sfu-db/dataprep/raw/develop/assets/plot_missing(df).gif"/></a>
```python
import pandas as pd
from dataprep.eda import create_report
df = pd.read_csv("titanic.csv")
create_report(df).show_browser()
```

You can drill down to get more information by given `plot`, `plot_correlation` and `plot_missing` a column name.: E.g. for `plot_missing`
Click [here](https://sfu-db.github.io/dataprep/_downloads/c9bf292ac949ebcf9b65bb2a2bc5a149/titanic_dp.html) to see the generated report of the above code.

<a href="https://sfu-db.github.io/dataprep/user_guide/eda/plot_missing.html#Understand-the-impact-of-the-missing-values-in-column-x-with-plot_missing(df,-x)"><img src="https://github.com/sfu-db/dataprep/raw/develop/assets/plot_missing(df, x).gif"/></a>
#### Innovative System Design
DataPrep.EDA is the ***only*** task-centric EDA system in Python. It is carefully designed to improve usability.
* **Task-Centric API Design**: You can declaratively specify a wide range of EDA tasks in different granularities with a single function call. All needed visualizations will be automatically and intelligently generated for you.
* **Auto-Insights**: DataPrep.EDA automatically detects and highlights the insights (e.g., a column has many outliers) to facilitate pattern discovery about the data.
* **How-to Guide** (available soon): A how-to guide is provided to show the configuration of each plot function. With this feature, you can easily customize the generated visualizations.

&nbsp;&nbsp;&nbsp;&nbsp;for numerical column using`plot`:
#### Understand the Titanic dataset with Task-Centric API:
<a href="assets/eda_demo.gif"><img src="assets/eda_demo.gif"/></a>

<a href="https://sfu-db.github.io/dataprep/user_guide/eda/plot.html#Understand-a-column-with-plot(df,-x)"><img src="https://github.com/sfu-db/dataprep/raw/develop/assets/plot(df,x)_num.gif"/></a>
Click [here](https://sfu-db.github.io/dataprep/user_guide/eda/introduction.html) to check all the supported tasks.

&nbsp;&nbsp;&nbsp;&nbsp;for categorical column using`plot`:
Check [plot](https://sfu-db.github.io/dataprep/user_guide/eda/plot.html), [plot_correlation](https://sfu-db.github.io/dataprep/user_guide/eda/plot_correlation.html), [plot_missing](https://sfu-db.github.io/dataprep/user_guide/eda/plot_missing.html) and [create_report](https://sfu-db.github.io/dataprep/user_guide/eda/create_report.html) to see how each function works.

<a href="https://sfu-db.github.io/dataprep/user_guide/eda/plot.html#Understand-a-column-with-plot(df,-x)"><img src="https://github.com/sfu-db/dataprep/raw/develop/assets/plot(df,x)_cat.gif"/></a>

Don't forget to checkout the [examples] folder for detailed demonstration!

### Clean

Expand Down
Binary file added assets/eda_demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit fd5ef8c

Please sign in to comment.