Skip to content

Commit

Permalink
docs(datasets): add introduction for datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
jinglinpeng committed Feb 4, 2021
1 parent a63403b commit 83d42ce
Show file tree
Hide file tree
Showing 2 changed files with 142 additions and 0 deletions.
141 changes: 141 additions & 0 deletions docs/source/user_guide/datasets/introduction.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3-final"
},
"orig_nbformat": 2,
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"source": [
"# Datasets\n",
"\n",
"DataPrep provides a collections of datasets. You could easily load them using one line of code and explore the functionalities of dataprep on them."
],
"cell_type": "markdown",
"metadata": {}
},
{
"source": [
"## List Available Datasets\n",
"You could list the name of all available datasets by calling `get_dataset_names`, as shown in below."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['iris', 'titanic']"
]
},
"metadata": {},
"execution_count": 1
}
],
"source": [
"from dataprep.datasets import get_dataset_names\n",
"get_dataset_names()"
]
},
{
"source": [
"## Load Dataset\n",
"\n",
"After you know the available dataset names from `get_dataset_names`. Next you could load the dataset by calling `load_dataset`."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" survived pclass sex age sibsp parch fare embarked class \\\n",
"0 0 3 male 22.0 1 0 7.2500 S Third \n",
"1 1 1 female 38.0 1 0 71.2833 C First \n",
"2 1 3 female 26.0 0 0 7.9250 S Third \n",
"3 1 1 female 35.0 1 0 53.1000 S First \n",
"4 0 3 male 35.0 0 0 8.0500 S Third \n",
".. ... ... ... ... ... ... ... ... ... \n",
"886 0 2 male 27.0 0 0 13.0000 S Second \n",
"887 1 1 female 19.0 0 0 30.0000 S First \n",
"888 0 3 female NaN 1 2 23.4500 S Third \n",
"889 1 1 male 26.0 0 0 30.0000 C First \n",
"890 0 3 male 32.0 0 0 7.7500 Q Third \n",
"\n",
" who adult_male deck embark_town alive alone \n",
"0 man True NaN Southampton no False \n",
"1 woman False C Cherbourg yes False \n",
"2 woman False NaN Southampton yes True \n",
"3 woman False C Southampton yes False \n",
"4 man True NaN Southampton no True \n",
".. ... ... ... ... ... ... \n",
"886 man True NaN Southampton no True \n",
"887 woman False B Southampton yes True \n",
"888 woman False NaN Southampton no False \n",
"889 man True C Cherbourg yes True \n",
"890 man True NaN Queenstown no True \n",
"\n",
"[891 rows x 15 columns]"
],
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>survived</th>\n <th>pclass</th>\n <th>sex</th>\n <th>age</th>\n <th>sibsp</th>\n <th>parch</th>\n <th>fare</th>\n <th>embarked</th>\n <th>class</th>\n <th>who</th>\n <th>adult_male</th>\n <th>deck</th>\n <th>embark_town</th>\n <th>alive</th>\n <th>alone</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>3</td>\n <td>male</td>\n <td>22.0</td>\n <td>1</td>\n <td>0</td>\n <td>7.2500</td>\n <td>S</td>\n <td>Third</td>\n <td>man</td>\n <td>True</td>\n <td>NaN</td>\n <td>Southampton</td>\n <td>no</td>\n <td>False</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>1</td>\n <td>female</td>\n <td>38.0</td>\n <td>1</td>\n <td>0</td>\n <td>71.2833</td>\n <td>C</td>\n <td>First</td>\n <td>woman</td>\n <td>False</td>\n <td>C</td>\n <td>Cherbourg</td>\n <td>yes</td>\n <td>False</td>\n </tr>\n <tr>\n <th>2</th>\n <td>1</td>\n <td>3</td>\n <td>female</td>\n <td>26.0</td>\n <td>0</td>\n <td>0</td>\n <td>7.9250</td>\n <td>S</td>\n <td>Third</td>\n <td>woman</td>\n <td>False</td>\n <td>NaN</td>\n <td>Southampton</td>\n <td>yes</td>\n <td>True</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1</td>\n <td>1</td>\n <td>female</td>\n <td>35.0</td>\n <td>1</td>\n <td>0</td>\n <td>53.1000</td>\n <td>S</td>\n <td>First</td>\n <td>woman</td>\n <td>False</td>\n <td>C</td>\n <td>Southampton</td>\n <td>yes</td>\n <td>False</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0</td>\n <td>3</td>\n <td>male</td>\n <td>35.0</td>\n <td>0</td>\n <td>0</td>\n <td>8.0500</td>\n <td>S</td>\n <td>Third</td>\n <td>man</td>\n <td>True</td>\n <td>NaN</td>\n <td>Southampton</td>\n <td>no</td>\n <td>True</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>886</th>\n <td>0</td>\n <td>2</td>\n <td>male</td>\n <td>27.0</td>\n <td>0</td>\n <td>0</td>\n <td>13.0000</td>\n <td>S</td>\n <td>Second</td>\n <td>man</td>\n <td>True</td>\n <td>NaN</td>\n <td>Southampton</td>\n <td>no</td>\n <td>True</td>\n </tr>\n <tr>\n <th>887</th>\n <td>1</td>\n <td>1</td>\n <td>female</td>\n <td>19.0</td>\n <td>0</td>\n <td>0</td>\n <td>30.0000</td>\n <td>S</td>\n <td>First</td>\n <td>woman</td>\n <td>False</td>\n <td>B</td>\n <td>Southampton</td>\n <td>yes</td>\n <td>True</td>\n </tr>\n <tr>\n <th>888</th>\n <td>0</td>\n <td>3</td>\n <td>female</td>\n <td>NaN</td>\n <td>1</td>\n <td>2</td>\n <td>23.4500</td>\n <td>S</td>\n <td>Third</td>\n <td>woman</td>\n <td>False</td>\n <td>NaN</td>\n <td>Southampton</td>\n <td>no</td>\n <td>False</td>\n </tr>\n <tr>\n <th>889</th>\n <td>1</td>\n <td>1</td>\n <td>male</td>\n <td>26.0</td>\n <td>0</td>\n <td>0</td>\n <td>30.0000</td>\n <td>C</td>\n <td>First</td>\n <td>man</td>\n <td>True</td>\n <td>C</td>\n <td>Cherbourg</td>\n <td>yes</td>\n <td>True</td>\n </tr>\n <tr>\n <th>890</th>\n <td>0</td>\n <td>3</td>\n <td>male</td>\n <td>32.0</td>\n <td>0</td>\n <td>0</td>\n <td>7.7500</td>\n <td>Q</td>\n <td>Third</td>\n <td>man</td>\n <td>True</td>\n <td>NaN</td>\n <td>Queenstown</td>\n <td>no</td>\n <td>True</td>\n </tr>\n </tbody>\n</table>\n<p>891 rows × 15 columns</p>\n</div>"
},
"metadata": {},
"execution_count": 2
}
],
"source": [
"from dataprep.datasets import load_dataset\n",
"df = load_dataset(\"titanic\")\n",
"df"
]
},
{
"source": [
"## Analyze Dataset\n",
"After you get the dataset, you could try to use dataprep to explore the dataset. For example, you may want to create a profiling report of the dataset using `dataprep.eda`."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from dataprep.eda import create_report\n",
"report = create_report(df)\n",
"report.show_browser()"
]
}
]
}
1 change: 1 addition & 0 deletions docs/source/user_guide/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ The User Guide introduces the components of the DataPrep library.
:maxdepth: 2
:titlesonly:

datasets/introduction
eda/introduction
connector/connector
clean/introduction

0 comments on commit 83d42ce

Please sign in to comment.