Skip to content

Commit

Permalink
docs(eda):replace x, y, z with col1, col2, col3
Browse files Browse the repository at this point in the history
  • Loading branch information
jinglinpeng committed May 19, 2021
1 parent 95074f5 commit 57f65b3
Show file tree
Hide file tree
Showing 5 changed files with 41 additions and 41 deletions.
12 changes: 6 additions & 6 deletions docs/source/user_guide/eda/insights.ipynb

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/source/user_guide/eda/parameter_configurations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@
"\n",
"| Global Parameter | Description |\n",
"| --- | --- | \n",
"| `width` | Change the plots' width in `plot(df, x)`, `plot(df, x, y)`, `plot(df, x, y, z)`, `plot_correlation()` and `plot_missing()`.\n",
"| `height` | Change the plots' height in `plot(df, x)`, `plot(df, x, y)` and `plot(df, x, y, z)`, `plot_correlation()` and `plot_missing()`.\n",
"| `width` | Change the plots' width in `plot(df, col1)`, `plot(df, col1, col2)`, `plot(df, col1, col2, col3)`, `plot_correlation()` and `plot_missing()`.\n",
"| `height` | Change the plots' height in `plot(df, col1)`, `plot(df, col1, col2)` and `plot(df, col1, col2, col3)`, `plot_correlation()` and `plot_missing()`.\n",
"| `bins` | Apply to `bins` for `Histogram`, `KDE Plot`, `Box Plot`, `Word Length`, `Line Chart`, `Spectrum`.\n",
"| `ngroups` | Apply to `bars` and `slices` for the `Bar Chart` and `Pie Chart`."
]
Expand Down Expand Up @@ -207,4 +207,4 @@
},
"nbformat": 4,
"nbformat_minor": 2
}
}
30 changes: 15 additions & 15 deletions docs/source/user_guide/eda/plot.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@
"The function `plot()` explores the distributions and statistics of the dataset. It generates a variety of visualizations and statistics which enables the user to achieve a comprehensive understanding of the column distributions and their relationships. The following describes the functionality of `plot()` for a given dataframe `df`.\n",
"\n",
"1. `plot(df)`: plots the distribution of each column and computes dataset statistics\n",
"2. `plot(df, x)`: plots the distribution of column `x` in various ways, and computes its statistics\n",
"3. `plot(df, x, y)`: generates plots depicting the relationship between columns `x` and `y`\n",
"2. `plot(df, col1)`: plots the distribution of column `col1` in various ways, and computes its statistics\n",
"3. `plot(df, col1, col2)`: generates plots depicting the relationship between columns `col1` and `col2`\n",
"\n",
"The generated plots are different for numerical, categorical and geography columns. The following table summarizes the output for the different column types.\n",
"\n",
"| `x` | `y` | Output |\n",
"| `col1` | `col2` | Output |\n",
"| --- | --- | --- |\n",
"| None | None | dataset statistics, [histogram](https://www.wikiwand.com/en/Histogram) or [bar chart](https://www.wikiwand.com/en/Bar_chart) for each column |\n",
"| Numerical | None | column statistics, histogram, [kde plot](https://www.wikiwand.com/en/Kernel_density_estimation), [qq-normal plot](https://www.wikiwand.com/en/Q%E2%80%93Q_plot), [box plot](https://www.wikiwand.com/en/Box_plot) |\n",
Expand Down Expand Up @@ -101,11 +101,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understand a column with `plot(df, x)`\n",
"## Understand a column with `plot(df, col1)`\n",
"\n",
"After getting an overview of the dataset, we can thoroughly investigate a column of interest `x` using `plot(df, x)`. The output is of `plot(df, x)` is different for numerical and categorical columns.\n",
"After getting an overview of the dataset, we can thoroughly investigate a column of interest `col1` using `plot(df, col1)`. The output is of `plot(df, col1)` is different for numerical and categorical columns.\n",
"\n",
"When `x` is a numerical column, it computes column statistics, and generates a histogram, kde plot, box plot and qq-normal plot:"
"When `col1` is a numerical column, it computes column statistics, and generates a histogram, kde plot, box plot and qq-normal plot:"
]
},
{
Expand Down Expand Up @@ -164,11 +164,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understand the relationship between two columns with `plot(df, x, y)`\n",
"## Understand the relationship between two columns with `plot(df, col1, col2)`\n",
"\n",
"Next, we can explore the relationship between columns `x` and `y` using `plot(df, x, y)`. The output depends on the types of the columns. \n",
"Next, we can explore the relationship between columns `col1` and `col2` using `plot(df, col1, col2)`. The output depends on the types of the columns. \n",
"\n",
"When `x` and `y` are both numerical columns, it generates a scatter plot, hexbin plot and box plot:"
"When `col1` and `col2` are both numerical columns, it generates a scatter plot, hexbin plot and box plot:"
]
},
{
Expand All @@ -189,7 +189,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When `x` and `y` are both categorical columns, it plots a nested bar chart, stacked bar chart and heat map:"
"When `col1` and `col2` are both categorical columns, it plots a nested bar chart, stacked bar chart and heat map:"
]
},
{
Expand All @@ -210,7 +210,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When `x` and `y` are one each of type numerical and categorical, it generates a box plot per category and a multi-line chart:"
"When `col1` and `col2` are one each of type numerical and categorical, it generates a box plot per category and a multi-line chart:"
]
},
{
Expand All @@ -230,7 +230,7 @@
},
{
"source": [
"When `x` and `y` are one each of type geopoint and categorical, or, geography and categorical, it generates a box plot per category and a multi-line chart:"
"When `col1` and `col2` are one each of type geopoint and categorical, or, geography and categorical, it generates a box plot per category and a multi-line chart:"
],
"cell_type": "markdown",
"metadata": {}
Expand All @@ -241,7 +241,7 @@
"metadata": {},
"outputs": [],
"source": [
"from dataprep.eda.dtypes import LatLong\n",
"from dataprep.eda.dtypes_v2 import LatLong\n",
"covid = load_dataset('covid19')\n",
"latlong = LatLong(\"Lat\", \"Long\") # create geopoint type using \"LatLong\" function by inputing two columns names\n",
"plot(covid, latlong, \"Country/Region\")\n",
Expand All @@ -253,7 +253,7 @@
},
{
"source": [
"When `x` and `y` are one each of type geography and numerical, it generates a box plot per category, a multi-line chart and a world map:"
"When `col1` and `col2` are one each of type geography and numerical, it generates a box plot per category, a multi-line chart and a world map:"
],
"cell_type": "markdown",
"metadata": {}
Expand All @@ -270,7 +270,7 @@
},
{
"source": [
"When `x` and `y` are one each of type geopoint and numerical, it generates a geo map:"
"When `col1` and `col2` are one each of type geopoint and numerical, it generates a geo map:"
],
"cell_type": "markdown",
"metadata": {}
Expand Down
16 changes: 8 additions & 8 deletions docs/source/user_guide/eda/plot_correlation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@
"The function `plot_correlation()` explores the correlation between columns in various ways and using multiple correlation metrics. The following describes the functionality of `plot_correlation()` for a given dataframe `df`.\n",
"\n",
"1. `plot_correlation(df)`: plots correlation matrices (correlations between all pairs of columns)\n",
"2. `plot_correlation(df, x)`: plots the most correlated columns to column `x`\n",
"3. `plot_correlation(df, x, y)`: plots the joint distribution of column `x` and column `y` and computes a regression line\n",
"2. `plot_correlation(df, col1)`: plots the most correlated columns to column `col1`\n",
"3. `plot_correlation(df, col1, col2)`: plots the joint distribution of column `col1` and column `col2` and computes a regression line\n",
"\n",
"The following table summarizes the output plots for different settings of `x` and `y`.\n",
"The following table summarizes the output plots for different settings of `col1` and `col2`.\n",
"\n",
"| `x` | `y` | Output |\n",
"| `col1` | `col2` | Output |\n",
"| --- | --- | --- |\n",
"| None | None | *n*\\**n* correlation matrix, computed with [Person](https://www.wikiwand.com/en/Pearson_correlation_coefficien), [Spearman](https://www.wikiwand.com/en/Spearman%27s_rank_correlation_coefficient), and [KendallTau](https://www.wikiwand.com/en/Kendall_rank_correlation_coefficient) correlation coefficients | \n",
"| Numerical | None | *n*\\*1 correlation matrix, computed with Pearson, Spearman, and KendallTau correlation coefficients |\n",
Expand Down Expand Up @@ -86,7 +86,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Find the columns that are most correlated to column `x` with `plot_correlation(df, x)`\n",
"## Find the columns that are most correlated to column `col1` with `plot_correlation(df, col1)`\n",
"\n",
"After computing the correlation matrices, we can discover how other columns correlate to a specific column `x` using `plot_correlation(df, x)`. This function computes the correlation between column `x` and all other columns (using Pearson, Spearman, and KendallTau correlation coefficients), and sorts them in decreasing order. This enables easy determination of the columns that are most positively and negatively correlated with column `x`. The following shows an example:"
]
Expand All @@ -109,9 +109,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the correlation between two columns with `plot_correlation(df, x, y)`\n",
"## Explore the correlation between two columns with `plot_correlation(df, col1, col2)`\n",
"\n",
"Furthermore, `plot_correlation(df, x, y)` provides detailed analysis of the correlation between two columns `x` and `y`. It plots the joint distribution of the columns `x` and `y` as a scatter plot, as well as a regression line. The following shows an example:"
"Furthermore, `plot_correlation(df, col1, col2)` provides detailed analysis of the correlation between two columns `col1` and `col2`. It plots the joint distribution of the columns `col1` and `col2` as a scatter plot, as well as a regression line. The following shows an example:"
]
},
{
Expand Down Expand Up @@ -193,4 +193,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
18 changes: 9 additions & 9 deletions docs/source/user_guide/eda/plot_missing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@
"The function `plot_missing()` enables thorough analysis of the missing values and their impact on the dataset. The *impact* is the change in the dataset's characteristics (e.g., the histogram of a numerical column or bar chart of a categorical column) after removing the rows with missing values from the dataset. The following describes the functionality of `plot_missing()` for a given dataframe `df`.\n",
"\n",
"1. `plot_missing(df)`: plots the amount and position of missing values, and their relationship between columns\n",
"2. `plot_missing(df, x)`: plots the impact of the missing values in column `x` on all other columns\n",
"3. `plot_missing(df, x, y)`: plots the impact of the missing values from column `x` on column `y` in various ways.\n",
"2. `plot_missing(df, col1)`: plots the impact of the missing values in column `col1` on all other columns\n",
"3. `plot_missing(df, col1, col2)`: plots the impact of the missing values from column `col1` on column `col2` in various ways.\n",
"\n",
"Next, we demonstrate the functionality of `plot_missing()`. "
]
Expand Down Expand Up @@ -95,9 +95,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understand the *impact* of the missing values in column *x* with `plot_missing(df, x)`\n",
"## Understand the *impact* of the missing values in column *x* with `plot_missing(df, col1)`\n",
"\n",
"After getting an overview of the missing values with `plot_missing(df)`, we can analyze the impact of the missing values in a specific column `x` with `plot_missing(df, x)`. The *impact* of the missing values in column `x` is the change in the dataset's characteristics after removing the rows where column `x`'s values are missing. Here, we consider two types of characteristics: the histogram (for numerical columns) and the bar chart (for categorical columns). `plot_missing(df, x)` plots the histogram or bar chart (for appropriate column types) for each column before and after removing the rows that contain missing values in column `x`.\n",
"After getting an overview of the missing values with `plot_missing(df)`, we can analyze the impact of the missing values in a specific column `col1` with `plot_missing(df, col1)`. The *impact* of the missing values in column `col1` is the change in the dataset's characteristics after removing the rows where column `col1`'s values are missing. Here, we consider two types of characteristics: the histogram (for numerical columns) and the bar chart (for categorical columns). `plot_missing(df, col1)` plots the histogram or bar chart (for appropriate column types) for each column before and after removing the rows that contain missing values in column `col1`.\n",
"\n",
"The following shows an example:"
]
Expand All @@ -120,12 +120,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understand the impact of the missing values in column `x` on column `y` with `plot_missing(df, x, y)`\n",
"## Understand the impact of the missing values in column `col1` on column `col2` with `plot_missing(df, col1, col2)`\n",
"\n",
"\n",
"`plot_missing(df, x)` only displays the frequency distribution of each column before and after removing the rows containing missing values in column `x`. If the user is specifically concerned with the impact of the missing values in one column `x` on another column `y`, she/he can call `plot_missing(df, x, y)`. `plot_missing(df, x, y)` plots the impact of the missing values in column `x` on column `y` in different ways depending on the type of column `y`.\n",
"`plot_missing(df, col1)` only displays the frequency distribution of each column before and after removing the rows containing missing values in column `col1`. If the user is specifically concerned with the impact of the missing values in one column `col1` on another column `col2`, she/he can call `plot_missing(df, col1, col2)`. `plot_missing(df, col1, col2)` plots the impact of the missing values in column `col1` on column `col2` in different ways depending on the type of column `col2`.\n",
"\n",
"If `y` is a numerical column, `plot_missing(df, x, y)` shows the impact as a histogram, pdf, cdf, and box plot. The following shows an example:"
"If `col2` is a numerical column, `plot_missing(df, col1, col2)` shows the impact as a histogram, pdf, cdf, and box plot. The following shows an example:"
]
},
{
Expand All @@ -146,7 +146,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If `y` is a categorical column, `plot_missing(df, x, y)` shows the impact as a bar chart. The following shows an example:"
"If `y` is a categorical column, `plot_missing(df, col1, col2)` shows the impact as a bar chart. The following shows an example:"
]
},
{
Expand Down Expand Up @@ -228,4 +228,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}

0 comments on commit 57f65b3

Please sign in to comment.