Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS-#3953: Add docs and notebook examples on running Modin with OmniSci #4001

Merged
merged 43 commits into from
Mar 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
2ce1c27
FEAT-#3953: Add examples for OmniSci in jupyter notebook
Rubtsowa Feb 18, 2022
0eeec92
change img & change example
Rubtsowa Feb 21, 2022
dfd2faf
change doc & example
Rubtsowa Feb 21, 2022
60eea01
change ci-notebooks
Rubtsowa Feb 21, 2022
a8e921a
change ci-notebooks
Rubtsowa Feb 21, 2022
8dd0c1a
change ci-notebooks
Rubtsowa Feb 21, 2022
aea5b25
change ci-notebooks
Rubtsowa Feb 21, 2022
ec0a62d
change ci-notebooks
Rubtsowa Feb 21, 2022
21408fb
change ci-notebooks
Rubtsowa Feb 21, 2022
04b81e0
change ci-notebooks
Rubtsowa Feb 21, 2022
527dd3f
change ci-notebooks
Rubtsowa Feb 22, 2022
74316ae
change ci-notebooks
Rubtsowa Feb 22, 2022
8e5f116
change ci-notebooks
Rubtsowa Feb 22, 2022
194ab19
change ci-notebooks
Rubtsowa Feb 22, 2022
728d906
change ci-notebooks
Rubtsowa Feb 22, 2022
17dc1ec
change ci-notebooks
Rubtsowa Feb 22, 2022
6cb66b0
formated tests
Rubtsowa Feb 22, 2022
7fc4e33
change ci-notebooks
Rubtsowa Feb 22, 2022
ed12dfd
change ci-notebooks
Rubtsowa Feb 22, 2022
c51ca31
add ci-env.yml & change ci-notebooks
Rubtsowa Feb 22, 2022
dcdd3d8
change ci-notebooks
Rubtsowa Feb 22, 2022
10843e6
change ci-notebooks
Rubtsowa Feb 22, 2022
b36b21f
change test
Rubtsowa Feb 23, 2022
8e501da
change test
Rubtsowa Feb 23, 2022
b5a8953
change exercise & tests & env & add new png
Rubtsowa Feb 25, 2022
7380537
change example
Rubtsowa Feb 25, 2022
5f99d58
delete table & change exercises
Rubtsowa Feb 25, 2022
5234d14
change exercise
Rubtsowa Feb 25, 2022
0a01344
change tests & ci-tests
Rubtsowa Feb 25, 2022
55a378e
change ci-tests
Rubtsowa Feb 25, 2022
0ac3f32
change ci-tests
Rubtsowa Feb 25, 2022
4ab029a
change ci-tests
Rubtsowa Feb 25, 2022
87e895f
change ci-tests
Rubtsowa Feb 25, 2022
ba4d6a2
change ci-tests
Rubtsowa Feb 25, 2022
aa95b7f
change examples & README & ci-test
Rubtsowa Mar 1, 2022
a8504fd
change README & ci-tests & examples
Rubtsowa Mar 1, 2022
6a5651c
change examples
Rubtsowa Mar 2, 2022
ff65a07
changed link
Rubtsowa Mar 2, 2022
b3e7311
delete sum
Rubtsowa Mar 3, 2022
a6ad995
change example
Rubtsowa Mar 3, 2022
f9bd462
change test
Rubtsowa Mar 3, 2022
d85c781
Update examples/tutorial/jupyter/execution/omnisci_on_native/local/ex…
YarShev Mar 3, 2022
59217f6
Update examples/tutorial/jupyter/execution/omnisci_on_native/local/ex…
YarShev Mar 3, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions .github/workflows/ci-notebooks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,14 @@ on:
- setup.py
jobs:
test-tutorial-notebooks:
defaults:
run:
shell: bash -l {0}
name: test tutorial notebooks
runs-on: ubuntu-latest
strategy:
matrix:
execution: [pandas_on_ray, pandas_on_dask]
execution: [pandas_on_ray, pandas_on_dask, omnisci_on_native]
steps:
- uses: actions/checkout@v2
with:
Expand All @@ -22,20 +25,40 @@ jobs:
with:
python-version: "3.8.x"
architecture: "x64"
if: matrix.execution != 'omnisci_on_native'
- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: modin_on_omnisci
environment-file: requirements/env_omnisci.yml
python-version: 3.8
channel-priority: strict
if: matrix.execution == 'omnisci_on_native'
- name: Cache datasets
uses: actions/cache@v2
with:
path: taxi.csv
# update cache only if notebooks require it to be changed
key: hashFiles("examples/tutorial/jupyter/**")
# replace modin with . in the tutorial requirements file since we need
# Modin built from sources
# replace modin with . in the tutorial requirements file for `pandas_on_ray` and
# `pandas_on_dask` since we need Modin built from sources
- run: sed -i 's/modin/./g' examples/tutorial/jupyter/execution/${{ matrix.execution }}/requirements.txt
# install dependencies required for notebooks execution
if: matrix.execution != 'omnisci_on_native'
# install dependencies required for notebooks execution for `pandas_on_ray` and `pandas_on_dask`
- run: pip install -r examples/tutorial/jupyter/execution/${{ matrix.execution }}/requirements.txt
if: matrix.execution != 'omnisci_on_native'
# Build Modin from sources for `omnisci_on_native`
- run: pip install .
YarShev marked this conversation as resolved.
Show resolved Hide resolved
if: matrix.execution == 'omnisci_on_native'
# install test dependencies
- run: pip install pytest pytest-cov black flake8 flake8-print flake8-no-implicit-concat
if: matrix.execution != 'omnisci_on_native'
- run: conda install black flake8 flake8-print jupyter nbformat nbconvert -c conda-forge
if: matrix.execution == 'omnisci_on_native'
- run: pip list
YarShev marked this conversation as resolved.
Show resolved Hide resolved
- run: |
conda info
conda list
if: matrix.execution == 'omnisci_on_native'
- run: |
black --check --diff examples/tutorial/jupyter/execution/${{ matrix.execution }}/test/test_notebooks.py
black --check --diff examples/tutorial/jupyter/execution/test/utils.py
Expand Down
6 changes: 3 additions & 3 deletions docs/getting_started/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ Tutorials

The following tutorials cover the basic usage of Modin. `Here <https://www.youtube.com/watch?v=NglkafEmbhE>`_ is a one hour video tutorial that walks through these basic exercises.

- Exercise 1: Introduction to Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_1.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_1.ipynb>`_]
- Exercise 2: Speed Improvements with Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_2.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_2.ipynb>`_]
- Exercise 3: Defaulting to pandas with Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_3.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_3.ipynb>`_]
- Exercise 1: Introduction to Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_1.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_1.ipynb>`_, `Source OmnisciOnNative <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/omnisci_on_native/local/exercise_1.ipynb>`_]
- Exercise 2: Speed Improvements with Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_2.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_2.ipynb>`_, `Source OmnisciOnNative <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/omnisci_on_native/local/exercise_2.ipynb>`_]
- Exercise 3: Defaulting to pandas with Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_3.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_3.ipynb>`_, `Source OmnisciOnNative <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/omnisci_on_native/local/exercise_3.ipynb>`_]

The following tutorials covers more advanced features in Modin:

Expand Down
6 changes: 3 additions & 3 deletions docs/usage_guide/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ Tutorials

The following tutorials cover the basic usage of Modin. `Here <https://www.youtube.com/watch?v=NglkafEmbhE>`_ is a one hour video tutorial that walks through these basic exercises.

- Exercise 1: Introduction to Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_1.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_1.ipynb>`_]
- Exercise 2: Speed Improvements with Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_2.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_2.ipynb>`_]
- Exercise 3: Defaulting to pandas with Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_3.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_3.ipynb>`_]
- Exercise 1: Introduction to Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_1.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_1.ipynb>`_, `Source OmnisciOnNative <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/omnisci_on_native/local/exercise_1.ipynb>`_]
- Exercise 2: Speed Improvements with Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_2.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_2.ipynb>`_, `Source OmnisciOnNative <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/omnisci_on_native/local/exercise_2.ipynb>`_]
- Exercise 3: Defaulting to pandas with Modin [`Source PandasOnRay <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_ray/local/exercise_3.ipynb>`_, `Source PandasOnDask <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/pandas_on_dask/local/exercise_3.ipynb>`_, `Source OmnisciOnNative <https://github.com/modin-project/modin/blob/master/examples/tutorial/jupyter/execution/omnisci_on_native/local/exercise_3.ipynb>`_]

The following tutorials covers more advanced features in Modin:

Expand Down
20 changes: 19 additions & 1 deletion examples/tutorial/jupyter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@ Currently we provide tutorial notebooks for the following execution backends:

- [PandasOnRay](https://modin.readthedocs.io/en/latest/development/using_pandas_on_ray.html)
- [PandasOnDask](https://modin.readthedocs.io/en/latest/development/using_pandas_on_dask.html)
- [OmnisciOnNative](https://modin.readthedocs.io/en/latest/development/using_omnisci.html)

## Creating a development environment

To get required dependencies for these Jupyter Notebooks
To get required dependencies for `PandasOnRay` and `PandasOnDask` Jupyter Notebooks
you should create a development environment with `pip`
using `requirements.txt` file located in the respective directory:

Expand All @@ -26,6 +27,23 @@ to install dependencies needed to run notebooks with Modin on `PandasOnDask` exe
**Note:** Sometimes pip is installing every version of a package. If you encounter that issue,
please install every package listed in `requirements.txt` file individually with `pip install <package>`.

To get required dependencies for `OmnisciOnNative` Jupyter Notebooks
you should create a development environment with `conda`
using `jupyter_omnisci_env.yml` file located in the respective directory:

```bash
conda config --set channel_priority strict
conda env create -f execution/omnisci_on_native/jupyter_omnisci_env.yml
```

After the environment is created it needs to be activated:

```bash
conda activate jupyter_modin_on_omnisci
```

**Note:** `Omnisci` engine is available on Linux only for now.

## Run Jupyter Notebooks

A Jupyter Notebook server can be run from the current directory as follows:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
name: jupyter_modin_on_omnisci
channels:
- conda-forge
dependencies:
- modin-omnisci
- jupyter
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![LOGO](../../../img/MODIN_ver2_hrz.png)\n",
"\n",
"<center><h2>Scale your pandas workflows by changing one line of code</h2>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 1: How to use Modin\n",
"\n",
"**GOAL**: Learn how to import Modin to accelerate and scale pandas workflows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Modin is a drop-in replacement for pandas that distributes the computation \n",
"across all of the cores in your machine or in a cluster.\n",
"In practical terms, this means that you can continue using the same pandas scripts\n",
"as before and expect the behavior and results to be the same. The only thing that needs\n",
"to change is the import statement. Normally, you would change:\n",
"\n",
"```python\n",
"import pandas as pd\n",
"```\n",
"\n",
"to:\n",
"\n",
"```python\n",
"import modin.pandas as pd\n",
"```\n",
"\n",
"Changing this line of code will allow you to use all of the cores in your machine to do computation on your data. One of the major performance bottlenecks of pandas is that it only uses a single core for any given computation. Modin exposes an API that is identical to pandas, allowing you to continue interacting with your data as you would with pandas. There are no additional commands required to use Modin locally. Partitioning, scheduling, data transfer, and other related concerns are all handled by Modin under the hood."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"text-align:left;\">\n",
" <h1>pandas on a multicore laptop\n",
" <span style=\"float:right;\">\n",
" Modin on a multicore laptop\n",
" </span>\n",
"\n",
"<div>\n",
"<img align=\"left\" src=\"../../../img/pandas_multicore.png\"><img src=\"../../../img/modin_multicore.png\">\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Concept for exercise: Dataframe constructor\n",
"\n",
"Often when playing around in pandas, it is useful to create a DataFrame with the constructor. That is where we will start.\n",
"\n",
"```python\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"frame_data = np.random.randint(0, 100, size=(2**10, 2**5))\n",
"df = pd.DataFrame(frame_data)\n",
"```\n",
"\n",
"When creating a dataframe from a non-distributed object, it will take extra time to partition the data for Modin. When this is happening, you will see this message:\n",
"\n",
"```\n",
"UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.\n",
"```\n",
"\n",
"Modin uses Ray as an execution engine by default. Since this notebook is related to OmniSci, let's run examples on the OmniSci engine. For reaching this, we need to activate OmniSci either via Modin config or Modin environment variable. See more in [OmniSci usage](https://github.com/modin-project/modin/blob/master/docs/development/using_omnisci.rst) section.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note: Do not change this code!\n",
"import numpy as np\n",
"import pandas\n",
"import sys\n",
"import modin"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import modin.config as cfg\n",
"cfg.StorageFormat.put('omnisci')"
YarShev marked this conversation as resolved.
Show resolved Hide resolved
]
},
YarShev marked this conversation as resolved.
Show resolved Hide resolved
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pandas.__version__"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"modin.__version__"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Implement your answer here. You are also free to play with the size\n",
"# and shape of the DataFrame, but beware of exceeding your memory!\n",
"\n",
"import pandas as pd\n",
"\n",
"frame_data = np.random.randint(0, 100, size=(2**10, 2**5))\n",
"df = pd.DataFrame(frame_data)\n",
"\n",
"# ***** Do not change the code below! It verifies that \n",
"# ***** the exercise has been done correctly. *****\n",
"\n",
"try:\n",
" assert df is not None\n",
" assert frame_data is not None\n",
" assert isinstance(frame_data, np.ndarray)\n",
"except:\n",
" raise AssertionError(\"Don't change too much of the original code!\")\n",
"assert \"modin.pandas\" in sys.modules, \"Not quite correct. Remember the single line of code change (See above)\"\n",
"\n",
"import modin.pandas\n",
"assert pd == modin.pandas, \"Remember the single line of code change (See above)\"\n",
"assert hasattr(df, \"_query_compiler\"), \"Make sure that `df` is a modin.pandas DataFrame.\"\n",
"\n",
"print(\"Success! You only need to change one line of code!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have created a toy example for playing around with the DataFrame, let's print it out in different ways."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Concept for Exercise: Data Interaction and Printing\n",
"\n",
"When interacting with data, it is very imporant to look at different parts of the data (e.g. `df.head()`). Here we will show that you can print the modin.pandas DataFrame in the same ways you would pandas."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# When working with non-string column labels it could happen that some backend logic would try to insert a column \n",
"# with a string name to the frame, so we do add_prefix()\n",
"df = df.add_prefix(\"col\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print the first 10 lines.\n",
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Please move on to [Exercise 2](./exercise_2.ipynb) when you are ready**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading