Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#7272: Remove HDK engine #7275

Merged
merged 2 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 10 additions & 16 deletions .github/workflows/ci-notebooks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ on:
- .github/workflows/ci-notebooks.yml
- setup.cfg
- setup.py
- requirements/env_hdk.yml
- requirements/env_unidist_linux.yml
concurrency:
# Cancel other jobs in the same branch. We don't care whether CI passes
Expand All @@ -25,16 +24,11 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
execution: [pandas_on_ray, pandas_on_dask, pandas_on_unidist, hdk_on_native]
execution: [pandas_on_ray, pandas_on_dask, pandas_on_unidist]
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/python-only
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
- uses: ./.github/actions/mamba-env
with:
environment-file: requirements/env_hdk.yml
activate-environment: modin_on_hdk
if: matrix.execution == 'hdk_on_native'
if: matrix.execution != 'pandas_on_unidist'
- uses: ./.github/actions/mamba-env
with:
environment-file: requirements/env_unidist_linux.yml
Expand All @@ -49,29 +43,29 @@ jobs:
# replace modin with . in the tutorial requirements file for `pandas_on_ray` and
# `pandas_on_dask` since we need Modin built from sources
- run: sed -i 's/modin/./g' examples/tutorial/jupyter/execution/${{ matrix.execution }}/requirements.txt
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
if: matrix.execution != 'pandas_on_unidist'
# install dependencies required for notebooks execution for `pandas_on_ray` and `pandas_on_dask`
# Override modin-spreadsheet install for now
- run: |
pip install -r examples/tutorial/jupyter/execution/${{ matrix.execution }}/requirements.txt
pip install git+https://github.com/modin-project/modin-spreadsheet.git@49ffd89f683f54c311867d602c55443fb11bf2a5
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
# Build Modin from sources for `hdk_on_native` and `pandas_on_unidist`
if: matrix.execution != 'pandas_on_unidist'
# Build Modin from sources for `pandas_on_unidist`
- run: pip install -e .
if: matrix.execution == 'hdk_on_native' || matrix.execution == 'pandas_on_unidist'
if: matrix.execution == 'pandas_on_unidist'
# install test dependencies
# NOTE: If you are changing the set of packages installed here, make sure that
# the dev requirements match them.
- run: pip install pytest pytest-cov black flake8 flake8-print flake8-no-implicit-concat
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
if: matrix.execution != 'pandas_on_unidist'
- run: pip install flake8-print jupyter nbformat nbconvert
if: matrix.execution == 'hdk_on_native' || matrix.execution == 'pandas_on_unidist'
if: matrix.execution == 'pandas_on_unidist'
- run: pip list
if: matrix.execution != 'hdk_on_native' && matrix.execution != 'pandas_on_unidist'
if: matrix.execution != 'pandas_on_unidist'
- run: |
conda info
conda list
if: matrix.execution == 'hdk_on_native' || matrix.execution == 'pandas_on_unidist'
if: matrix.execution == 'pandas_on_unidist'
# setup kernel configuration for `pandas_on_unidist` execution with mpi backend
- run: python examples/tutorial/jupyter/execution/${{ matrix.execution }}/setup_kernel.py
if: matrix.execution == 'pandas_on_unidist'
Expand Down
13 changes: 0 additions & 13 deletions .github/workflows/ci-required.yml
Original file line number Diff line number Diff line change
Expand Up @@ -90,19 +90,6 @@ jobs:
modin/experimental/pandas/__init__.py
- run: python scripts/doc_checker.py modin/core/storage_formats/base
- run: python scripts/doc_checker.py modin/core/storage_formats/pandas
- run: |
python scripts/doc_checker.py \
modin/experimental/core/execution/native/implementations/hdk_on_native/dataframe \
modin/experimental/core/execution/native/implementations/hdk_on_native/io \
modin/experimental/core/execution/native/implementations/hdk_on_native/partitioning \
modin/experimental/core/execution/native/implementations/hdk_on_native/calcite_algebra.py \
modin/experimental/core/execution/native/implementations/hdk_on_native/calcite_builder.py \
modin/experimental/core/execution/native/implementations/hdk_on_native/calcite_serializer.py \
modin/experimental/core/execution/native/implementations/hdk_on_native/df_algebra.py \
modin/experimental/core/execution/native/implementations/hdk_on_native/expr.py \
modin/experimental/core/execution/native/implementations/hdk_on_native/hdk_worker.py \
- run: python scripts/doc_checker.py modin/experimental/core/storage_formats/hdk
- run: python scripts/doc_checker.py modin/experimental/core/execution/native/implementations/hdk_on_native/interchange/dataframe_protocol
- run: python scripts/doc_checker.py modin/experimental/batch/pipeline.py
- run: python scripts/doc_checker.py modin/logging

Expand Down
73 changes: 1 addition & 72 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -150,62 +150,6 @@ jobs:
runner: python -m pytest --execution=${{ matrix.execution }}
- uses: ./.github/actions/upload-coverage

test-hdk:
needs: [lint-flake8]
runs-on: ubuntu-latest
defaults:
run:
shell: bash -l {0}
env:
MODIN_EXPERIMENTAL: "True"
MODIN_ENGINE: "native"
MODIN_STORAGE_FORMAT: "hdk"
name: Test HDK storage format, Python 3.9
services:
moto:
image: motoserver/moto
ports:
- 5000:5000
env:
AWS_ACCESS_KEY_ID: foobar_key
AWS_SECRET_ACCESS_KEY: foobar_secret
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/mamba-env
with:
environment-file: requirements/env_hdk.yml
activate-environment: modin_on_hdk
- name: Install HDF5
run: sudo apt update && sudo apt install -y libhdf5-dev
- run: python -m pytest modin/tests/core/storage_formats/hdk/test_internals.py
- run: python -m pytest modin/tests/experimental/hdk_on_native/test_init.py
- run: python -m pytest modin/tests/experimental/hdk_on_native/test_dataframe.py
- run: python -m pytest modin/tests/experimental/hdk_on_native/test_utils.py
- run: python -m pytest modin/tests/pandas/test_io.py --verbose
- run: python -m pytest modin/tests/interchange/dataframe_protocol/test_general.py
- run: python -m pytest modin/tests/interchange/dataframe_protocol/hdk
- run: python -m pytest modin/tests/experimental/test_sql.py
- run: python -m pytest modin/tests/pandas/test_concat.py
- run: python -m pytest modin/tests/pandas/dataframe/test_binary.py
- run: python -m pytest modin/tests/pandas/dataframe/test_reduce.py
- run: python -m pytest modin/tests/pandas/dataframe/test_join_sort.py
- run: python -m pytest modin/tests/pandas/test_general.py
- run: python -m pytest modin/tests/pandas/dataframe/test_indexing.py
- run: python -m pytest modin/tests/pandas/test_series.py
- run: python -m pytest modin/tests/pandas/dataframe/test_map_metadata.py
- run: python -m pytest modin/tests/pandas/dataframe/test_window.py
- run: python -m pytest modin/tests/pandas/dataframe/test_default.py
- run: python examples/docker/modin-hdk/census-hdk.py examples/data/census_1k.csv -no-ml
- run: python examples/docker/modin-hdk/nyc-taxi-hdk.py examples/data/nyc-taxi_1k.csv
- run: |
python examples/docker/modin-hdk/plasticc-hdk.py \
examples/data/plasticc_training_set_1k.csv \
examples/data/plasticc_test_set_1k.csv \
examples/data/plasticc_training_set_metadata_1k.csv \
examples/data/plasticc_test_set_metadata_1k.csv \
-no-ml
- uses: ./.github/actions/upload-coverage

test-asv-benchmarks:
if: github.event_name == 'pull_request'
needs: [lint-flake8]
Expand Down Expand Up @@ -249,18 +193,6 @@ jobs:
# check pure pandas
MODIN_ASV_USE_IMPL=pandas asv run --quick --dry-run --python=same --strict --show-stderr --launch-method=spawn \
-b ^benchmarks -b ^io | tee benchmarks.log

# TODO: Remove manual environment creation after fix https://github.com/airspeed-velocity/asv/issues/1310
conda deactivate
mamba env create -f ../requirements/env_hdk.yml
conda activate modin_on_hdk
pip install asv==0.5.1
pip install ..

# check Modin on HDK
MODIN_ENGINE=native MODIN_STORAGE_FORMAT=hdk MODIN_EXPERIMENTAL=true asv run --quick --dry-run --python=same --strict --show-stderr \
--launch-method=forkserver --python=same --config asv.conf.hdk.json \
-b ^hdk | tee benchmarks.log
else
echo "Benchmarks did not run, no changes detected"
fi
Expand Down Expand Up @@ -374,7 +306,6 @@ jobs:
- run: |
mpiexec -n 1 -genv AWS_ACCESS_KEY_ID foobar_key -genv AWS_SECRET_ACCESS_KEY foobar_secret \
python -m pytest modin/tests/experimental/test_io_exp.py
- run: mpiexec -n 1 python -m pytest modin/tests/experimental/test_sql.py
- run: mpiexec -n 1 python -m pytest modin/tests/interchange/dataframe_protocol/test_general.py
- run: mpiexec -n 1 python -m pytest modin/tests/interchange/dataframe_protocol/pandas/test_protocol.py
- run: |
Expand Down Expand Up @@ -495,8 +426,6 @@ jobs:
if: matrix.engine == 'python' || matrix.test_task == 'group_4'
- run: python -m pytest modin/tests/experimental/test_io_exp.py
if: matrix.engine == 'python' || matrix.test_task == 'group_4'
- run: python -m pytest modin/tests/experimental/test_sql.py
if: matrix.os == 'ubuntu' && (matrix.engine == 'python' || matrix.test_task == 'group_4')
- run: python -m pytest modin/tests/interchange/dataframe_protocol/test_general.py
if: matrix.engine == 'python' || matrix.test_task == 'group_4'
- run: python -m pytest modin/tests/interchange/dataframe_protocol/pandas/test_protocol.py
Expand Down Expand Up @@ -703,7 +632,7 @@ jobs:
- run: python -m pytest modin/tests/experimental/spreadsheet/test_general.py

merge-coverage-artifacts:
needs: [test-internals, test-api-and-no-engine, test-defaults, test-hdk, test-all-unidist, test-all, test-experimental, test-sanity]
needs: [test-internals, test-api-and-no-engine, test-defaults, test-all-unidist, test-all, test-experimental, test-sanity]
if: always() # we need to run it regardless of some job being skipped, like in PR
runs-on: ubuntu-latest
defaults:
Expand Down
2 changes: 0 additions & 2 deletions .github/workflows/codeql/codeql-config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,3 @@ name: "Modin CodeQL config"

paths:
- modin/**
paths-ignore:
- modin/tests/experimental/hdk_on_native/** # TODO: fix unhashable list error, see #5227
5 changes: 0 additions & 5 deletions CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
# These owners will be the default owners for everything in
# the repo unless a later match takes precedence,
* @modin-project/modin-core @devin-petersohn @mvashishtha @RehanSD @YarShev @vnlitvinov @anmyachev @dchigarev

# These owners will review everything in the HDK engine component
# of Modin.
/modin/experimental/core/storage_formats/hdk/** @modin-project/modin-hdk @aregm @gshimansky @ienkovich @Garra1980 @YarShev @vnlitvinov @anmyachev @dchigarev @AndreyPavlenko
/modin/experimental/core/execution/native/implementations/hdk_on_native/** @modin-project/modin-hdk @aregm @gshimansky @ienkovich @Garra1980 @YarShev @vnlitvinov @anmyachev @dchigarev @AndreyPavlenko
16 changes: 4 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ Modin automatically detects which engine(s) you have installed and uses that for
#### From conda-forge

Installing from [conda forge](https://github.com/conda-forge/modin-feedstock) using `modin-all`
will install Modin and four engines: [Ray](https://github.com/ray-project/ray), [Dask](https://github.com/dask/dask),
[MPI through unidist](https://github.com/modin-project/unidist) and [HDK](https://github.com/intel-ai/hdk).
will install Modin and three engines: [Ray](https://github.com/ray-project/ray), [Dask](https://github.com/dask/dask) and
[MPI through unidist](https://github.com/modin-project/unidist).

```bash
conda install -c conda-forge modin-all
Expand All @@ -98,7 +98,6 @@ Each engine can also be installed individually (and also as a combination of sev
conda install -c conda-forge modin-ray # Install Modin dependencies and Ray.
conda install -c conda-forge modin-dask # Install Modin dependencies and Dask.
conda install -c conda-forge modin-mpi # Install Modin dependencies and MPI through unidist.
conda install -c conda-forge modin-hdk # Install Modin dependencies and HDK.
```

**Note:** Since Modin 0.30.0 we use a reduced set of Ray dependencies: `ray-core` instead of `ray-default`.
Expand All @@ -118,13 +117,13 @@ conda install -n base conda-libmamba-solver
and then use it during istallation either like:

```bash
conda install -c conda-forge modin-ray modin-hdk --experimental-solver=libmamba
conda install -c conda-forge modin-ray --experimental-solver=libmamba
```

or starting from conda 22.11 and libmamba solver 22.12 versions:

```bash
conda install -c conda-forge modin-ray modin-hdk --solver=libmamba
conda install -c conda-forge modin-ray --solver=libmamba
```

#### Choosing a Compute Engine
Expand Down Expand Up @@ -158,8 +157,6 @@ modin_cfg.Engine.put('unidist') # Modin will use Unidist
unidist_cfg.Backend.put('mpi') # Unidist will use MPI backend
```

Check [this Modin docs section](https://modin.readthedocs.io/en/latest/development/using_hdk.html) for HDK engine setup.

_Note: You should not change the engine after your first operation with Modin as it will result in undefined behavior._

#### Which engine should I use?
Expand All @@ -168,11 +165,6 @@ On Linux, MacOS, and Windows you can install and use either Ray, Dask or MPI thr
to use either of these engines as Modin abstracts away all of the complexity, so feel
free to pick either!

On Linux you also can choose [HDK](https://modin.readthedocs.io/en/latest/development/using_hdk.html), which is an experimental
engine based on [HDK](https://github.com/intel-ai/hdk) and included in the
[Intel® Distribution of Modin](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/distribution-of-modin.html),
which is a part of [Intel® oneAPI AI Analytics Toolkit (AI Kit)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html).

### Pandas API Coverage

<p align="center">
Expand Down
60 changes: 0 additions & 60 deletions asv_bench/asv.conf.hdk.json

This file was deleted.

3 changes: 0 additions & 3 deletions asv_bench/benchmarks/benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@
random_columns,
random_string,
translator_groupby_ngroups,
trigger_import,
)


Expand Down Expand Up @@ -675,7 +674,6 @@ class TimeIndexing:

def setup(self, shape, indexer_type):
self.df = generate_dataframe("int", *shape, RAND_LOW, RAND_HIGH)
trigger_import(self.df)

self.indexer = self.indexer_getters[indexer_type](self.df)
if isinstance(self.indexer, (IMPL.Series, IMPL.DataFrame)):
Expand All @@ -701,7 +699,6 @@ class TimeIndexingColumns:

def setup(self, shape):
self.df = generate_dataframe("int", *shape, RAND_LOW, RAND_HIGH)
trigger_import(self.df)
self.numeric_indexer = [0, 1]
self.labels_indexer = self.df.columns[self.numeric_indexer].tolist()

Expand Down
14 changes: 0 additions & 14 deletions asv_bench/benchmarks/hdk/__init__.py

This file was deleted.

Loading
Loading