RuntimeError: uint8 is not yet supported. #3368

anmyachev · 2021-08-24T10:39:36Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
Modin version (modin.__version__): 0.10.2 (from conda-forge)
Python version: 3.9.6
Code we can use to reproduce:

conda install modin-omnisci -c conda-forge
MODIN_BACKEND=omnisci MODIN_EXPERIMENTAL=true python test.py

Test.py:

import numpy as np
import modin.pandas as pd

test = np.array([1., 3., 5.])
pd.get_dummies(np.array([1., 3., 5.])).sum(axis=0)

Describe the problem

This issue was found when running PlastiCC workload.

Source code / logs

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/pandas/dataframe.py", line 2070, in sum
    data._query_compiler.sum(
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/backends/omnisci/query_compiler.py", line 96, in method_wrapper
    return method(self, *args, **kwargs)
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/backends/omnisci/query_compiler.py", line 361, in sum
    return self._agg("sum", **kwargs)
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/backends/omnisci/query_compiler.py", line 96, in method_wrapper
    return method(self, *args, **kwargs)
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/backends/omnisci/query_compiler.py", line 413, in _agg
    new_frame = new_frame._set_index(
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/engines/omnisci_on_ray/frame/data.py", line 1931, in _set_index
    self._execute()
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/engines/omnisci_on_ray/frame/data.py", line 1708, in _execute
    new_partitions = self._partition_mgr_cls.run_exec_plan(
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/engines/omnisci_on_ray/frame/partition_manager.py", line 248, in run_exec_plan
    p.frame_id = omniSession.put_arrow_to_omnisci(obj)
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/engines/omnisci_on_ray/frame/omnisci_worker.py", line 197, in put_arrow_to_omnisci
    cls._server.importArrowTable(name, table, fragment_size=fragment_size)
  File "dbe.pyx", line 207, in omniscidbe.PyDbEngine.importArrowTable
RuntimeError: uint8 is not yet supported.

The text was updated successfully, but these errors were encountered:

anmyachev · 2021-08-24T10:40:19Z

@dchigarev can you take a look?

YarShev · 2021-08-24T11:49:08Z

As far as I know, OmniSci doesn't support uints for most operations.

cc @ienkovich , @fexolm

dchigarev · 2021-08-24T11:51:29Z

it seems that PyDBE (not sure about OmniSci itself) does not support unsigned integers at all:

Reproducer

import sys

sys.setdlopenflags(1 | 256)  # RTLD_LAZY+RTLD_GLOBAL
from dbe import PyDbEngine

import pyarrow as pa

server = PyDbEngine()

data = {"a": [1, 2, 3]}

dtypes = (
    "uint8", "uint16", "uint32", "uint64",
    "int8", "int16", "int32", "int64"
)
tables = (
    pa.Table.from_pydict(data, schema=pa.schema({"a": getattr(pa, dtype)()}))
    for dtype in dtypes
)

for dtype, table in zip(dtypes, tables):
    try:
        server.importArrowTable(dtype, table)
    except Exception as e:
        print(f"{dtype} import has failed: {e}")
    else:
        print(f"{dtype} has been imported successfully")

Output:

uint8 import has failed: uint8 is not yet supported.
uint16 import has failed: uint16 is not yet supported.
uint32 import has failed: uint32 is not yet supported.
uint64 import has failed: uint64 is not yet supported.
int8 has been imported successfully
int16 has been imported successfully
int32 has been imported successfully
int64 has been imported successfully

The simplest workaround that I could find for this issue is the following:

import numpy as np
import modin.pandas as pd
import pandas

uint8_dummies = pd.get_dummies(np.array([1., 3., 5.]))
# Modin's astype will trigger the data import to OmniSci as well, so defaulting to pandas
valid_dummies = uint8_dummies._default_to_pandas(pandas.DataFrame.astype, "int8")
# Workaround for Modin issue #3370
valid_dummies.columns = valid_dummies.columns.astype("str")
result = valid_dummies.sum(axis=0)
print(result)

anmyachev · 2021-08-31T14:06:33Z

@dchigarev What about workaround performance? Maybe it's faster to default to pandas with appropriate warning?

gshimansky · 2021-09-09T17:06:20Z

For reference, this is a patch for plasticc benchmark example that makes it work:

diff --git a/examples/docker/modin-omnisci/plasticc-omnisci.py b/examples/docker/modin-omnisci/plasticc-omnisci.py
index 958a524d..6a505b51 100644
--- a/examples/docker/modin-omnisci/plasticc-omnisci.py
+++ b/examples/docker/modin-omnisci/plasticc-omnisci.py
@@ -16,6 +16,7 @@ import time
 from collections import OrderedDict
 from functools import partial
 import modin.pandas as pd
+import pandas
 from modin.experimental.engines.omnisci_on_ray.frame.omnisci_worker import OmnisciServer

 import numpy as np
@@ -118,10 +119,12 @@ def multi_weighted_logloss(y_true, y_preds, classes, class_weights):
     """
     y_p = y_preds.reshape(y_true.shape[0], len(classes), order="F")
     y_ohe = pd.get_dummies(y_true)
+    valid_y_ohe = y_ohe._default_to_pandas(pandas.DataFrame.astype, "int8")
+    valid_y_ohe.columns = valid_y_ohe.columns.astype("str")
     y_p = np.clip(a=y_p, a_min=1e-15, a_max=1 - 1e-15)
     y_p_log = np.log(y_p)
     y_log_ones = np.sum(y_ohe.values * y_p_log, axis=0)
-    nb_pos = y_ohe.sum(axis=0).values.astype(float)
+    nb_pos = valid_y_ohe.sum(axis=0).values.astype(float)
     class_arr = np.array([class_weights[k] for k in sorted(class_weights.keys())])
     y_w = y_log_ones * class_arr / nb_pos

Garra1980 · 2021-10-01T06:43:01Z

As far as I know, OmniSci doesn't support uints for most operations.

cc @ienkovich , @fexolm

that's right.
@gshimansky How does benchmark source code changes affect performance?

dchigarev · 2021-10-01T09:57:37Z

BTW, since get_dummies itself already fallbacks to pandas, we can just put all of the steps of the presented workaround into a single default-to-pandas call instead of doing it multiple times and waste time on unnecessary conversions from modin to pandas and vice versa:

import numpy as np
import modin.pandas as pd
from modin.utils import try_cast_to_pandas
import pandas

test = np.array([1., 3., 5.])
# Calling `pandas.get_dummies` instead of modin to get pandas object in the result:
pandas_uint_dummies = pandas.get_dummies(try_cast_to_pandas(test))
pandas_uint_dummies = pandas_uint_dummies.astype("int8")
# Workaround for Modin issue #3370
pandas_uint_dummies.columns = pandas_uint_dummies.columns.astype("str")

# Converting valid for OmniSci result back to modin
modin_valid_dummies = pd.DataFrame(pandas_uint_dummies)
print(modin_valid_dummies.sum())

Using this workaround appears to be twice faster than the one suggested by me before.

Perf comparison of these workarounds

from modin.config import Backend, Engine

Backend.put("Omnisci")
Engine.put("Native")

import modin.pandas as pd
from modin.utils import try_cast_to_pandas
import pandas
import timeit


def uint_get_dummies_only_pandas_workaround(df): # timeit(number=10) -> 3.15s
    """This function uses only pandas."""
    df = try_cast_to_pandas(df)
    # Calling `pandas.get_dummies` instead of modin to get pandas object in the result:
    result = pandas.get_dummies(df).astype("int8")
    # Workaround for Modin issue #3370
    result.columns = result.columns.astype("str")
    # Converting valid for OmniSci result back to modin:
    result = pd.DataFrame(result)
    return result

def uint_get_dummies_default2pandas_workaround(df): # timeit(number=10) -> 6.99s
    """This function uses only modin, but defaults to pandas for every operation in this workaround."""
    uint8_dummies = pd.get_dummies(df)
    # Modin's astype will trigger the data import to OmniSci as well, so defaulting to pandas
    valid_dummies = uint8_dummies._default_to_pandas(pandas.DataFrame.astype, "int8")
    # Workaround for Modin issue #3370
    valid_dummies.columns = valid_dummies.columns.astype("str")
    return valid_dummies

ser = pd.Series(range(4096))

t1 = timeit.timeit(lambda: uint_get_dummies_only_pandas_workaround(ser), number=10)
t2 = timeit.timeit(lambda: uint_get_dummies_default2pandas_workaround(ser), number=10)

print(f"use only pandas: {t1}")
print(f"use modin, but fallback to pandas: {t2}")

gshimansky · 2021-10-03T02:30:07Z

that's right. @gshimansky How does benchmark source code changes affect performance?

I cannot compare performance with and without the patch. Without the patch benchmark doesn't complete because of an exception.

Garra1980 · 2021-10-03T07:33:47Z

It didn’t work earlier at all?

gshimansky · 2021-10-03T14:36:04Z

I remember that it worked but we don't know which commit broke it.

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>

anmyachev added bug 🦗 Something isn't working HDK Related to HDK (OmniSci successor) engine or backend labels Aug 24, 2021

anmyachev mentioned this issue Aug 24, 2021

FEAT-#3266: update modin, omniscidbe4py in omnisci docker examples #3270

Merged

7 tasks

dchigarev added a commit to dchigarev/modin that referenced this issue Feb 23, 2022

FIX-modin-project#3368: support unsigned integers in OmniSci backend

8830b5f

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev mentioned this issue Feb 23, 2022

FIX-#3368: support unsigned integers in OmniSci backend #4256

Merged

8 tasks

vnlitvinov closed this as completed in #4256 Mar 1, 2022

vnlitvinov pushed a commit that referenced this issue Mar 1, 2022

FIX-#3368: support unsigned integers in OmniSci backend (#4256)

241a46d

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>

gshimansky mentioned this issue Jan 18, 2023

Problem with running plasticc benchmark using HDK intel/hdk#168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: uint8 is not yet supported. #3368

RuntimeError: uint8 is not yet supported. #3368

anmyachev commented Aug 24, 2021

anmyachev commented Aug 24, 2021

YarShev commented Aug 24, 2021

dchigarev commented Aug 24, 2021 •

edited

Loading

anmyachev commented Aug 31, 2021

gshimansky commented Sep 9, 2021 •

edited

Loading

Garra1980 commented Oct 1, 2021

dchigarev commented Oct 1, 2021

gshimansky commented Oct 3, 2021

Garra1980 commented Oct 3, 2021

gshimansky commented Oct 3, 2021

RuntimeError: uint8 is not yet supported. #3368

RuntimeError: uint8 is not yet supported. #3368

Comments

anmyachev commented Aug 24, 2021

System information

Describe the problem

Source code / logs

anmyachev commented Aug 24, 2021

YarShev commented Aug 24, 2021

dchigarev commented Aug 24, 2021 • edited Loading

anmyachev commented Aug 31, 2021

gshimansky commented Sep 9, 2021 • edited Loading

Garra1980 commented Oct 1, 2021

dchigarev commented Oct 1, 2021

gshimansky commented Oct 3, 2021

Garra1980 commented Oct 3, 2021

gshimansky commented Oct 3, 2021

dchigarev commented Aug 24, 2021 •

edited

Loading

gshimansky commented Sep 9, 2021 •

edited

Loading