Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: uint8 is not yet supported. #3368

Closed
anmyachev opened this issue Aug 24, 2021 · 10 comments · Fixed by #4256
Closed

RuntimeError: uint8 is not yet supported. #3368

anmyachev opened this issue Aug 24, 2021 · 10 comments · Fixed by #4256
Labels
bug 🦗 Something isn't working HDK Related to HDK (OmniSci successor) engine or backend

Comments

@anmyachev
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Modin version (modin.__version__): 0.10.2 (from conda-forge)
  • Python version: 3.9.6
  • Code we can use to reproduce:
conda install modin-omnisci -c conda-forge
MODIN_BACKEND=omnisci MODIN_EXPERIMENTAL=true python test.py

Test.py:

import numpy as np
import modin.pandas as pd

test = np.array([1., 3., 5.])
pd.get_dummies(np.array([1., 3., 5.])).sum(axis=0)

Describe the problem

This issue was found when running PlastiCC workload.

Source code / logs

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/pandas/dataframe.py", line 2070, in sum
    data._query_compiler.sum(
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/backends/omnisci/query_compiler.py", line 96, in method_wrapper
    return method(self, *args, **kwargs)
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/backends/omnisci/query_compiler.py", line 361, in sum
    return self._agg("sum", **kwargs)
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/backends/omnisci/query_compiler.py", line 96, in method_wrapper
    return method(self, *args, **kwargs)
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/backends/omnisci/query_compiler.py", line 413, in _agg
    new_frame = new_frame._set_index(
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/engines/omnisci_on_ray/frame/data.py", line 1931, in _set_index
    self._execute()
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/engines/omnisci_on_ray/frame/data.py", line 1708, in _execute
    new_partitions = self._partition_mgr_cls.run_exec_plan(
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/engines/omnisci_on_ray/frame/partition_manager.py", line 248, in run_exec_plan
    p.frame_id = omniSession.put_arrow_to_omnisci(obj)
  File ".../modin_on_omnisci_conda_forge/lib/python3.9/site-packages/modin/experimental/engines/omnisci_on_ray/frame/omnisci_worker.py", line 197, in put_arrow_to_omnisci
    cls._server.importArrowTable(name, table, fragment_size=fragment_size)
  File "dbe.pyx", line 207, in omniscidbe.PyDbEngine.importArrowTable
RuntimeError: uint8 is not yet supported.
@anmyachev anmyachev added bug 🦗 Something isn't working HDK Related to HDK (OmniSci successor) engine or backend labels Aug 24, 2021
@anmyachev
Copy link
Collaborator Author

@dchigarev can you take a look?

@YarShev
Copy link
Collaborator

YarShev commented Aug 24, 2021

As far as I know, OmniSci doesn't support uints for most operations.

cc @ienkovich , @fexolm

@dchigarev
Copy link
Collaborator

dchigarev commented Aug 24, 2021

it seems that PyDBE (not sure about OmniSci itself) does not support unsigned integers at all:

Reproducer
import sys

sys.setdlopenflags(1 | 256)  # RTLD_LAZY+RTLD_GLOBAL
from dbe import PyDbEngine

import pyarrow as pa

server = PyDbEngine()

data = {"a": [1, 2, 3]}

dtypes = (
    "uint8", "uint16", "uint32", "uint64",
    "int8", "int16", "int32", "int64"
)
tables = (
    pa.Table.from_pydict(data, schema=pa.schema({"a": getattr(pa, dtype)()}))
    for dtype in dtypes
)

for dtype, table in zip(dtypes, tables):
    try:
        server.importArrowTable(dtype, table)
    except Exception as e:
        print(f"{dtype} import has failed: {e}")
    else:
        print(f"{dtype} has been imported successfully")

Output:

uint8 import has failed: uint8 is not yet supported.
uint16 import has failed: uint16 is not yet supported.
uint32 import has failed: uint32 is not yet supported.
uint64 import has failed: uint64 is not yet supported.
int8 has been imported successfully
int16 has been imported successfully
int32 has been imported successfully
int64 has been imported successfully

The simplest workaround that I could find for this issue is the following:

import numpy as np
import modin.pandas as pd
import pandas

uint8_dummies = pd.get_dummies(np.array([1., 3., 5.]))
# Modin's astype will trigger the data import to OmniSci as well, so defaulting to pandas
valid_dummies = uint8_dummies._default_to_pandas(pandas.DataFrame.astype, "int8")
# Workaround for Modin issue #3370
valid_dummies.columns = valid_dummies.columns.astype("str")
result = valid_dummies.sum(axis=0)
print(result)

@anmyachev
Copy link
Collaborator Author

@dchigarev What about workaround performance? Maybe it's faster to default to pandas with appropriate warning?

@gshimansky
Copy link
Collaborator

gshimansky commented Sep 9, 2021

For reference, this is a patch for plasticc benchmark example that makes it work:

diff --git a/examples/docker/modin-omnisci/plasticc-omnisci.py b/examples/docker/modin-omnisci/plasticc-omnisci.py
index 958a524d..6a505b51 100644
--- a/examples/docker/modin-omnisci/plasticc-omnisci.py
+++ b/examples/docker/modin-omnisci/plasticc-omnisci.py
@@ -16,6 +16,7 @@ import time
 from collections import OrderedDict
 from functools import partial
 import modin.pandas as pd
+import pandas
 from modin.experimental.engines.omnisci_on_ray.frame.omnisci_worker import OmnisciServer

 import numpy as np
@@ -118,10 +119,12 @@ def multi_weighted_logloss(y_true, y_preds, classes, class_weights):
     """
     y_p = y_preds.reshape(y_true.shape[0], len(classes), order="F")
     y_ohe = pd.get_dummies(y_true)
+    valid_y_ohe = y_ohe._default_to_pandas(pandas.DataFrame.astype, "int8")
+    valid_y_ohe.columns = valid_y_ohe.columns.astype("str")
     y_p = np.clip(a=y_p, a_min=1e-15, a_max=1 - 1e-15)
     y_p_log = np.log(y_p)
     y_log_ones = np.sum(y_ohe.values * y_p_log, axis=0)
-    nb_pos = y_ohe.sum(axis=0).values.astype(float)
+    nb_pos = valid_y_ohe.sum(axis=0).values.astype(float)
     class_arr = np.array([class_weights[k] for k in sorted(class_weights.keys())])
     y_w = y_log_ones * class_arr / nb_pos

@Garra1980
Copy link
Collaborator

As far as I know, OmniSci doesn't support uints for most operations.

cc @ienkovich , @fexolm

that's right.
@gshimansky How does benchmark source code changes affect performance?

@dchigarev
Copy link
Collaborator

BTW, since get_dummies itself already fallbacks to pandas, we can just put all of the steps of the presented workaround into a single default-to-pandas call instead of doing it multiple times and waste time on unnecessary conversions from modin to pandas and vice versa:

import numpy as np
import modin.pandas as pd
from modin.utils import try_cast_to_pandas
import pandas

test = np.array([1., 3., 5.])
# Calling `pandas.get_dummies` instead of modin to get pandas object in the result:
pandas_uint_dummies = pandas.get_dummies(try_cast_to_pandas(test))
pandas_uint_dummies = pandas_uint_dummies.astype("int8")
# Workaround for Modin issue #3370
pandas_uint_dummies.columns = pandas_uint_dummies.columns.astype("str")

# Converting valid for OmniSci result back to modin
modin_valid_dummies = pd.DataFrame(pandas_uint_dummies)
print(modin_valid_dummies.sum())

Using this workaround appears to be twice faster than the one suggested by me before.

Perf comparison of these workarounds
from modin.config import Backend, Engine

Backend.put("Omnisci")
Engine.put("Native")

import modin.pandas as pd
from modin.utils import try_cast_to_pandas
import pandas
import timeit


def uint_get_dummies_only_pandas_workaround(df): # timeit(number=10) -> 3.15s
    """This function uses only pandas."""
    df = try_cast_to_pandas(df)
    # Calling `pandas.get_dummies` instead of modin to get pandas object in the result:
    result = pandas.get_dummies(df).astype("int8")
    # Workaround for Modin issue #3370
    result.columns = result.columns.astype("str")
    # Converting valid for OmniSci result back to modin:
    result = pd.DataFrame(result)
    return result

def uint_get_dummies_default2pandas_workaround(df): # timeit(number=10) -> 6.99s
    """This function uses only modin, but defaults to pandas for every operation in this workaround."""
    uint8_dummies = pd.get_dummies(df)
    # Modin's astype will trigger the data import to OmniSci as well, so defaulting to pandas
    valid_dummies = uint8_dummies._default_to_pandas(pandas.DataFrame.astype, "int8")
    # Workaround for Modin issue #3370
    valid_dummies.columns = valid_dummies.columns.astype("str")
    return valid_dummies

ser = pd.Series(range(4096))

t1 = timeit.timeit(lambda: uint_get_dummies_only_pandas_workaround(ser), number=10)
t2 = timeit.timeit(lambda: uint_get_dummies_default2pandas_workaround(ser), number=10)

print(f"use only pandas: {t1}")
print(f"use modin, but fallback to pandas: {t2}")

@gshimansky
Copy link
Collaborator

that's right. @gshimansky How does benchmark source code changes affect performance?

I cannot compare performance with and without the patch. Without the patch benchmark doesn't complete because of an exception.

@Garra1980
Copy link
Collaborator

It didn’t work earlier at all?

@gshimansky
Copy link
Collaborator

I remember that it worked but we don't know which commit broke it.

dchigarev added a commit to dchigarev/modin that referenced this issue Feb 23, 2022
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
vnlitvinov pushed a commit that referenced this issue Mar 1, 2022
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working HDK Related to HDK (OmniSci successor) engine or backend
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants