-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: uint8 is not yet supported. #3368
Comments
@dchigarev can you take a look? |
As far as I know, OmniSci doesn't support uints for most operations. cc @ienkovich , @fexolm |
it seems that PyDBE (not sure about OmniSci itself) does not support unsigned integers at all: Reproducerimport sys
sys.setdlopenflags(1 | 256) # RTLD_LAZY+RTLD_GLOBAL
from dbe import PyDbEngine
import pyarrow as pa
server = PyDbEngine()
data = {"a": [1, 2, 3]}
dtypes = (
"uint8", "uint16", "uint32", "uint64",
"int8", "int16", "int32", "int64"
)
tables = (
pa.Table.from_pydict(data, schema=pa.schema({"a": getattr(pa, dtype)()}))
for dtype in dtypes
)
for dtype, table in zip(dtypes, tables):
try:
server.importArrowTable(dtype, table)
except Exception as e:
print(f"{dtype} import has failed: {e}")
else:
print(f"{dtype} has been imported successfully") Output:
The simplest workaround that I could find for this issue is the following: import numpy as np
import modin.pandas as pd
import pandas
uint8_dummies = pd.get_dummies(np.array([1., 3., 5.]))
# Modin's astype will trigger the data import to OmniSci as well, so defaulting to pandas
valid_dummies = uint8_dummies._default_to_pandas(pandas.DataFrame.astype, "int8")
# Workaround for Modin issue #3370
valid_dummies.columns = valid_dummies.columns.astype("str")
result = valid_dummies.sum(axis=0)
print(result) |
@dchigarev What about workaround performance? Maybe it's faster to default to pandas with appropriate warning? |
For reference, this is a patch for plasticc benchmark example that makes it work: diff --git a/examples/docker/modin-omnisci/plasticc-omnisci.py b/examples/docker/modin-omnisci/plasticc-omnisci.py
index 958a524d..6a505b51 100644
--- a/examples/docker/modin-omnisci/plasticc-omnisci.py
+++ b/examples/docker/modin-omnisci/plasticc-omnisci.py
@@ -16,6 +16,7 @@ import time
from collections import OrderedDict
from functools import partial
import modin.pandas as pd
+import pandas
from modin.experimental.engines.omnisci_on_ray.frame.omnisci_worker import OmnisciServer
import numpy as np
@@ -118,10 +119,12 @@ def multi_weighted_logloss(y_true, y_preds, classes, class_weights):
"""
y_p = y_preds.reshape(y_true.shape[0], len(classes), order="F")
y_ohe = pd.get_dummies(y_true)
+ valid_y_ohe = y_ohe._default_to_pandas(pandas.DataFrame.astype, "int8")
+ valid_y_ohe.columns = valid_y_ohe.columns.astype("str")
y_p = np.clip(a=y_p, a_min=1e-15, a_max=1 - 1e-15)
y_p_log = np.log(y_p)
y_log_ones = np.sum(y_ohe.values * y_p_log, axis=0)
- nb_pos = y_ohe.sum(axis=0).values.astype(float)
+ nb_pos = valid_y_ohe.sum(axis=0).values.astype(float)
class_arr = np.array([class_weights[k] for k in sorted(class_weights.keys())])
y_w = y_log_ones * class_arr / nb_pos |
that's right. |
BTW, since import numpy as np
import modin.pandas as pd
from modin.utils import try_cast_to_pandas
import pandas
test = np.array([1., 3., 5.])
# Calling `pandas.get_dummies` instead of modin to get pandas object in the result:
pandas_uint_dummies = pandas.get_dummies(try_cast_to_pandas(test))
pandas_uint_dummies = pandas_uint_dummies.astype("int8")
# Workaround for Modin issue #3370
pandas_uint_dummies.columns = pandas_uint_dummies.columns.astype("str")
# Converting valid for OmniSci result back to modin
modin_valid_dummies = pd.DataFrame(pandas_uint_dummies)
print(modin_valid_dummies.sum()) Using this workaround appears to be twice faster than the one suggested by me before. Perf comparison of these workaroundsfrom modin.config import Backend, Engine
Backend.put("Omnisci")
Engine.put("Native")
import modin.pandas as pd
from modin.utils import try_cast_to_pandas
import pandas
import timeit
def uint_get_dummies_only_pandas_workaround(df): # timeit(number=10) -> 3.15s
"""This function uses only pandas."""
df = try_cast_to_pandas(df)
# Calling `pandas.get_dummies` instead of modin to get pandas object in the result:
result = pandas.get_dummies(df).astype("int8")
# Workaround for Modin issue #3370
result.columns = result.columns.astype("str")
# Converting valid for OmniSci result back to modin:
result = pd.DataFrame(result)
return result
def uint_get_dummies_default2pandas_workaround(df): # timeit(number=10) -> 6.99s
"""This function uses only modin, but defaults to pandas for every operation in this workaround."""
uint8_dummies = pd.get_dummies(df)
# Modin's astype will trigger the data import to OmniSci as well, so defaulting to pandas
valid_dummies = uint8_dummies._default_to_pandas(pandas.DataFrame.astype, "int8")
# Workaround for Modin issue #3370
valid_dummies.columns = valid_dummies.columns.astype("str")
return valid_dummies
ser = pd.Series(range(4096))
t1 = timeit.timeit(lambda: uint_get_dummies_only_pandas_workaround(ser), number=10)
t2 = timeit.timeit(lambda: uint_get_dummies_default2pandas_workaround(ser), number=10)
print(f"use only pandas: {t1}")
print(f"use modin, but fallback to pandas: {t2}") |
I cannot compare performance with and without the patch. Without the patch benchmark doesn't complete because of an exception. |
It didn’t work earlier at all? |
I remember that it worked but we don't know which commit broke it. |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
System information
Ubuntu 20.04
modin.__version__
):0.10.2
(from conda-forge)3.9.6
Test.py:
Describe the problem
This issue was found when running
PlastiCC
workload.Source code / logs
The text was updated successfully, but these errors were encountered: