Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: can't create a new column with string which looks like a date #7371

Closed
2 of 3 tasks
MarcoGorelli opened this issue Aug 19, 2024 · 3 comments · Fixed by #7372
Closed
2 of 3 tasks

BUG: can't create a new column with string which looks like a date #7371

MarcoGorelli opened this issue Aug 19, 2024 · 3 comments · Fixed by #7372
Assignees
Labels
bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests

Comments

@MarcoGorelli
Copy link

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

In [1]: import modin.pandas as pd

In [2]: df = pd.DataFrame({'a': [0]})

In [3]: df['b'] = 'foobar'

In [4]: df['c'] = '2020-01-01'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 df['c'] = '2020-01-01'

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/modin/logging/logger_decorator.py:144, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    129 """
    130 Compute function with logging if Modin logging is enabled.
    131 
   (...)
    141 Any
    142 """
    143 if LogMode.get() == "disable":
--> 144     return obj(*args, **kwargs)
    146 logger = get_logger()
    147 logger.log(log_level, start_line)

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/modin/pandas/dataframe.py:2683, in DataFrame.__setitem__(self, key, value)
   2681         return
   2682     # Do new column assignment after error checks and possible value modifications
-> 2683     self.insert(loc=len(self.columns), column=key, value=value)
   2684     return
   2686 if not hashable(key):

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/modin/logging/logger_decorator.py:144, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    129 """
    130 Compute function with logging if Modin logging is enabled.
    131 
   (...)
    141 Any
    142 """
    143 if LogMode.get() == "disable":
--> 144     return obj(*args, **kwargs)
    146 logger = get_logger()
    147 logger.log(log_level, start_line)

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/modin/pandas/dataframe.py:1130, in DataFrame.insert(self, loc, column, value, allow_duplicates)
   1128     if isinstance(value, (Series, array)):
   1129         value = value._query_compiler
-> 1130     new_query_compiler = self._query_compiler.insert(loc, column, value)
   1132 self._update_inplace(new_query_compiler=new_query_compiler)

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/modin/logging/logger_decorator.py:144, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    129 """
    130 Compute function with logging if Modin logging is enabled.
    131 
   (...)
    141 Any
    142 """
    143 if LogMode.get() == "disable":
--> 144     return obj(*args, **kwargs)
    146 logger = get_logger()
    147 logger.log(log_level, start_line)

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/modin/core/storage_formats/pandas/query_compiler.py:3271, in PandasQueryCompiler.insert(self, loc, column, value)
   3268     df.insert(internal_idx, column, value)
   3269     return df
-> 3271 value_dtype = extract_dtype(value)
   3272 new_columns = self.columns.insert(loc, column)
   3273 new_dtypes = ModinDtypes.concat(
   3274     [
   3275         self._modin_frame._dtypes,
   (...)
   3279     new_columns
   3280 )  # get dtypes in a proper order

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/modin/core/dataframe/pandas/metadata/dtypes.py:1227, in extract_dtype(value)
   1215 """
   1216 Extract dtype(s) from the passed `value`.
   1217 
   (...)
   1224 DtypeObj or pandas.Series of DtypeObj
   1225 """
   1226 try:
-> 1227     dtype = pandas.api.types.pandas_dtype(value)
   1228 except TypeError:
   1229     dtype = pandas.Series(value).dtype

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/pandas/core/dtypes/common.py:1645, in pandas_dtype(dtype)
   1640     with warnings.catch_warnings():
   1641         # GH#51523 - Series.astype(np.integer) doesn't show
   1642         # numpy deprecation warning of np.integer
   1643         # Hence enabling DeprecationWarning
   1644         warnings.simplefilter("always", DeprecationWarning)
-> 1645         npdtype = np.dtype(dtype)
   1646 except SyntaxError as err:
   1647     # np.dtype uses `eval` which can raise SyntaxError
   1648     raise TypeError(f"data type '{dtype}' not understood") from err

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/numpy/_core/_internal.py:170, in _commastring(astr)
    168 mo = sep_re.match(astr, pos=startindex)
    169 if not mo:
--> 170     raise ValueError(
    171         'format number %d of "%s" is not recognized' %
    172         (len(result)+1, astr))
    173 startindex = mo.end()
    174 islist = True

ValueError: format number 1 of "2020-01-01" is not recognized

Issue Description

it raises unnecessarily

Expected Behavior

create a new column with '2020-01-01' as value

Error Logs

Replace this line with the error backtrace (if applicable).

Installed Versions

In [5]: pd.show_versions()
UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml

INSTALLED VERSIONS

commit : c8bbca8
python : 3.11.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.153.1-microsoft-standard-WSL2
Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.31.0
ray : None
dask : 2024.8.0
distributed : 2024.8.0

pandas dependencies

pandas : 2.2.2
numpy : 2.0.1
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 72.1.0
pip : 24.2
Cython : None
pytest : 8.3.2
hypothesis : 6.111.0
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.26.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.6.1
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@MarcoGorelli MarcoGorelli added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Aug 19, 2024
@noloerino
Copy link
Collaborator

It looks like when modin tries to insert a value into a frame, it first tries to call pandas.api.types.pandas_dtype on the actual value, and if that raises a TypeError, it tries pandas.Series(value).dtype to infer the dtype of the new column. Along the way, it's hitting np.dtype(value), which seems to be trying to parse a structured dtype out of the input string (https://github.com/numpy/numpy/blob/d35cd07ea997f033b2d89d349734c61f5de54b0d/numpy/core/_internal.py#L166), not 100% sure on what the regex is trying to match), and numpy is raising a ValueError instead of TypeError in this case.

@devin-petersohn (or other contributors) would you happen to know why we're trying to directly call pandas.api.types.pandas_dtype on the new value in the first place? If we can't remove that check, we should at least be able to fix this by catching ValueError in addition to TypeError here:

@noloerino noloerino added P2 Minor bugs or low-priority feature requests and removed Triage 🩹 Issues that need triage labels Aug 20, 2024
@ramakrishna975
Copy link

I too experiencing the same. Is there any workaround for this?

@noloerino noloerino self-assigned this Aug 21, 2024
noloerino added a commit to noloerino/modin that referenced this issue Aug 21, 2024
Signed-off-by: Jonathan Shi <jhshi07@gmail.com>
noloerino added a commit to noloerino/modin that referenced this issue Aug 22, 2024
Signed-off-by: Jonathan Shi <jhshi07@gmail.com>
@noloerino
Copy link
Collaborator

@ramakrishna975 You can insert a Series instead of a scalar value as a workaround.

>>> df = pd.DataFrame({'a': [0]})
>>> df["c"] = pd.Series(["2020-01-01"])
>>> df
   a           c
0  0  2020-01-01

noloerino added a commit to noloerino/modin that referenced this issue Aug 26, 2024
Signed-off-by: Jonathan Shi <jhshi07@gmail.com>
YarShev pushed a commit that referenced this issue Aug 28, 2024
Signed-off-by: Jonathan Shi <jhshi07@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants