Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid file path or buffer when trying read_csv_glob with usecols parameter #4373

Closed
c3-cjazra opened this issue Apr 7, 2022 · 3 comments · Fixed by #4405
Closed

Invalid file path or buffer when trying read_csv_glob with usecols parameter #4373

c3-cjazra opened this issue Apr 7, 2022 · 3 comments · Fixed by #4405
Assignees
Labels
bug 🦗 Something isn't working

Comments

@c3-cjazra
Copy link

System information

ray 1.9.2
modin 0.12
python 3.9.12

Describe the problem

import modin.experimental.pandas as modin_pd
modin_pd.read_csv_glob(
                filepath_or_buffer= "somefile.csv",
                usecols = ["someExistingColumn"],
                )
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File conda/envs/ray-env/lib/python3.9/site-packages/modin/experimental/pandas/io.py:183, in _make_parser_func.<locals>.parser_func(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    180     f_locals["sep"] = "\t"
    182 kwargs = {k: v for k, v in f_locals.items() if k in _pd_read_csv_signature}
--> 183 return _read(**kwargs)

File conda/envs/ray-env/lib/python3.9/site-packages/modin/experimental/pandas/io.py:208, in _read(**kwargs)
    205 Engine.subscribe(_update_engine)
    207 try:
--> 208     pd_obj = FactoryDispatcher.read_csv_glob(**kwargs)
    209 except AttributeError:
    210     raise AttributeError("read_csv_glob() is only implemented for pandas on Ray.")

File conda/envs/ray-env/lib/python3.9/site-packages/modin/core/execution/dispatching/factories/dispatcher.py:185, in FactoryDispatcher.read_csv_glob(cls, **kwargs)
    182 @classmethod
    183 @_inherit_docstrings(factories.ExperimentalPandasOnRayFactory._read_csv_glob)
    184 def read_csv_glob(cls, **kwargs):
--> 185     return cls.__factory._read_csv_glob(**kwargs)

File conda/envs/ray-env/lib/python3.9/site-packages/modin/core/execution/dispatching/factories/factories.py:513, in ExperimentalPandasOnRayFactory._read_csv_glob(cls, **kwargs)
    506 @classmethod
    507 @doc(
    508     _doc_io_method_raw_template,
   (...)
    511 )
    512 def _read_csv_glob(cls, **kwargs):
--> 513     return cls.io_cls.read_csv_glob(**kwargs)

File conda/envs/ray-env/lib/python3.9/site-packages/modin/core/io/text/csv_glob_dispatcher.py:135, in CSVGlobDispatcher._read(cls, filepath_or_buffer, **kwargs)
    133 if usecols is not None and usecols_md[1] != "integer":
    134     del kwargs["usecols"]
--> 135     all_cols = pandas.read_csv(
    136         OpenFile(filepath_or_buffer, "rb"),
    137         **dict(kwargs, nrows=0, skipfooter=0),
    138     ).columns
    139     usecols = all_cols.get_indexer_for(list(usecols_md[0]))
    140 parse_dates = kwargs.pop("parse_dates", False)

File conda/envs/ray-env/lib/python3.9/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File conda/envs/ray-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py:586, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    571 kwds_defaults = _refine_defaults_read(
    572     dialect,
    573     delimiter,
   (...)
    582     defaults={"delimiter": ","},
    583 )
    584 kwds.update(kwds_defaults)
--> 586 return _read(filepath_or_buffer, kwds)

File conda/envs/ray-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py:482, in _read(filepath_or_buffer, kwds)
    479 _validate_names(kwds.get("names", None))
    481 # Create the parser.
--> 482 parser = TextFileReader(filepath_or_buffer, **kwds)
    484 if chunksize or iterator:
    485     return parser

File conda/envs/ray-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py:811, in TextFileReader.__init__(self, f, engine, **kwds)
    808 if "has_index_names" in kwds:
    809     self.options["has_index_names"] = kwds["has_index_names"]
--> 811 self._engine = self._make_engine(self.engine)

File conda/envs/ray-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1040, in TextFileReader._make_engine(self, engine)
   1036     raise ValueError(
   1037         f"Unknown engine: {engine} (valid options are {mapping.keys()})"
   1038     )
   1039 # error: Too many arguments for "ParserBase"
-> 1040 return mapping[engine](self.f, **self.options)

File conda/envs/ray-env/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py:51, in CParserWrapper.__init__(self, src, **kwds)
     48 kwds["usecols"] = self.usecols
     50 # open handles
---> 51 self._open_handles(src, kwds)
     52 assert self.handles is not None
     54 # Have to pass int, would break tests using TextReader directly otherwise :(

File conda/envs/ray-env/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py:222, in ParserBase._open_handles(self, src, kwds)
    218 def _open_handles(self, src: FilePathOrBuffer, kwds: dict[str, Any]) -> None:
    219     """
    220     Let the readers open IOHandles after they are done with their potential raises.
    221     """
--> 222     self.handles = get_handle(
    223         src,
    224         "r",
    225         encoding=kwds.get("encoding", None),
    226         compression=kwds.get("compression", None),
    227         memory_map=kwds.get("memory_map", False),
    228         storage_options=kwds.get("storage_options", None),
    229         errors=kwds.get("encoding_errors", "strict"),
    230     )

File conda/envs/ray-env/lib/python3.9/site-packages/pandas/io/common.py:609, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    602     raise ValueError(
    603         f"Invalid value for `encoding_errors` ({errors}). Please see "
    604         + "https://docs.python.org/3/library/codecs.html#error-handlers "
    605         + "for valid values."
    606     )
    608 # open URLs
--> 609 ioargs = _get_filepath_or_buffer(
    610     path_or_buf,
    611     encoding=encoding,
    612     compression=compression,
    613     mode=mode,
    614     storage_options=storage_options,
    615 )
    617 handle = ioargs.filepath_or_buffer
    618 handles: list[Buffer]

File conda/envs/ray-env/lib/python3.9/site-packages/pandas/io/common.py:396, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    394 if not is_file_like(filepath_or_buffer):
    395     msg = f"Invalid file path or buffer object type: {type(filepath_or_buffer)}"
--> 396     raise ValueError(msg)
    398 return IOArgs(
    399     filepath_or_buffer=filepath_or_buffer,
    400     encoding=encoding,
   (...)
    403     mode=mode,
    404 )

ValueError: Invalid file path or buffer object type: <class 'modin.core.io.file_dispatcher.OpenFile'>
@mvashishtha
Copy link
Collaborator

@c3-cjazra thank you for reporting this. I can reproduce it with this script at Modin version 0.14.0+2.gf41432c1:

import modin.experimental.pandas as pd

pd.DataFrame([[1]], columns=["col0"]).to_csv("/tmp/1x1.csv")

pd.read_csv_glob(
    filepath_or_buffer="/tmp/1x1.csv",
    usecols=["col0"],
)

We'll try to fix the bug soon.

@mvashishtha mvashishtha added the bug 🦗 Something isn't working label Apr 8, 2022
@anmyachev anmyachev self-assigned this Apr 20, 2022
anmyachev added a commit to anmyachev/modin that referenced this issue Apr 20, 2022
…arameter

Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
anmyachev added a commit to anmyachev/modin that referenced this issue Apr 20, 2022
…arameter

Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
YarShev pushed a commit that referenced this issue Apr 21, 2022
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
@c3-cjazra
Copy link
Author

Thanks @YarShev! which release do you expect this to come in?

@YarShev
Copy link
Collaborator

YarShev commented Apr 21, 2022

@c3-cjazra, we'll put this in 0.14.1 release, which should happen very soon.

devin-petersohn pushed a commit that referenced this issue May 4, 2022
Signed-off-by: Anatoly Myachev <anatoliimyachev@mail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants