Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte #976

Closed
ghsama opened this issue Jan 8, 2020 · 12 comments · Fixed by #980 or #2593
Assignees
Labels
bug 🦗 Something isn't working
Milestone

Comments

@ghsama
Copy link

ghsama commented Jan 8, 2020

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04
  • **Modin installed from : pip install modin[ray]
  • Modin version: 0.6.3
  • Python version: 3.7.3

Describe the problem

Hello,
i'm trying to use modin to reduce the memory peak due the volum of the data, so i change the pandas with modin.pandas, i try to do a simple read of a file but encoded in 'latin-1' (french) . With pandas all goes smoothly but using modin i got an error of encoding as follow :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte

the script used (which works fine on pandas but not in modin ) :
caract = pd.read_csv(path, sep="\t", encoding = "ISO-8859-1")

ps :: i tried other encoding and the same remark : works on pandas and not on modin (backed by ray) : ISO-8859-1, ISO-8859-9, latin-1

any solution ??

thanks

Source code / logs

`RayTaskError: ray_worker (pid=10815, host=ubuntu)
File "pandas/_libs/parsers.pyx", line 1297, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1520, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte

During handling of the above exception, another exception occurred:

ray_worker (pid=10815, host=ubuntu)
File "/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/modin/engines/ray/task_wrapper.py", line 8, in deploy_ray_func
return func(**args)
File "/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/modin/backends/pandas/parsers.py", line 69, in parse
pandas_df = pandas.read_csv(BytesIO(to_read), **kwargs)
File "/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/pandas/io/parsers.py", line 463, in _read
data = parser.read(nrows)
File "/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "/home/lasngd/.conda/envs/pytorch/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 973, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1105, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1158, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1281, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1297, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1520, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2: invalid continuation byte`

@devin-petersohn
Copy link
Collaborator

Hi @ghsama, thanks for posting!

Do you mind to share a couple of lines of synthetic or real data so I can reproduce this locally? Locally and in our test suite, we test with various forms of encoding, but I haven't been able to reproduce. If you can share some data that reproduces the issue for you, I will be able to diagnose and fix the issue. Thanks!

@hvardhan20
Copy link

System information
OS Platform Windows 10 Home
**Modin installed from : pip install modin[dask]
Modin version: 0.6.3
Python version: 3.7.3

Hi @devin-petersohn,

I am facing a similar issue as @ghsama on windows with modin using dask engine. With vanilla pandas this works just fine:
pd.read_csv('fires_50k.csv', encoding = "ISO-8859-1")
However, while reading CSV with modin.pandas like this:
mpd.read_csv('fires_50k.csv', encoding = "ISO-8859-1")
throws this UnicodeDecodeError:

Traceback (most recent call last):
File "", line 1, in
File "D:\Users\hvard\Anaconda3\lib\site-packages\modin\pandas\io.py", line 97, in parser_func
return _read(**kwargs)
File "D:\Users\hvard\Anaconda3\lib\site-packages\modin\pandas\io.py", line 110, in _read
pd_obj = BaseFactory.read_csv(**kwargs)
File "D:\Users\hvard\Anaconda3\lib\site-packages\modin\data_management\factories.py", line 52, in read_csv
return cls._determine_engine()._read_csv(**kwargs)
File "D:\Users\hvard\Anaconda3\lib\site-packages\modin\data_management\factories.py", line 56, in _read_csv
return cls.io_cls.read_csv(**kwargs)
File "D:\Users\hvard\Anaconda3\lib\site-packages\modin\engines\base\io\text\csv_reader.py", line 197, in read
row_lengths = cls.materialize(index_ids)
File "D:\Users\hvard\Anaconda3\lib\site-packages\modin\engines\dask\task_wrapper.py", line 20, in materialize
return client.gather(future)
File "D:\Users\hvard\Anaconda3\lib\site-packages\distributed\client.py", line 1876, in gather
asynchronous=asynchronous,
File "D:\Users\hvard\Anaconda3\lib\site-packages\distributed\client.py", line 771, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "D:\Users\hvard\Anaconda3\lib\site-packages\distributed\utils.py", line 334, in sync
raise exc.with_traceback(tb)
File "D:\Users\hvard\Anaconda3\lib\site-packages\distributed\utils.py", line 318, in f
result[0] = yield future
File "D:\Users\hvard\Anaconda3\lib\site-packages\tornado\gen.py", line 1133, in run
value = future.result()
File "D:\Users\hvard\Anaconda3\lib\site-packages\distributed\client.py", line 1732, in _gather
raise exception.with_traceback(traceback)
File "D:\Users\hvard\Anaconda3\lib\site-packages\modin\backends\pandas\parsers.py", line 69, in parse
pandas_df = pandas.read_csv(BytesIO(to_read), **kwargs)
File "D:\Users\hvard\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "D:\Users\hvard\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 463, in _read
data = parser.read(nrows)
File "D:\Users\hvard\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "D:\Users\hvard\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 973, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1105, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1158, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1281, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1297, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1520, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 8: invalid start byte

This is the CSV I am trying to read. This is a part(50K) of a large 1.88M rows dataset. I think you should be able to reproduce the issue with this data. Please do let me know otherwise.

Thanks!

@ghsama
Copy link
Author

ghsama commented Jan 9, 2020

Hi @devin-petersohn ,
here is a part of the csv file that causes the bug : csvFile
Thanks

@devin-petersohn
Copy link
Collaborator

Thanks @ghsama and @hvardhan20! I was able to reproduce the issue with your files.

In short, this line was the culprit and was excluding the csv reader from seeing the encoding:

https://github.com/modin-project/modin/blob/master/modin/backends/pandas/parsers.py#L79

We should add more testing for a wider variety of encodings because this was missed by our tests.

I will get this resolved, thanks so much for reporting!

@devin-petersohn devin-petersohn added this to the 0.7.0 milestone Jan 10, 2020
@devin-petersohn devin-petersohn added the bug 🦗 Something isn't working label Jan 10, 2020
devin-petersohn added a commit to devin-petersohn/modin that referenced this issue Jan 10, 2020
* Resolves modin-project#976
* Change default value in `kwargs.get` to match pandas
* Add parametrized test for `encoding` with a variety of new encodings
devin-petersohn added a commit that referenced this issue Jan 10, 2020
* Resolves #976
* Change default value in `kwargs.get` to match pandas
* Add parametrized test for `encoding` with a variety of new encodings
@sector119
Copy link

sector119 commented Dec 30, 2020

MacOS Big Sur 11.2

python 3.8.6

ray = https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-1.2.0.dev0-cp38-cp38-macosx_10_13_x86_64.whl
modin from git = https://github.com/modin-project/modin

Got the same error with windows-1251 encoding. CSV file and traceback are attached.

file.txt

Traceback (most recent call last):
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/bin/unity", line 5, in
app()
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/typer/main.py", line 214, in call
return get_command(self)(*args, **kwargs)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/Users/sector119/PycharmProjects/epsilon_unity/unity/lib/cmd/commands/db/commands/read.py", line 18, in read
adapter.read(cache)
File "/Users/sector119/PycharmProjects/epsilon/epsilon/lib/profiling.py", line 9, in timed
output = func(*args, **kw)
File "/Users/sector119/PycharmProjects/epsilon_unity/unity/lib/parser/adapter/plugins/base.py", line 87, in read
self.dataframe = source_type.dataframe()
File "/Users/sector119/PycharmProjects/epsilon_unity/unity/lib/parser/type/plugins/csv.py", line 10, in dataframe
df = pd.read_csv(self.filepath,
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/modin/pandas/io.py", line 115, in parser_func
return _read(**kwargs)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/modin/pandas/io.py", line 133, in _read
pd_obj = EngineDispatcher.read_csv(**kwargs)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/modin/data_management/factories/dispatcher.py", line 104, in read_csv
return cls.__engine._read_csv(**kwargs)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/modin/data_management/factories/factories.py", line 87, in _read_csv
return cls.io_cls.read_csv(**kwargs)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 29, in read
query_compiler = cls._read(*args, **kwargs)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 74, in _read
empty_pd_df = pandas.read_csv(filepath_or_buffer, nrows=0)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/sector119/.pyenv/versions/3.8.6/envs/unity-epsilon-3.8.6/lib/python3.8/site-packages/pandas/io/parsers.py", line 2010, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 537, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 740, in pandas._libs.parsers.TextReader._get_header
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 32: invalid continuation byte

@Garra1980
Copy link
Collaborator

сс @anmyachev

@anmyachev anmyachev self-assigned this Jan 11, 2021
@anmyachev
Copy link
Collaborator

@sector119 thanks for posting. I was able to reproduce the error, but only with using names parameter in read_csv call.

Can you share your read_csv call? Maybe there are another details.

@sector119
Copy link

sector119 commented Jan 11, 2021

df = pd.read_csv( filepath, encoding='windows-1251', names=['field', 'names', 'here'], sep=';', skiprows=0, na_values='\\N', engine='c' )

Thank You

anmyachev added a commit to anmyachev/modin that referenced this issue Jan 12, 2021
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
anmyachev added a commit to anmyachev/modin that referenced this issue Jan 12, 2021
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
anmyachev added a commit to anmyachev/modin that referenced this issue Jan 12, 2021
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
amyskov pushed a commit that referenced this issue Jan 13, 2021
* FIX-#976: add failed test

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-#976: add encoding parameter to read_csv call

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-#976: fix test in experimental mode

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
aregm added a commit to aregm/modin that referenced this issue Feb 18, 2021
* FIX-modin-project#2195: fix describe error for datasets with datetimes (modin-project#2272)

* FIX-modin-project#2195: fix describe error for datasets with datetimes

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2195: add test

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2195: enable fix

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2195: Update modin/pandas/test/dataframe/test_reduction.py

Co-authored-by: Dmitry Chigarev <62142979+dchigarev@users.noreply.github.com>

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#1906: fixed incorrect behaviour of 'groupby.__getattr' (modin-project#2276)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2277: applied Title Case to the names of DATASET_SIZE_DICT keys (modin-project#2278)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2280: use 32 bytes in secrets.token_hex (modin-project#2286)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2260: use recommended pandas testing api (modin-project#2273)

* TEST-modin-project#2260: use recommended pandas testing api

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2260: replace getSeriesData with test_data

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2260: remove assert_categories_equal

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2254: handling dict functions at groupby.agg improved (modin-project#2267)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FEAT-modin-project#2282: support DataFrame.[count|max|min|sum] for OmniSci backend (modin-project#2283)

Signed-off-by: ienkovich <ilya.enkovich@intel.com>

* FIX-modin-project#1976: indices matching at reduction functions fixed (modin-project#2270)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FEAT-modin-project#2299: support value_counts in OmniSci backend. (modin-project#2300)

Signed-off-by: ienkovich <ilya.enkovich@intel.com>

* FIX-modin-project#1765: Fix support of s3 in `read_parquet` (modin-project#2287)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

* FIX-modin-project#2285: Default to pandas warning message improved (modin-project#2302)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FEAT-modin-project#2303: fix OmniSci aggregates and add mean (modin-project#2304)

Signed-off-by: ienkovich <ilya.enkovich@intel.com>

* FIX-modin-project#2258: return 'Commit Message formatting' topic (modin-project#2306)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2133 modin-project#2265: Fix binary operations for modin frames in case when partitioning isn't aligned (modin-project#2256)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

* FIX-modin-project#2239: Compute row index start using pandas (modin-project#2240)

* FIX-modin-project#2239: Compute row index start using pandas

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-modin-project#2239: Documentation

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-modin-project#2239: Improve testing for case

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-modin-project#2253: loc assignment fixed in case of (1, 1) shape frame (modin-project#2316)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2311: fixed performance bottleneck at reduction operations (modin-project#2314)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* TEST-modin-project#2288: Cover by tests delimiters parameters of read_csv (modin-project#2310)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* FIX-modin-project#2234: update dask_deps in setup.py (modin-project#2325)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2326: move s3fs import in _read function (modin-project#2327)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2329: TypeError while creating cluster  (modin-project#2330)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-#0000: Indexing regression (modin-project#2333)

* FIX-#0000: Indexing regression

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-#0000: Fix `loc`

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-#0000: Fix DatetimeIndex

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-#0000: Fix Datetime and checks

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* DOCS-modin-project#2334: Add tutorials to main repo (modin-project#2335)

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* DOCS-modin-project#2193: Add contributing doc in checklist (modin-project#2216)

* DOCS-modin-project#2193: update contributing doc

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* REFACTOR-modin-project#2343: refactor offset, _read_rows, partitioned_file (modin-project#2344)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#1927: Fix performance issue related to `sparse` attribute access (modin-project#2318)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2269: Move `default_to_pandas` logic from API layer to backend (modin-project#2332)

* FIX-modin-project#2269: Move `default_to_pandas` logic from API layer to backend

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2269: Added a test which calls _apply_agg_function

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2269: Added required arguments for groupby_agg

Moved wrap_udf_function into backend because omnisci doesn't support
executing lambdas.

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2269: Use correct default_to_pandas for groupby in backend,
refactor default to pandas functions in BaseQC

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2269: Renamed new default_to_pandas_groupby function

into private function of Pandas backend because it is not used anywhere
else.

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2269: Fixed specification of backend

now it is possible to specify --backend=PandasOnDask,
--backend=PandasOnRay or --backend=PandasOnPython, not just
--backend=BaseOnPython.

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2269: Fix BaseOnPython tests

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2269: Remove default_to_pandas_groupby

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2269: logic of dropping 'by' moved back to API level

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Co-authored-by: Gregory Shimansky <gregory.shimansky@intel.com>
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* TEST-modin-project#2292: Cover by tests Datetime Handling parameters of read_csv (modin-project#2336)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* FEAT-modin-project#2271: Add implementation of `groupby.shift` (modin-project#2323)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

* FIX-modin-project#2348: Fix default to pandas warnings (modin-project#2349)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2357: Fix path to documentation for contributing (modin-project#2358)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2352: remove deprecated option: 'num-redis-shards' (modin-project#2353)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2339: Fix links to documentation (modin-project#2361)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2354: use conda activate instead of conda run (modin-project#2355)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2363: introduce getter and setter for index name (modin-project#2368)

Signed-off-by: ienkovich <ilya.enkovich@intel.com>

* FEAT-modin-project#1844: upgrade pyarrow to 1.0 (modin-project#2347)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2365: Fix `Series.value_counts` when `dropna=False` (modin-project#2366)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2369: Update pandas version to 1.1.4 (modin-project#2371)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2322: add aligning partition' blocks (modin-project#2367)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* Bump version to 0.8.2 (modin-project#2383)

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-modin-project#2386: add new location for import ray functions (modin-project#2387)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2388: Fixed requirements for omnisci binaries (modin-project#2389)

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2380: don't ignore lengths parameter for dask engine (modin-project#2381)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2390: Fix inserting Series into DataFrame (modin-project#2391)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-2200: Enable Calcite by default in OmniSci backend (modin-project#2385)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2289: Columns, Index Locations and Names parameters of read_csv (modin-project#2319)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* REFACTOR-modin-project#2397: remove redundant assigment (modin-project#2398)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2363: fix index name setter in OmniSci backend (modin-project#2379)

Signed-off-by: ienkovich <ilya.enkovich@intel.com>

* Merged groupby_agg and groupby_dict_agg to implement dictionary functions aggregations (modin-project#2317)

* FIX-modin-project#2254: Added dictionary functions to groupby aggregate tests

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2254: Initial implementation of dictionary functions aggregation

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2254: Remove lambda wrapper to allow dictionary to go to backend

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2254: Fixed AttributeError not being thrown from getattr

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2254: Lint fixes

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FEAT-modin-project#2363: fix index name setter in OmniSci backend

Signed-off-by: ienkovich <ilya.enkovich@intel.com>

* FIX-modin-project#2254: Removed obsolete groupby_dict_agg API function

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2254: Fixed dict aggregate for base backend

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2254: Address reformatting comments

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2254: Remove whitespace

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2254: Removed redundant argument conversion

because it is already done inside of base backend.

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

Co-authored-by: ienkovich <ilya.enkovich@intel.com>

* FIX-modin-project#2406: filter dictionary aggregation keys to limit them to keys only present in current partition (modin-project#2407)

* FIX-modin-project#2406: Added test to detect this bug

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2406: Added filter for keys absent in current partition

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2406: Attemt to fix broken test on BaseOnPython backend

This test gets a corrupted dataframe with "col2" removed by previous
test cases.

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* DOCS-modin-project#2413: Add examples page to documentation (modin-project#2414)

* Resolves modin-project#2413

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* DOCS-modin-project#2415: Add comparisons section to documentation with stubs (modin-project#2416)

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* DOCS-modin-project#2417: add sklearn example (modin-project#2425)

Signed-off-by: reshamas <reshama.stat@gmail.com>

* DOCS-modin-project#2421: Fixes bad link on contributing from architecture.rst (modin-project#2427)

Signed-off-by: Victor Fomin <vfdev.5@gmail.com>

* DOCS-modin-project#2419: Updated CONTRIBUTING.rst (modin-project#2423)

Signed-off-by: Victor Fomin <vfdev.5@gmail.com>

* DOCS-modin-project#2426,DOCS-modin-project#2424: Fixed two issues (modin-project#2431)

- Closes modin-project#2424, CONTRIBUTING.rst does not render the commit message formatting example
- Closes modin-project#2426, Bad links in index.rst
- Renamed CONTRIBUTING.rst into contributing.rst

Signed-off-by: Victor Fomin <vfdev.5@gmail.com>

* DOCS-modin-project#2420: Changed documentation to numpydoc style (modin-project#2429)

Signed-off-by: Mohammed Kashif <md.kashif.py93@gmail.com>

Co-authored-by: Mohammed Kashif <md.kashif.py93@gmail.com>

* DOCS-modin-project#2433: Updated README.md with modin_vs_dask.md doc (modin-project#2435)

Signed-off-by: Abdulelah S. Al Mesfer <abdulelah.almesfer@gmail.com>

* FIX-modin-project#2450: fix CI recipe (modin-project#2449)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* DOCS-modin-project#2437: Add documentation contrasting Modin and Dask (modin-project#2441)

* Resolves modin-project#2437

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FEAT-modin-project#2444: add docker file for nyc on omnisci (modin-project#2445)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2458: fix 'psutil' install (modin-project#2452)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2456: update taxi queries with .copy usage (modin-project#2457)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2447: add docker file for census on omnisci (modin-project#2448)

Also add instructions for building docker images

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2470: revert b867edf (modin-project#2471)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* FIX-modin-project#2473: Some configuration values should not be transformed (modin-project#2476)

* FIX-modin-project#2473: Some configuration values should not be transformed

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#2473: Add tests for ExactStr

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#2402: Fix read_excel when files come from older windows (modin-project#2403)

* Resolves modin-project#2402
* Search for the content files instead of assuming location

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* REFACTOR-modin-project#2467: Convert internal base dataframe objects to ABC (modin-project#2468)

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-modin-project#2459: Updated TeamCity tests image to use Ray as base image (modin-project#2460)

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* TEST-modin-project#2488: Increase commitlint message length limit to 88 characters from 70 (modin-project#2489)

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* DOCS-modin-project#2439: Add Documentation for Modin vs. pandas (modin-project#2487)

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* TEST-modin-project#2290: Cover by tests General Parsing Configuration parameters of read_csv (modin-project#2331)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* FIX-modin-project#2453: Remove sorting indices for equal values in `Series.value_counts` (modin-project#2454)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* TEST-modin-project#2291: Cover by tests NA and Missing Data Handling parameters of read_csv (modin-project#2337)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* REFACTOR-modin-project#2496: Change internal reader names to dispatcher (modin-project#2497)

* Resolves modin-project#2496

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* TEST-modin-project#2294: add iteration parameters for read_csv tests (modin-project#2477)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* FIX-modin-project#2463: Added test with callable functions as aggregate argument (modin-project#2503)

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* TEST-modin-project#2296: Error Handling parameters of read_csv (modin-project#2501)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2295: Cover by tests Quoting, Compression, and File Format parameters of read_csv (modin-project#2495)

Co-authored-by: Anatoly Myachev <45976948+anmyachev@users.noreply.github.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* FEAT-modin-project#2479: integrate asv (modin-project#2484)

* FEAT-modin-project#2479: integrate asv

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2479: add merge pytest-benchmark in asv style

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2479: add CI job for check asv benchmarks

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2479: increase verbosity

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2479: use launch-method=spawn

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2479: add CpuCount usage to control number of partitions

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2479: change: TestDatasetSize -> MODIN_TEST_DATASET_SIZE

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2374: remove extra code; add pandas way to handle duplicate values in reindex func for binary operations (modin-project#2378)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2297: Cover by tests Internal parameters of read_csv (modin-project#2502)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* Ensure excel reader closes file if it is passed as path (modin-project#2514)

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FEAT-modin-project#2375: implementation of multi-column groupby aggregation (modin-project#2461)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2442: fixed Series assignment with different indices (modin-project#2443)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FEAT-modin-project#2013: merge_asof that is a little more efficient (modin-project#2510)

* FEAT-modin-project#2013: merge_asof that is a little more efficient.

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* DOCS-modin-project#2436: Explicit local / single node backend (modin-project#2483)

Signed-off-by: raphaelauv <raphaelauv@users.noreply.github.com>

* Fix indices when reading Excel files in parallel (modin-project#2526)

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#2527: Use random name for hdf file test, clean file after testing (modin-project#2528)

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#2524: Update pandas version to 1.1.5 (modin-project#2525)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2408: Fix read_csv and read_table args when used inside a decora… (modin-project#2486)

Signed-off-by: Weiwen Gu <gwengww@gmail.com>

* FIX-modin-project#2169: avoid unnecessary index access in groupby (modin-project#2469)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2313: improved handling non-numeric types at 'mean' when 'axis=1' (modin-project#2535)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* TEST-modin-project#2509: Io tests refactoring (modin-project#2523)

* TEST-modin-project#2509: refactor read_csv tests

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: refactor tests with warnings

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: read_parquet tests refactoring

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: read_json tests refactoring

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: read_excel tests refactoring

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: read_hdf tests refactoring

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: add html and sql tests

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: fwf tests refactoring

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: further tests refactoring

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: mark xfailed tests and fix

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: fix

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: further refactoring

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

TEST-modin-project#2509: correct teardown stage

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2509: mark failed tests

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2509: fix

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2509: correct test_HDFStore test

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2509: use common teardown function

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2509: typo fix

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2509: fix

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2509: addressing review comments

Co-authored-by: Anatoly Myachev <45976948+anmyachev@users.noreply.github.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2509: addressing review comments

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

Co-authored-by: Anatoly Myachev <45976948+anmyachev@users.noreply.github.com>

* FIX-modin-project#2540: add __iter__ implementation (modin-project#2541)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2520: add most important operations for asv benchmarks (modin-project#2539)

* FEAT-modin-project#2520: add most important operations for asv benchmarks

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2520: add groupby microbenchmarks

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2520: address review comments

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2498: Fix possible number of partitions for Dask engine (modin-project#2532)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2550: remove decorators usage for asv tested functions (modin-project#2551)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2236: Handling of space limited Ray Plasma directories (modin-project#2547)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* DOCS-modin-project#2518: add asv usage topic (modin-project#2549)

* DOCS-modin-project#2518: add asv usage topic

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* DOCS-modin-project#2518: fix style

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* DOCS-modin-project#2518: address review comments

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2491: optimized groupby dictionary aggregation (modin-project#2534)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FEAT-modin-project#2553: add ability to run microbenchmarks for old Modin version (modin-project#2554)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* Fix .loc[] assignment for Modin Series (modin-project#2555)

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#2482: improved handling non-str 'by' (modin-project#2548)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* Fix taxi-runner.py cluster example (modin-project#2557)

* Added regression test
* Fix modin package installation

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* Fix loc/iloc assignments when columns are selected (modin-project#2536)

* FIX-modin-project#1620: Add test for reported issue

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#1620: Use pandas.reindex() properly

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#1620: Improve tests

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#1620: Convert lookups to values for both indices and columns

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#1620: Add test for .loc[] ordering

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#1620: XFail a test that unearths internal sorting

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#1620: Improve test robustness a bit per code review

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#2559: Ignore files from /proc/ when detecting file leaks (modin-project#2560)

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* Switch to Ray from conda-forge (modin-project#2562)

* FIX-modin-project#2561: Switch to Ray from conda-forge, abandon pip caching

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#2561: Remove pip caching from push CI actions

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>

* FIX-modin-project#2566: Ensure `Series.unique` does not return a scalar when there is only one unique value (modin-project#2567)

* FIX-modin-project#2566: Ensure unique doesn't return a scalar using np.atleast_1d

Signed-off-by: Richard Lin <richard.lin.047@berkeley.edu>

* FIX-modin-project#2566: Check array shapes match for test_unique

Signed-off-by: Richard Lin <richard.lin.047@berkeley.edu>

* FIX-modin-project#2566: Reduce unique dimensions using constructor instead

Signed-off-by: Richard Lin <richard.lin.047@berkeley.edu>

* FIX-modin-project#2572: fixed arrow version in OmniSci dependencies (modin-project#2571)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* DOCS-modin-project#2578: fix simple typo, parition -> partition (modin-project#2573)

There is a small typo in modin/engines/dask/pandas_on_dask/frame/partition.py, modin/engines/ray/pandas_on_ray/frame/partition.py.

Should read `partition` rather than `parition`.

Signed-off-by: Tim Gates <tim.gates@iress.com>

* FIX-#0000: pin xlrd<=1.2.0 (modin-project#2594)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2543: fixed handling 'as_index' at groupby dictionary renaming aggregation (modin-project#2592)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* Release commit for version 0.8.3 (modin-project#2597)

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* REFACTOR-modin-project#2580: Move automatic engine init to after data ingestion (modin-project#2581)

* REFACTOR-modin-project#2580: Move automatic engine init to after data ingestion

* Resovles modin-project#2580

Instead of automatically starting the engine when Modin is imported,
we start it after the first time the user reads or creates a dataframe.
This is intended to help downstream libraries not need the engine to
check for typing, as well as clear up some transient errors that can
occur with certain engines on large machines.

I have also added a warning message that informs the user how to clear
the message. We will likely need a way to suppress these errors, because
many users will not care about them and potentially want to suppress.
We will probably also want to add a benchmarking page on best practices
for benchmarking because this change can give the impression of a
performance degradation on data ingestion even though nothing is
changing from that perspective.

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* REFACTOR-modin-project#2580: Add to experimental API

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* REFACTOR-modin-project#2580: Add `read_feather` and `read_clipboard`

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* REFACTOR-modin-project#2580: Remove redundant error message

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* TEST-modin-project#2598: Add test for clean install from source (modin-project#2599)

* TEST-modin-project#2598: Add test for clean install from source

* Resolves modin-project#2598

This change adds a test for installing Modin without all of the testing
dependencies.

It is intended to test how a user who does not have all of the test
dependencies will see a Modin import.

* TEST-modin-project#2598: Target Python3

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-modin-project#976: add encoding parameter to read_csv call (modin-project#2593)

* FIX-modin-project#976: add failed test

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#976: add encoding parameter to read_csv call

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#976: fix test in experimental mode

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2342: Add axis partitions API (modin-project#2515)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

Co-authored-by: Devin Petersohn <devin.petersohn@gmail.com>

* Fixed MultiIndex.from_frame implementation (modin-project#2587)

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2608: Disable proxy for commands running inside container (modin-project#2609)

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FIX-modin-project#2601: reduce data size for some asv tests (modin-project#2602)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2611: Fixed crash and sklearn version (modin-project#2612)

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FEAT-modin-project#2604: add docker file with plasticc benchmark on omnisci (modin-project#2605)

* FEAT-modin-project#2604: add docker file with plasticc benchmark on omnisci
* FEAT-modin-project#2604: change xgboost verbose_eval

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* DOCS-modin-project#2618: Add code of conduct (modin-project#2619)

* Resolves modin-project#2618

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FEAT-modin-project#2373: Add distributed xgboost on Modin with Ray (modin-project#2545)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

Co-authored-by: Devin Petersohn <devin.petersohn@gmail.com>

* FEAT-2624: Improve performance of read_* methods when file handles are passed in (modin-project#2625)

Signed-off-by: Zain Patel <zain.patel06@gmail.com>

* FIX-modin-project#2616: Add config for num partitions, deprecate DEFAULT_NPARTITIONS (modin-project#2622)

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FEAT-modin-project#2091: add distributed dataframe compare (modin-project#2579)

Signed-off-by: Khang Vu <khangvu200391845@gmail.com>

* DOCS-modin-project#2649: Fix github pr template's dead link. (modin-project#2650)

Signed-off-by: William Ma <williamwma5@gmail.com>

* FEAT-modin-project#2606: Support creating DataFrame from remote partitions (modin-project#2613)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* FIX-modin-project#2637: Fix deprecation warnings due to invalid escape sequences. (modin-project#2641)

Signed-off-by: Karthikeyan Singaravelan <tir.karthi@gmail.com>

* REFACTOR-modin-project#2648: Correct uses of MapReduceFunction and metadata manipu… (modin-project#2655)

* REFACTOR-modin-project#2648: Correct uses of MapReduceFunction and metadata manipulation

Resolves modin-project#2648

Removes some code that is problematic for performance. There was a mix
of use cases for modifying the external metadata and internal metadata,
and some problematic components of these APIs that could hide bugs. The
implementation has been updated to ensure that these bugs do not
resurface.

Previously, the internal and external indices were compared, and then
updated according to some arguments that were passed in. This is not
scalable because collecting the indices is expensive. The possible bugs
hidden in this implementation decision could end up being very difficult
to detect: it implicitly updates the internal or external indices based
on a somewhat cryptic string pattern combined with a boolean flag.
Another very large issue is that sometimes external indices are updated
based on the partition lengths metadata. This was likely done to solve a
use case of not using the APIs properly.

This implementation has been removed and replaced with something more
explicit. If the internal indices need to be updated, they are updated
explicitly via existing APIs. Likewise if external indices need to be
updated, they are updated with a different API.

Several QueryCompiler APIs had to be reverted because they were misusing
the ReductionFunction or MapReduceFunction, thus the need for the
implicit modification of metadata. When this implicit modification was
removed, these APIs no longer worked, and so were reverted until they
can be reimplemented using correct APIs. The following APIs were
reverted as a part of this commit:

* `is_monotonic_increasing`
* `is_monotonic_decreasing`
* `value_counts`
* `searchsorted`
* `dt_tz`
* `dt_freq`

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* REFACTOR-modin-project#2648: Remove debug code

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* REFACTOR-modin-project#2648: Fix explicit rename

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* DOCS-2653: Fix links in Modin's documentation (modin-project#2654)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

* FEAT-modin-project#2663: Add algebraic operator `from_labels` (modin-project#2665)

Resolves modin-project#2663

This operator is necessary for efficient `reset_index` operations. See
this paper for more information on the operator:
http://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf

Co-authored-by: William Ma <12377941+williamma12@users.noreply.github.com>

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-modin-project#2672: pin numpy>=1.16.5,<1.20  (modin-project#2673)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FEAT-modin-project#2675: Added benchmark for sort_values (modin-project#2676)

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* FEAT-modin-project#2664: Add `to_labels` algebraic operator (modin-project#2666)

Resolves modin-project#2664

This add the algebraic operator for `to_labels`, which enables Modin to
better optimize the movement of data to metadata. See more in the paper
about the algebraic operator:
http://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf

Co-authored-by: William Ma <williamwma5@gmail.com>

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-modin-project#1806: Resolved error when reverting to Pandas for Multiindex (modin-project#2660)

Signed-off-by: Todd Yu <toddyu@berkeley.edu>

* FIX-modin-project#2614: Up python version for test jobs (modin-project#2615)

Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* DOCS-2633: Add documentation for distributed XGBoost on Modin (modin-project#2640)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

* FIX-modin-project#2667: Change names of files for development env (modin-project#2668)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

* FIX-modin-project#2658: Move backend check in xgb to train/predict (modin-project#2659)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

* FEAT-modin-project#2451: Read multiple csv files simultaneously via glob paths (modin-project#2662)

Signed-off-by: William Ma <williamwma5@gmail.com>

* FIX-modin-project#2681: pin numpy<1.20.0 for docker containers with omnisci (modin-project#2682)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: some updates to improve asv tests stability (modin-project#2671)

* TEST-modin-project#2670: some updates to improve asv tests stability

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: fixes

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: data_size -> shape

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: use dict approach

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: use CpuCount when Npartitions isn't defined

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: fix ASV_DATASET_SIZE

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: update TimeSortValues

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: modify asv tests for using with old modin version

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: reply to review comments

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2670: use env variables for default values

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2686: add fillna benchmark (modin-project#2687)

* TEST-modin-project#2686: add fillna benchmark

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2686: reply to review comments

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2686: add inplace parameter

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2692: add drop benchmark (modin-project#2693)

* TEST-modin-project#2692: add drop benchmark

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2692: add one column case

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2688: Update ray.ObjectID to ray.ObjectRef for Ray 2.0 (modin-project#2695)

* FIX-modin-project#2688: Update ray.ObjectID to ray.ObjectRef for Ray 2.0

Resovles modin-project#2688

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* FIX-modin-project#2688: Address comments

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

* TEST-modin-project#2707: add lint check for ASV benchmarks (modin-project#2708)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* TEST-modin-project#2699: add append benchmark (modin-project#2700)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2684: Add method level docs for Modin XGBoost (modin-project#2685)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

* TEST-modin-project#2694: add head benchmark (modin-project#2696)

* TEST-modin-project#2694: add head benchmark

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2694: add small number for head op

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2705: add 'value_counts' benchmarks (modin-project#2706)

* TEST-modin-project#2705: add 'value_counts' benchmarks

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* TEST-modin-project#2705: apply suggestions from review

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2709: fixed typo in '_copartition' (modin-project#2710)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2596: Update pandas version to 1.2.1 (modin-project#2600)

Co-authored-by: Alexey Prutskov <alexey.prutskov@intel.com>
Co-authored-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>

* TEST-modin-project#2690: add astype benchmark (modin-project#2691)

* TEST-modin-project#2690: add astype benchmark

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2690: add category dtype; use df.types

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2690: add case with one column

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2702: add loc/iloc benchmark (modin-project#2703)

* TEST-modin-project#2702: add loc/iloc benchmark

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2702: add multiindex loc bench

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2702: add row_loc check

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* TEST-modin-project#2716: add describe bench (modin-project#2718)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* DOCS-modin-project#2717: Fix version of Modin for building latest docs (modin-project#2719)

Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>

* FEAT-modin-project#1611: Add mod operation (modin-project#2726)

Signed-off-by: Alina <alina.bykovskaya@intel.com>

* TEST-modin-project#2725: add index, columns, shape benchmarks (modin-project#2727)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

* FIX-modin-project#2305: fix handling of renaming aggregation (modin-project#2732)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2362: fix key handling in 'Series.__setitem__' (modin-project#2731)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* TEST-modin-project#2722: add ASV read_csv skiprows benchmark (modin-project#2724)

* TEST-modin-project#2722: add ASV read_csv skiprows benchmark

Co-authored-by: Anatoly Myachev <45976948+anmyachev@users.noreply.github.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* FIX-modin-project#2735: move '.reindex' logic about axis dispatching from the base class (modin-project#2736)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* TEST-modin-project#1496: add tests for setting new column with different from frame length (modin-project#2733)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* REFACTOR-modin-project#2739: io tests refactoring (modin-project#2740)

Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* TEST-modin-project#2753: add GroupBy benchmarsk with huge amount of groups (modin-project#2754)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2362: fix handling slices in 'DataFrame.__setitem__' (modin-project#2741)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2742: fix performance degradation for dictionary GroupBy aggregation (modin-project#2743)

* FIX-modin-project#2742: changed callable functions to its names in dict aggregation

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2742: commends added

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

* FIX-modin-project#2737: fix handling of dates for read_csv with OmniSci backend (modin-project#2738)

Co-authored-by: Anatoly Myachev <45976948+anmyachev@users.noreply.github.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>

* DOCS-modin-project#2584: Add CODEOWNERS file (modin-project#2759)

* Resolves modin-project#2584

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Co-authored-by: Anatoly Myachev <45976948+anmyachev@users.noreply.github.com>
Co-authored-by: Dmitry Chigarev <62142979+dchigarev@users.noreply.github.com>
Co-authored-by: ienkovich <ilya.enkovich@intel.com>
Co-authored-by: Alexey Prutskov <alexey.prutskov@intel.com>
Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Co-authored-by: amyskov <55585026+amyskov@users.noreply.github.com>
Co-authored-by: YarShev <yaroslav.igoshev@intel.com>
Co-authored-by: Gregory Shimansky <gregory.shimansky@intel.com>
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Co-authored-by: Gregory Shimansky <gshimansky@users.noreply.github.com>
Co-authored-by: Reshama Shaikh <reshama.stat@gmail.com>
Co-authored-by: vfdev <vfdev.5@gmail.com>
Co-authored-by: Mohammed Kashif <mohammed15035@iiitd.ac.in>
Co-authored-by: Mohammed Kashif <md.kashif.py93@gmail.com>
Co-authored-by: Abdulelah S. Al Mesfer <28743265+abdulelahsm@users.noreply.github.com>
Co-authored-by: Abolfazl Shahbazi <abolfazl.shahbazi@intel.com>
Co-authored-by: Vasily Litvinov <45396231+vnlitvinov@users.noreply.github.com>
Co-authored-by: Itamar Turner-Trauring <itamar@itamarst.org>
Co-authored-by: raphaelauv <raphaelauv@users.noreply.github.com>
Co-authored-by: Weiwen Gu <gwengww@gmail.com>
Co-authored-by: Richard Lin <35508487+richardlin047@users.noreply.github.com>
Co-authored-by: Tim Gates <tim.gates@iress.com>
Co-authored-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Zain Patel <zain.patel06@gmail.com>
Co-authored-by: Khang Vu <khangvu200391845@gmail.com>
Co-authored-by: William Ma <12377941+williamma12@users.noreply.github.com>
Co-authored-by: Karthikeyan Singaravelan <tir.karthi@gmail.com>
Co-authored-by: Todd Yu <toddyu@berkeley.edu>
Co-authored-by: Alina Bykovskaya <alina.bykovskaya@intel.com>
@rchouhan170590
Copy link

similar type of error:

Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 303: invalid continuation byte

@anmyachev
Copy link
Collaborator

similar type of error:

Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 303: invalid continuation byte

Hello @rchouhan170590, what Modin release are you used?

@rchouhan170590
Copy link

I am not using modin,

@ikrammohamdi
Copy link

I facing this error
10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
9 participants