Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_pandas in PandasQueryCompiler loses meta information about indices and columns #1726

Closed
dchigarev opened this issue Jul 14, 2020 · 1 comment · Fixed by #1727
Closed
Assignees
Labels
bug 🦗 Something isn't working
Milestone

Comments

@dchigarev
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Modin version (modin.__version__): 0.7.3+198.ge4e3fecb
  • Python version: 3.7.5
  • Code we can use to reproduce:
if __name__ == "__main__":
    import modin.pandas as pd
    import numpy as np

    data = {"A": np.arange(256)}

    index = pd.MultiIndex.from_tuples((i, i*2) for i in np.arange(257)).drop(0)

    md_df = pd.DataFrame(data, index=index)

    print("md_df.index:", len(md_df.index.levels[0]))  # md_df.index: 257 
    print("pd_df.index:", len(md_df._to_pandas().index.levels[0]))  # pd_df.index: 256

Describe the problem

to_pandas loses information about index levels which is useful for operations that operates with indices levels internally (like unstack)

@dchigarev dchigarev added the bug 🦗 Something isn't working label Jul 14, 2020
@dchigarev dchigarev added this to the 0.8.0 milestone Jul 14, 2020
@dchigarev dchigarev self-assigned this Jul 14, 2020
@dchigarev dchigarev linked a pull request Jul 14, 2020 that will close this issue
5 tasks
@dchigarev
Copy link
Collaborator Author

dchigarev commented Jul 16, 2020

I figured out that to_pandas does not consider indices and columns from modin_frame at all, only from partitions. For example if modin frame has columns = ['A', 'B'] but in partitions columns = [1, 2, 3] (different columns lengths could appear because of incorrect function implementation) then to_pandas will return us the DataFrame with columns = [1, 2, 3].

More explicit example why it is important:
Many tests in modin uses df_equals to compare modin_result with pandas_result. Under the hood df_equals translates modin_df to pandas and after that compares its like two pandas dataframes. Since to_pandas does not consider modin frame indices, we will never know if modin frame itself has correct indices/columns in such comparison.

For example, let's take nlargest function, its tests pass ok, but when we would like to use its result without converting it to pandas, we will see that it's incorrect (#1734).

Many tests started to fail after I fixed that to_pandas does not use modin frame indices in the conversion to pandas (#1729 #1731 #1732 #1733 #1734), I've fixed most of them in #1727.

@devin-petersohn please take a look, I guess that all the PRs with new pandas API functions implementation should be blocked for now, because it can't be properly tested

dchigarev added a commit to dchigarev/modin that referenced this issue Jul 23, 2020
This fix is also fixing inconsistent indices in some of operations (modin-project#1731, modin-project#1732 and modin-project#1734)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
dchigarev added a commit to dchigarev/modin that referenced this issue Jul 23, 2020
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant