-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#1291 #1187: Add DataFrame.unstack
, Series.unstack
#1649
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1649 +/- ##
==========================================
+ Coverage 76.42% 79.94% +3.51%
==========================================
Files 79 79
Lines 9309 9352 +43
==========================================
+ Hits 7114 7476 +362
+ Misses 2195 1876 -319
Continue to review full report at Codecov.
|
5072522
to
26c001d
Compare
26c001d
to
94ef4d9
Compare
94ef4d9
to
f0b752b
Compare
modin/pandas/base.py
Outdated
@@ -3298,7 +3298,7 @@ def tz_localize( | |||
return self.set_axis(labels=new_labels, axis=axis, inplace=not copy) | |||
|
|||
def unstack(self, level=-1, fill_value=None): | |||
return self._default_to_pandas("unstack", level=level, fill_value=fill_value) | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the children both implement this, we can just delete it.
@@ -626,6 +626,20 @@ def sort_index_for_equal_values(result, ascending): | |||
|
|||
# END Reduction operations | |||
|
|||
def unstack(self, is_ser_out=False, is_ser_in=False, level=-1, fill_value=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is_ser_out
and is_ser_in
should not be parameters. Query compiler should always be a dataframe and should not have this interface level detail. The API layer should handle the conversion with self._reduce_dimension
if it needs to change the dimension of the result. We try to avoid leaking these API details to the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devin-petersohn Query compiler has always been a dataframe in this function. Parameter is_ser_in
I used for determination for which object this unstack
function was called (in case of Series
need to apply squeeze()
for df
in map_func
). Parameter is_ser_out
I used for excluding of recomputing new_columns
in function apply_full_axis
(in case of Series
new_columns
was set to ["reduced"]).
self._reduce_dimension
doesn't solve my problem because this function calls constructor of Series
internally and result-object contains name = self._query_compiler.columns[0]
which should be None
in case of calling unstack
function in my case. Therefore, I added parameter new_columns
to unstack
in query_compiler
and set its into ["__reduced__"]
when result should be a Series
(line).
5081b15
to
c64706f
Compare
@prutskov it seems that I found a bug in your implementation. For that test case I get different from pandas behavior: code to reproduceif __name__ == "__main__":
import numpy as np
import modin.pandas as pd
import pandas
first_part = {
"A": ["one", "two", "three", "two"] * 64,
"B": ["foo", "bar", 1, 2, 3, 4, 5, 6] * 32,
}
middle_part = {f"sparse_column{i}": np.arange(256) for i in range(64)}
last_part = {
"C": [0, 1, 2, 3, 4, 5, 6, 7] * 32,
"D": [10, 11, 12, 13, 14, 15, 16, 17] * 32,
}
values = {**first_part, **middle_part, **last_part}
index = pd.MultiIndex.from_product(
[
["a", "b", "c", "d"],
["x", "y", "z", "last"],
["i", "j", "k", "index"],
[1, 2, 3, 4],
]
)
md_df, pd_df = (
pd.DataFrame(values, index=index),
pandas.DataFrame(values, index=index),
)
pd_res = pd_df.unstack(level=[0, 1, 2, 3])
print("PD RESULT:\n", pd_res)
md_res = md_df.unstack(level=[0, 1, 2, 3]) # IndexError: positional indexers are out-of-bounds
print("MD RESULT:\n", md_res) If we will set a breakpoint here, and look at resulting modin frame, we will see, that partitions at the same row have different indices, which does not suppose to be: long log
Why did that happened?
And then we will apply
code to reproduceif __name__ == "__main__":
import pandas as pd
values = {
"A": ["one", "two", "three", "one", "two", "three", "one", "two", "three"],
"B": ["foo", "bar", 1, 2, 3, 4, 5, 6, 7],
"C": [0, 1, 2, 3, 4, 5, 6, 7, 8],
"D": [10, 11, 12, 13, 14, 15, 16, 17, 18],
}
index = pd.MultiIndex.from_product([["a", "b", "c"], ["x", "y", "z"]])
df = pd.DataFrame(values)
df1 = df[["A", "B"]]
df2 = df[["C", "D"]]
print(df.unstack(level=[0, 1]))
print(df1.unstack(level=[0, 1]))
print(df2.unstack(level=[0, 1])) So it seems that P.S. behavior of that bug depends on splitting into partition, so I'll note that I've tested this on 4 cores CPU. |
@dchigarev Thank you! I added additional behavior for case |
unstack
function for DataFrame and Seriesunstack
function for DataFrame and Series
91ff00b
to
b2b3be8
Compare
modin/pandas/series.py
Outdated
return DataFrame( | ||
query_compiler=self._query_compiler.unstack( | ||
map_func=lambda df: pandas.DataFrame( | ||
df.squeeze(axis=1).unstack(level=level, fill_value=fill_value) | ||
), | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that query_compiler::unstack
should repeat the interface of DataFrame::unstack
, because query_compiler
is backend-specific. Let's say another non-pandas backed already have unstack
implementation which expects level
and fill_value
params, but in that implementation, it will get some python callable (which actually could be not supported by non-pandas backends).
I've read your comment about the reasons you've implemented it this way, but do the results differ if we call unstack
on DataFrame
and DataFrame.squeeze()
? If it is, can we translate the result of unstack
from DataFrame_which_is_actually_series
to the correct answer? If no, maybe we should make two different implementations of unstack
in query_compiler
for Series
and for DataFrame
? @devin-petersohn, need your advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with @dchigarev. If possible, these details like map_func
should be kept in the query compiler because other engines may not be able to apply them.
unstack
is pivot
on a level of the index instead of a column. pivot
and unstack
are both a combination of groupby and transpose. It might be better to try to implement them this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to fix my implementation using apply_full_axis
. But I get data with incorrect size of output when result is Series
. In reproducer below results should have 1089 rows, but I get only 1056 rows. Missed 33 rows are 33rd column of original DataFrame
. There are only two partitions (size of each is 528 rows) internally (in line ). Should there be third partition in this case, which will be consist 33rd column of original DataFrame
? @devin-petersohn Have you encountered such problem?
Code of test:
def test_unstack(self):
data = {f"Col{i}": np.arange(33) for i in range(33)}
modin_df = pd.DataFrame(data)
print(modin_df)
pandas_df = pandas.DataFrame(data)
res_p = pandas_df.unstack(level=0)
res_m = modin_df.unstack(level=0)
p_idx = res_p.index
m_idx = res_m.index
i = 0
for idx in p_idx:
print(idx, "_", m_idx[i])
i = i +1
df_equals(res_p, res_m)
Also I don't properly understand why ustack
is combination groupby
and transpose
. Can you explain what you mean?
b2b3be8
to
471585e
Compare
5955107
to
207bc74
Compare
bf659d3
to
7b072aa
Compare
@prutskov I've runned the performance measurements and that's what I got:
I believe that in a script below I reproduced the optimist path for (I think that I've previously runned this branch to test Script to measureimport numpy as np
import os
import pandas
from timeit import default_timer as timer
RAND_LOW = -100
RAND_HIGH = 100
N = 50000
M = 128
MULTILINE = False
TEST_FILENAME = os.path.abspath(
f"int_dataset-{N},{M},{RAND_LOW},{RAND_HIGH},{MULTILINE}.csv"
)
def generate_data_file(filename, row_n, col_n, multiline_rows=False):
data = {
f"col{i}": np.concatenate(
[
np.concatenate(
[
["some\nvery very very\nlong string\nwith many multilines"],
np.random.randint(RAND_LOW, RAND_HIGH, 9),
]
)
for _ in np.arange(row_n // 10)
]
)
if (i % 10 == 0 and multiline_rows)
else np.random.randint(RAND_LOW, RAND_HIGH, row_n)
for i in np.arange(col_n)
}
print("dict generated!")
df = pandas.DataFrame(data)
print("dataframe created!")
df.to_csv(filename)
print("csv ready!")
def multiIndex_generator(df, axis=0):
if axis == 0:
df.index = pandas.MultiIndex.from_tuples(
[(0, i) for i in np.arange(len(df.index))]
)
else:
df.columns = pandas.MultiIndex.from_tuples(
[(0, i) for i in np.arange(len(df.columns))]
)
return df
if __name__ == "__main__":
import modin.pandas as pd
if not os.path.exists(TEST_FILENAME):
generate_data_file(TEST_FILENAME, N, M, MULTILINE)
md_df = multiIndex_generator(pd.read_csv(TEST_FILENAME))
pd_df = multiIndex_generator(pandas.read_csv(TEST_FILENAME))
print(
f"DataFrame shape: ({N}, {M}) ~ {os.stat(TEST_FILENAME).st_size // (1024 * 1024)}MB, {pd.DEFAULT_NPARTITIONS} cores"
)
t1 = timer()
res = repr(pd_df.unstack())
print("PD unstack:", "{:.2f}".format(timer() - t1), "s")
t1 = timer()
res = repr(md_df.unstack())
print("MD unstack:", "{:.2f}".format(timer() - t1), "s") |
@dchigarev Thank you, I'll see on it. I didn't change implementation since last discussion (only rebasing on current master was applied). |
@dchigarev I changed making of df.index = pandas.MultiIndex.from_tuples(
[(j, i) for j in np.arange(10) for i in np.arange(len(df.index)/10)]
) In your case external level of MultiIndex is I made measurements using your script with my changes and get next results:
In all measured cases only |
d0c59e9
to
b663e0d
Compare
@dchigarev @devin-petersohn Could we merge this to master? This implementation is faster than |
@prutskov Please create an issue to tune performance further, will take a look and merge now if everything is okay. |
b663e0d
to
0684255
Compare
@devin-petersohn Issue #1975 was created. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming tests pass, looks good as first pass. Thanks @prutskov!
0684255
to
88439ab
Compare
…`Series.unstack` Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>
88439ab
to
fe7c935
Compare
@devin-petersohn All checks have passed after rebasing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prutskov LGTM
…`Series.unstack` (modin-project#1649) Signed-off-by: Alexey Prutskov <alexey.prutskov@intel.com>
Signed-off-by: Alexey Prutskov alexey.prutskov@intel.com
What do these changes do?
flake8 modin
black --check modin
git commit -s