-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add value_counts
implementation for Series
and as free function
#1535
Add value_counts
implementation for Series
and as free function
#1535
Conversation
value_counts
implementation for both Series
and as free functionvalue_counts
implementation for Series
and as free function
Codecov Report
@@ Coverage Diff @@
## master #1535 +/- ##
==========================================
+ Coverage 88.12% 88.20% +0.07%
==========================================
Files 71 71
Lines 7023 7104 +81
==========================================
+ Hits 6189 6266 +77
- Misses 834 838 +4
Continue to review full report at Codecov.
|
e6fd2f2
to
3452dc1
Compare
Signed-off-by: Yaroslav Igoshev <yaroslav.igoshev@intel.com>
3452dc1
to
b31375a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think value_counts
can be implemented with groupby
, what do you think?
self.groupby(self, as_index=True).count()
@devin-petersohn , if implement |
5789e52
to
4689672
Compare
4689672
to
8b2c4aa
Compare
looks like the containerized tests are failed because of syntax error in /localdisk/tc_agent/temp/agentTmp/custom_script12880017450402304134: line 16: syntax error: unexpected end of file |
Yes. Containerized tests didn't send logs into PR comments because log file was created outside of container. I tried to fix it but broke all test runs. I am working on it. |
@gshimansky , ok, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Is it possible to do without _apply_full_axis
? It is quite expensive and should only be used when absolutely required.
Something like this in the query compiler:
value_counts = MapReduceFunction(lambda x: x.squeeze().value_counts(**kwargs), lambda x: x.groupby(x.index).sum())
If we'd try to do value_counts = MapReduceFunction.register(
lambda x, *args, **kwargs: x.squeeze().value_counts(**kwargs),
lambda y, *args, **kwargs: y.squeeze().groupby(y.squeeze().index, sort=False).sum()
) Here I see two problems. The first one is that it is required to recompute |
@YarShev I see, thanks for pointing these out. The difficult challenge here is that this is very commonly used, and I will think on a potential fix for this. There should still be some way to do it in two phases. |
@devin-petersohn , from my point of view, |
I have a branch here where I have tried a quick and dirty implementation, let me know what you think: https://github.com/devin-petersohn/modin/tree/features/value_counts |
@devin-petersohn , I looked through the changes in your branch. I guess, we can't implement import modin.pandas as pd
import pandas
import numpy as np
NROWS = 2 ** 8
RAND_LOW = 0
RAND_HIGH = 100
random_state = np.random.RandomState(seed=42)
data = random_state.randint(RAND_LOW, RAND_HIGH, size=(NROWS))
ms = pd.Series(data)
ps = pandas.Series(data)
mr = ms.value_counts()
pr = ps.value_counts()
mr[:10] # sorted (by descending) both values and indices for same values
61 12
1 8
87 6
43 6
14 6
89 5
88 5
59 5
58 5
2 5
ps[:10] # sorted (by descending) values but not sorted indices for same values
61 12
1 8
43 6
14 6
87 6
88 5
2 5
59 5
58 5
89 5 Regarding to the changes related to recomputing of indices in @devin-petersohn , what do you think about all this? |
@YarShev I think it is okay if we warn the user that the output order may slightly differ from pandas. We can create an issue to revisit this and if it becomes a huge problem we can revisit how to fix it, but scalability is more important. I think if users are relying on pandas order, they are using the |
@devin-petersohn , okay, then can I take the changes from your branch and put them into this PR? Also, I will put the changes regarding sorting of indices in order to review this. |
@YarShev Yes, feel free to cherry pick those commits over. |
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
bd7b7f7
to
bf90699
Compare
bf90699
to
d47b1d6
Compare
@devin-petersohn , I combined all changes. Please, review. Also, I created the issue on difference of indices order in pandas and in Modin #1650 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of questions and documentation requests, thanks @YarShev!
------- | ||
PandasQueryCompiler | ||
""" | ||
if kwargs.get("bins", None) is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do bins not work with MapReduce
style? How is the designator for the bin determined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The designator for the bin determined as follow:
import pandas
import numpy as np
s = pandas.Series([3, 1, 2, 3, 4, np.nan])
r = s.value_counts(bins=3)
r
(2.0, 3.0] 2
(0.996, 2.0] 2
(3.0, 4.0] 1
dtype: int64
r.index
IntervalIndex([(2.0, 3.0], (0.996, 2.0], (3.0, 4.0]],
closed='right',
dtype='interval[float64]')
r.index[0]
Interval(2.0, 3.0, closed='right')
MapReduce
style doesn't work for bins because equal values may be put in to different bins at the map
stage (when we have more than one partitions). Then, when we do groupby
by index, we will not able to group bins containing appropriate values.
is_end = True | ||
break | ||
if is_range: | ||
new_index = np.concatenate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will probably be slightly more performance to concatenate at the end to avoid the multiple malloc
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully got rid np.concatenate
.
|
||
result = result.sort_values(ascending=ascending) if sort else result | ||
|
||
def sort_index_for_identical_values(result, ascending): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief comment on why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, done. Also, I added some brief doc for this function.
|
||
return sort_index_for_identical_values(result, ascending) | ||
|
||
return MapReduceFunction.register(map_func, reduce_func, preserve_index=False)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on how to structure this to keep the query compiler clean of these low-level implementation details, what if we created a file backends/pandas/kernels.py
, then added these functions there and import? I have been thinking about how to keep things clean, maybe it is better left to another PR, but what do you think on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is great thought. I think, a file backend/pandas/kernels.py
that will contain these implementation details is what we need to keep query compiler clean. Importing functions from that file into query compiler is a very nice solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some tests, performance is good. Thanks @YarShev LGTM!
Signed-off-by: Yaroslav Igoshev yaroslav.igoshev@intel.com
What do these changes do?
flake8 modin
black --check modin
git commit -s