Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cythonized GroupBy Quantile #20405

Merged
merged 65 commits into from
Feb 28, 2019
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
618ec99
Reorganized existing describe test
WillAyd Mar 15, 2018
74871d8
Added quantile tests and impl
WillAyd Mar 15, 2018
7b6ca68
Broken impl and doc updates
WillAyd Mar 15, 2018
31aff03
Working impl with non-missing; more tests
WillAyd Mar 16, 2018
4a43815
DOC: update the Index.isin docstring (#20249)
noemielteto Mar 18, 2018
eb18823
Working impl with NA data
WillAyd Mar 18, 2018
813da81
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Mar 18, 2018
e152dd5
Added check_names arg to failing tests
WillAyd Mar 18, 2018
7a8fefb
Added tests for dt, object raises
WillAyd Mar 18, 2018
b4938ba
Added interpolation keyword support
WillAyd Mar 19, 2018
3f7d0a9
LINT fix
WillAyd Mar 19, 2018
d7aec3f
Updated benchmarks
WillAyd Mar 19, 2018
e712946
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Aug 7, 2018
72cd30e
Removed errant git diff
WillAyd Aug 7, 2018
a3c4b11
Removed errant pd file
WillAyd Aug 7, 2018
ac96526
Fixed broken function tests
WillAyd Aug 7, 2018
7d439d8
Added check_names=False to tests
WillAyd Aug 7, 2018
3047eed
Py27 compat
WillAyd Aug 7, 2018
70bf89a
LINT fixup
WillAyd Aug 7, 2018
02eb336
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Aug 7, 2018
7c3c349
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Aug 13, 2018
3b9c7c4
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 13, 2018
ad8b184
Replaced double with float64
WillAyd Nov 13, 2018
b846bc2
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 15, 2018
93b122c
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 19, 2018
09308d4
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 24, 2018
1a718f2
Fixed segfault on all NA group
WillAyd Nov 24, 2018
ff062bd
Stylistic and idiomatic test updates
WillAyd Nov 24, 2018
bdb5089
LINT fixup
WillAyd Nov 24, 2018
9b55fb5
Added cast to remove build warning
WillAyd Nov 24, 2018
31e66fc
Used memoryview.shape instead of len
WillAyd Nov 24, 2018
41a734f
Use pytest.raises
WillAyd Nov 24, 2018
67e0f00
Better Cython types
WillAyd Nov 24, 2018
07b0c00
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Nov 26, 2018
86aeb4a
Loosened test expectation on Windows
WillAyd Nov 26, 2018
86b9d8d
Used api types
WillAyd Nov 26, 2018
cfa1b45
Removed test hacks
WillAyd Nov 26, 2018
00085d0
Used is_object_dtype
WillAyd Nov 27, 2018
1f02532
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Dec 25, 2018
3c64c1f
Removed loosened check on agg_result
WillAyd Dec 25, 2018
09695f5
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 9, 2019
68cfed9
isort fixup
WillAyd Jan 9, 2019
4ce1448
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 11, 2019
5e840da
Removed nonlocal variable usage
WillAyd Jan 11, 2019
7969fb6
Updated documentation
WillAyd Jan 11, 2019
f9a8317
LINT fixup
WillAyd Jan 11, 2019
464a831
Reverted errant whatsnew
WillAyd Jan 11, 2019
4b3f9be
Refactor processor signatures
WillAyd Jan 11, 2019
b996e1d
Documentation updates
WillAyd Jan 11, 2019
cdd8985
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 22, 2019
64f46a3
Added empty assignment for variable
WillAyd Jan 22, 2019
4d88e8a
Docstring fixup
WillAyd Jan 22, 2019
1cd93dd
Updated README
WillAyd Jan 22, 2019
9ae23c1
Pytest arg deprecation fix
WillAyd Jan 26, 2019
eb99f07
Removed test_describe test
WillAyd Jan 31, 2019
94d4892
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Jan 31, 2019
0512f37
Moved whatsnew
WillAyd Jan 31, 2019
2370129
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 2, 2019
a018570
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 12, 2019
f41cd05
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 20, 2019
21691bb
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 27, 2019
082aea3
LINT fixup
WillAyd Feb 27, 2019
dc5877a
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 27, 2019
7496a9b
Merge remote-tracking branch 'upstream/master' into grp-desc-perf
WillAyd Feb 28, 2019
ec013bf
LINT fixup
WillAyd Feb 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
method_blacklist = {
'object': {'median', 'prod', 'sem', 'cumsum', 'sum', 'cummin', 'mean',
'max', 'skew', 'cumprod', 'cummax', 'rank', 'pct_change', 'min',
'var', 'mad', 'describe', 'std'},
'var', 'mad', 'describe', 'std', 'quantile'},
'datetime': {'median', 'prod', 'sem', 'cumsum', 'sum', 'mean', 'skew',
'cumprod', 'cummax', 'pct_change', 'var', 'mad', 'describe',
'std'}
Expand Down Expand Up @@ -343,8 +343,8 @@ class GroupByMethods(object):
['all', 'any', 'bfill', 'count', 'cumcount', 'cummax', 'cummin',
'cumprod', 'cumsum', 'describe', 'ffill', 'first', 'head',
'last', 'mad', 'max', 'min', 'median', 'mean', 'nunique',
'pct_change', 'prod', 'rank', 'sem', 'shift', 'size', 'skew',
'std', 'sum', 'tail', 'unique', 'value_counts', 'var'],
'pct_change', 'prod', 'quantile', 'rank', 'sem', 'shift', 'size',
'skew', 'std', 'sum', 'tail', 'unique', 'value_counts', 'var'],
['direct', 'transformation']]

def setup(self, dtype, method, application):
Expand Down
6 changes: 6 additions & 0 deletions pandas/_libs/groupby.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
cdef enum InterpolationEnumType:
jreback marked this conversation as resolved.
Show resolved Hide resolved
INTERPOLATION_LINEAR,
INTERPOLATION_LOWER,
INTERPOLATION_HIGHER,
INTERPOLATION_NEAREST,
INTERPOLATION_MIDPOINT
98 changes: 98 additions & 0 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -379,5 +379,103 @@ def group_any_all(ndarray[uint8_t] out,
out[lab] = flag_val


@cython.boundscheck(False)
@cython.wraparound(False)
def group_quantile(ndarray[float64_t] out,
ndarray[int64_t] labels,
numeric[:] values,
ndarray[uint8_t] mask,
double_t q,
object interpolation):
"""
Calculate the quantile per group.

Parameters
----------
out : ndarray
Array of aggregated values that will be written to.
labels : ndarray
Array containing the unique group labels.
values : ndarray
Array containing the values to apply the function against.
q : double
The quantile value to search for.

Notes
-----
Rather than explicitly returning a value, this function modifies the
provided `out` parameter.
"""
cdef:
Py_ssize_t i, N=len(labels)
int64_t lab, ngroups, grp_sz, non_na_sz, grp_start=0, idx=0
uint8_t interp, offset
numeric val, next_val
double_t q_idx, frac
ndarray[int64_t] counts, non_na_counts
ndarray[int64_t] sort_arr

inter_methods = {
'linear': INTERPOLATION_LINEAR,
'lower': INTERPOLATION_LOWER,
'higher': INTERPOLATION_HIGHER,
'nearest': INTERPOLATION_NEAREST,
'midpoint': INTERPOLATION_MIDPOINT,
}
interp = inter_methods[interpolation]

counts = np.zeros_like(out, dtype=np.int64)
non_na_counts = np.zeros_like(out, dtype=np.int64)
ngroups = len(counts)

# First figure out the size of every group
with nogil:
for i in range(N):
lab = labels[i]
counts[lab] += 1
if not mask[i]:
non_na_counts[lab] += 1

# Get an index of values sorted by labels and then values
assert len(values) == len(labels)
order = (values, labels)
sort_arr = np.lexsort(order).astype(np.int64, copy=False)

with nogil:
for i in range(ngroups):
# Figure out how many group elements there are
grp_sz = counts[i]
non_na_sz = non_na_counts[i]

# Calculate where to retrieve the desired value
# Casting to int will intentionaly truncate result
idx = grp_start + <int64_t>(q * <double_t>(non_na_sz - 1))

val = values[sort_arr[idx]]
# If requested quantile falls evenly on a particular index
# then write that index's value out. Otherwise interpolate
q_idx = q * (non_na_sz - 1)
frac = q_idx % 1

if frac == 0.0 or interp == INTERPOLATION_LOWER:
out[i] = val
else:
next_val = values[sort_arr[idx + 1]]
if interp == INTERPOLATION_LINEAR:
out[i] = val + (next_val - val) * frac
elif interp == INTERPOLATION_HIGHER:
out[i] = next_val
elif interp == INTERPOLATION_MIDPOINT:
out[i] = (val + next_val) / 2.0
elif interp == INTERPOLATION_NEAREST:
if frac > .5 or (frac == .5 and q > .5): # Always safe?
out[i] = next_val
else:
out[i] = val

# Increment the index reference in sorted_arr for the next group
grp_start += grp_sz


# generated from template
include "groupby_helper.pxi"
103 changes: 85 additions & 18 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1732,6 +1732,70 @@ def nth(self, n, dropna=None):

return result

def quantile(self, q=0.5, interpolation='linear'):
"""
Return group values at the given quantile, a la numpy.percentile.

Parameters
----------
q : float or array-like, default 0.5 (50% quantile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking this should share code with Series.quantile. (or be in cython is ok)

0 <= q <= 1, the quantile(s) to compute
interpolation : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you list the methods here

Method to use when the desired quantile falls between two points.

Returns
-------
Series or DataFrame
Return type determined by caller of GroupBy object.

See Also
--------
Series.quantile : Similar method for Series
DataFrame.quantile : Similar method for DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add numpy.percentile


Examples
--------
>>> df = pd.DataFrame(
... [['foo'] * 5 + ['bar'] * 5,
... [1, 2, 3, 4, 5, 5, 4, 3, 2, 1]],
... columns=['key', 'val'])
>>> df
"""

is_dt = False
is_int = False

def pre_processor(vals):
if vals.dtype == np.object:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really really need to clean this up and simply put this in a class. I would be really happy to do this before this PR.

raise TypeError("'quantile' cannot be performed against "
"'object' dtypes!")
elif vals.dtype == np.int:
nonlocal is_int
is_int = True
elif vals.dtype == 'datetime64[ns]':
vals = vals.astype(np.float)
nonlocal is_dt
is_dt = True

return vals

def post_processor(vals):
if is_dt:
vals = vals.astype('datetime64[ns]')
elif is_int and interpolation in ['lower', 'higher', 'nearest']:
vals = vals.astype(np.int)

return vals

return self._get_cythonized_result('group_quantile', self.grouper,
aggregate=True,
needs_values=True,
needs_mask=True,
cython_dtype=np.float64,
pre_processing=pre_processor,
post_processing=post_processor,
q=q, interpolation=interpolation)

@Substitution(name='groupby')
def ngroup(self, ascending=True):
"""
Expand Down Expand Up @@ -1928,43 +1992,46 @@ def cummax(self, axis=0, **kwargs):
def _get_cythonized_result(self, how, grouper, aggregate=False,
cython_dtype=None, needs_values=False,
needs_mask=False, needs_ngroups=False,
result_is_index=False,
pre_processing=None, post_processing=None,
**kwargs):
"""Get result for Cythonized functions
result_is_index=False, pre_processing=None,
post_processing=None, **kwargs):
"""
Get result for Cythonized functions.

Parameters
----------
how : str, Cythonized function name to be called
grouper : Grouper object containing pertinent group info
how : str
Cythonized function name to be called.
grouper : pandas.Grouper
Grouper object containing pertinent group info.
aggregate : bool, default False
Whether the result should be aggregated to match the number of
groups
groups.
cython_dtype : default None
Type of the array that will be modified by the Cython call. If
`None`, the type will be inferred from the values of each slice
`None`, the type will be inferred from the values of each slice.
needs_values : bool, default False
Whether the values should be a part of the Cython call
signature
signature.
needs_mask : bool, default False
Whether boolean mask needs to be part of the Cython call
signature
signature.
needs_ngroups : bool, default False
Whether number of groups is part of the Cython call signature
Whether number of groups is part of the Cython call signature.
result_is_index : bool, default False
Whether the result of the Cython operation is an index of
values to be retrieved, instead of the actual values themselves
values to be retrieved, instead of the actual values themselves.
pre_processing : function, default None
Function to be applied to `values` prior to passing to Cython
Raises if `needs_values` is False
Function to be applied to `values` prior to passing to Cython.
Raises if `needs_values` is False.
post_processing : function, default None
Function to be applied to result of Cython function
**kwargs : dict
Extra arguments to be passed back to Cython funcs
Function to be applied to result of Cython function.
**kwargs
Extra arguments to be passed back to Cython funcs.

Returns
-------
`Series` or `DataFrame` with filled values
`Series` or `DataFrame`
Object type determined by caller of the ``GroupBy`` object.
"""
if result_is_index and aggregate:
raise ValueError("'result_is_index' and 'aggregate' cannot both "
Expand Down
66 changes: 60 additions & 6 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3396,8 +3396,11 @@ def map(self, mapper, na_action=None):

def isin(self, values, level=None):
"""
Return a boolean array where the index values are in `values`.

Compute boolean array of whether each index value is found in the
passed set of values.
passed set of values. The length of the returned boolean array matches
the length of the index.

Parameters
----------
Expand All @@ -3406,23 +3409,74 @@ def isin(self, values, level=None):

.. versionadded:: 0.18.1

Support for values as a set
Support for values as a set.

level : str or int, optional
Name or position of the index level to use (if the index is a
MultiIndex).
`MultiIndex`).

Returns
-------
is_contained : ndarray
NumPy array of boolean values.

See also
--------
Series.isin : Same for Series.
DataFrame.isin : Same method for DataFrames.

Notes
-----
In the case of `MultiIndex` you must either specify `values` as a
list-like object containing tuples that are the same length as the
number of levels, or specify `level`. Otherwise it will raise a
``ValueError``.

If `level` is specified:

- if it is the name of one *and only one* index level, use that level;
- otherwise it should be a number indicating level position.

Returns
-------
is_contained : ndarray (boolean dtype)
Examples
--------
>>> idx = pd.Index([1,2,3])
>>> idx
Int64Index([1, 2, 3], dtype='int64')

Check whether each index value in a list of values.
>>> idx.isin([1, 4])
array([ True, False, False])

>>> midx = pd.MultiIndex.from_arrays([[1,2,3],
... ['red', 'blue', 'green']],
... names=('number', 'color'))
>>> midx
MultiIndex(levels=[[1, 2, 3], ['blue', 'green', 'red']],
labels=[[0, 1, 2], [2, 0, 1]],
names=['number', 'color'])

Check whether the strings in the 'color' level of the MultiIndex
are in a list of colors.

>>> midx.isin(['red', 'orange', 'yellow'], level='color')
array([ True, False, False])

To check across the levels of a MultiIndex, pass a list of tuples:

>>> midx.isin([(1, 'red'), (3, 'red')])
array([ True, False, False])

For a DatetimeIndex, string values in `values` are converted to
Timestamps.

>>> dates = ['2000-03-11', '2000-03-12', '2000-03-13']
>>> dti = pd.to_datetime(dates)
>>> dti
DatetimeIndex(['2000-03-11', '2000-03-12', '2000-03-13'],
dtype='datetime64[ns]', freq=None)

>>> dti.isin(['2000-03-11'])
array([ True, False, False])
"""
if level is not None:
self._validate_index_level(level)
Expand Down
Loading