-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouping API consistency and improvements #1256
Comments
I did some research and managed to beat data.table in some groupby cases. Key areas to work on are string hashing and better parallelization. There may be a way to merge the work in FastGroupBy.jl into the Dataframes by rewriting the grouping backend to Radixsort based for bit types. See post See code here |
Interesting. Let's discuss this elsewhere as this issue was supposed to deal about APIs, not implementations. |
A good solution to make |
Bumping this. I have been playing around with
I am not sure how contrived an example this is, but it's one we are more or less running into with our work in Stata, and Stata isn't handling it that well either. I think the error is that the
You can see an |
@nalimilan what is left from this issue to be done? |
I think several points still need to be addressed, e.g. regarding |
👍 Thank you! |
Just leaving a note that it could make sense to replace The only question is what we also need a replacement for |
With So maybe:
|
That's tempting, but both dplyr's
Good point, so indeed |
I am OK to do the way you want to go (I just wanted to hint that By Now I actually see that it can be even simpler like Edit by |
The last step to complete to close this issue is to deprecate |
Two comments:
@piever - regarding the second point. We need two kinds of transformation operations:
How do you handle this distinction in JuliaDB.jl (it seems that |
@piever This is my current conclusion from he discussion with @nalimilan what would be best for DataFrames.jl. Please comment what you think of it (point 1 is probably the most relevant to JuliaDB.jl):
In this way I think it would be great to settle this before DataFrames.jl 1.0 release (so hopefully soon). |
Among others, In JuliaDB working on the columns directly is IMO not as common as in DataFrames as it doesn't work well with the distributed case. The main function to do this are I suspect it'd be nicer, just like there is Would
I'm not sure I'm particularly qualified to comment on this one. On the one hand, if DataFrames don't iterate rows (you need
Even though JuliaDB has the syntactic sugar |
Thank you for the extensive and insightful comment. I will add two short notes (probably @nalimilan will have more to say on this):
|
One more tiny comment: I think that DataFrames actually uses something like |
Thanks. That's so hard! I think there are two different but related issues:
Actually the former issue has much deeper implications than the latter, so I've filed a separate issue to continue the discussion: #1952. The discussion about I must say I'm not really convinced by the proposal to use Regarding the standardization of the API between DataFrames and JuliaDB, there are several issues:
I think it would be fine to add |
I think it would be wise to check that we have a rough idea of where we want to go before 1.0, just in case it requires deprecating something (apart from |
Of course there needs to be consensus with the other JuliaDB maintainers, but, as far as I'm concerned, I agree that the name To me, the only big thing that need deciding soonish to settle this is whether there is a good name for the equivalent of The name I had proposed Personally, I like that
|
I'm afraid this is a very JuliaDB-esque proposal. :-)
If we added a Also as I said I don't really like
What do you mean by "groupwise window functions"? |
I'm not completely sold on
I simply meant things like |
I'm not really sure
|
Let me add that pandas' In [1]: import pandas
...:
...: df = pandas.DataFrame(
...: {
...: "a": [1, 1, 2, 2, 3, 3],
...: "b": [1, 2, 3, 4, 5, 6],
...: }
...: )
...:
...: for key, group in df.groupby("a"):
...: print("key =", key)
...: print("group =")
...: print(group)
...: print()
key = 1
group =
a b
0 1 1
1 1 2
key = 2
group =
a b
2 2 3
3 2 4
key = 3
group =
a b
4 3 5
5 3 6 I find this useful sometimes. Of course, with DataFrames.jl, I can pull out the (I don't know DataFrame.jl enough to say for sure there is no function that does this out-of-the-box. I hope my comment is not adding unnecessary noise.) |
The Query.jl way of doing this is to have a |
Yes, that would be equivalent and nice to have. |
Yes, exactly that. Thanks! |
I found Query.jl's performance on large group by problems to be slower. In fact, I would say much slower. As long as can keep the performance at a good level |
@xiaodaigh Your work on extensive benchmarking of group-by is valuable to many people. But I believe this thread is about API, not implementation or performance. If you have a strong concern about Query.jl's |
I am just not sure if a certain API will lead to slower implementation e.g. Maybe an iteration based API will be slower? |
Having a |
Additional to-do / to-decide before 1.0 release:
|
I am closing this as I do not see anything not covered elsewhere (or not fitting the current design). If you feel something is missing then please open a new issue (possibly giving a link to this one for past reference). |
Our grouping API is currently lacking, as it does not allow performing most common operations in the most efficient way (see this discussion and this one), and because it is not very user-friendly for all cases.
groupby
is the essential building block which returns aGroupedDataFrame
object, i.e. a per-group view on the originalDataFrame
. Callingmap
on the resulting object gives aGroupApplied
object, which can then be turned into a standardDataFrame
usingcombine
.by
is a convenience wrapper around these operations. It supports any function taking aDataFrame
and returning either a single value, a vector or aDataFrame
, and combines the result into aDataFrame
, with as many rows per group as the function returned. It is very flexible, but also cumbersome (since you need to writedf[:col]
inside the function). It is also quite slow since we can make almost no assumptions about the kind of data which is going to be returned, and inference cannot really help since the input type is alwaysDataFrame
and does not include any information on the column types.aggregate
is a more specific function which applies an operations independently on each column. It is also quite flexible since the function can return either a single value or a vector (in which case one row is created for each entry). This makes it slower than it could be since knowing that the function returns a single value would allow for optimizations (allocating in advance and looping efficiently over each column). However, here inference could be used since the function is applied separately to each function. See Modify aggregate for efficiency and decide its future #1246.Comparing with other software, our API is quite similar to Pandas, which provides three different functions (see also this summary):
groupby
is similar to our function. However, the result allows accessing its "columns" via indexing, and calling a reduction function likemean
on them computes it for each group.apply
for general transformations. This is similar to ourby
.transform
returns a data frame of the same size as the original. The passed function is applied independently to each column, and must return either a vector of the same length as the original for each column, or a single value which is recycled/broadcasted. This is the equivalent of ouraggregate
except that the result is guaranteed to have as many rows as the input (while in ours the number of rows is arbitrary). In terms of performance, this restriction allows allocating the final data frame in advance since the number of rows is known, avoiding the need to store copies for all groups before concatenating them.aggregate
returns a data frame with one row per group. It's a more restrictive variant oftransform
that requires the passed function to return a single value. Ouraggregate
is a bit more general since it also accepts vector returns, which can come with a performance impact unless we can find a way to use inference to distinguish the two cases (see above).dplyr takes a quite different approach:
group_by
is similar to ourgroupby
. However, the grouped data frame it returns behaves mostly like a standard data frame (contrary to ourGroupedDataFrame
type) except for aggregation operations, so it does not always need to be combined back to a data frame immediately.mutate
applies to each group a function which can access any columns of a data frame, and returns a column with as many rows as the original (several such operations can be carried out at the same time by passing multiple arguments). It could be seen as a version ofby
with a restricted output type: since the number of rows must be equal to the original, operations simply add new columns to (a copy of) the original data frame. An essential difference is that a convenience syntax is provided so that one does not actually write a function accessing the data frame, but expressions likex = mean(col)
. In terms of performance, this means the list of columns that each operation needs is known, which allows compiling specialized code (similar to what DataFramesMeta does with@transform
). Finally, note thatmutate
returns a grouped data frame, but since its behavior is close to that of plain data frames, this has only limited consequences for usability (comparing to what would happen with our current implementation).summarise
is similar tomutate
, but for functions returning a single value. It can benefit from the same performance advantages asmutate
, but can be even more efficient since results can be saved directly into the finalDataFrame
without intermediate allocations.mutate_all
andsummarise_all
are variants of the two previous functions which take a function and apply it to each column (similar to ouraggregate
). For the first one, the function must return either a single value, or a vector of the same length as the input. For the second one, the function must return a single value. The result is respectively a grouped data frame or a data frame.Overall, our API is quite similar to Pandas, though we do not provide an equivalent of their
aggregate
for the simplest case where a function always returns a scalar. It is not clear whether we need it for performance, or whether we could use inference in ouraggregate
to automatically use a fast path when possible (see above).On the contrary, our API is quite different from dplyr, which provides convenient functions to create new columns based on per-group operations (aggregation or arbitrary transformations), using a syntax similar to what Query or DataFramesMeta provide. Of course our
by
allows this kind of operation, but it makes it hard or impossible to do this efficiently since inference and specialization are problematic (see above). This kind of problem can only be solved using macros in order to have access the the names of columns which need to be used (and specialized on). So it is probably better left to Query and DataFramesMeta. So I'd say we should concentrate on ensuring our limited Pandas-like API is consistent and efficient, and point to other packages in our documentation for use cases which cannot be efficiently handled through it.One big limitation we currently have compared with both Pandas and dplyr is that a simple case like computing a grouped mean for a single column is inconvenient and slow. Pandas allows accessing the pseudo-column of a grouped data frame and calling
mean()
on it. dplyr allows a similar operation by callingmutate(grouped_df, m = mean(col))
. On the contrary, we require either usingby
, which is slow and not so convenient, or usingaggregate
, which applies the operation to all columns (and fails when some are not numeric, which should be fixed).Ideas for possible improvements:
GroupedDataFrame
behave more like a plainDataFrame
when possible. In particular, using a more compact printing would make it more useful on its own (showing grouping columns like normal ones, maybe highlighted).GroupedDataFrame
object, for example by adding agetindex
method to select specific columns before callingcolwise
. Allowing to callmean
(or any other reduction) on the resulting object could be convenient, but that would only work for the special case when there is a single column, whilecolwise
is more general. We probably don't want to allow callingmean
on aDataFrame
to apply it column-wise.aggregate
.aggregate
silently skip (i.e. return as-is) columns for which the passed function does not apply (according to its signature). Pandas does this by default, but since this is potentially confusing, it could be enabled via an argument.colwise
return aDataFrame
so that when called on aGroupedDataFrame
it would preserve grouping columns (currentlycolwise
does not even report the names of columns to which numbers correspond).aggregate
more efficient by using inference (WIP: Modify aggregate for efficiency DataTables.jl#65). Unfortunately, this sounds more difficult forby
, even though there may be room for improvement.The text was updated successfully, but these errors were encountered: