-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop treating DataFrames like matrices #484
Conversation
Thoughts on this, @kmsquire, @simonster, @garborg? |
So, I applaud the effort, and I think the implications aren't exactly clear to me yet. On one hand, operations on individual columns are not affected, since columns are just On the other hand, while removing quirky, unused features is good, I'm wondering if any nice, convenient features are being removed by this PR? Anyway, since applying some operations to whole DataFrames (or even subsets of columns) simultaneously can still be useful, I think that we should provide |
Well, one argument that nothing worth keeping changed is that not a single line of any test file had to be changed except the test file filled with made up calls to these functions. It's a little hard to imagine anyone was writing
Implementing |
I don't think Thinking further, one operation I use frequently in pandas is So perhaps |
Well, an error, of course! :-) |
Actually, for I guess I would punt on |
Thinking about this more, I think As you note, |
One thought I had was to parametrize DataFrames by column type, so that when the column types are all the same, you have type stability. This makes it harder to add non-type conformant columns to a typed DataFrame though. |
That's very similar to attempts by @simonster and @tshort to make DataFrames have a tuple-like type. It seems to be a hard problem to get right from my reading of Simon's efforts along this line. |
I agree that the generalization to tuple types is hard. I was proposing handling the easier case of parametrizing by one type which is the supertype of all column types. Generally, this would be |
Ah. You're proposing that row operations generate results using the typejoin of the column types. Seems like a reasonable start for sure, although I bet some people will be worried that the resulting DataFrame offers worse performance than the input DataFrame. |
I'd like to put the Setting aside how we handle that issue, I still really want to make this change. @simonster, are you cool with that? I really don't think we need to support taking the square root of a DataFrame. |
This functionality seems convenient even on a heterogenous DataFrame (e.g. #to_round = ....
rounded = [string(s, "Rdd") for s in to_round]
d[rounded] = round(d[to_round])
#or
d[to_round] = round(d[to_round], 3) ). But regardless of how mapping will be handled, I'm for this pull request -- I'd rather loose the functionality than handle these functions differently than custom / other functions. |
I'm a little confused about your example, @garborg. Isn't the code you've written just an instance of standard column indexing rules? |
@@ -211,67 +78,9 @@ function isequal(df1::AbstractDataFrame, df2::AbstractDataFrame) | |||
return true | |||
end | |||
|
|||
for sf in scalar_comparison_operators | |||
for sf in [:(==), :(!=), :(>), :(>=), :(<), :(<=)] | |||
vf = symbol(".$sf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be clearer as:
for vf in [:(.==), :(.!=), :(.>), :(.>=), :(.<), :(.<=)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
I would be happy to see these go. Revisiting this code makes me realize that our binary operators aren't actually checking that column names were identical. I'll open a new issue for this. |
Ok. I'll make the changes you suggest and then merge this. |
Stop treating DataFrames like matrices
@johnmyleswhite The example that confused you -- it was a response to the comment that we don't need to support taking the square root of a DataFrame. The right hand sides are rounding an entire DataFrame even if the purpose of the full line of code is to update a subset of columns -- the code worked before this PR and doesn't now. I was just acknowledging that it enabled a clean, concise syntax (I like clean and concise) -- just not the right one. |
Ah. Sorry for not understanding your example. Is there something you'd us to offer to make up for the change? For me the trouble with the operators we've now removed is that they're conceptually murky: what really is the square root of a DataFrame? And, assuming you buy the old definition, why would |
No worries. And I agree with both of the troubles you raise, which is why I supported the PR.
|
I think we can make |
For reference, thanks to broadcasting, we can now support element-wise operations and interaction with matrix-like objects without having to reimplement an arbitrary subset of functions: #1840. |
A few weeks ago I had a very long conversation with Hadley Wickham where we debated the merits of allowing arithmetic operations on DataFrames and came to agree that it's not helpful to offer functions that work on some DataFrames (if they're all numeric), but fail on others.
This pull request attempts to push that logic to its final conclusion: DataFrames aren't matrices. They shouldn't be the objects of numeric computation and we shouldn't define operators on DataFrames that only really work for
DataMatrix{T <: Number}
.Removing this functionality goes a long way towards simplifying our codebase and removing quirky features that we probably never use and would never encourage.
NB: This depends upon my previous pull request.