-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375) #474
Conversation
|
||
typealias DataFrameRow SubDataFrame{Int} | ||
|
||
Base.start(row::DataFrameRow) = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do people use this kind of iteration? I kind of want to remove this idiom for iterating over an AbstractDataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use it frequently in Pandas. While it's nice to write batch operations on DataFrames, some things are just easier to write with a for loop.
Setting aside the pretty superficial comments I made, this looks really promising. Thanks for taking it on, Kevin! |
@@ -914,11 +914,11 @@ end | |||
|
|||
# a SubDataFrame is a lightweight wrapper around a DataFrame used most frequently in | |||
# split/apply sorts of operations. | |||
type SubDataFrame <: AbstractDataFrame | |||
type SubDataFrame{T<:Union(Int,Vector{Int})} <: AbstractDataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this shouldn't also accept Range1{Int}
and maybe Range{Int}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was wondering if you could use AbstractVector{Int}
safely here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason other than I haven't gotten to it yet
I don't think I like this change. Shouldn't a single-row SubDataFrame act like a single-row DataFrame? Wouldn't it be better to have EachRow return a different type that can be indexed this way? |
Can't please everyone. ;-) I actually implemented that separate type once, when working on the original sort code. It turned out that I didn't need it there, and was encouraged to remove it, so I did. Having implemented it this way, I can say that a Anyone else have thoughts on this? |
How would a single row DataFrame and a single row SubDataFrame differ in behavior? |
Sorry, it's a little unclear what you're asking. On master right now, the row iterator returns a single-row DataFrame. As pointed out in #375, even though we know there's only one row, we need to include the row number when indexing if we want to get at the actual value, e.g., ## On master
julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1 | 1 | M |
| 2 | 2 | F |
| 3 | 3 | F |
| 4 | 4 | M |
julia> row = EachRow(df)[1]
1x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1 | 1 | M |
julia> row["A"]
1-element DataArray{Int64,1}:
1
julia> row[1,"A"]
1 With this pull request, If I understand Tom correctly, he's suggesting that
|
If understand correctly too, I agree with @tshort. Better have This is very similar to the question of dropping dimensions from an array: here, dropping dimensions means returning a row, just like dropping dimensions in a matrix means returning a vector. I think |
Changing My main concern that this change would be largely cosmetic, and that any changes made to |
@kmsquire correctly summed up my suggestion. I don't think the type DataFrameRow
df::AbstractDataFrame
row::Int
end
DataFrameRow(df::AbstractDataFrame) = DataFrameRow(df, 1)
Base.getindex(r::DataFrameRow, idx) = r.df[r.row, idx]
Base.setindex!(r::DataFrameRow, value, idx) = setindex!(r.df, value, r.row, idx)
Base.start(r::DataFrameRow) = 1
Base.done(r::DataFrameRow, i::Int) = i > size(r.df, 1)
Base.next(r::DataFrameRow, i::Int) = (DataFrameRow(r.df, i), i + 1)
eachrow(df::AbstractDataFrame) = DataFrameRow(df) A |
Sorry about that. I posted the above from the wrong account. |
True, that isn't much code. We actually still need |
Sorry, Tom, I didn't look closely enough at your code--your iterator does indeed replace However, I think that the iterator definition for |
Kevin, the code above is the iterator over the DataFrame. Here's an example: julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1 | 1 | M |
| 2 | 2 | F |
| 3 | 3 | F |
| 4 | 4 | M |
julia> for row in eachrow(df) println(row["A"]) end
1
2
3
4
julia> for row in eachrow(df) println(row["B"]) end
M
F
F
M
julia> eachrow(df)
DataFrameRow(4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1 | 1 | M |
| 2 | 2 | F |
| 3 | 3 | F |
| 4 | 4 | M |,1)
julia> for row in eachrow(df) row["B"] = row["B"] * "X" end
julia> df
4x2 DataFrame
|-------|---|----|
| Row # | A | B |
| 1 | 1 | MX |
| 2 | 2 | FX |
| 3 | 3 | FX |
| 4 | 4 | MX |
julia> for row in eachrow(df) println(row[2]) end
MX
FX
FX
MX
|
We're posting a the same time! See my comment above yours... |
Got it. I can understand why you'd want that. |
(And ignore my comment about the bug in |
Looks great! ;-) |
Unfortunately, the possibility of an iterator over the elements of the I think that a Anyway, I'll punt for now and open up a separate issue/pull request once this gets settled. |
Okay, I've refactored everything to use a separate One of the tests in
Iterating using Other comments/suggestions welcome. I can change it back if others dislike this change. ;-) |
I like this appoach. I'd prefer that |
Currently |
I agree with @tshort's last comment: let's have With that change made, would there be anything else standing in the way of merging this? |
So, changing that back broke column iteration on SubDataFrames, and I don't have time to fix that right now. Will update later. |
I've updated this to something close to what people have requested.
To me, these changes slightly complicate the code, and I'm not exactly happy with them. But they reflect the functionality people are asking for.
Feedback welcome. If this looks good, I'll squash and commit. |
I didn't run it, but I like the approach and scanned through the code. On Wed, Jan 22, 2014 at 12:57 AM, Kevin Squire notifications@git.luolix.topwrote:
|
|
||
Base.sub(D::DataFrame, r::Int) = SubDataFrame(D, [r]) | ||
Base.sub(D::DataFrame, rs::RowsType) = SubDataFrame(D, rs) | ||
Base.sub(D::SubDataFrame, r::Int) = SubDataFrame(D.parent, [D.rows[r]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this behave different than doing sub(d::DataFrame, r::Int)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. I think I can collapse them into one function using AbstractDataFrame
--I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry: I think I must be confused. I thought that single row indexing was going to produce a DataFrameRow from now on, not a SubDataFrame. Is that right?
I like everything about this, except for being on the fence about the definition of single row indexing for |
Okay, so a little more information about that: If we include julia> sdf = sub(df, [1])
1x2 SubDataFrame{Array{Int64,1}}
|-------|---|---|
| Row # | A | B |
| 1 | 1 | M |
julia> sdf["A"]
1-element DataArray{Int64,1}:
1 If we include julia> sdf = sub(df, 1)
1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1 | 1 | M |
julia> sdf["A"]
1 (Which is exactly how a Anyway, I'll squash, and without further comments, commit. Cheers! |
Actually, I'll fix the failing tests before committing. ;-o |
I've figured out my own source of confusion. Sorry for the noise. Just for future reference: a bug in GitHub means that I only receive e-mails for about 75% of any given conversation, so I'm now perpetually confused about any complex conversation since I miss large chunks of it. |
LOL. ;-) No worries, John! |
Okay, I've made a few additional simplifying changes and fixes, and squashed the pull request. Changes:
Fixes:
Tests pass on my machine. Assuming they pass in Travis and there are no further comments today, I'll merge later. |
I'm happy with this and think it's ready for merging. One question for later work: couldn't a couple more of these methods apply to AbstractDataFrame? |
@@ -913,37 +913,25 @@ end | |||
|
|||
# a SubDataFrame is a lightweight wrapper around a DataFrame used most frequently in | |||
# split/apply sorts of operations. | |||
type SubDataFrame <: AbstractDataFrame | |||
|
|||
immutable SubDataFrame{T<:AbstractVector{Int}} <: AbstractDataFrame | |||
parent::DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, I think we could plausibly get away with making this a parametric type that can match any kind of AbstractDataFrame.
I think depends on the the characteristics of Anyway, that's a simple change, so I think I'll merge this for now. |
* Parametrize SubDataFrames by T<:AbstractVector{Int} * Changed SubDataFrame to immutable * DRYed out sub() functions * Deprecated subset (alias for sub), on the theory that we should remove redundant methods like this * Add show() for DataFrameRow * Change EachRow, EachCol -> eachrow, eachcol, to better match Base convention for iterators * Return DataFrameRows from eachrow * Changed Sort.lt comparisons to compare DataFrameRows (for issorted) * Specialize collect(r::DataFrameRow), so that Dict(collect(r::DataFrameRow)) works * Start to use size(df, 1) for nrow(df), size(df, 2) for ncol(df)
Yes, please do. |
Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375)
Are we updating METADATA.jl, or holding off? |
Given that we have a bad track record for making carefully targeted releases, I'd say we're better off just doing the update. |
So, I'm using the new I can either bump REQUIRES, or provide an implementation for backward compatibility. I know you suggested not concerning ourselves with supporting v0.2, but I'd rather not put out a known-broken release, and this is an easy fix. |
Let's bump REQUIRES. We already need to require 0.3 to support the new formula syntax for making model matrices. |
Done, but see JuliaLang/METADATA.jl#538 |
As suggested by @StefanKarpinski here, this PR parameterizes
SubDataFrames
by the type ofrows
.At this point, I've only implemented
Int
(for individual rows) andVector{Int}
(for everything else). Parameterizing byRange1{Int}
, as Stefan suggested, would also be a good addition.This makes indexing within rows more natural, as requested in #375--i.e., the elements of each row can be accessed either by name (
Dict
-like), or numerical index (Array
-like).Some examples:
Note that this is a BREAKING change
Before:
After: