Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375) #474

kmsquire · 2014-01-13T23:56:31Z

As suggested by @StefanKarpinski here, this PR parameterizes SubDataFrames by the type of rows.

At this point, I've only implemented Int (for individual rows) and Vector{Int} (for everything else). Parameterizing by Range1{Int}, as Stefan suggested, would also be a good addition.

This makes indexing within rows more natural, as requested in #375--i.e., the elements of each row can be accessed either by name (Dict-like), or numerical index (Array-like).

Some examples:

julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |
| 2     | 2 | F |
| 3     | 3 | F |
| 4     | 4 | M |

julia> for row in EachRow(df)
          println(row)
       end
1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 2 | F |

1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 3 | F |

1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 4 | M |


julia> for row in EachRow(df)
          println(row["A"], " ", row[2])
       end
1 M
2 F
3 F
4 M

julia> row = sub(df, 2)
1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 2 | F |

julia> row["A"] = 200
200

julia> df
4x2 DataFrame
|-------|-----|---|
| Row # | A   | B |
| 1     | 1   | M |
| 2     | 200 | F |
| 3     | 3   | F |
| 4     | 4   | M |

Note that this is a BREAKING change

Before:

julia> row = EachRow(df)[1]
1x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> row[1]
1-element DataArray{Int64,1}:
 1

julia> typeof(row)
DataFrame (constructor with 22 methods)

After:

julia> row = EachRow(df)[1]
1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> row[1]
1

julia> typeof(row)
SubDataFrame{Int64} (constructor with 1 method)

johnmyleswhite · 2014-01-14T00:11:54Z

src/dataframe.jl

+
+typealias DataFrameRow SubDataFrame{Int}
+
+Base.start(row::DataFrameRow) = 1


Do people use this kind of iteration? I kind of want to remove this idiom for iterating over an AbstractDataFrame.

I use it frequently in Pandas. While it's nice to write batch operations on DataFrames, some things are just easier to write with a for loop.

johnmyleswhite · 2014-01-14T00:13:49Z

Setting aside the pretty superficial comments I made, this looks really promising. Thanks for taking it on, Kevin!

simonster · 2014-01-14T00:15:38Z

src/dataframe.jl

@@ -914,11 +914,11 @@ end

 # a SubDataFrame is a lightweight wrapper around a DataFrame used most frequently in
 # split/apply sorts of operations.
-type SubDataFrame <: AbstractDataFrame
+type SubDataFrame{T<:Union(Int,Vector{Int})} <: AbstractDataFrame


Any reason this shouldn't also accept Range1{Int} and maybe Range{Int}?

Yeah, I was wondering if you could use AbstractVector{Int} safely here.

No reason other than I haven't gotten to it yet

tshort · 2014-01-14T01:22:17Z

I don't think I like this change. Shouldn't a single-row SubDataFrame act like a single-row DataFrame? Wouldn't it be better to have EachRow return a different type that can be indexed this way?

kmsquire · 2014-01-14T03:10:06Z

I don't think I like this change. Shouldn't a single-row SubDataFrame act like a single-row DataFrame? Wouldn't it be better to have EachRow return a different type that can be indexed this way?

Can't please everyone. ;-)

I actually implemented that separate type once, when working on the original sort code. It turned out that I didn't need it there, and was encouraged to remove it, so I did.

Having implemented it this way, I can say that a DataFrameRow type would look exactly like a SubDataFrame{Int}, and would likely duplicate a bit of code. But maybe not too much.

Anyone else have thoughts on this?

kmsquire · 2014-01-14T03:11:33Z

@tshort, see also the discussion in #375.

johnmyleswhite · 2014-01-14T03:39:59Z

How would a single row DataFrame and a single row SubDataFrame differ in behavior?

kmsquire · 2014-01-14T04:38:20Z

How would a single row DataFrame and a single row SubDataFrame differ in behavior?

Sorry, it's a little unclear what you're asking.

On master right now, the row iterator returns a single-row DataFrame. As pointed out in #375, even though we know there's only one row, we need to include the row number when indexing if we want to get at the actual value, e.g.,

## On master
julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |
| 2     | 2 | F |
| 3     | 3 | F |
| 4     | 4 | M |

julia> row = EachRow(df)[1]
1x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> row["A"]
1-element DataArray{Int64,1}:
 1

julia> row[1,"A"]
1

With this pull request, row["A"] and row[1] do the same thing as row[1, "A"] and row[1, 1] do on Master (but row[1, "A"] actually still works, if you prefer).

If I understand Tom correctly, he's suggesting that

changing the behavior of SubDataFrame{T} depending on the parameterization T is unintuitive
that instead, a separate type should be created for DataFrameRow which has the same behavior as SubDataFrame{Int}

nalimilan · 2014-01-14T09:34:21Z

If understand correctly too, I agree with @tshort. Better have SubDataFrame behave always like a DataFrame, and have a DataFrameRow type which would act differently.

This is very similar to the question of dropping dimensions from an array: here, dropping dimensions means returning a row, just like dropping dimensions in a matrix means returning a vector. I think sub(df, 1) (and EachRow) should return a DataFrameRow, but sub(df, [1]) or sub(df, 1:1) should return a SubDataFrame.

kmsquire · 2014-01-14T12:45:25Z

Changing SubDataFrame{Int} to DataFrameRow is pretty trivial (it's already an alias, although printing obviously doesn't reveal that).

My main concern that this change would be largely cosmetic, and that any changes made to SubDataFrame methods in the future will likely have to be duplicated for DataFrameRow methods. Not very DRY. (Of course, there's a bit of duplication in the code already, but I'd prefer the direction to be decreasing.)

powerdistribution · 2014-01-14T12:48:03Z

@kmsquire correctly summed up my suggestion. I don't think the DataFrameRow type is particularly complicated. Here's some minimally tested code that does what I think we want.

type DataFrameRow
    df::AbstractDataFrame
    row::Int
end

DataFrameRow(df::AbstractDataFrame) = DataFrameRow(df, 1)

Base.getindex(r::DataFrameRow, idx) = r.df[r.row, idx]

Base.setindex!(r::DataFrameRow, value, idx) = setindex!(r.df, value, r.row, idx)

Base.start(r::DataFrameRow) = 1
Base.done(r::DataFrameRow, i::Int) = i > size(r.df, 1)
Base.next(r::DataFrameRow, i::Int) = (DataFrameRow(r.df, i), i + 1)

eachrow(df::AbstractDataFrame) = DataFrameRow(df)

A show method would also be nice. Overall, it's not much extra code, especially considering that this replaces DFRowIterator.

tshort · 2014-01-14T12:51:05Z

Sorry about that. I posted the above from the wrong account.

kmsquire · 2014-01-14T13:00:58Z

A show method would also be nice. Overall, it's not much extra code, especially considering that this replaces DFRowIterator.

True, that isn't much code. We actually still need DFRowIterator, though. The iteration here is for iterating over the elements of a row. DFRowIterator iterates over rows of a DataFrame.

kmsquire · 2014-01-14T13:08:20Z

Sorry, Tom, I didn't look closely enough at your code--your iterator does indeed replace DFRowIterator. (Your eachrow has a small bug.)

However, I think that the iterator definition for DataFrameRow should iterate over the elements of the row, not the DataFrame, and that DFRowIterator should iterate over the rows of the DataFrame.

tshort · 2014-01-14T13:08:53Z

Kevin, the code above is the iterator over the DataFrame. Here's an example:

julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |
| 2     | 2 | F |
| 3     | 3 | F |
| 4     | 4 | M |

julia> for row in eachrow(df) println(row["A"]) end
1
2
3
4

julia> for row in eachrow(df) println(row["B"]) end
M
F
F
M

julia> eachrow(df)
DataFrameRow(4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |
| 2     | 2 | F |
| 3     | 3 | F |
| 4     | 4 | M |,1)

julia> for row in eachrow(df) row["B"] = row["B"] * "X" end

julia> df
4x2 DataFrame
|-------|---|----|
| Row # | A | B  |
| 1     | 1 | MX |
| 2     | 2 | FX |
| 3     | 3 | FX |
| 4     | 4 | MX |

julia> for row in eachrow(df) println(row[2]) end
MX
FX
FX
MX

kmsquire · 2014-01-14T13:09:48Z

We're posting a the same time! See my comment above yours...

tshort · 2014-01-14T13:10:27Z

Got it. I can understand why you'd want that.

kmsquire · 2014-01-14T13:10:40Z

(And ignore my comment about the bug in eachrow)

nalimilan · 2014-01-14T13:11:09Z

Looks great! ;-)

kmsquire · 2014-01-14T13:26:25Z

Unfortunately, the possibility of an iterator over the elements of the DataFrame row opens up a can of worms: when iterating over the row, does one return the values or (key, value) pairs?

I think that a DataFrameRow should be a subtype of Associative, or at least it should be possible to easily create a Dict or OrderedDict from a DataFrameRow, which would allow either type of iteration.

Anyway, I'll punt for now and open up a separate issue/pull request once this gets settled.

kmsquire · 2014-01-14T14:23:58Z

Okay, I've refactored everything to use a separate DataFrameRow class, and DRYed out sub() as well.

One of the tests in test/iteration.jl requires iterating over the elements of a row, which is not yet implemented for DataFrameRows (see my comments above), so I've commented it out for now.

sub(df, i) now returns a DataFrameRow if i is an Int; otherwise it returns a SubDataFrame.

Iterating using eachrow returns a DataFrameRow for successive rows.

Other comments/suggestions welcome. I can change it back if others dislike this change. ;-)

tshort · 2014-01-14T14:31:35Z

I like this appoach. I'd prefer that sub always returns a SubDataFrame. If you want to pick out one row, you can just directly use DataFrameRow(df, i).

nalimilan · 2014-01-14T18:01:25Z

Currently sub() for arrays only drops trailing dimensions when indexing with an integer. Since rows are the first dimension, and that sub() selects all columns of the DataFrame (second dimension), it would be consistent to preserve dimensions indeed. (My proposal above to drop dimensions makes sense only in a world where getindex() for arrays drops all dimensions, which I think would be right but is not what happens currently.)

johnmyleswhite · 2014-01-16T04:36:47Z

I agree with @tshort's last comment: let's have sub always return a SubDataFrame and eachrow produce the DataFrameRow type for each iteration.

With that change made, would there be anything else standing in the way of merging this?

kmsquire · 2014-01-16T15:01:16Z

So, changing that back broke column iteration on SubDataFrames, and I don't have time to fix that right now. Will update later.

kmsquire · 2014-01-22T05:57:42Z

I've updated this to something close to what people have requested.

SubDataFrames can still be specialized by Array{Int} or Ranges{Int}, but no longer by Int, since doing so pretty much makes them act like DataFrameRows, which isn't desirable. (This required special casing sub when using integer indexes.)
Because DataFrameRows are no longer SubDataFrames, they did't have a show method, so one was added.

To me, these changes slightly complicate the code, and I'm not exactly happy with them. But they reflect the functionality people are asking for.

Iteration over a DataFrameRow returns (key, value) tuples.

Feedback welcome. If this looks good, I'll squash and commit.

tshort · 2014-01-22T12:43:20Z

I didn't run it, but I like the approach and scanned through the code.

On Wed, Jan 22, 2014 at 12:57 AM, Kevin Squire notifications@git.luolix.topwrote:

I've updated this to something close to what people have requested.

SubDataFrames can still be specialized by Array{Int} or Ranges{Int},
but no longer by Int, since doing so pretty much makes them act like
DataFrameRows, which isn't desirable. (This required special casing subwhen using integer indexes.)

Because DataFrameRows are no longer SubDataFrames, they did't have a
show method, so one was added.

To me, these changes slightly complicate the code, and I'm not exactly
happy with them. But they reflect the functionality people are asking for.

Iteration over a DataFrameRow returns (key, value) tuples.

Feedback welcome. If this looks good, I'll squash and commit.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/474#issuecomment-32995577
.

johnmyleswhite · 2014-01-22T17:19:51Z

src/dataframe.jl

+
+Base.sub(D::DataFrame, r::Int) = SubDataFrame(D, [r])
+Base.sub(D::DataFrame, rs::RowsType) = SubDataFrame(D, rs)
+Base.sub(D::SubDataFrame, r::Int) = SubDataFrame(D.parent, [D.rows[r]])


Does this behave different than doing sub(d::DataFrame, r::Int)?

Not really. I think I can collapse them into one function using AbstractDataFrame--I'll do that.

Sorry: I think I must be confused. I thought that single row indexing was going to produce a DataFrameRow from now on, not a SubDataFrame. Is that right?

johnmyleswhite · 2014-01-22T17:25:04Z

I like everything about this, except for being on the fence about the definition of single row indexing for SubDataFrame.

kmsquire · 2014-01-22T18:57:26Z

I like everything about this, except for being on the fence about the definition of single row indexing for SubDataFrame.

Okay, so a little more information about that:

If we include Int as a possible RowsType almost works, but the behavior seems to be contrary to what was requested. In particular, access to columns of a SubDataFrame{Vector{Int}} or SubDataFrame{T<:Ranges{Int}} always gives back a DataArray:

julia> sdf = sub(df, [1])
1x2 SubDataFrame{Array{Int64,1}}
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> sdf["A"]
1-element DataArray{Int64,1}:
 1

If we include Int in RowsType, access to a column of a SubDataFrame{Int}, with almost no change in definitions(*), gives the element itself:

julia> sdf = sub(df, 1)
1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> sdf["A"]
 1

(Which is exactly how a DataFrameRow works.) My interpretation of Tom's and Milan's preferences (and John's concordance) was that this was undesirable, so I removed Int from RowsType, and that necessitates special-casing the SubDataFrame constructor when using an Int (which is actually identical to some old constructor/sub functions).

Anyway, I'll squash, and without further comments, commit. Cheers!

kmsquire · 2014-01-22T18:58:23Z

Actually, I'll fix the failing tests before committing. ;-o

johnmyleswhite · 2014-01-22T19:02:17Z

I've figured out my own source of confusion. Sorry for the noise.

Just for future reference: a bug in GitHub means that I only receive e-mails for about 75% of any given conversation, so I'm now perpetually confused about any complex conversation since I miss large chunks of it.

kmsquire · 2014-01-22T19:03:05Z

LOL. ;-) No worries, John!

kmsquire · 2014-01-22T20:12:26Z

Okay, I've made a few additional simplifying changes and fixes, and squashed the pull request.

Changes:

I removed RowsType; SubDataFrames are now parametrized by T<:AbstractArray{Int}, which allows DataVectors{Int}, Vectors{Int}, and Ranges{Int} to all work
I added external functions/constructors which relax the eltype requirement to Integer (but convert to Int)

Fixes:

extrema doesn't like empty inputs
The sorting framework uses eachrow, and needed some minor updates to use DataFrameRows

Tests pass on my machine. Assuming they pass in Travis and there are no further comments today, I'll merge later.

johnmyleswhite · 2014-01-22T20:18:38Z

I'm happy with this and think it's ready for merging.

One question for later work: couldn't a couple more of these methods apply to AbstractDataFrame?

johnmyleswhite · 2014-01-22T20:19:38Z

src/dataframe.jl

@@ -913,37 +913,25 @@ end

 # a SubDataFrame is a lightweight wrapper around a DataFrame used most frequently in
 # split/apply sorts of operations.
-type SubDataFrame <: AbstractDataFrame
+
+immutable SubDataFrame{T<:AbstractVector{Int}} <: AbstractDataFrame
    parent::DataFrame


In the future, I think we could plausibly get away with making this a parametric type that can match any kind of AbstractDataFrame.

kmsquire · 2014-01-22T20:59:54Z

One question for later work: couldn't a couple more of these methods apply to AbstractDataFrame?

I think depends on the the characteristics of AbstractDataFrames, and might run into some of the subtleties being debated around the Julia abstract array types.

Anyway, that's a simple change, so I think I'll merge this for now.

* Parametrize SubDataFrames by T<:AbstractVector{Int} * Changed SubDataFrame to immutable * DRYed out sub() functions * Deprecated subset (alias for sub), on the theory that we should remove redundant methods like this * Add show() for DataFrameRow * Change EachRow, EachCol -> eachrow, eachcol, to better match Base convention for iterators * Return DataFrameRows from eachrow * Changed Sort.lt comparisons to compare DataFrameRows (for issorted) * Specialize collect(r::DataFrameRow), so that Dict(collect(r::DataFrameRow)) works * Start to use size(df, 1) for nrow(df), size(df, 2) for ncol(df)

johnmyleswhite · 2014-01-22T21:57:42Z

Yes, please do.

Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375)

kmsquire · 2014-01-22T22:26:25Z

Are we updating METADATA.jl, or holding off?

johnmyleswhite · 2014-01-22T22:27:47Z

Given that we have a bad track record for making carefully targeted releases, I'd say we're better off just doing the update.

kmsquire · 2014-01-22T22:37:06Z

So, I'm using the new extrema function, which is not available in v0.2.

I can either bump REQUIRES, or provide an implementation for backward compatibility.

I know you suggested not concerning ourselves with supporting v0.2, but I'd rather not put out a known-broken release, and this is an easy fix.

johnmyleswhite · 2014-01-22T22:38:21Z

Let's bump REQUIRES. We already need to require 0.3 to support the new formula syntax for making model matrices.

kmsquire · 2014-01-22T22:53:30Z

Done, but see JuliaLang/METADATA.jl#538

johnmyleswhite reviewed Jan 14, 2014
View reviewed changes

simonster reviewed Jan 14, 2014
View reviewed changes

johnmyleswhite reviewed Jan 22, 2014
View reviewed changes

kmsquire added a commit that referenced this pull request Jan 22, 2014

Merge pull request #474 from kmsquire/row_as_subdataframe

f2b0139

Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375)

kmsquire merged commit f2b0139 into JuliaData:master Jan 22, 2014

kmsquire deleted the row_as_subdataframe branch January 22, 2014 22:04


		typealias DataFrameRow SubDataFrame{Int}

		Base.start(row::DataFrameRow) = 1

Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375) #474

Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375) #474

Conversation

kmsquire commented Jan 13, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnmyleswhite commented Jan 14, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tshort commented Jan 14, 2014

kmsquire commented Jan 14, 2014

kmsquire commented Jan 14, 2014

johnmyleswhite commented Jan 14, 2014

kmsquire commented Jan 14, 2014

nalimilan commented Jan 14, 2014

kmsquire commented Jan 14, 2014

powerdistribution commented Jan 14, 2014

tshort commented Jan 14, 2014

kmsquire commented Jan 14, 2014

kmsquire commented Jan 14, 2014

tshort commented Jan 14, 2014

kmsquire commented Jan 14, 2014

tshort commented Jan 14, 2014

kmsquire commented Jan 14, 2014

nalimilan commented Jan 14, 2014

kmsquire commented Jan 14, 2014

kmsquire commented Jan 14, 2014

tshort commented Jan 14, 2014

nalimilan commented Jan 14, 2014

johnmyleswhite commented Jan 16, 2014

kmsquire commented Jan 16, 2014

kmsquire commented Jan 22, 2014

tshort commented Jan 22, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnmyleswhite commented Jan 22, 2014

kmsquire commented Jan 22, 2014

kmsquire commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

kmsquire commented Jan 22, 2014

kmsquire commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

Choose a reason for hiding this comment

kmsquire commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

kmsquire commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

kmsquire commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

kmsquire commented Jan 22, 2014