Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375) #474

Merged
merged 1 commit into from
Jan 22, 2014

Conversation

kmsquire
Copy link
Contributor

As suggested by @StefanKarpinski here, this PR parameterizes SubDataFrames by the type of rows.

At this point, I've only implemented Int (for individual rows) and Vector{Int} (for everything else). Parameterizing by Range1{Int}, as Stefan suggested, would also be a good addition.

This makes indexing within rows more natural, as requested in #375--i.e., the elements of each row can be accessed either by name (Dict-like), or numerical index (Array-like).

Some examples:

julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |
| 2     | 2 | F |
| 3     | 3 | F |
| 4     | 4 | M |

julia> for row in EachRow(df)
          println(row)
       end
1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 2 | F |

1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 3 | F |

1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 4 | M |


julia> for row in EachRow(df)
          println(row["A"], " ", row[2])
       end
1 M
2 F
3 F
4 M

julia> row = sub(df, 2)
1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 2 | F |

julia> row["A"] = 200
200

julia> df
4x2 DataFrame
|-------|-----|---|
| Row # | A   | B |
| 1     | 1   | M |
| 2     | 200 | F |
| 3     | 3   | F |
| 4     | 4   | M |

Note that this is a BREAKING change

Before:

julia> row = EachRow(df)[1]
1x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> row[1]
1-element DataArray{Int64,1}:
 1

julia> typeof(row)
DataFrame (constructor with 22 methods)

After:

julia> row = EachRow(df)[1]
1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> row[1]
1

julia> typeof(row)
SubDataFrame{Int64} (constructor with 1 method)


typealias DataFrameRow SubDataFrame{Int}

Base.start(row::DataFrameRow) = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do people use this kind of iteration? I kind of want to remove this idiom for iterating over an AbstractDataFrame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use it frequently in Pandas. While it's nice to write batch operations on DataFrames, some things are just easier to write with a for loop.

@johnmyleswhite
Copy link
Contributor

Setting aside the pretty superficial comments I made, this looks really promising. Thanks for taking it on, Kevin!

@@ -914,11 +914,11 @@ end

# a SubDataFrame is a lightweight wrapper around a DataFrame used most frequently in
# split/apply sorts of operations.
type SubDataFrame <: AbstractDataFrame
type SubDataFrame{T<:Union(Int,Vector{Int})} <: AbstractDataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this shouldn't also accept Range1{Int} and maybe Range{Int}?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wondering if you could use AbstractVector{Int} safely here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason other than I haven't gotten to it yet

@tshort
Copy link
Contributor

tshort commented Jan 14, 2014

I don't think I like this change. Shouldn't a single-row SubDataFrame act like a single-row DataFrame? Wouldn't it be better to have EachRow return a different type that can be indexed this way?

@kmsquire
Copy link
Contributor Author

I don't think I like this change. Shouldn't a single-row SubDataFrame act like a single-row DataFrame? Wouldn't it be better to have EachRow return a different type that can be indexed this way?

Can't please everyone. ;-)

I actually implemented that separate type once, when working on the original sort code. It turned out that I didn't need it there, and was encouraged to remove it, so I did.

Having implemented it this way, I can say that a DataFrameRow type would look exactly like a SubDataFrame{Int}, and would likely duplicate a bit of code. But maybe not too much.

Anyone else have thoughts on this?

@kmsquire
Copy link
Contributor Author

@tshort, see also the discussion in #375.

@johnmyleswhite
Copy link
Contributor

How would a single row DataFrame and a single row SubDataFrame differ in behavior?

@kmsquire
Copy link
Contributor Author

How would a single row DataFrame and a single row SubDataFrame differ in behavior?

Sorry, it's a little unclear what you're asking.

On master right now, the row iterator returns a single-row DataFrame. As pointed out in #375, even though we know there's only one row, we need to include the row number when indexing if we want to get at the actual value, e.g.,

## On master
julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |
| 2     | 2 | F |
| 3     | 3 | F |
| 4     | 4 | M |

julia> row = EachRow(df)[1]
1x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> row["A"]
1-element DataArray{Int64,1}:
 1

julia> row[1,"A"]
1

With this pull request, row["A"] and row[1] do the same thing as row[1, "A"] and row[1, 1] do on Master (but row[1, "A"] actually still works, if you prefer).

If I understand Tom correctly, he's suggesting that

  1. changing the behavior of SubDataFrame{T} depending on the parameterization T is unintuitive
  2. that instead, a separate type should be created for DataFrameRow which has the same behavior as SubDataFrame{Int}

@nalimilan
Copy link
Member

If understand correctly too, I agree with @tshort. Better have SubDataFrame behave always like a DataFrame, and have a DataFrameRow type which would act differently.

This is very similar to the question of dropping dimensions from an array: here, dropping dimensions means returning a row, just like dropping dimensions in a matrix means returning a vector. I think sub(df, 1) (and EachRow) should return a DataFrameRow, but sub(df, [1]) or sub(df, 1:1) should return a SubDataFrame.

@kmsquire
Copy link
Contributor Author

Changing SubDataFrame{Int} to DataFrameRow is pretty trivial (it's already an alias, although printing obviously doesn't reveal that).

My main concern that this change would be largely cosmetic, and that any changes made to SubDataFrame methods in the future will likely have to be duplicated for DataFrameRow methods. Not very DRY. (Of course, there's a bit of duplication in the code already, but I'd prefer the direction to be decreasing.)

@powerdistribution
Copy link
Contributor

@kmsquire correctly summed up my suggestion. I don't think the DataFrameRow type is particularly complicated. Here's some minimally tested code that does what I think we want.

type DataFrameRow
    df::AbstractDataFrame
    row::Int
end

DataFrameRow(df::AbstractDataFrame) = DataFrameRow(df, 1)

Base.getindex(r::DataFrameRow, idx) = r.df[r.row, idx]

Base.setindex!(r::DataFrameRow, value, idx) = setindex!(r.df, value, r.row, idx)

Base.start(r::DataFrameRow) = 1
Base.done(r::DataFrameRow, i::Int) = i > size(r.df, 1)
Base.next(r::DataFrameRow, i::Int) = (DataFrameRow(r.df, i), i + 1)

eachrow(df::AbstractDataFrame) = DataFrameRow(df)

A show method would also be nice. Overall, it's not much extra code, especially considering that this replaces DFRowIterator.

@tshort
Copy link
Contributor

tshort commented Jan 14, 2014

Sorry about that. I posted the above from the wrong account.

@kmsquire
Copy link
Contributor Author

A show method would also be nice. Overall, it's not much extra code, especially considering that this replaces DFRowIterator.

True, that isn't much code. We actually still need DFRowIterator, though. The iteration here is for iterating over the elements of a row. DFRowIterator iterates over rows of a DataFrame.

@kmsquire
Copy link
Contributor Author

Sorry, Tom, I didn't look closely enough at your code--your iterator does indeed replace DFRowIterator. (Your eachrow has a small bug.)

However, I think that the iterator definition for DataFrameRow should iterate over the elements of the row, not the DataFrame, and that DFRowIterator should iterate over the rows of the DataFrame.

@tshort
Copy link
Contributor

tshort commented Jan 14, 2014

Kevin, the code above is the iterator over the DataFrame. Here's an example:

julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |
| 2     | 2 | F |
| 3     | 3 | F |
| 4     | 4 | M |

julia> for row in eachrow(df) println(row["A"]) end
1
2
3
4

julia> for row in eachrow(df) println(row["B"]) end
M
F
F
M

julia> eachrow(df)
DataFrameRow(4x2 DataFrame
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |
| 2     | 2 | F |
| 3     | 3 | F |
| 4     | 4 | M |,1)

julia> for row in eachrow(df) row["B"] = row["B"] * "X" end

julia> df
4x2 DataFrame
|-------|---|----|
| Row # | A | B  |
| 1     | 1 | MX |
| 2     | 2 | FX |
| 3     | 3 | FX |
| 4     | 4 | MX |

julia> for row in eachrow(df) println(row[2]) end
MX
FX
FX
MX

@kmsquire
Copy link
Contributor Author

We're posting a the same time! See my comment above yours...

@tshort
Copy link
Contributor

tshort commented Jan 14, 2014

Got it. I can understand why you'd want that.

@kmsquire
Copy link
Contributor Author

(And ignore my comment about the bug in eachrow)

@nalimilan
Copy link
Member

Looks great! ;-)

@kmsquire
Copy link
Contributor Author

Unfortunately, the possibility of an iterator over the elements of the DataFrame row opens up a can of worms: when iterating over the row, does one return the values or (key, value) pairs?

I think that a DataFrameRow should be a subtype of Associative, or at least it should be possible to easily create a Dict or OrderedDict from a DataFrameRow, which would allow either type of iteration.

Anyway, I'll punt for now and open up a separate issue/pull request once this gets settled.

@kmsquire
Copy link
Contributor Author

Okay, I've refactored everything to use a separate DataFrameRow class, and DRYed out sub() as well.

One of the tests in test/iteration.jl requires iterating over the elements of a row, which is not yet implemented for DataFrameRows (see my comments above), so I've commented it out for now.

sub(df, i) now returns a DataFrameRow if i is an Int; otherwise it returns a SubDataFrame.

Iterating using eachrow returns a DataFrameRow for successive rows.

Other comments/suggestions welcome. I can change it back if others dislike this change. ;-)

@tshort
Copy link
Contributor

tshort commented Jan 14, 2014

I like this appoach. I'd prefer that sub always returns a SubDataFrame. If you want to pick out one row, you can just directly use DataFrameRow(df, i).

@nalimilan
Copy link
Member

Currently sub() for arrays only drops trailing dimensions when indexing with an integer. Since rows are the first dimension, and that sub() selects all columns of the DataFrame (second dimension), it would be consistent to preserve dimensions indeed. (My proposal above to drop dimensions makes sense only in a world where getindex() for arrays drops all dimensions, which I think would be right but is not what happens currently.)

@johnmyleswhite
Copy link
Contributor

I agree with @tshort's last comment: let's have sub always return a SubDataFrame and eachrow produce the DataFrameRow type for each iteration.

With that change made, would there be anything else standing in the way of merging this?

@kmsquire
Copy link
Contributor Author

So, changing that back broke column iteration on SubDataFrames, and I don't have time to fix that right now. Will update later.

@kmsquire
Copy link
Contributor Author

I've updated this to something close to what people have requested.

  • SubDataFrames can still be specialized by Array{Int} or Ranges{Int}, but no longer by Int, since doing so pretty much makes them act like DataFrameRows, which isn't desirable. (This required special casing sub when using integer indexes.)
  • Because DataFrameRows are no longer SubDataFrames, they did't have a show method, so one was added.

To me, these changes slightly complicate the code, and I'm not exactly happy with them. But they reflect the functionality people are asking for.

  • Iteration over a DataFrameRow returns (key, value) tuples.

Feedback welcome. If this looks good, I'll squash and commit.

@tshort
Copy link
Contributor

tshort commented Jan 22, 2014

I didn't run it, but I like the approach and scanned through the code.

On Wed, Jan 22, 2014 at 12:57 AM, Kevin Squire notifications@git.luolix.topwrote:

I've updated this to something close to what people have requested.

  • SubDataFrames can still be specialized by Array{Int} or Ranges{Int},
    but no longer by Int, since doing so pretty much makes them act like
    DataFrameRows, which isn't desirable. (This required special casing subwhen using integer indexes.)
  • Because DataFrameRows are no longer SubDataFrames, they did't have a
    show method, so one was added.

To me, these changes slightly complicate the code, and I'm not exactly
happy with them. But they reflect the functionality people are asking for.

  • Iteration over a DataFrameRow returns (key, value) tuples.

Feedback welcome. If this looks good, I'll squash and commit.


Reply to this email directly or view it on GitHubhttps://github.com//pull/474#issuecomment-32995577
.


Base.sub(D::DataFrame, r::Int) = SubDataFrame(D, [r])
Base.sub(D::DataFrame, rs::RowsType) = SubDataFrame(D, rs)
Base.sub(D::SubDataFrame, r::Int) = SubDataFrame(D.parent, [D.rows[r]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this behave different than doing sub(d::DataFrame, r::Int)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I think I can collapse them into one function using AbstractDataFrame--I'll do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry: I think I must be confused. I thought that single row indexing was going to produce a DataFrameRow from now on, not a SubDataFrame. Is that right?

@johnmyleswhite
Copy link
Contributor

I like everything about this, except for being on the fence about the definition of single row indexing for SubDataFrame.

@kmsquire
Copy link
Contributor Author

I like everything about this, except for being on the fence about the definition of single row indexing for SubDataFrame.

Okay, so a little more information about that:

If we include Int as a possible RowsType almost works, but the behavior seems to be contrary to what was requested. In particular, access to columns of a SubDataFrame{Vector{Int}} or SubDataFrame{T<:Ranges{Int}} always gives back a DataArray:

julia> sdf = sub(df, [1])
1x2 SubDataFrame{Array{Int64,1}}
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> sdf["A"]
1-element DataArray{Int64,1}:
 1

If we include Int in RowsType, access to a column of a SubDataFrame{Int}, with almost no change in definitions(*), gives the element itself:

julia> sdf = sub(df, 1)
1x2 SubDataFrame{Int64}
|-------|---|---|
| Row # | A | B |
| 1     | 1 | M |

julia> sdf["A"]
 1

(Which is exactly how a DataFrameRow works.) My interpretation of Tom's and Milan's preferences (and John's concordance) was that this was undesirable, so I removed Int from RowsType, and that necessitates special-casing the SubDataFrame constructor when using an Int (which is actually identical to some old constructor/sub functions).

Anyway, I'll squash, and without further comments, commit. Cheers!

@kmsquire
Copy link
Contributor Author

Actually, I'll fix the failing tests before committing. ;-o

@johnmyleswhite
Copy link
Contributor

I've figured out my own source of confusion. Sorry for the noise.

Just for future reference: a bug in GitHub means that I only receive e-mails for about 75% of any given conversation, so I'm now perpetually confused about any complex conversation since I miss large chunks of it.

@kmsquire
Copy link
Contributor Author

LOL. ;-) No worries, John!

@kmsquire
Copy link
Contributor Author

Okay, I've made a few additional simplifying changes and fixes, and squashed the pull request.

Changes:

  • I removed RowsType; SubDataFrames are now parametrized by T<:AbstractArray{Int}, which allows DataVectors{Int}, Vectors{Int}, and Ranges{Int} to all work
  • I added external functions/constructors which relax the eltype requirement to Integer (but convert to Int)

Fixes:

  • extrema doesn't like empty inputs
  • The sorting framework uses eachrow, and needed some minor updates to use DataFrameRows

Tests pass on my machine. Assuming they pass in Travis and there are no further comments today, I'll merge later.

@johnmyleswhite
Copy link
Contributor

I'm happy with this and think it's ready for merging.

One question for later work: couldn't a couple more of these methods apply to AbstractDataFrame?

@@ -913,37 +913,25 @@ end

# a SubDataFrame is a lightweight wrapper around a DataFrame used most frequently in
# split/apply sorts of operations.
type SubDataFrame <: AbstractDataFrame

immutable SubDataFrame{T<:AbstractVector{Int}} <: AbstractDataFrame
parent::DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, I think we could plausibly get away with making this a parametric type that can match any kind of AbstractDataFrame.

@kmsquire
Copy link
Contributor Author

One question for later work: couldn't a couple more of these methods apply to AbstractDataFrame?

I think depends on the the characteristics of AbstractDataFrames, and might run into some of the subtleties being debated around the Julia abstract array types.

Anyway, that's a simple change, so I think I'll merge this for now.

* Parametrize SubDataFrames by T<:AbstractVector{Int}
* Changed SubDataFrame to immutable
* DRYed out sub() functions
* Deprecated subset (alias for sub), on the theory that we should
  remove redundant methods like this
* Add show() for DataFrameRow
* Change EachRow, EachCol -> eachrow, eachcol, to better match
  Base convention for iterators
* Return DataFrameRows from eachrow
* Changed Sort.lt comparisons to compare DataFrameRows (for issorted)
* Specialize collect(r::DataFrameRow), so that Dict(collect(r::DataFrameRow)) works
* Start to use size(df, 1) for nrow(df), size(df, 2) for ncol(df)
@johnmyleswhite
Copy link
Contributor

Yes, please do.

kmsquire added a commit that referenced this pull request Jan 22, 2014
Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375)
@kmsquire kmsquire merged commit f2b0139 into JuliaData:master Jan 22, 2014
@kmsquire kmsquire deleted the row_as_subdataframe branch January 22, 2014 22:04
@kmsquire
Copy link
Contributor Author

Are we updating METADATA.jl, or holding off?

@johnmyleswhite
Copy link
Contributor

Given that we have a bad track record for making carefully targeted releases, I'd say we're better off just doing the update.

@kmsquire
Copy link
Contributor Author

So, I'm using the new extrema function, which is not available in v0.2.

I can either bump REQUIRES, or provide an implementation for backward compatibility.

I know you suggested not concerning ourselves with supporting v0.2, but I'd rather not put out a known-broken release, and this is an easy fix.

@johnmyleswhite
Copy link
Contributor

Let's bump REQUIRES. We already need to require 0.3 to support the new formula syntax for making model matrices.

@kmsquire
Copy link
Contributor Author

Done, but see JuliaLang/METADATA.jl#538

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants