EachRow(df)[i] should be a vector, not a DataFrame #375

jvns · 2013-10-11T19:01:38Z

Right now EachRow(df)[i] is defined like as a slice:

getindex(itr::DFRowIterator, i::Any) = itr.df[i, :]

This means that if you set

row = EachRow(df)[i]

then row[1] is an array, not an element. Not sure how to handle this.

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2013-10-11T19:03:23Z

How else would this be defined?

I'm personally inclined to remove all of the iteration constructs for DataFrames.

jvns · 2013-10-11T19:19:00Z

Why? It's definitely important to be able to get a row of a DataFrame, in any case... is this possible right now?

johnmyleswhite · 2013-10-11T19:24:11Z

Right now the iterator returns one-row DataFrames. All of the standard indexing rules for DataFrames apply to that DataFrame. Does that make possible what you're trying to do?

jvns · 2013-10-11T19:50:47Z

It's not causing problems for me right now, but I don't think that's what the iterator should return.

What I meant was: if you removed the iteration constructs, how would you get rows from a dataframe?

edit: it makes anything possible, just kind of ugly. It means that you have to do something like

for row = EachRow(df)
    name = row["name"][1]
end

to access elements in a row.

I'm curious about how pandas handles this now.

johnmyleswhite · 2013-10-11T19:57:16Z

Just to be sure I understand: you think that a row of a DataFrame should not be a DataFrame, but an Array{Any}. Is that right?

You can always do iteration with explicit indexing. The virtue of the iterator seems restricted to composition with functions over iterators.

StefanKarpinski · 2013-10-11T20:03:28Z

I would argue that a tuple would be better for this than an Any array. Of course, you can't index by name into a tuple or an array, so that's a downside of either of these alternatives.

jvns · 2013-10-11T20:04:05Z

I think the row of a DataFrame should have the same type as a column of a DataFrame -- a DataArray{Any} (or if the types are all the same, a DataArray{Float64} or whatever.

Does that make sense?

johnmyleswhite · 2013-10-11T20:07:15Z

Ah. I think Stefan's point about indexing is the obvious reason why one might like to have a DataFrame returned, rather than a DataArray. That said, I can see reasonable arguments for many approaches.

StefanKarpinski · 2013-10-11T20:07:53Z

No, that doesn't really make sense: DataArrays are homogeneous whereas DataFrames are a heterogeneous bundle of homogeneous columns.

jvns · 2013-10-11T20:08:46Z

Ideally I'd like to have a df["name"][1] be the same as df[1]["name"], which you can (mostly) have in pandas.

johnmyleswhite · 2013-10-11T20:16:28Z

Hmmm. That seems like a non-trivial change to me. For me, the relevant equivalence is that a 1-row DataFrame, when indexed for its unique row, should return the DataFrame, not a separate entity.

For what it's worth, our approach is like a sane version of R's approach. (R's approach is nutty because a single row of a 1-column DataFrame is a vector, but a multi-column DataFrame gives a DataFrame for a single row.)

jvns · 2013-10-11T20:33:05Z

That's makes sense. It seems like making df["name"][1] be the same as df[1]["name"] would be a pretty big change, and anything else would be worse.

kmsquire · 2013-10-11T23:46:21Z

I'm also missing the ability to access rows of a data frame and treat them
as entities. Pandas handles this by defining a Series type (as well as
going the other extending DataFrames into 3D Panels, etc.) I have made
good use of both of these features, which has kept me from using Julia's
DataFrames as much as I would like.

Would an implementation of "rows" using OrderedDicts() be worthwhile? I
have an implementation of OrderedDicts, in limbo as a submission to
mainline Julia, but possibly better off in a package anyway.

(I actually implemented a separate row type at one point, but it turned out
not to be needed as part of the patch I was submitting, so I dropped it.)

On Fri, Oct 11, 2013 at 1:33 PM, Julia Evans notifications@git.luolix.topwrote:

That's makes sense. It seems like making df["name"][1] be the same as
df[1]["name"] would be a pretty big change, and anything else would be
worse.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/375#issuecomment-26170211
.

StefanKarpinski · 2013-10-12T05:26:53Z

I'm also missing the ability to access rows of a data frame and treat them
as entities.

I'm confused by this. Doesn't returning a row as a single-value DataFrame do that? Or are you talking about having something like a DataRow type? That could be represented efficiently as a reference to the data frame plus a row index. Although that seems similar to what a SubDataFrame would be.

kmsquire · 2013-10-12T20:16:31Z

On Fri, Oct 11, 2013 at 10:26 PM, Stefan Karpinski <notifications@github.com

wrote:

I'm also missing the ability to access rows of a data frame and treat them
as entities.

I'm confused by this. Doesn't returning a row as a single-value DataFrame
do that? Or are you talking about having something like a DataRow type?
That could be represented efficiently as a reference to the data frame plus
a row index. Although that seems similar to what a SubDataFrame would be.

That's true--Sorry, I was in a hurry and wasn't careful with my words.

My actual issue is partly aesthetic and partly semantic. Aesthetically, x[1,
"lib_id"] is not as nice as x["lib_id"], especially if it appears
frequently (as it does in my code--see below). Semantically, I find it
sometimes useful to think of a row as a dictionary, and the presence of an
extra index forces me to thinking about it as a table instead.

As an example, I have a data processing pipeline which makes extensive use
of DataFrames in pandas. Because I'm creating many intermediate files, I
often create filenames based on the values of other columns. Something like

lanes['bam_dir'] =
(lanes.
apply(lambda x:
os.path.join(bam_dirs.ix[x['lib_id'], 'bam_dir'],
'_'.join([x['FCID'], str(x['lane'])])),
axis=1))

Taking advantage of the fact that a row is a dictionary:

realigned_bams['target_intervals'] =
(realigned_bams.
apply(lambda x:
os.path.join(x['realign_subdir'],
"{family_id}.{interval}.intervals".format(**x)),
axis=1))

I've wanted to port this code to Julia for a while--there are (non-Pandas)
parts to the code which could really use some speed up. But many of the
Pandas features I use (not just those above) aren't sufficiently developed
yet in Julia's DataFrames, and I haven't had the time to work on them
myself.

I realize this probably isn't a common use case for DataFrames (and I use
them for more typical data processing as well), so it might not be the best
motivation for these types of features. But it is one of my motivations.
;-)

Kevin

StefanKarpinski · 2013-10-12T20:55:00Z

If SubDataFrame were parameterized on the type of the .rows field, then the current SubDataFrame would become SubDataFrame{Vector{Int}} and you could have SubDataFrame{Int} that references only a single row. You could also have SubDataFrame{Range1{Int}} that references a contiguous range of rows and so on. This could improve efficiency, but it could also allow indexing into a single-row data frame to behave differently from indexing into a multi-row data frame (even one that only happens to reference a single row).

kmsquire · 2013-10-12T21:10:27Z

That's a very interesting idea--thanks Stefan!

On Sat, Oct 12, 2013 at 1:55 PM, Stefan Karpinski
notifications@git.luolix.topwrote:

If SubDataFrame were parameterized on the type of the .rows field, then
the current SubDataFrame would become SubDataFrame{Vector{Int}} and you
could have SubDataFrame{Int} that references only a single row. You could
also have SubDataFrame{Range1{Int}} that references a contiguous range of
rows and so on. This could improve efficiency, but it could also allow
indexing into a single-row data frame to behave differently from indexing
into a multi-row data frame (even one that only happens to reference a
single row).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/375#issuecomment-26205839
.

nalimilan · 2013-10-21T13:07:46Z

A related issue is that you can modify EachRow(df)[i] at will without any error, but the changes have absolutely no effect on the original DataFrame. Rather than a DataFrame, shouldn't EachRow(df)[i] be a SubDataFrame so that iterators can be used to modify data?

kmsquire mentioned this issue Jan 13, 2014

Implementation of a DataFrame row as a parameterized SubDataFrame (fixes #375) #474

Merged

kmsquire closed this as completed in f2b0139 Jan 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EachRow(df)[i] should be a vector, not a DataFrame #375

EachRow(df)[i] should be a vector, not a DataFrame #375

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

StefanKarpinski commented Oct 11, 2013

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

StefanKarpinski commented Oct 11, 2013

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

jvns commented Oct 11, 2013

kmsquire commented Oct 11, 2013

StefanKarpinski commented Oct 12, 2013

kmsquire commented Oct 12, 2013

StefanKarpinski commented Oct 12, 2013

kmsquire commented Oct 12, 2013

nalimilan commented Oct 21, 2013

EachRow(df)[i] should be a vector, not a DataFrame #375

EachRow(df)[i] should be a vector, not a DataFrame #375

Comments

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

StefanKarpinski commented Oct 11, 2013

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

StefanKarpinski commented Oct 11, 2013

jvns commented Oct 11, 2013

johnmyleswhite commented Oct 11, 2013

jvns commented Oct 11, 2013

kmsquire commented Oct 11, 2013

StefanKarpinski commented Oct 12, 2013

kmsquire commented Oct 12, 2013

StefanKarpinski commented Oct 12, 2013

kmsquire commented Oct 12, 2013

nalimilan commented Oct 21, 2013