Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EachRow(df)[i] should be a vector, not a DataFrame #375

Closed
jvns opened this issue Oct 11, 2013 · 18 comments
Closed

EachRow(df)[i] should be a vector, not a DataFrame #375

jvns opened this issue Oct 11, 2013 · 18 comments

Comments

@jvns
Copy link
Contributor

jvns commented Oct 11, 2013

Right now EachRow(df)[i] is defined like as a slice:

getindex(itr::DFRowIterator, i::Any) = itr.df[i, :]

This means that if you set

row = EachRow(df)[i]

then row[1] is an array, not an element. Not sure how to handle this.

@johnmyleswhite
Copy link
Contributor

How else would this be defined?

I'm personally inclined to remove all of the iteration constructs for DataFrames.

@jvns
Copy link
Contributor Author

jvns commented Oct 11, 2013

Why? It's definitely important to be able to get a row of a DataFrame, in any case... is this possible right now?

@johnmyleswhite
Copy link
Contributor

Right now the iterator returns one-row DataFrames. All of the standard indexing rules for DataFrames apply to that DataFrame. Does that make possible what you're trying to do?

@jvns
Copy link
Contributor Author

jvns commented Oct 11, 2013

It's not causing problems for me right now, but I don't think that's what the iterator should return.

What I meant was: if you removed the iteration constructs, how would you get rows from a dataframe?

edit: it makes anything possible, just kind of ugly. It means that you have to do something like

for row = EachRow(df)
    name = row["name"][1]
end

to access elements in a row.

I'm curious about how pandas handles this now.

@johnmyleswhite
Copy link
Contributor

Just to be sure I understand: you think that a row of a DataFrame should not be a DataFrame, but an Array{Any}. Is that right?

You can always do iteration with explicit indexing. The virtue of the iterator seems restricted to composition with functions over iterators.

@StefanKarpinski
Copy link
Member

I would argue that a tuple would be better for this than an Any array. Of course, you can't index by name into a tuple or an array, so that's a downside of either of these alternatives.

@jvns
Copy link
Contributor Author

jvns commented Oct 11, 2013

I think the row of a DataFrame should have the same type as a column of a DataFrame -- a DataArray{Any} (or if the types are all the same, a DataArray{Float64} or whatever.

Does that make sense?

@johnmyleswhite
Copy link
Contributor

Ah. I think Stefan's point about indexing is the obvious reason why one might like to have a DataFrame returned, rather than a DataArray. That said, I can see reasonable arguments for many approaches.

@StefanKarpinski
Copy link
Member

No, that doesn't really make sense: DataArrays are homogeneous whereas DataFrames are a heterogeneous bundle of homogeneous columns.

@jvns
Copy link
Contributor Author

jvns commented Oct 11, 2013

Ideally I'd like to have a df["name"][1] be the same as df[1]["name"], which you can (mostly) have in pandas.

@johnmyleswhite
Copy link
Contributor

Hmmm. That seems like a non-trivial change to me. For me, the relevant equivalence is that a 1-row DataFrame, when indexed for its unique row, should return the DataFrame, not a separate entity.

For what it's worth, our approach is like a sane version of R's approach. (R's approach is nutty because a single row of a 1-column DataFrame is a vector, but a multi-column DataFrame gives a DataFrame for a single row.)

@jvns
Copy link
Contributor Author

jvns commented Oct 11, 2013

That's makes sense. It seems like making df["name"][1] be the same as df[1]["name"] would be a pretty big change, and anything else would be worse.

@kmsquire
Copy link
Contributor

I'm also missing the ability to access rows of a data frame and treat them
as entities. Pandas handles this by defining a Series type (as well as
going the other extending DataFrames into 3D Panels, etc.) I have made
good use of both of these features, which has kept me from using Julia's
DataFrames as much as I would like.

Would an implementation of "rows" using OrderedDicts() be worthwhile? I
have an implementation of OrderedDicts, in limbo as a submission to
mainline Julia, but possibly better off in a package anyway.

(I actually implemented a separate row type at one point, but it turned out
not to be needed as part of the patch I was submitting, so I dropped it.)

On Fri, Oct 11, 2013 at 1:33 PM, Julia Evans notifications@git.luolix.topwrote:

That's makes sense. It seems like making df["name"][1] be the same as
df[1]["name"] would be a pretty big change, and anything else would be
worse.


Reply to this email directly or view it on GitHubhttps://github.com//issues/375#issuecomment-26170211
.

@StefanKarpinski
Copy link
Member

I'm also missing the ability to access rows of a data frame and treat them
as entities.

I'm confused by this. Doesn't returning a row as a single-value DataFrame do that? Or are you talking about having something like a DataRow type? That could be represented efficiently as a reference to the data frame plus a row index. Although that seems similar to what a SubDataFrame would be.

@kmsquire
Copy link
Contributor

On Fri, Oct 11, 2013 at 10:26 PM, Stefan Karpinski <notifications@github.com

wrote:

I'm also missing the ability to access rows of a data frame and treat them
as entities.

I'm confused by this. Doesn't returning a row as a single-value DataFrame
do that? Or are you talking about having something like a DataRow type?
That could be represented efficiently as a reference to the data frame plus
a row index. Although that seems similar to what a SubDataFrame would be.

That's true--Sorry, I was in a hurry and wasn't careful with my words.

My actual issue is partly aesthetic and partly semantic. Aesthetically, x[1,
"lib_id"] is not as nice as x["lib_id"], especially if it appears
frequently (as it does in my code--see below). Semantically, I find it
sometimes useful to think of a row as a dictionary, and the presence of an
extra index forces me to thinking about it as a table instead.

As an example, I have a data processing pipeline which makes extensive use
of DataFrames in pandas. Because I'm creating many intermediate files, I
often create filenames based on the values of other columns. Something like

lanes['bam_dir'] =
(lanes.
apply(lambda x:
os.path.join(bam_dirs.ix[x['lib_id'], 'bam_dir'],
'_'.join([x['FCID'], str(x['lane'])])),
axis=1))

Taking advantage of the fact that a row is a dictionary:

realigned_bams['target_intervals'] =
(realigned_bams.
apply(lambda x:
os.path.join(x['realign_subdir'],
"{family_id}.{interval}.intervals".format(**x)),
axis=1))

I've wanted to port this code to Julia for a while--there are (non-Pandas)
parts to the code which could really use some speed up. But many of the
Pandas features I use (not just those above) aren't sufficiently developed
yet in Julia's DataFrames, and I haven't had the time to work on them
myself.

I realize this probably isn't a common use case for DataFrames (and I use
them for more typical data processing as well), so it might not be the best
motivation for these types of features. But it is one of my motivations.
;-)

Kevin

@StefanKarpinski
Copy link
Member

If SubDataFrame were parameterized on the type of the .rows field, then the current SubDataFrame would become SubDataFrame{Vector{Int}} and you could have SubDataFrame{Int} that references only a single row. You could also have SubDataFrame{Range1{Int}} that references a contiguous range of rows and so on. This could improve efficiency, but it could also allow indexing into a single-row data frame to behave differently from indexing into a multi-row data frame (even one that only happens to reference a single row).

@kmsquire
Copy link
Contributor

That's a very interesting idea--thanks Stefan!

On Sat, Oct 12, 2013 at 1:55 PM, Stefan Karpinski
notifications@git.luolix.topwrote:

If SubDataFrame were parameterized on the type of the .rows field, then
the current SubDataFrame would become SubDataFrame{Vector{Int}} and you
could have SubDataFrame{Int} that references only a single row. You could
also have SubDataFrame{Range1{Int}} that references a contiguous range of
rows and so on. This could improve efficiency, but it could also allow
indexing into a single-row data frame to behave differently from indexing
into a multi-row data frame (even one that only happens to reference a
single row).


Reply to this email directly or view it on GitHubhttps://github.com//issues/375#issuecomment-26205839
.

@nalimilan
Copy link
Member

A related issue is that you can modify EachRow(df)[i] at will without any error, but the changes have absolutely no effect on the original DataFrame. Rather than a DataFrame, shouldn't EachRow(df)[i] be a SubDataFrame so that iterators can be used to modify data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants