Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat produces incorrect output #3602

Closed
rhstanton opened this issue May 14, 2013 · 20 comments · Fixed by #3647
Closed

concat produces incorrect output #3602

rhstanton opened this issue May 14, 2013 · 20 comments · Fixed by #3647
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@rhstanton
Copy link
Contributor

Under certain circumstances, concat seems to produce erroneous results. I haven't worked out what causes the problems to arise, but here's an example:

df1 = DataFrame({'firmNo' : [0,0,0,0], 'stringvar' : ['rrr', 'rrr', 'rrr', 'rrr'], 'prc' : [6,6,6,6] })
df2 = DataFrame({'misc' : [1,2,3,4], 'prc' : [6,6,6,6], 'C' : [9,10,11,12]})
concat([df1,df2],axis=1)

produces as output:

firmNo prc stringvar C misc prc
0 rrr 0 6 9 1 6
1 rrr 0 6 10 2 6
2 rrr 0 6 11 3 6
3 rrr 0 6 12 4 6

@cpcloud
Copy link
Member

cpcloud commented May 15, 2013

@rhstanton It's helpful if you can put your output in "``" so that it prints in a monospaced font. It's easier on the eyes. :) I can reproduce this on git master. What is the expected output?

@cpcloud
Copy link
Member

cpcloud commented May 15, 2013

Ah. looks like there's a sorting problem here...

@rhstanton
Copy link
Contributor Author

I agree it looks terrible! Does the output go in quotes in my notebook or when I upload to github? If you could give me a quick example of how to do this, I'd be more than happy to help others' eyesight in future.

From: Phillip Cloud <notifications@git.luolix.topmailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.git.luolix.topmailto:reply@reply.github.com>
Date: Wednesday, May 15, 2013 3:42 PM
To: pydata/pandas <pandas@noreply.git.luolix.topmailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edumailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] concat produces incorrect output (#3602)

@rhstantonhttps://github.com/rhstanton It's helpful if you can put your output in "``" so that it prints in a monospaced fonts. It's easier on the eyes. :)


Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-17970916.

@cpcloud
Copy link
Member

cpcloud commented May 15, 2013

Surround anything you want monospaced type with backquotes, i.e., the **** character. For example x = 1`. When you get a chance to get on GitHub (not re ing from ur email) you should click on the GitHub Flavored Markdown link. It's full of useful info.

@jreback
Copy link
Contributor

jreback commented May 18, 2013

@cpcloud any luck with this?

@cpcloud
Copy link
Member

cpcloud commented May 18, 2013

Nah not yet, but I haven't given it more than a cursory glance. I will look
into it this weekend.
On May 17, 2013 9:02 PM, "jreback" notifications@github.com wrote:

@cpcloud https://github.com/cpcloud any luck with this?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-18092899
.

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

@jreback @rhstanton What is expected output? 2 prc columns with the same values? one of the merge behaviors?

Is this the expected output? (I will assume it is since it is the least magical thing concat could do here: just basically "join" [not in the database sense] the two frames together along the requested axis.)

firmNo prc stringvar C misc prc
0 0 6 rrr 9 1 6
1 0 6 rrr 10 2 6
2 0 6 rrr 11 3 6
3 0 6 rrr 12 4 6

(heh i will try to parse this with read_html later...)

@rhstanton
Copy link
Contributor Author

Yes, I’d expect the output you show below, just with the right column headings (2 prc columns with the same values, but only because they were passed in with the same values. If they’d had different values in df1 and df2, I’d expect two prc columns with different contents).

Best,

Richard

From: Phillip Cloud [mailto:notifications@github.com]
Sent: Saturday, May 18, 2013 6:42 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] concat produces incorrect output (#3602)

@jrebackhttps://github.com/jreback @rhstantonhttps://github.com/rhstanton What is expected output? 2 prc columns with the same values? one of the merge behaviors? if u do

use dfs from above

df = concat([df1, df2], axis, ignore_index=True)

print df

0

1

2

3

4

5

0

0

6

Rrr

9

1

6

1

0

6

Rrr

10

2

6

2

0

6

Rrr

11

3

6

3

0

6

Rrr

12

4

6

Is this what u want except with the original column indices?
(heh i will try to parse this with read_html later...)


Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-18110879.

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

@jreback This is a strange beast i went all the way into ndframe.init and back up to concat. values attrs test the same, prolly a repr bug now

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

nvm something else...

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

@jreback AH HA! the bug is that the _ref_locs attribute of BlockManager is not set when u concat the two dfs and keep track of the index, but when u ignore the index and then set the columns there is already an ordering (_ref_locs is already set) so u r good. the question remains tho how u want to deal with this...seems like might want to raise when trying to concat in this situation. right now the exception thrown by get_indexer is caught and the assumption is made that the ordering is 0..n - 1 where n is the number of blocks, but that doesn't seem totally consistent, not sure what the optimal approach here is.

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

828f9f9 fixed the series version of this by just assigning the columns after the concat. is that the correct fix here? don't think so, maybe can ignore index if ignore_index is false and there are dup cols and axis is 1

@rhstanton
Copy link
Contributor Author

Looks like that would work given the results of your earlier concat without column names.

From: Phillip Cloud [mailto:notifications@github.com]
Sent: Saturday, May 18, 2013 11:01 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] concat produces incorrect output (#3602)

828f9f9828f9f99 fixed the series version of this by just assigning the columns after the concat. is that the correct fix here?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-18112898.

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

u might be right, need to see how to do this in a sane way...

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

i wonder if an __eq__ on block and blkmgr might help things like this in the future. could compare items, values and ref_locs

@jreback
Copy link
Contributor

jreback commented May 19, 2013

let me take a look

@jreback
Copy link
Contributor

jreback commented May 19, 2013

should be fixed by #3647, once I figured out was going on, fix was trivial

@cpcloud you were basically right, the newly created block has a non-unique index, so the block manager tries to create _ref_locs on each block, but this is wrong because it doesn't have an indexer map for the axes -> block locations (but of course have one when we are creating the blocks in the first place, so just set it there)

this worked in <= 0.11, but not in master because of the changes in non-unqique indexes

non-unique are a bit of an animal!

@jreback
Copy link
Contributor

jreback commented May 19, 2013

In [3]: df1 = DataFrame({'firmNo' : [0,0,0,0], 'stringvar' : ['rrr', 'rrr', 'rrr', 'rrr'], 'prc' : [6,6,6,6] })

In [4]: df2 = DataFrame({'misc' : [1,2,3,4], 'prc' : [6,6,6,6], 'C' : [9,10,11,12]})

In [5]: df1
Out[5]: 
   firmNo  prc stringvar
0       0    6       rrr
1       0    6       rrr
2       0    6       rrr
3       0    6       rrr

In [6]: df2
Out[6]: 
    C  misc  prc
0   9     1    6
1  10     2    6
2  11     3    6
3  12     4    6

In [7]: pd.concat([df1,df2],axis=1)
Out[7]: 
   firmNo  prc stringvar   C  misc  prc
0       0    6       rrr   9     1    6
1       0    6       rrr  10     2    6
2       0    6       rrr  11     3    6
3       0    6       rrr  12     4    6

In [8]: pd.concat([df1,df2],axis=1).dtypes
Out[8]: 
firmNo        int64
prc           int64
stringvar    object
C             int64
misc          int64
prc           int64
dtype: object

@rhstanton
Copy link
Contributor Author

That looks a lot better. Thanks.

From: jreback [mailto:notifications@github.com]
Sent: Sunday, May 19, 2013 7:11 AM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] concat produces incorrect output (#3602)

In [3]: df1 = DataFrame({'firmNo' : [0,0,0,0], 'stringvar' : ['rrr', 'rrr', 'rrr', 'rrr'], 'prc' : [6,6,6,6] })

In [4]: df2 = DataFrame({'misc' : [1,2,3,4], 'prc' : [6,6,6,6], 'C' : [9,10,11,12]})

In [5]: df1

Out[5]:

firmNo prc stringvar

0 0 6 rrr

1 0 6 rrr

2 0 6 rrr

3 0 6 rrr

In [6]: df2

Out[6]:

C  misc  prc

0 9 1 6

1 10 2 6

2 11 3 6

3 12 4 6

In [7]: pd.concat([df1,df2],axis=1)

Out[7]:

firmNo prc stringvar C misc prc

0 0 6 rrr 9 1 6

1 0 6 rrr 10 2 6

2 0 6 rrr 11 3 6

3 0 6 rrr 12 4 6

In [8]: pd.concat([df1,df2],axis=1).dtypes

Out[8]:

firmNo int64

prc int64

stringvar object

C int64

misc int64

prc int64

dtype: object


Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-18118236.

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

This is the part that I missed: "but of course have one when we are creating the blocks in the first place, so just set it there" arg :) @jreback thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
3 participants