concat produces incorrect output #3602

rhstanton · 2013-05-14T15:28:02Z

Under certain circumstances, concat seems to produce erroneous results. I haven't worked out what causes the problems to arise, but here's an example:

df1 = DataFrame({'firmNo' : [0,0,0,0], 'stringvar' : ['rrr', 'rrr', 'rrr', 'rrr'], 'prc' : [6,6,6,6] })
df2 = DataFrame({'misc' : [1,2,3,4], 'prc' : [6,6,6,6], 'C' : [9,10,11,12]})
concat([df1,df2],axis=1)

produces as output:

firmNo prc stringvar C misc prc
0 rrr 0 6 9 1 6
1 rrr 0 6 10 2 6
2 rrr 0 6 11 3 6
3 rrr 0 6 12 4 6

cpcloud · 2013-05-15T22:42:21Z

@rhstanton It's helpful if you can put your output in "``" so that it prints in a monospaced font. It's easier on the eyes. :) I can reproduce this on git master. What is the expected output?

cpcloud · 2013-05-15T22:53:46Z

Ah. looks like there's a sorting problem here...

rhstanton · 2013-05-15T22:57:00Z

I agree it looks terrible! Does the output go in quotes in my notebook or when I upload to github? If you could give me a quick example of how to do this, I'd be more than happy to help others' eyesight in future.

From: Phillip Cloud <notifications@github.com mailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.com mailto:reply@reply.github.com>
Date: Wednesday, May 15, 2013 3:42 PM
To: pydata/pandas <pandas@noreply.github.com mailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edu mailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] concat produces incorrect output (#3602)

@rhstantonhttps://github.com/rhstanton It's helpful if you can put your output in "``" so that it prints in a monospaced fonts. It's easier on the eyes. :)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-17970916.

cpcloud · 2013-05-15T22:59:36Z

Surround anything you want monospaced type with backquotes, i.e., the **** character. For example x = 1`. When you get a chance to get on GitHub (not re ing from ur email) you should click on the GitHub Flavored Markdown link. It's full of useful info.

jreback · 2013-05-18T01:02:19Z

@cpcloud any luck with this?

cpcloud · 2013-05-18T02:36:57Z

Nah not yet, but I haven't given it more than a cursory glance. I will look
into it this weekend.
On May 17, 2013 9:02 PM, "jreback" notifications@github.com wrote:

@cpcloud https://github.com/cpcloud any luck with this?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-18092899
.

cpcloud · 2013-05-19T01:41:48Z

@jreback @rhstanton What is expected output? 2 prc columns with the same values? one of the merge behaviors?

Is this the expected output? (I will assume it is since it is the least magical thing concat could do here: just basically "join" [not in the database sense] the two frames together along the requested axis.)

	prc	stringvar	C	misc	prc
0	6	rrr	9	1	6
1	6	rrr	10	2	6
2	6	rrr	11	3	6
3	6	rrr	12	4	6

(heh i will try to parse this with read_html later...)

rhstanton · 2013-05-19T03:28:21Z

Yes, I’d expect the output you show below, just with the right column headings (2 prc columns with the same values, but only because they were passed in with the same values. If they’d had different values in df1 and df2, I’d expect two prc columns with different contents).

Best,

Richard

From: Phillip Cloud [mailto:notifications@github.com]
Sent: Saturday, May 18, 2013 6:42 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] concat produces incorrect output (#3602)

@jrebackhttps://github.com/jreback @rhstantonhttps://github.com/rhstanton What is expected output? 2 prc columns with the same values? one of the merge behaviors? if u do

use dfs from above

df = concat([df1, df2], axis, ignore_index=True)

print df

0

1

2

3

4

5

0

6

Rrr

9

1

6

1

0

6

Rrr

10

2

6

2

0

6

Rrr

11

3

6

3

0

6

Rrr

12

4

6

Is this what u want except with the original column indices?
(heh i will try to parse this with read_html later...)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-18110879.

cpcloud · 2013-05-19T04:36:35Z

@jreback This is a strange beast i went all the way into ndframe.init and back up to concat. values attrs test the same, prolly a repr bug now

cpcloud · 2013-05-19T05:02:42Z

nvm something else...

cpcloud · 2013-05-19T05:44:58Z

@jreback AH HA! the bug is that the _ref_locs attribute of BlockManager is not set when u concat the two dfs and keep track of the index, but when u ignore the index and then set the columns there is already an ordering (_ref_locs is already set) so u r good. the question remains tho how u want to deal with this...seems like might want to raise when trying to concat in this situation. right now the exception thrown by get_indexer is caught and the assumption is made that the ordering is 0..n - 1 where n is the number of blocks, but that doesn't seem totally consistent, not sure what the optimal approach here is.

cpcloud · 2013-05-19T06:00:50Z

828f9f9 fixed the series version of this by just assigning the columns after the concat. is that the correct fix here? don't think so, maybe can ignore index if ignore_index is false and there are dup cols and axis is 1

rhstanton · 2013-05-19T06:29:20Z

Looks like that would work given the results of your earlier concat without column names.

From: Phillip Cloud [mailto:notifications@github.com]
Sent: Saturday, May 18, 2013 11:01 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] concat produces incorrect output (#3602)

828f9f9828f9f99 fixed the series version of this by just assigning the columns after the concat. is that the correct fix here?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-18112898.

cpcloud · 2013-05-19T06:32:17Z

u might be right, need to see how to do this in a sane way...

cpcloud · 2013-05-19T06:33:40Z

i wonder if an __eq__ on block and blkmgr might help things like this in the future. could compare items, values and ref_locs

jreback · 2013-05-19T10:52:14Z

let me take a look

jreback · 2013-05-19T14:09:56Z

should be fixed by #3647, once I figured out was going on, fix was trivial

@cpcloud you were basically right, the newly created block has a non-unique index, so the block manager tries to create _ref_locs on each block, but this is wrong because it doesn't have an indexer map for the axes -> block locations (but of course have one when we are creating the blocks in the first place, so just set it there)

this worked in <= 0.11, but not in master because of the changes in non-unqique indexes

non-unique are a bit of an animal!

jreback · 2013-05-19T14:11:07Z

In [3]: df1 = DataFrame({'firmNo' : [0,0,0,0], 'stringvar' : ['rrr', 'rrr', 'rrr', 'rrr'], 'prc' : [6,6,6,6] })

In [4]: df2 = DataFrame({'misc' : [1,2,3,4], 'prc' : [6,6,6,6], 'C' : [9,10,11,12]})

In [5]: df1
Out[5]: 
   firmNo  prc stringvar
0       0    6       rrr
1       0    6       rrr
2       0    6       rrr
3       0    6       rrr

In [6]: df2
Out[6]: 
    C  misc  prc
0   9     1    6
1  10     2    6
2  11     3    6
3  12     4    6

In [7]: pd.concat([df1,df2],axis=1)
Out[7]: 
   firmNo  prc stringvar   C  misc  prc
0       0    6       rrr   9     1    6
1       0    6       rrr  10     2    6
2       0    6       rrr  11     3    6
3       0    6       rrr  12     4    6

In [8]: pd.concat([df1,df2],axis=1).dtypes
Out[8]: 
firmNo        int64
prc           int64
stringvar    object
C             int64
misc          int64
prc           int64
dtype: object

rhstanton · 2013-05-19T14:12:41Z

That looks a lot better. Thanks.

From: jreback [mailto:notifications@github.com]
Sent: Sunday, May 19, 2013 7:11 AM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] concat produces incorrect output (#3602)

In [3]: df1 = DataFrame({'firmNo' : [0,0,0,0], 'stringvar' : ['rrr', 'rrr', 'rrr', 'rrr'], 'prc' : [6,6,6,6] })

In [4]: df2 = DataFrame({'misc' : [1,2,3,4], 'prc' : [6,6,6,6], 'C' : [9,10,11,12]})

In [5]: df1

Out[5]:

firmNo prc stringvar

0 0 6 rrr

1 0 6 rrr

2 0 6 rrr

3 0 6 rrr

In [6]: df2

Out[6]:

C  misc  prc

0 9 1 6

1 10 2 6

2 11 3 6

3 12 4 6

In [7]: pd.concat([df1,df2],axis=1)

Out[7]:

firmNo prc stringvar C misc prc

0 0 6 rrr 9 1 6

1 0 6 rrr 10 2 6

2 0 6 rrr 11 3 6

3 0 6 rrr 12 4 6

In [8]: pd.concat([df1,df2],axis=1).dtypes

Out[8]:

firmNo int64

prc int64

stringvar object

C int64

misc int64

prc int64

dtype: object

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3602#issuecomment-18118236.

cpcloud · 2013-05-19T14:30:41Z

This is the part that I missed: "but of course have one when we are creating the blocks in the first place, so just set it there" arg :) @jreback thanks.

jreback mentioned this issue May 19, 2013

BUG: (GH3602) Concat to produce a non-unique columns when duplicates are across dtypes #3647

Merged

jreback closed this as completed in #3647 May 19, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concat produces incorrect output #3602

concat produces incorrect output #3602

rhstanton commented May 14, 2013

cpcloud commented May 15, 2013

cpcloud commented May 15, 2013

rhstanton commented May 15, 2013

cpcloud commented May 15, 2013

jreback commented May 18, 2013

cpcloud commented May 18, 2013

cpcloud commented May 19, 2013

rhstanton commented May 19, 2013

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

rhstanton commented May 19, 2013

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

jreback commented May 19, 2013

jreback commented May 19, 2013

jreback commented May 19, 2013

rhstanton commented May 19, 2013

cpcloud commented May 19, 2013

concat produces incorrect output #3602

concat produces incorrect output #3602

Comments

rhstanton commented May 14, 2013

cpcloud commented May 15, 2013

cpcloud commented May 15, 2013

rhstanton commented May 15, 2013

cpcloud commented May 15, 2013

jreback commented May 18, 2013

cpcloud commented May 18, 2013

cpcloud commented May 19, 2013

rhstanton commented May 19, 2013

use dfs from above

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

rhstanton commented May 19, 2013

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

jreback commented May 19, 2013

jreback commented May 19, 2013

jreback commented May 19, 2013

rhstanton commented May 19, 2013

cpcloud commented May 19, 2013