Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing rows on DataFrame outer join with MultiIndex #1421

Closed
manuteleco opened this issue Jun 7, 2012 · 0 comments
Closed

Missing rows on DataFrame outer join with MultiIndex #1421

manuteleco opened this issue Jun 7, 2012 · 0 comments
Labels
Milestone

Comments

@manuteleco
Copy link

Hi,

I'm trying to compute an outer join on several columns applied to several DataFrame objects in one step. However, the result I get seems to force the uniqueness on the set of join columns and, as a consequence, some rows are missing.

Here is some example code that shows the output from a join operation over 3 dataframes in one step and a merge operation (in 2 steps) over the same data. Comparing both, we see that the join operation doesn't include the row "1 1 10 100 1000".

from pandas import DataFrame, merge

def multiple_join():
    df1 = DataFrame({"a": [1,1], "b": [1,1], "c": [10,20]})
    df2 = DataFrame({"a": [1,1], "b": [1,2], "d": [100,200]})
    df3 = DataFrame({"a": [1,1], "b": [1,2], "e": [1000,2000]})
    df1.set_index(["a", "b"], inplace=True)
    df2.set_index(["a", "b"], inplace=True)
    df3.set_index(["a", "b"], inplace=True)
    df_joined = df1.join([df2, df3], how='outer')
    return df_joined

def cascade_merge():
    df1 = DataFrame({"a": [1,1], "b": [1,1], "c": [10,20]})
    df2 = DataFrame({"a": [1,1], "b": [1,2], "d": [100,200]})
    df3 = DataFrame({"a": [1,1], "b": [1,2], "e": [1000,2000]})
    df_partially_merged = merge(df1, df2, on=['a', 'b'], how='outer')
    df_merged = merge(df_partially_merged, df3, on=['a', 'b'], how='outer')
    return df_merged

if __name__ == "__main__":
    print multiple_join()
    print cascade_merge()




#         c    d     e
#   a b               
#   1 1  20  100  1000
#     2 NaN  200  2000
#
#      a  b   c    d     e
#   0  1  1  10  100  1000
#   1  1  1  20  100  1000
#   2  1  2 NaN  200  2000

However, this problem doesn't seem to arise when we specify "how={anything other that outer}" in the join operation.

So, either this is a bug or I'm missing something here. In either case, I would appreciate any comment regarding this issue. And, BTW, it would be really could if "merge" could accept a list of DataFrames and join them efficiently in one step.

Thanks and regards.

@wesm wesm closed this as completed in e11777e Jun 11, 2012
yarikoptic added a commit to neurodebian/pandas that referenced this issue Jun 21, 2012
Version 0.8.0 beta 2

* tag 'v0.8.0b2': (37 commits)
  RLS: 0.8.0 beta 2
  BUG: bytes_to_str for read_csv
  BUG: import BytesIO for py3compat
  BUG: fix compat errors for yahoo data reader
  ENH: convert datetime.datetime ourselves, 15x speedup
  Make tox work across versions of Python from 2.5 to 3.2
  Reenable py31 and py32 in .travis.yml
  TST: test coverage
  TST: oops, delete stray line
  REF: factor out ujson extension into pandasjson for now
  TST: eliminate copies in datetime64 serialization; don't copy data in DatetimeIndex, close pandas-dev#1320
  DOC: refresh time zone docs close pandas-dev#1447
  BUG: always raise exception when concat keys aren't found in passed levels, close pandas-dev#1406
  ENH: implement passed quantile array to qcut and document that plus factors, close pandas-dev#1407
  ENH: clearer out of bounds error message in cut/qcut, close pandas-dev#1409
  ENH: allow renaming of index levels when concatenating, close pandas-dev#1419
  BUG: fix MultiIndex bugs described in pandas-dev#1401
  DOC: release notes
  BUG: implement multiple DataFrame.join / merge on non-unique indexes by multiple merges, close pandas-dev#1421
  REF: remove offset names from pandas namespace
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants