Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: sorting with large float and multiple columns incorrect #14944

Closed
wants to merge 8 commits into from

Conversation

uweschmitt
Copy link
Contributor

@uweschmitt uweschmitt commented Dec 21, 2016

Fixes #14922

Having the int equivalent of NaT in an int64 column caused wrong sorting because this special value was considered as "missing value".

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

columns=("int", "float"))

df2 = df.sort_values(["int", "float"])
assert is_sorted(df2.loc[:, "int"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert_frame_equal and construct the expected result.

@jreback jreback changed the title Fix GH 14922 BUG: sorting with large float and multiple columns incorrect Dec 21, 2016
@jreback jreback added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 21, 2016
    having the int equivalent of NaT in an int64 column caused wrong
    sorting because this special value was considered as "missing
    value".
    having the int equivalent of NaT in an int64 column caused wrong
    sorting because this special value was considered as "missing
    value".
@uweschmitt
Copy link
Contributor Author

@jreback please review 03699c6, I used rebase and now the changes do not show up in github as expected.

@codecov-io
Copy link

codecov-io commented Dec 21, 2016

Current coverage is 84.65% (diff: 100%)

Merging #14944 into master will increase coverage by <.01%

@@             master     #14944   diff @@
==========================================
  Files           144        144          
  Lines         51021      51031    +10   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43188      43199    +11   
+ Misses         7833       7832     -1   
  Partials          0          0          

Powered by Codecov. Last update f79bc7a...c244438


from pandas.util.testing import (assert_series_equal,
assert_frame_equal,
assertRaisesRegexp)
assertRaisesRegexp,
is_sorted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to import the is_sorted

@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

can you add a similar tests which uses datetimes and a NaT as well (just to verify that it doesn't break).

further pls check both ascending and descending.

def test_sort_nat_values_in_int_column(self):

# GH 14922, sorting with large float and multiple columns incorrect
int_values = (2, int(NaT))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.iinfo(np.int64).min instead (and NaT) is already an int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought NaT is implicitly indicating what went wrong...

@jreback jreback added this to the 0.20.0 milestone Dec 22, 2016
@jreback
Copy link
Contributor

jreback commented Dec 22, 2016

can you add a whatsnew note (0.20.0, bug fixes)


# and now check if NaT is still considered as "na" for datetime64
# columns:
df = DataFrame(dict(int=int_values, float=float_values),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a not-used line

df = DataFrame(dict(datetime=[Timestamp("2016-01-01"), NaT],
float=float_values), columns=["datetime", "float"])

# check if the dtype is datetime64[ns]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

df = DataFrame(dict(datetime=[Timestamp("2016-01-01"), NaT],
float=float_values), columns=["datetime", "float"])

assert df["datetime"].dtypes == np.dtype("datetime64[ns]"),\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line here is not needed (the assert of datetime)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@uweschmitt
Copy link
Contributor Author

Ok now ?

@jreback
Copy link
Contributor

jreback commented Dec 23, 2016

thanks!

@jreback jreback closed this in 6bea827 Dec 23, 2016
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
closes pandas-dev#14922

Having the `int` equivalent of `NaT` in an `int64` column caused wrong
sorting because this special value was considered as "missing value".

Author: Uwe <uwe.schmitt@id.ethz.ch>

Closes pandas-dev#14944 from uweschmitt/fix-gh-14922 and squashes the following commits:

c244438 [Uwe] further cleanup tests
4f28026 [Uwe] fixed typo in whatsnew/v0.20.0.txt
60cca5d [Uwe] add fix of GH14922 to release notes for 0.20.0
04dcbe8 [Uwe] further test cleanup
21e610c [Uwe] extended tests + minor cleanup
358a31e [Uwe] Merge branch 'fix-pandas-devgh-14922' of github.com:uweschmitt/pandas into fix-pandas-devgh-14922
03699c6 [Uwe] Fix GH 14922
1afdbb8 [Uwe] Fix GH 14922
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: sorting with large float and multiple columns incorrect
3 participants