Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.first/last with nans #8427

Closed
jreback opened this issue Sep 30, 2014 · 12 comments · Fixed by #46195
Closed

BUG: groupby.first/last with nans #8427

jreback opened this issue Sep 30, 2014 · 12 comments · Fixed by #46195
Assignees
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Sep 30, 2014

This is incorrect, as this is applied column by column (as they are different dtypes)
so _first_compat should first compute the mask then use it.

from SO

In [18]: df = pd.DataFrame([{'id':"a",'val':np.nan, 'val2':-1},{'id':"a",'val':'TREE','val2':15}])

In [19]: df
Out[19]: 
  id   val  val2
0  a   NaN    -1
1  a  TREE    15

In [20]: df.groupby('id').first()
Out[20]: 
     val  val2
id            
a   TREE    -1
@jreback jreback added this to the 0.15.1 milestone Sep 30, 2014
@behzadnouri
Copy link
Contributor

this is designed behavior, and does not depend on types:

>>> df
   jim  joe  jolie
0    0    1    NaN
1    0  NaN      2
>>> df.dtypes
jim      float64
joe      float64
jolie    float64
dtype: object
>>> df._data.nblocks
1
>>> df.groupby('jim').first()
     joe  jolie
jim            
0      1      2

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2014

I think it's incorrect for the frame case (series case ok)

@jorisvandenbossche
Copy link
Member

Related: #6732

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2014

I think this is confusing at the least. So you have multiple values on the first row, some of which are NaN. I guess .head(1) is the 'correct' answer here. And .first() does its thing. I think maybe this should NOT be the default behavior.

@jreback jreback added API Design and removed Bug labels Oct 1, 2014
@jreback jreback changed the title BUG: groupby.first/last multi-dtypes and a nan incorrectly selects values BUG: groupby.first/last with nans Oct 1, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@TomAugspurger
Copy link
Contributor

So the documented behavior of Groupby.first / Groupby.last is the first / last non-NA value? Anything to do here?

@WillAyd
Copy link
Member

WillAyd commented Jul 7, 2018

There's some inconsistency between first and nth(0) that needs to be addressed here:

In [10]: df.groupby('id').nth(0)
Out[10]: 
    val  val2
id           
a   NaN    -1

@rhshadrach
Copy link
Member

@WillAyd - first is designed to work column by column whereas nth selects a single row. So it seems to me this inconsistency is purposeful (but agree it's confusing).

To me an ideal solution would allow a user to

  • get the nth value, either selecting a single row or column-by-column (note: this is different than using axis=1 in the grouby(...) call)
  • first/last as a shortcut for nth(0) and nth(-1) as these are very common, again by either selecting a single row or column-by-column

If we had this, then it'd be best to align nth/first so the default behavior of nth(0) is the same as first, and similarly with last.

@WillAyd
Copy link
Member

WillAyd commented Apr 16, 2021

I think your second bullet feels right, and might have the added bonus of simplifying our code paths

@NumberPiOso
Copy link
Contributor

take

@NumberPiOso
Copy link
Contributor

@rhshadrach , I understand that modifying first and last methods to point to nth would be the way to go. This would cause this behaviour.

In [10]: df.groupby('id').nth(0)
Out[10]: 
    val  val2
id           
a   NaN    -1

In [10]: df.groupby('id').first()
Out[10]: 
    val  val2
id           
a   NaN    -1

Then, we would include another parameter to decide if it is the nth value by row or column-by-column.

@jreback jreback removed this from the Someday milestone Mar 2, 2022
@jreback jreback added this to the 1.5 milestone Mar 2, 2022
@rhshadrach
Copy link
Member

@NumberPiOso - Certainly we should include the parameter first if going that route before changing the default behavior of first/last. However, I do not feel certain that is the correct approach. I would have to guess that first/last are common operations, at least more so than nth. Is changing the default of first to mean "first row" more useful than "first value in each column"? That is very unclear to me. It's at this point that I wonder if changing the behavior here is more disruptive than helpful.

As a tangential aside, I've always found treating nans differently in first/last/etc more surprising than beneficial, but perhaps that's just my use case. However this is pretty consistent across pandas and I certainly don't find it sever enough to merit a change to the default behavior (or at the very least, merit me arguing for a change!).

@NumberPiOso
Copy link
Contributor

This is true.
As a user, I tend to use first and last a lot more than nth and by themselves each one of this methods are consistent. So I would not change the methods either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants