BUG: groupby.first/last with nans #8427

jreback · 2014-09-30T15:41:38Z

This is incorrect, as this is applied column by column (as they are different dtypes)
so _first_compat should first compute the mask then use it.

from SO

In [18]: df = pd.DataFrame([{'id':"a",'val':np.nan, 'val2':-1},{'id':"a",'val':'TREE','val2':15}])

In [19]: df
Out[19]: 
  id   val  val2
0  a   NaN    -1
1  a  TREE    15

In [20]: df.groupby('id').first()
Out[20]: 
     val  val2
id            
a   TREE    -1

The text was updated successfully, but these errors were encountered:

behzadnouri · 2014-10-01T11:17:24Z

this is designed behavior, and does not depend on types:

>>> df
   jim  joe  jolie
0    0    1    NaN
1    0  NaN      2
>>> df.dtypes
jim      float64
joe      float64
jolie    float64
dtype: object
>>> df._data.nblocks
1
>>> df.groupby('jim').first()
     joe  jolie
jim            
0      1      2

jreback · 2014-10-01T11:42:36Z

I think it's incorrect for the frame case (series case ok)

jorisvandenbossche · 2014-10-01T12:28:37Z

Related: #6732

jreback · 2014-10-01T12:36:46Z

I think this is confusing at the least. So you have multiple values on the first row, some of which are NaN. I guess .head(1) is the 'correct' answer here. And .first() does its thing. I think maybe this should NOT be the default behavior.

TomAugspurger · 2018-07-06T21:44:50Z

So the documented behavior of Groupby.first / Groupby.last is the first / last non-NA value? Anything to do here?

WillAyd · 2018-07-07T14:02:25Z

There's some inconsistency between first and nth(0) that needs to be addressed here:

In [10]: df.groupby('id').nth(0)
Out[10]: 
    val  val2
id           
a   NaN    -1

rhshadrach · 2021-04-16T03:31:40Z

@WillAyd - first is designed to work column by column whereas nth selects a single row. So it seems to me this inconsistency is purposeful (but agree it's confusing).

To me an ideal solution would allow a user to

get the nth value, either selecting a single row or column-by-column (note: this is different than using axis=1 in the grouby(...) call)
first/last as a shortcut for nth(0) and nth(-1) as these are very common, again by either selecting a single row or column-by-column

If we had this, then it'd be best to align nth/first so the default behavior of nth(0) is the same as first, and similarly with last.

WillAyd · 2021-04-16T19:43:01Z

I think your second bullet feels right, and might have the added bonus of simplifying our code paths

NumberPiOso · 2022-02-21T16:41:21Z

take

NumberPiOso · 2022-03-01T22:18:16Z

@rhshadrach , I understand that modifying first and last methods to point to nth would be the way to go. This would cause this behaviour.

In [10]: df.groupby('id').nth(0)
Out[10]: 
    val  val2
id           
a   NaN    -1

In [10]: df.groupby('id').first()
Out[10]: 
    val  val2
id           
a   NaN    -1

Then, we would include another parameter to decide if it is the nth value by row or column-by-column.

rhshadrach · 2022-03-02T22:19:53Z

@NumberPiOso - Certainly we should include the parameter first if going that route before changing the default behavior of first/last. However, I do not feel certain that is the correct approach. I would have to guess that first/last are common operations, at least more so than nth. Is changing the default of first to mean "first row" more useful than "first value in each column"? That is very unclear to me. It's at this point that I wonder if changing the behavior here is more disruptive than helpful.

As a tangential aside, I've always found treating nans differently in first/last/etc more surprising than beneficial, but perhaps that's just my use case. However this is pretty consistent across pandas and I certainly don't find it sever enough to merit a change to the default behavior (or at the very least, merit me arguing for a change!).

NumberPiOso · 2022-03-08T22:54:55Z

This is true.
As a user, I tend to use first and last a lot more than nth and by themselves each one of this methods are consistent. So I would not change the methods either.

jreback added Bug Groupby labels Sep 30, 2014

jreback added this to the 0.15.1 milestone Sep 30, 2014

jreback added API Design and removed Bug labels Oct 1, 2014

jreback changed the title ~~BUG: groupby.first/last multi-dtypes and a nan incorrectly selects values~~ BUG: groupby.first/last with nans Oct 1, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

nbonnotte mentioned this issue Jun 19, 2016

WIP: ENH: pivot/groupby index with nan #12607

Closed

36 tasks

sursu mentioned this issue Apr 14, 2018

.last() does not perform as expected #20657

Closed

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

WillAyd mentioned this issue Jul 25, 2019

groupby().first() docs should explain distinction between nth and first #27578

Open

mroeschke added Bug and removed API Design labels Apr 11, 2021

github-actions bot assigned NumberPiOso Feb 21, 2022

NumberPiOso mentioned this issue Mar 1, 2022

DOC: Clarify groupby.first does not use nulls #46195

Merged

2 tasks

jreback removed this from the Someday milestone Mar 2, 2022

jreback added this to the 1.5 milestone Mar 2, 2022

rhshadrach closed this as completed in #46195 Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.first/last with nans #8427

BUG: groupby.first/last with nans #8427

jreback commented Sep 30, 2014

behzadnouri commented Oct 1, 2014

jreback commented Oct 1, 2014

jorisvandenbossche commented Oct 1, 2014

jreback commented Oct 1, 2014

TomAugspurger commented Jul 6, 2018

WillAyd commented Jul 7, 2018

rhshadrach commented Apr 16, 2021

WillAyd commented Apr 16, 2021

NumberPiOso commented Feb 21, 2022

NumberPiOso commented Mar 1, 2022

rhshadrach commented Mar 2, 2022

NumberPiOso commented Mar 8, 2022

BUG: groupby.first/last with nans #8427

BUG: groupby.first/last with nans #8427

Comments

jreback commented Sep 30, 2014

behzadnouri commented Oct 1, 2014

jreback commented Oct 1, 2014

jorisvandenbossche commented Oct 1, 2014

jreback commented Oct 1, 2014

TomAugspurger commented Jul 6, 2018

WillAyd commented Jul 7, 2018

rhshadrach commented Apr 16, 2021

WillAyd commented Apr 16, 2021

NumberPiOso commented Feb 21, 2022

NumberPiOso commented Mar 1, 2022

rhshadrach commented Mar 2, 2022

NumberPiOso commented Mar 8, 2022