Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Fixes first_valid_index when DataFrame or Series has duplicate row index (GH21441) #21497

Merged
merged 21 commits into from
Jun 20, 2018
Merged

Conversation

KalyanGokhale
Copy link
Contributor

@KalyanGokhale KalyanGokhale commented Jun 15, 2018

@KalyanGokhale KalyanGokhale changed the title REGR: Fixes first_valid_index when dataframe has duplicate row index (GH21441) REGR/BUG: Fixes first_valid_index when dataframe has duplicate row index (GH21441) Jun 15, 2018
@@ -612,6 +612,16 @@ def test_pct_change(self, periods, fill_method, limit, exp):
else:
tm.assert_series_equal(res, Series(exp))

@pytest.mark.parametrize("DF,idx,first_idx,last_idx", [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some renaming suggestions for readability:

  • DF --> data (match DataFrame constructor)
  • idx --> index (match DataFrame constructor)
  • first_idx --> expected_first (follow standard expected/result unit test setup)
  • last_idx --> expected_last (follow standard expected/result unit test setup)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - done

({'A': [1, 2, 3, 4]}, ['d', 'd', 'd', 'd'], 'd', 'd')])
def test_valid_index(self, DF, idx, first_idx, last_idx):
# GH 21441
df1 = pd.DataFrame(DF, index=idx)
Copy link
Member

@jschendel jschendel Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just call this df; there's no ambiguity since there's only one frame in the test. Also DataFrame is imported, so the pd. isn't needed.


**Other Fixes**

- Bug in :meth:`first_valid_index` that raised for row index with duplicate values (:issue:`21441`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be :meth:`DataFrame.first_valid_index`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need a separate sub-section here, just list the issue

'raised for a row index'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - updated
Have left it as :meth:first_valid_index as this issue affects both DataFrame and Series (though the example and title of the original issue points only to DataFrame)

@jschendel jschendel added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version labels Jun 15, 2018

**Other Fixes**

- Bug in :meth:`first_valid_index` that raised for row index with duplicate values (:issue:`21441`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need a separate sub-section here, just list the issue

'raised for a row index'

if not is_valid[i]:
return None
return i
i = is_valid.values[::].argmin()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call this idxpos, no need for i any longer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - done

return None
return i
i = is_valid.values[::].argmin()
idxpos = i

elif how == 'last':
# Last valid value case
i = is_valid.values[::-1].argmax()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this idxpos

@@ -649,13 +649,6 @@ def test_asfreq_fill_value(self):
expected = frame.reindex(new_index, fill_value=4.0)
assert_frame_equal(result, expected)

def test_resample_interpolate(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this test as it was failing - investigated and seems that this test is from a closed PR #12974 opened for issue #12925
Not sure if this is the right call....
probably there are other tests failing which I haven't investigated yet - my sense is those might be related to this one - will check the TravisCI and other checks for it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored this test now - this test along with others were failing due to error in interpolation, which is fixed now

@@ -16,7 +16,7 @@ and bug fixes. We recommend that all users upgrade to this version.
Fixed Regressions
~~~~~~~~~~~~~~~~~

-
- Bug in :meth:`first_valid_index` raised for a row index with duplicate values (:issue:`21441`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have left it as :meth:`first_valid_index` as this issue affects both DataFrame and Series

I don't think the :meth: will correctly link to anything as written since there's no global pd.first_valid_index. You can write something like "Bug in both :meth:`Series.first_valid_index` and :meth:`DataFrame.first_valid_index` ..." if you want to be explicit that both are affected, which would link to both Series and DataFrame separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - Done

@KalyanGokhale KalyanGokhale changed the title REGR/BUG: Fixes first_valid_index when dataframe has duplicate row index (GH21441) REGR: Fixes first_valid_index when DataFrame or Series has duplicate row index (GH21441) Jun 17, 2018
@codecov
Copy link

codecov bot commented Jun 18, 2018

Codecov Report

Merging #21497 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21497      +/-   ##
==========================================
- Coverage   91.92%   91.91%   -0.01%     
==========================================
  Files         153      153              
  Lines       49570    49574       +4     
==========================================
+ Hits        45566    45568       +2     
- Misses       4004     4006       +2
Flag Coverage Δ
#multiple 90.31% <100%> (-0.01%) ⬇️
#single 41.8% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 96.13% <100%> (ø) ⬆️
pandas/util/testing.py 85.75% <0%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2da06c...751046d. Read the comment docs.

return None
return self.index[len(self) - i - 1]
idx = is_valid.idxmax()
if isinstance(is_valid[idx], ABCSeries):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this block is supposed to check that for multiple counts of same index, at least one is not NA.

However, while testing this with following data, the expected output is not being returned

x = pd.DataFrame({'b': [1,np.NaN,3]}, index=[1,1,2])

Expected 1, returned None

I'll rework this patch and commit again - Thanks again for the question prompt, it was fallacy of assumption on my part (had not checked explicitly for NaN value among the multiple index)

Copy link
Contributor Author

@KalyanGokhale KalyanGokhale Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the loop was incorrect leading to an error, not sure what I was thinking earlier :) - fixed now and committing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - rebased and committed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not clear on the logic here, why can't this be a mirror of the 'last' logic?

({'A': [1, 2, 3]}, [1, 1, 2], 1, 2),
({'A': [1, 2, 3]}, [1, 2, 2], 1, 2),
({'A': [1, 2, 3, 4]}, ['d', 'd', 'd', 'd'], 'd', 'd')])
def test_valid_index(self, data, index, expected_first, expected_last):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already have some tests for this? pls put near the others. does this duplicate existing tests at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already have some tests for this? pls put near the others.

Thanks - The only test involving first_valid_index and last_valid_index is in ./pandas/tests/frame/test_timeseries.py - and does not specifically check for duplicate
first or last index values. Would you suggest I move this test there?

@KalyanGokhale
Copy link
Contributor Author

will rebase and resolve conflict with whatsnew file and push later today

({'A': [np.nan, np.nan, 3]}, [1, 1, 2], 2, 2),
({'A': [1, np.nan, 3]}, [1, 2, 2], 1, 2)])
def test_first_last_valid(self, data, index,
expected_first, expected_last):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved tests here from pandas/tests/generic/test_generic.py - all related tests for first_valid_index and last_valid_index are co-located

return None
return self.index[len(self) - i - 1]
idx = is_valid.idxmax()
if isinstance(is_valid[idx], ABCSeries):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not clear on the logic here, why can't this be a mirror of the 'last' logic?

@jreback jreback added this to the 0.23.2 milestone Jun 20, 2018
@jreback jreback merged commit ec20207 into pandas-dev:master Jun 20, 2018
@jreback
Copy link
Contributor

jreback commented Jun 20, 2018

thanks @KalyanGokhale

@KalyanGokhale KalyanGokhale deleted the GH21441 branch June 20, 2018 10:56
jorisvandenbossche pushed a commit that referenced this pull request Jun 29, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jul 2, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

first_valid_index fails when dataframe has non-unique row index
4 participants