Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix initialization of DataFrame from dict with NaN as key #18600

Merged
merged 3 commits into from
Apr 1, 2018

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Dec 2, 2017

This does not solve the MI example in #18455, but that should be included in #18485 .

@@ -416,44 +416,29 @@ def _init_dict(self, data, index, columns, dtype=None):
Needs to handle a lot of exceptional cases.
"""
if columns is not None:
columns = _ensure_index(columns)
arrays = Series(data, index=columns, dtype=object)
data_names = arrays.index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be a perf issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe... but right now it seems to be worse...

     [d163de70]       [f7447b3f]
-      47.9±0.3ms       43.5±0.4ms     0.91  frame_ctor.FromDicts.time_frame_ctor_nested_dict
-      31.0±0.1ms       28.1±0.3ms     0.91  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BusinessDay', 2)
-        31.3±1ms       28.2±0.2ms     0.90  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2)
-      31.8±0.3ms       28.0±0.4ms     0.88  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessDay', 2)
-      32.8±0.3ms       28.2±0.2ms     0.86  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Day', 1)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There does seem to be a performance loss on very small dfs. E.g. for pd.DataFrame(data) with data = {1 : [2], 3 : [4], 5 : [6]} I get results around 530 µs for per loop before and 570 µs after. So we are talking about a ~10% gain on large dfs vs. a ~7.5% loss on small dfs.

... or I can avoid that Series and sort manually, at the cost of a bit of added complexity, probably ~10 LoCs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm... those asv results also seem pretty unstable:

      before           after         ratio
     [d163de70]       [f7447b3f]
+      30.7±0.2ms         41.1±3ms     1.34  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('QuarterBegin', 1)
+      29.4±0.1ms         39.1±4ms     1.33  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CBMonthBegin', 2)
+      30.7±0.7ms         40.4±3ms     1.32  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2)
+      31.1±0.7ms         39.1±5ms     1.26  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('SemiMonthEnd', 2)
+      30.4±0.1ms         38.0±4ms     1.25  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Hour', 2)
-        33.8±1ms       30.4±0.8ms     0.90  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Micro', 1)
-      48.3±0.6ms       42.6±0.8ms     0.88  frame_ctor.FromDicts.time_frame_ctor_nested_dict
-        23.8±1ms       20.5±0.2ms     0.86  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253Quarter_1', 2)
-      41.2±0.7ms         30.4±2ms     0.74  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessHour', 2)
-      8.35±0.9ms      6.05±0.01ms     0.72  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253_2', 2)
-        42.4±2ms       30.5±0.5ms     0.72  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BMonthEnd', 2)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I'll try to sort manually and see how it goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually doubt we have good benchmarks on this you are measuring the same benchmark here

we need benchmarks that contruct with different dtypes

and reducing code complexity is paramount here (though of course don’t want to sacrifice perf)

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Dtype Conversions Unexpected or buggy dtype conversions labels Dec 3, 2017
@jreback
Copy link
Contributor

jreback commented Jan 21, 2018

pls rebase if you can continue on this.

@pep8speaks
Copy link

pep8speaks commented Feb 4, 2018

Hello @toobaz! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on April 01, 2018 at 15:27 Hours UTC

@toobaz toobaz force-pushed the df_init_dict_nan branch 2 times, most recently from 6f9b502 to 9a65e8a Compare February 5, 2018 07:00
@codecov
Copy link

codecov bot commented Feb 5, 2018

Codecov Report

Merging #18600 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18600      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         152      152              
  Lines       49265    49256       -9     
==========================================
- Hits        45247    45241       -6     
+ Misses       4018     4015       -3
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.9% <93.33%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 95.94% <ø> (+0.04%) ⬆️
pandas/core/internals.py 95.53% <100%> (ø) ⬆️
pandas/core/series.py 93.9% <100%> (+0.11%) ⬆️
pandas/core/frame.py 97.15% <100%> (-0.02%) ⬇️
pandas/core/dtypes/cast.py 87.85% <0%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a22cf7...22701fc. Read the comment docs.

@@ -591,3 +591,5 @@ Other
^^^^^

- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`)
- Fixed construction of a :class:`DataFrame` from a ``dict`` containing ``NaN`` as key (:issue:`18455`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to reshaping


v.fill(np.nan)
# no obvious "empty" int column
if missing.any() and not (dtype is not None and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_integer_dtype

# no obvious "empty" int column
if missing.any() and not (dtype is not None and
issubclass(dtype.type, np.integer)):
if dtype is None or np.issubdtype(dtype, np.flexible):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_object_dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not equivalent:

In [2]: a = np.array('abc'.split())

In [3]: pd.core.dtypes.common.is_object_dtype(a.dtype)
Out[3]: False

In [4]: np.issubdtype(a.dtype, np.flexible)
Out[4]: True

data_names.append(k)
arrays.append(v)
nan_dtype = dtype
v = np.empty(len(index), dtype=nan_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use construct_1d_arraylike_from_scalar

subarr = np.array(subarr, dtype=dtype, copy=copy)
# Take care in creating object arrays (but generators are not
# supported, hence the __len__ check):
if dtype == 'object' and (hasattr(subarr, '__len__') and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_object_dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a separate branch rather than a nested if

Copy link
Member Author

@toobaz toobaz Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean

if is_object_dtype(dtype) and (hasattr(subarr, '__len__') and
                          not isinstance(subarr, np.ndarray)):
    [...]
elif not is_extension_type(subarr):
    [...]

?

@@ -287,8 +287,49 @@ def test_constructor_dict(self):
with tm.assert_raises_regex(ValueError, msg):
DataFrame({'a': 0.7}, columns=['a'])

with tm.assert_raises_regex(ValueError, msg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a separate test (this change), with a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done)

cols = [1, value, 3]
idx = ['a', value]
values = [[0, 3], [1, 4], [2, 5]]
data = {cols[c]: pd.Series(values[c], index=idx) for c in range(3)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont' use pd. on anything

result = (DataFrame(data)
.sort_values((11, 21))
.sort_values(('a', value), axis=1))
expected = pd.DataFrame(np.arange(6, dtype='int64').reshape(2, 3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -735,15 +776,15 @@ def test_constructor_corner(self):

# does not error but ends up float
df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int)
assert df.values.dtype == np.object_
assert df.values.dtype == np.dtype('float64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was wrong: an int should not upcast to object (the passed dtype is currently not considered). issue? whatsnew?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah this looks suspect. I would make a new issue for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -511,7 +511,7 @@ def test_read_one_empty_col_with_header(self):
)
expected_header_none = DataFrame(pd.Series([0], dtype='int64'))
tm.assert_frame_equal(actual_header_none, expected_header_none)
expected_header_zero = DataFrame(columns=[0], dtype='int64')
expected_header_zero = DataFrame(columns=[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was wrong and worked by accident. The result is, and should be, of object dtype; but the "expected" one was too, because the passed dtype wasn't being considered (see above).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok again add this as an example in a new issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again #19646

@TomAugspurger
Copy link
Contributor

@toobaz did you add a test case for #19497 to see if it's fixed?

@toobaz
Copy link
Member Author

toobaz commented Feb 5, 2018

@toobaz did you add a test case for #19497 to see if it's fixed?

See #19497 (comment)

@toobaz
Copy link
Member Author

toobaz commented Feb 8, 2018

@jreback ping. The new commit removes a workaround to #18455.

@@ -6468,7 +6468,6 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
if not is_bool_dtype(dt):
raise ValueError(msg.format(dtype=dt))

cond = cond.astype(bool, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what caused you to change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useless (bool dtype is checked just above)... but it's admittedly unrelated to the rest of the PR (it just came out debugging it).

if not is_extension_type(subarr):
# Take care in creating object arrays (but generators are not
# supported, hence the __len__ check):
if is_object_dtype(dtype) and (hasattr(subarr, '__len__') and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't you just checking is_list_like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for instance

In [2]: pd.core.dtypes.common.is_list_like((x for x in range(3)))
Out[2]: True

I don't know whether there was a general discussion about iterators as input; we could either decide to drop support for them, or to centralize its handling where possible (e.g. at least for indexes and data), in which case I would change these two lines. I can open a new issue for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then add a function in pandas.core.dtypes.inference to is_generator and use it here (similar to is_iterator)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing hasattr(subarr, '__len__') with is_list_like(dtype) and not is_iterator(dtype) should do, I don't think we need is_generator. But alternatively, I could add an argument is_list_like(iterators=True), and use is_list_like(iterators=False) here. I think it would come handy at several other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make a new function, much simpler that way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes there is, we do this elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate on what "new function" would make things "simpler".

@@ -762,3 +763,4 @@ Other
^^^^^

- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`)
- Suppressed error in the construction of a :class:`DataFrame` from a ``dict`` containing scalar values when the corresponding keys are not included in the passed index (:issue:`18600`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to reshaping

@@ -418,44 +419,28 @@ def _init_dict(self, data, index, columns, dtype=None):
Needs to handle a lot of exceptional cases.
"""
if columns is not None:
columns = _ensure_index(columns)
arrays = Series(data, index=columns, dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an asv that actually hits this path here, e.g. not-none columns and a dict as input? I am concerned that this Series conversion to object is going to cause issues (and an asv or 2 will determine this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some, see below

index = extract_index(list(data.values()))

# GH10856
# raise ValueError if only scalars in dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need the .tolist()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(removed)

v.fill(np.nan)
# no obvious "empty" int column
if missing.any() and not is_integer_dtype(dtype):
if dtype is None or np.issubdtype(dtype, np.flexible):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the flexible needed here? is this actually hit by a test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would appreciate an actual explanation. we do not check for this dtype anywhere else in the codebase. so at the very least this needs a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I would also appreciate an explanation (on that code @ajcr wrote and you committed).

v = construct_1d_arraylike_from_scalar(np.nan, len(index),
nan_dtype)
arrays.loc[missing] = [v] * missing.sum()
arrays = arrays.tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a 2-D here yes? can you add a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to do this conversion?

@@ -735,15 +776,15 @@ def test_constructor_corner(self):

# does not error but ends up float
df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int)
assert df.values.dtype == np.object_
assert df.values.dtype == np.dtype('float64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah this looks suspect. I would make a new issue for this

@@ -511,7 +511,7 @@ def test_read_one_empty_col_with_header(self):
)
expected_header_none = DataFrame(pd.Series([0], dtype='int64'))
tm.assert_frame_equal(actual_header_none, expected_header_none)
expected_header_zero = DataFrame(columns=[0], dtype='int64')
expected_header_zero = DataFrame(columns=[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok again add this as an example in a new issue

@toobaz
Copy link
Member Author

toobaz commented Feb 12, 2018

ASV run:

       before           after         ratio
     [324379ce]       [ef2340f7]
-      33.3±0.9ms       29.9±0.1ms     0.90  frame_ctor.FromDicts.time_nested_dict_index
-      32.8±0.2ms      28.5±0.09ms     0.87  frame_ctor.FromDictwithTimestamp.time_dict_with_timestamp_offsets(<Hour>)
-      50.4±0.3ms       42.4±0.5ms     0.84  frame_ctor.FromDicts.time_nested_dict_columns
-         417±3μs          281±6μs     0.67  frame_ctor.FromRecords.time_frame_from_records_generator(None)
-     1.32±0.03ms          277±1μs     0.21  frame_ctor.FromRecords.time_frame_from_records_generator(1000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

The Travis CI problem seems unrelated.

@toobaz
Copy link
Member Author

toobaz commented Mar 31, 2018

@jreback rebased, ready to merge if there are no further comments

@jreback
Copy link
Contributor

jreback commented Apr 1, 2018

will look

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can edit the whatsnew slightly as indicated, ok to merge (left another comment but can try to address in the future)

@@ -1135,6 +1135,9 @@ Reshaping
- Bug in :func:`DataFrame.unstack` which casts int to float if ``columns`` is a ``MultiIndex`` with unused levels (:issue:`17845`)
- Bug in :func:`DataFrame.unstack` which raises an error if ``index`` is a ``MultiIndex`` with unused labels on the unstacked level (:issue:`18562`)
- Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`)
- Fixed construction of a :class:`DataFrame` from a ``dict`` containing ``NaN`` as key (:issue:`18455`)
- Suppressed error in the construction of a :class:`DataFrame` from a ``dict`` containing scalar values when the corresponding keys are not included in the passed index (:issue:`18600`)
- Fixed (changed from ``object`` to ``float64``) dtype of DataFrame initialized with ``dtype=int`` and without data (:issues:`19646`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 3rd one not super clear, see if you can reword a bit

if not is_extension_type(subarr):
# Take care in creating object arrays (but iterators are not
# supported):
if is_object_dtype(dtype) and (is_list_like(subarr) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty hard to read, but ok for now, see if can simplify in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for sure we will need some unified mechanism to process iterators

@jreback jreback added this to the 0.23.0 milestone Apr 1, 2018
@toobaz toobaz merged commit 4efb39f into pandas-dev:master Apr 1, 2018
@toobaz toobaz deleted the df_init_dict_nan branch April 1, 2018 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Equality between DataFrames misbehaves if columns contain NaN
5 participants