BUG: Fix initialization of DataFrame from dict with NaN as key #18600

toobaz · 2017-12-02T14:27:59Z

closes Equality between DataFrames misbehaves if columns contain NaN #18455
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This does not solve the MI example in #18455, but that should be included in #18485 .

jreback · 2017-12-02T14:32:53Z

pandas/core/frame.py

@@ -416,44 +416,29 @@ def _init_dict(self, data, index, columns, dtype=None):
        Needs to handle a lot of exceptional cases.
        """
        if columns is not None:
-            columns = _ensure_index(columns)
+            arrays = Series(data, index=columns, dtype=object)
+            data_names = arrays.index


this will be a perf issue

Maybe... but right now it seems to be worse...

[d163de70] [f7447b3f] - 47.9±0.3ms 43.5±0.4ms 0.91 frame_ctor.FromDicts.time_frame_ctor_nested_dict - 31.0±0.1ms 28.1±0.3ms 0.91 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BusinessDay', 2) - 31.3±1ms 28.2±0.2ms 0.90 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2) - 31.8±0.3ms 28.0±0.4ms 0.88 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessDay', 2) - 32.8±0.3ms 28.2±0.2ms 0.86 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Day', 1) SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

There does seem to be a performance loss on very small dfs. E.g. for pd.DataFrame(data) with data = {1 : [2], 3 : [4], 5 : [6]} I get results around 530 µs for per loop before and 570 µs after. So we are talking about a ~10% gain on large dfs vs. a ~7.5% loss on small dfs.

... or I can avoid that Series and sort manually, at the cost of a bit of added complexity, probably ~10 LoCs.

uhm... those asv results also seem pretty unstable:

before after ratio [d163de70] [f7447b3f] + 30.7±0.2ms 41.1±3ms 1.34 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('QuarterBegin', 1) + 29.4±0.1ms 39.1±4ms 1.33 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CBMonthBegin', 2) + 30.7±0.7ms 40.4±3ms 1.32 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2) + 31.1±0.7ms 39.1±5ms 1.26 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('SemiMonthEnd', 2) + 30.4±0.1ms 38.0±4ms 1.25 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Hour', 2) - 33.8±1ms 30.4±0.8ms 0.90 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Micro', 1) - 48.3±0.6ms 42.6±0.8ms 0.88 frame_ctor.FromDicts.time_frame_ctor_nested_dict - 23.8±1ms 20.5±0.2ms 0.86 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253Quarter_1', 2) - 41.2±0.7ms 30.4±2ms 0.74 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessHour', 2) - 8.35±0.9ms 6.05±0.01ms 0.72 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253_2', 2) - 42.4±2ms 30.5±0.5ms 0.72 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BMonthEnd', 2) SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I'll try to sort manually and see how it goes.

i actually doubt we have good benchmarks on this you are measuring the same benchmark here

we need benchmarks that contruct with different dtypes

and reducing code complexity is paramount here (though of course don’t want to sacrifice perf)

jreback · 2018-01-21T18:21:50Z

pls rebase if you can continue on this.

pep8speaks · 2018-02-04T01:34:43Z

Hello @toobaz! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on April 01, 2018 at 15:27 Hours UTC

codecov · 2018-02-05T07:01:10Z

Codecov Report

Merging #18600 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18600      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         152      152              
  Lines       49265    49256       -9     
==========================================
- Hits        45247    45241       -6     
+ Misses       4018     4015       -3

Flag	Coverage Δ
#multiple	`90.23% <100%> (ø)`	⬆️
#single	`41.9% <93.33%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/generic.py	`95.94% <ø> (+0.04%)`	⬆️
pandas/core/internals.py	`95.53% <100%> (ø)`	⬆️
pandas/core/series.py	`93.9% <100%> (+0.11%)`	⬆️
pandas/core/frame.py	`97.15% <100%> (-0.02%)`	⬇️
pandas/core/dtypes/cast.py	`87.85% <0%> (+0.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a22cf7...22701fc. Read the comment docs.

jreback · 2018-02-05T11:16:53Z

doc/source/whatsnew/v0.23.0.txt

@@ -591,3 +591,5 @@ Other
 ^^^^^

 - Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`)
+- Fixed construction of a :class:`DataFrame` from a ``dict`` containing ``NaN`` as key (:issue:`18455`)


move to reshaping

jreback · 2018-02-05T11:17:31Z

pandas/core/frame.py

-
-                    v.fill(np.nan)
+            # no obvious "empty" int column
+            if missing.any() and not (dtype is not None and


use is_integer_dtype

jreback · 2018-02-05T11:17:39Z

pandas/core/frame.py

+            # no obvious "empty" int column
+            if missing.any() and not (dtype is not None and
+                                      issubclass(dtype.type, np.integer)):
+                if dtype is None or np.issubdtype(dtype, np.flexible):


use is_object_dtype

It is not equivalent:

In [2]: a = np.array('abc'.split()) In [3]: pd.core.dtypes.common.is_object_dtype(a.dtype) Out[3]: False In [4]: np.issubdtype(a.dtype, np.flexible) Out[4]: True

jreback · 2018-02-05T11:18:09Z

pandas/core/frame.py

-                data_names.append(k)
-                arrays.append(v)
+                    nan_dtype = dtype
+                v = np.empty(len(index), dtype=nan_dtype)


use construct_1d_arraylike_from_scalar

jreback · 2018-02-05T11:19:08Z

pandas/core/series.py

-                subarr = np.array(subarr, dtype=dtype, copy=copy)
+                # Take care in creating object arrays (but generators are not
+                # supported, hence the __len__ check):
+                if dtype == 'object' and (hasattr(subarr, '__len__') and


use is_object_dtype

this should be a separate branch rather than a nested if

Do you mean

if is_object_dtype(dtype) and (hasattr(subarr, '__len__') and not isinstance(subarr, np.ndarray)): [...] elif not is_extension_type(subarr): [...]

?

jreback · 2018-02-05T11:20:23Z

pandas/tests/frame/test_constructors.py

@@ -287,8 +287,49 @@ def test_constructor_dict(self):
        with tm.assert_raises_regex(ValueError, msg):
            DataFrame({'a': 0.7}, columns=['a'])

-        with tm.assert_raises_regex(ValueError, msg):


make a separate test (this change), with a comment

jreback · 2018-02-05T11:20:52Z

pandas/tests/frame/test_constructors.py

+        cols = [1, value, 3]
+        idx = ['a', value]
+        values = [[0, 3], [1, 4], [2, 5]]
+        data = {cols[c]: pd.Series(values[c], index=idx) for c in range(3)}


dont' use pd. on anything

jreback · 2018-02-05T11:21:04Z

pandas/tests/frame/test_constructors.py

+        result = (DataFrame(data)
+                  .sort_values((11, 21))
+                  .sort_values(('a', value), axis=1))
+        expected = pd.DataFrame(np.arange(6, dtype='int64').reshape(2, 3),


jreback · 2018-02-05T11:21:30Z

pandas/tests/frame/test_constructors.py

@@ -735,15 +776,15 @@ def test_constructor_corner(self):

        # does not error but ends up float
        df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int)
-        assert df.values.dtype == np.object_
+        assert df.values.dtype == np.dtype('float64')


why is this changing?

Because it was wrong: an int should not upcast to object (the passed dtype is currently not considered). issue? whatsnew?

hmm, yeah this looks suspect. I would make a new issue for this

jreback · 2018-02-05T11:21:42Z

pandas/tests/io/test_excel.py

@@ -511,7 +511,7 @@ def test_read_one_empty_col_with_header(self):
            )
        expected_header_none = DataFrame(pd.Series([0], dtype='int64'))
        tm.assert_frame_equal(actual_header_none, expected_header_none)
-        expected_header_zero = DataFrame(columns=[0], dtype='int64')
+        expected_header_zero = DataFrame(columns=[0])


why is this changing?

The test was wrong and worked by accident. The result is, and should be, of object dtype; but the "expected" one was too, because the passed dtype wasn't being considered (see above).

ok again add this as an example in a new issue

Again #19646

TomAugspurger · 2018-02-05T14:49:15Z

@toobaz did you add a test case for #19497 to see if it's fixed?

toobaz · 2018-02-05T16:33:53Z

@toobaz did you add a test case for #19497 to see if it's fixed?

See #19497 (comment)

toobaz · 2018-02-08T07:43:10Z

@jreback ping. The new commit removes a workaround to #18455.

jreback · 2018-02-08T11:36:48Z

pandas/core/generic.py

@@ -6468,7 +6468,6 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
                if not is_bool_dtype(dt):
                    raise ValueError(msg.format(dtype=dt))

-        cond = cond.astype(bool, copy=False)


what caused you to change this?

It's useless (bool dtype is checked just above)... but it's admittedly unrelated to the rest of the PR (it just came out debugging it).

jreback · 2018-02-08T11:40:09Z

pandas/core/series.py

-            if not is_extension_type(subarr):
+            # Take care in creating object arrays (but generators are not
+            # supported, hence the __len__ check):
+            if is_object_dtype(dtype) and (hasattr(subarr, '__len__') and


aren't you just checking is_list_like?

No, for instance

In [2]: pd.core.dtypes.common.is_list_like((x for x in range(3))) Out[2]: True

I don't know whether there was a general discussion about iterators as input; we could either decide to drop support for them, or to centralize its handling where possible (e.g. at least for indexes and data), in which case I would change these two lines. I can open a new issue for this.

ok, then add a function in pandas.core.dtypes.inference to is_generator and use it here (similar to is_iterator)

Replacing hasattr(subarr, '__len__') with is_list_like(dtype) and not is_iterator(dtype) should do, I don't think we need is_generator. But alternatively, I could add an argument is_list_like(iterators=True), and use is_list_like(iterators=False) here. I think it would come handy at several other places.

just make a new function, much simpler that way

yes there is, we do this elsewhere.

Please elaborate on what "new function" would make things "simpler".

jreback · 2018-02-10T17:27:06Z

doc/source/whatsnew/v0.23.0.txt

@@ -762,3 +763,4 @@ Other
 ^^^^^

 - Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`)
+- Suppressed error in the construction of a :class:`DataFrame` from a ``dict`` containing scalar values when the corresponding keys are not included in the passed index (:issue:`18600`)


move to reshaping

jreback · 2018-02-10T17:28:26Z

pandas/core/frame.py

@@ -418,44 +419,28 @@ def _init_dict(self, data, index, columns, dtype=None):
        Needs to handle a lot of exceptional cases.
        """
        if columns is not None:
-            columns = _ensure_index(columns)
+            arrays = Series(data, index=columns, dtype=object)


do we have an asv that actually hits this path here, e.g. not-none columns and a dict as input? I am concerned that this Series conversion to object is going to cause issues (and an asv or 2 will determine this)

Added some, see below

jreback · 2018-02-10T17:28:58Z

pandas/core/frame.py

-                index = extract_index(list(data.values()))
-
+                # GH10856
+                # raise ValueError if only scalars in dict


do you need the .tolist()?

jreback · 2018-02-10T17:29:17Z

pandas/core/frame.py

-                    v.fill(np.nan)
+            # no obvious "empty" int column
+            if missing.any() and not is_integer_dtype(dtype):
+                if dtype is None or np.issubdtype(dtype, np.flexible):


why is the flexible needed here? is this actually hit by a test?

i would appreciate an actual explanation. we do not check for this dtype anywhere else in the codebase. so at the very least this needs a comment

Sure, I would also appreciate an explanation (on that code @ajcr wrote and you committed).

jreback · 2018-02-10T17:29:58Z

pandas/core/frame.py

+                v = construct_1d_arraylike_from_scalar(np.nan, len(index),
+                                                       nan_dtype)
+                arrays.loc[missing] = [v] * missing.sum()
+            arrays = arrays.tolist()


this is a 2-D here yes? can you add a comment

do you need to do this conversion?

jreback · 2018-02-10T17:31:19Z

pandas/tests/frame/test_constructors.py

@@ -735,15 +776,15 @@ def test_constructor_corner(self):

        # does not error but ends up float
        df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int)
-        assert df.values.dtype == np.object_
+        assert df.values.dtype == np.dtype('float64')


hmm, yeah this looks suspect. I would make a new issue for this

jreback · 2018-02-10T17:31:42Z

pandas/tests/io/test_excel.py

@@ -511,7 +511,7 @@ def test_read_one_empty_col_with_header(self):
            )
        expected_header_none = DataFrame(pd.Series([0], dtype='int64'))
        tm.assert_frame_equal(actual_header_none, expected_header_none)
-        expected_header_zero = DataFrame(columns=[0], dtype='int64')
+        expected_header_zero = DataFrame(columns=[0])


ok again add this as an example in a new issue

toobaz · 2018-02-12T10:41:06Z

ASV run:

       before           after         ratio
     [324379ce]       [ef2340f7]
-      33.3±0.9ms       29.9±0.1ms     0.90  frame_ctor.FromDicts.time_nested_dict_index
-      32.8±0.2ms      28.5±0.09ms     0.87  frame_ctor.FromDictwithTimestamp.time_dict_with_timestamp_offsets(<Hour>)
-      50.4±0.3ms       42.4±0.5ms     0.84  frame_ctor.FromDicts.time_nested_dict_columns
-         417±3μs          281±6μs     0.67  frame_ctor.FromRecords.time_frame_from_records_generator(None)
-     1.32±0.03ms          277±1μs     0.21  frame_ctor.FromRecords.time_frame_from_records_generator(1000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

The Travis CI problem seems unrelated.

toobaz · 2018-03-31T23:55:16Z

@jreback rebased, ready to merge if there are no further comments

jreback · 2018-04-01T00:20:14Z

will look

jreback

if you can edit the whatsnew slightly as indicated, ok to merge (left another comment but can try to address in the future)

jreback · 2018-04-01T14:13:02Z

doc/source/whatsnew/v0.23.0.txt

@@ -1135,6 +1135,9 @@ Reshaping
 - Bug in :func:`DataFrame.unstack` which casts int to float if ``columns`` is a ``MultiIndex`` with unused levels (:issue:`17845`)
 - Bug in :func:`DataFrame.unstack` which raises an error if ``index`` is a ``MultiIndex`` with unused labels on the unstacked level (:issue:`18562`)
 - Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`)
+- Fixed construction of a :class:`DataFrame` from a ``dict`` containing ``NaN`` as key (:issue:`18455`)
+- Suppressed error in the construction of a :class:`DataFrame` from a ``dict`` containing scalar values when the corresponding keys are not included in the passed index (:issue:`18600`)
+- Fixed (changed from ``object`` to ``float64``) dtype of DataFrame initialized with ``dtype=int`` and without data (:issues:`19646`)


this 3rd one not super clear, see if you can reword a bit

jreback · 2018-04-01T14:15:34Z

pandas/core/series.py

-            if not is_extension_type(subarr):
+            # Take care in creating object arrays (but iterators are not
+            # supported):
+            if is_object_dtype(dtype) and (is_list_like(subarr) and


this is pretty hard to read, but ok for now, see if can simplify in the future

Yes, for sure we will need some unified mechanism to process iterators

closes pandas-dev#18455 closes pandas-dev#19646

jreback requested changes Dec 2, 2017

View reviewed changes

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Dtype Conversions Unexpected or buggy dtype conversions labels Dec 3, 2017

toobaz mentioned this pull request Dec 18, 2017

BUG: avoid unnecessary casting when unstacking index with unused levels #18460

Merged

3 tasks

toobaz mentioned this pull request Feb 2, 2018

Bug: rename incapable of accepting tuples as new name #19497

Closed

toobaz force-pushed the df_init_dict_nan branch from f7447b3 to e85b207 Compare February 4, 2018 01:34

toobaz force-pushed the df_init_dict_nan branch 2 times, most recently from 6f9b502 to 9a65e8a Compare February 5, 2018 07:00

jreback requested changes Feb 5, 2018

View reviewed changes

toobaz force-pushed the df_init_dict_nan branch from 9a65e8a to 0ddd89e Compare February 7, 2018 21:19

jreback requested changes Feb 8, 2018

View reviewed changes

toobaz mentioned this pull request Feb 9, 2018

BUG: fix construction of Series from dict with nested lists #18626

Closed

4 tasks

jreback requested changes Feb 10, 2018

View reviewed changes

toobaz mentioned this pull request Feb 11, 2018

Actual dtypes of pd.DataFrame(None, [...], dtype=int) #19646

Closed

toobaz force-pushed the df_init_dict_nan branch 2 times, most recently from 0a7f677 to ef2340f Compare February 12, 2018 08:00

toobaz force-pushed the df_init_dict_nan branch from ef2340f to 60eaeee Compare March 31, 2018 22:38

jreback approved these changes Apr 1, 2018

View reviewed changes

jreback added this to the 0.23.0 milestone Apr 1, 2018

toobaz added 2 commits April 1, 2018 17:26

BUG: Fix initialization of DataFrame from dict with NaN as key

fe556d6

closes pandas-dev#18455 closes pandas-dev#19646

TST: removed workaround for pandas-dev#18455

92fd37f

TST: asv for DataFrame from dict with index/columns

22701fc

toobaz force-pushed the df_init_dict_nan branch from 60eaeee to 22701fc Compare April 1, 2018 15:27

toobaz merged commit 4efb39f into pandas-dev:master Apr 1, 2018

toobaz deleted the df_init_dict_nan branch April 1, 2018 17:48

BUG: Fix initialization of DataFrame from dict with NaN as key #18600

BUG: Fix initialization of DataFrame from dict with NaN as key #18600

Conversation

toobaz commented Dec 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 21, 2018

pep8speaks commented Feb 4, 2018 • edited Loading

Comment last updated on April 01, 2018 at 15:27 Hours UTC

codecov bot commented Feb 5, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz Feb 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Feb 5, 2018

toobaz commented Feb 5, 2018

toobaz commented Feb 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz commented Feb 12, 2018

toobaz commented Mar 31, 2018

jreback commented Apr 1, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Feb 4, 2018 •

edited

Loading

codecov bot commented Feb 5, 2018 •

edited

Loading

toobaz Feb 5, 2018 •

edited

Loading

toobaz commented Feb 8, 2018 •

edited

Loading