-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix initialization of DataFrame from dict with NaN as key #18600
Conversation
@@ -416,44 +416,29 @@ def _init_dict(self, data, index, columns, dtype=None): | |||
Needs to handle a lot of exceptional cases. | |||
""" | |||
if columns is not None: | |||
columns = _ensure_index(columns) | |||
arrays = Series(data, index=columns, dtype=object) | |||
data_names = arrays.index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be a perf issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe... but right now it seems to be worse...
[d163de70] [f7447b3f]
- 47.9±0.3ms 43.5±0.4ms 0.91 frame_ctor.FromDicts.time_frame_ctor_nested_dict
- 31.0±0.1ms 28.1±0.3ms 0.91 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BusinessDay', 2)
- 31.3±1ms 28.2±0.2ms 0.90 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2)
- 31.8±0.3ms 28.0±0.4ms 0.88 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessDay', 2)
- 32.8±0.3ms 28.2±0.2ms 0.86 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Day', 1)
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There does seem to be a performance loss on very small dfs. E.g. for pd.DataFrame(data)
with data = {1 : [2], 3 : [4], 5 : [6]}
I get results around 530 µs for per loop before and 570 µs after. So we are talking about a ~10% gain on large dfs vs. a ~7.5% loss on small dfs.
... or I can avoid that Series
and sort manually, at the cost of a bit of added complexity, probably ~10 LoCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm... those asv results also seem pretty unstable:
before after ratio
[d163de70] [f7447b3f]
+ 30.7±0.2ms 41.1±3ms 1.34 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('QuarterBegin', 1)
+ 29.4±0.1ms 39.1±4ms 1.33 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CBMonthBegin', 2)
+ 30.7±0.7ms 40.4±3ms 1.32 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2)
+ 31.1±0.7ms 39.1±5ms 1.26 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('SemiMonthEnd', 2)
+ 30.4±0.1ms 38.0±4ms 1.25 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Hour', 2)
- 33.8±1ms 30.4±0.8ms 0.90 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Micro', 1)
- 48.3±0.6ms 42.6±0.8ms 0.88 frame_ctor.FromDicts.time_frame_ctor_nested_dict
- 23.8±1ms 20.5±0.2ms 0.86 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253Quarter_1', 2)
- 41.2±0.7ms 30.4±2ms 0.74 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessHour', 2)
- 8.35±0.9ms 6.05±0.01ms 0.72 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253_2', 2)
- 42.4±2ms 30.5±0.5ms 0.72 frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BMonthEnd', 2)
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
I'll try to sort manually and see how it goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i actually doubt we have good benchmarks on this you are measuring the same benchmark here
we need benchmarks that contruct with different dtypes
and reducing code complexity is paramount here (though of course don’t want to sacrifice perf)
pls rebase if you can continue on this. |
f7447b3
to
e85b207
Compare
Hello @toobaz! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on April 01, 2018 at 15:27 Hours UTC |
6f9b502
to
9a65e8a
Compare
Codecov Report
@@ Coverage Diff @@
## master #18600 +/- ##
==========================================
+ Coverage 91.84% 91.84% +<.01%
==========================================
Files 152 152
Lines 49265 49256 -9
==========================================
- Hits 45247 45241 -6
+ Misses 4018 4015 -3
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -591,3 +591,5 @@ Other | |||
^^^^^ | |||
|
|||
- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`) | |||
- Fixed construction of a :class:`DataFrame` from a ``dict`` containing ``NaN`` as key (:issue:`18455`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to reshaping
pandas/core/frame.py
Outdated
|
||
v.fill(np.nan) | ||
# no obvious "empty" int column | ||
if missing.any() and not (dtype is not None and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_integer_dtype
pandas/core/frame.py
Outdated
# no obvious "empty" int column | ||
if missing.any() and not (dtype is not None and | ||
issubclass(dtype.type, np.integer)): | ||
if dtype is None or np.issubdtype(dtype, np.flexible): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_object_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not equivalent:
In [2]: a = np.array('abc'.split())
In [3]: pd.core.dtypes.common.is_object_dtype(a.dtype)
Out[3]: False
In [4]: np.issubdtype(a.dtype, np.flexible)
Out[4]: True
pandas/core/frame.py
Outdated
data_names.append(k) | ||
arrays.append(v) | ||
nan_dtype = dtype | ||
v = np.empty(len(index), dtype=nan_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use construct_1d_arraylike_from_scalar
pandas/core/series.py
Outdated
subarr = np.array(subarr, dtype=dtype, copy=copy) | ||
# Take care in creating object arrays (but generators are not | ||
# supported, hence the __len__ check): | ||
if dtype == 'object' and (hasattr(subarr, '__len__') and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_object_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a separate branch rather than a nested if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean
if is_object_dtype(dtype) and (hasattr(subarr, '__len__') and
not isinstance(subarr, np.ndarray)):
[...]
elif not is_extension_type(subarr):
[...]
?
@@ -287,8 +287,49 @@ def test_constructor_dict(self): | |||
with tm.assert_raises_regex(ValueError, msg): | |||
DataFrame({'a': 0.7}, columns=['a']) | |||
|
|||
with tm.assert_raises_regex(ValueError, msg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a separate test (this change), with a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(done)
cols = [1, value, 3] | ||
idx = ['a', value] | ||
values = [[0, 3], [1, 4], [2, 5]] | ||
data = {cols[c]: pd.Series(values[c], index=idx) for c in range(3)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont' use pd. on anything
result = (DataFrame(data) | ||
.sort_values((11, 21)) | ||
.sort_values(('a', value), axis=1)) | ||
expected = pd.DataFrame(np.arange(6, dtype='int64').reshape(2, 3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -735,15 +776,15 @@ def test_constructor_corner(self): | |||
|
|||
# does not error but ends up float | |||
df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int) | |||
assert df.values.dtype == np.object_ | |||
assert df.values.dtype == np.dtype('float64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it was wrong: an int
should not upcast to object
(the passed dtype is currently not considered). issue? whatsnew?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah this looks suspect. I would make a new issue for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -511,7 +511,7 @@ def test_read_one_empty_col_with_header(self): | |||
) | |||
expected_header_none = DataFrame(pd.Series([0], dtype='int64')) | |||
tm.assert_frame_equal(actual_header_none, expected_header_none) | |||
expected_header_zero = DataFrame(columns=[0], dtype='int64') | |||
expected_header_zero = DataFrame(columns=[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test was wrong and worked by accident. The result is, and should be, of object
dtype; but the "expected" one was too, because the passed dtype wasn't being considered (see above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok again add this as an example in a new issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again #19646
See #19497 (comment) |
9a65e8a
to
0ddd89e
Compare
@@ -6468,7 +6468,6 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |||
if not is_bool_dtype(dt): | |||
raise ValueError(msg.format(dtype=dt)) | |||
|
|||
cond = cond.astype(bool, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what caused you to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useless (bool dtype is checked just above)... but it's admittedly unrelated to the rest of the PR (it just came out debugging it).
pandas/core/series.py
Outdated
if not is_extension_type(subarr): | ||
# Take care in creating object arrays (but generators are not | ||
# supported, hence the __len__ check): | ||
if is_object_dtype(dtype) and (hasattr(subarr, '__len__') and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't you just checking is_list_like
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for instance
In [2]: pd.core.dtypes.common.is_list_like((x for x in range(3)))
Out[2]: True
I don't know whether there was a general discussion about iterators as input; we could either decide to drop support for them, or to centralize its handling where possible (e.g. at least for indexes and data), in which case I would change these two lines. I can open a new issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then add a function in pandas.core.dtypes.inference to is_generator and use it here (similar to is_iterator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing hasattr(subarr, '__len__')
with is_list_like(dtype) and not is_iterator(dtype)
should do, I don't think we need is_generator
. But alternatively, I could add an argument is_list_like(iterators=True)
, and use is_list_like(iterators=False)
here. I think it would come handy at several other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just make a new function, much simpler that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes there is, we do this elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate on what "new function" would make things "simpler".
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -762,3 +763,4 @@ Other | |||
^^^^^ | |||
|
|||
- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`) | |||
- Suppressed error in the construction of a :class:`DataFrame` from a ``dict`` containing scalar values when the corresponding keys are not included in the passed index (:issue:`18600`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to reshaping
@@ -418,44 +419,28 @@ def _init_dict(self, data, index, columns, dtype=None): | |||
Needs to handle a lot of exceptional cases. | |||
""" | |||
if columns is not None: | |||
columns = _ensure_index(columns) | |||
arrays = Series(data, index=columns, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have an asv that actually hits this path here, e.g. not-none columns and a dict as input? I am concerned that this Series conversion to object is going to cause issues (and an asv or 2 will determine this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some, see below
index = extract_index(list(data.values())) | ||
|
||
# GH10856 | ||
# raise ValueError if only scalars in dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need the .tolist()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(removed)
v.fill(np.nan) | ||
# no obvious "empty" int column | ||
if missing.any() and not is_integer_dtype(dtype): | ||
if dtype is None or np.issubdtype(dtype, np.flexible): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the flexible needed here? is this actually hit by a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would appreciate an actual explanation. we do not check for this dtype anywhere else in the codebase. so at the very least this needs a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I would also appreciate an explanation (on that code @ajcr wrote and you committed).
pandas/core/frame.py
Outdated
v = construct_1d_arraylike_from_scalar(np.nan, len(index), | ||
nan_dtype) | ||
arrays.loc[missing] = [v] * missing.sum() | ||
arrays = arrays.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a 2-D here yes? can you add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to do this conversion?
@@ -735,15 +776,15 @@ def test_constructor_corner(self): | |||
|
|||
# does not error but ends up float | |||
df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int) | |||
assert df.values.dtype == np.object_ | |||
assert df.values.dtype == np.dtype('float64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah this looks suspect. I would make a new issue for this
@@ -511,7 +511,7 @@ def test_read_one_empty_col_with_header(self): | |||
) | |||
expected_header_none = DataFrame(pd.Series([0], dtype='int64')) | |||
tm.assert_frame_equal(actual_header_none, expected_header_none) | |||
expected_header_zero = DataFrame(columns=[0], dtype='int64') | |||
expected_header_zero = DataFrame(columns=[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok again add this as an example in a new issue
0a7f677
to
ef2340f
Compare
ASV run:
The Travis CI problem seems unrelated. |
ef2340f
to
60eaeee
Compare
@jreback rebased, ready to merge if there are no further comments |
will look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can edit the whatsnew slightly as indicated, ok to merge (left another comment but can try to address in the future)
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1135,6 +1135,9 @@ Reshaping | |||
- Bug in :func:`DataFrame.unstack` which casts int to float if ``columns`` is a ``MultiIndex`` with unused levels (:issue:`17845`) | |||
- Bug in :func:`DataFrame.unstack` which raises an error if ``index`` is a ``MultiIndex`` with unused labels on the unstacked level (:issue:`18562`) | |||
- Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`) | |||
- Fixed construction of a :class:`DataFrame` from a ``dict`` containing ``NaN`` as key (:issue:`18455`) | |||
- Suppressed error in the construction of a :class:`DataFrame` from a ``dict`` containing scalar values when the corresponding keys are not included in the passed index (:issue:`18600`) | |||
- Fixed (changed from ``object`` to ``float64``) dtype of DataFrame initialized with ``dtype=int`` and without data (:issues:`19646`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 3rd one not super clear, see if you can reword a bit
if not is_extension_type(subarr): | ||
# Take care in creating object arrays (but iterators are not | ||
# supported): | ||
if is_object_dtype(dtype) and (is_list_like(subarr) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty hard to read, but ok for now, see if can simplify in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for sure we will need some unified mechanism to process iterators
git diff upstream/master -u -- "*.py" | flake8 --diff
This does not solve the MI example in #18455, but that should be included in #18485 .