BUG: to_json - prevent various segfault conditions (GH14256) #17857

matthiashuschle · 2017-10-12T15:59:09Z

closes BUG: to_json with objects causing segfault #14256
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

There were several sources for the JSON string buffer at enc->start exceeding the reserved space:

Loops on DataFrame columns being stuck on the same column if enc->errorMsg was set. This is fixed by adding the check for the errorMsg to the respective iterNext function.
Loops on objects (Dir_iterNext) not breaking due to similar reasons, which allows infinite recursion. Also added check for errorMsg.
Column labels were added in iterName methods inside objToJSON.c where the check for remaining buffer was not accessible. Moved Buffer_Reserve to the ujson header file.
Closing brackets could be added without sufficient buffer. Added Buffer_Reserve calls.

gfyoung · 2017-10-12T17:30:24Z

doc/source/whatsnew/v0.21.0.txt

@@ -940,3 +940,5 @@ Other
 ^^^^^
 - Bug where some inplace operators were not being wrapped and produced a copy when invoked (:issue:`12962`)
 - Bug in :func:`eval` where the ``inplace`` parameter was being incorrectly handled (:issue:`16732`)
+- Bug in :func:`to_json` where several conditions (including objects with unprintable symbols, objects with deep recursion, overlong labels) caused segfaults instead of raising the appropriate exception (:issue:`14256`)


This is an I/O bug, so add it under that sub-section.

gfyoung · 2017-10-12T17:31:18Z

pandas/_libs/src/ujson/lib/ultrajson.h

+        Buffer_Realloc((__enc), (__len));                             \
+    }
+
+void Buffer_Realloc(JSONObjectEncoder *enc, size_t cbNeeded);


Where do you implement this?

It's already in ultrajsonenc.c. No changes, just exposed in the header to be usable from objToJSON.c

gfyoung · 2017-10-12T17:31:41Z

pandas/tests/io/json/test_pandas.py

+        assert df_printable.to_json() == '{"A":{"0":"%s"}}' % hexed
+        df_nonprintable = DataFrame({'A': [binthing]})
+        pytest.raises(exc_type, df_nonprintable.to_json)
+        # GH14256: failing column caused segfaults, if it is not the last one


Better to reference this at the top of the function definition.

gfyoung · 2017-10-12T17:31:53Z

pandas/tests/io/json/test_pandas.py

+            '{"A":{"0":"%s"},"B":{"0":1}}' % hexed
+
+    def test_label_overflow(self):
+        df = pd.DataFrame({'foo': [1337], 'bar' * 100000: [1]})


Reference the issue number above this line.

codecov · 2017-10-13T09:09:24Z

Codecov Report

Merging #17857 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #17857      +/-   ##
==========================================
- Coverage   91.22%    91.2%   -0.02%     
==========================================
  Files         163      163              
  Lines       50069    50038      -31     
==========================================
- Hits        45673    45639      -34     
- Misses       4396     4399       +3

Flag	Coverage Δ
#multiple	`89.01% <ø> (ø)`	⬆️
#single	`40.26% <ø> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/dtypes/concat.py	`98.26% <0%> (-0.87%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️
pandas/core/indexes/datetimes.py	`95.48% <0%> (-0.1%)`	⬇️
pandas/core/internals.py	`94.38% <0%> (-0.07%)`	⬇️
pandas/core/sparse/series.py	`95.26% <0%> (-0.02%)`	⬇️
pandas/core/generic.py	`92.2% <0%> (-0.01%)`	⬇️
pandas/core/dtypes/dtypes.py	`95.14% <0%> (ø)`	⬆️
pandas/core/reshape/concat.py	`97.6% <0%> (+0.03%)`	⬆️
pandas/core/indexing.py	`93% <0%> (+0.18%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c277cd7...41a1aa0. Read the comment docs.

codecov · 2017-10-13T09:09:25Z

Codecov Report

Merging #17857 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #17857      +/-   ##
==========================================
+ Coverage   91.23%   91.24%   +<.01%     
==========================================
  Files         163      163              
  Lines       50075    50075              
==========================================
+ Hits        45688    45691       +3     
+ Misses       4387     4384       -3

Flag	Coverage Δ
#multiple	`89.05% <ø> (+0.02%)`	⬆️
#single	`40.29% <ø> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️
pandas/plotting/_converter.py	`65.2% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c964a4...747a942. Read the comment docs.

matthiashuschle · 2017-10-13T09:12:19Z

Thank you for the comments. I moved the test comments and the whatsnew entry, and pushed the changes.

jreback

lgtm. just some minor formatting comments of the tests.

jreback · 2017-10-13T10:17:59Z

pandas/tests/io/json/test_pandas.py

+        hexed = '574b4454ba8c5eb4f98a8f45'
+        exc_type = OverflowError
+        binthing = BinaryThing(hexed)
+        df_printable = DataFrame({'A': [binthing.hexed]})


before each sub-section, can put a comment on what you are testing; and blank lines between sub-sections

jreback · 2017-10-13T10:19:03Z

pandas/tests/io/json/test_pandas.py

+        pytest.raises(exc_type, df_nonprintable.to_json)
+        df_mixed = DataFrame({'A': [binthing], 'B': [1]},
+                             columns=['A', 'B'])
+        pytest.raises(exc_type, df_mixed.to_json)


with the with version to test raising

jreback · 2017-10-13T10:20:15Z

pandas/tests/io/json/test_pandas.py

+                return self.hexed
+
+        hexed = '574b4454ba8c5eb4f98a8f45'
+        exc_type = OverflowError


don't define this separately, just inline the exceptions you are checking

jreback · 2017-10-13T10:23:52Z

@matthiashuschle also pls rebase on master. some CI things were updated to make circleci work with the new version of mpl.

matthiashuschle · 2017-10-13T12:33:30Z

thanks, I just incorporated your suggestions.

jreback · 2017-10-14T14:36:52Z

thanks @matthiashuschle nice patch!

…dev#17857)

* upstream/master: (76 commits) CategoricalDtype construction: actually use fastpath (pandas-dev#17891) DEPR: Deprecate tupleize_cols in to_csv (pandas-dev#17877) BUG: Fix wrong column selection in drop_duplicates when duplicate column names (pandas-dev#17879) DOC: Adding examples to update docstring (pandas-dev#16812) (pandas-dev#17859) TST: Skip if no openpyxl in test_excel (pandas-dev#17883) TST: Catch read_html slow test warning (pandas-dev#17874) flake8 cleanup (pandas-dev#17873) TST: remove moar warnings (pandas-dev#17872) ENH: tolerance now takes list-like argument for reindex and get_indexer. (pandas-dev#17367) ERR: Raise ValueError when week is passed in to_datetime format witho… (pandas-dev#17819) TST: remove some deprecation warnings (pandas-dev#17870) Refactor index-as-string groupby tests and fix spurious warning (Bug 17383) (pandas-dev#17843) BUG: merging with a boolean/int categorical column (pandas-dev#17841) DEPR: Deprecate read_csv arguments fully (pandas-dev#17865) BUG: to_json - prevent various segfault conditions (GH14256) (pandas-dev#17857) CLN: Use pandas.core.common for None checks (pandas-dev#17816) BUG: set tz on DTI from fixed format HDFStore (pandas-dev#17844) RLS: v0.21.0rc1 Whatsnew cleanup (pandas-dev#17858) DEPR: Deprecate the convert parameter completely (pandas-dev#17831) ...

…dev#17857)

gfyoung added Bug IO JSON read_json, to_json, json_normalize labels Oct 12, 2017

gfyoung reviewed Oct 12, 2017

View reviewed changes

matthiashuschle force-pushed the rcb branch from f784469 to 41a1aa0 Compare October 13, 2017 09:09

jreback approved these changes Oct 13, 2017

View reviewed changes

jreback added this to the 0.21.0 milestone Oct 13, 2017

BUG: to_json - prevent various segfault conditions (GH14256)

747a942

matthiashuschle force-pushed the rcb branch from 41a1aa0 to 747a942 Compare October 13, 2017 12:27

jreback merged commit 446d5b4 into pandas-dev:master Oct 14, 2017

ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017

BUG: to_json - prevent various segfault conditions (GH14256) (pandas-…

5d1aa08

…dev#17857)

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

BUG: to_json - prevent various segfault conditions (GH14256) (pandas-…

17a165a

…dev#17857)

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

BUG: to_json - prevent various segfault conditions (GH14256) (pandas-…

5078474

…dev#17857)

WillAyd mentioned this pull request Aug 8, 2019

Remove Encoding of values in char** For Labels #27618

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: to_json - prevent various segfault conditions (GH14256) #17857

BUG: to_json - prevent various segfault conditions (GH14256) #17857

matthiashuschle commented Oct 12, 2017

gfyoung Oct 12, 2017

gfyoung Oct 12, 2017

matthiashuschle Oct 13, 2017

gfyoung Oct 12, 2017

gfyoung Oct 12, 2017

codecov bot commented Oct 13, 2017

codecov bot commented Oct 13, 2017 •

edited

Loading

matthiashuschle commented Oct 13, 2017

jreback left a comment

jreback Oct 13, 2017

jreback Oct 13, 2017

jreback Oct 13, 2017

jreback commented Oct 13, 2017

matthiashuschle commented Oct 13, 2017

jreback commented Oct 14, 2017

BUG: to_json - prevent various segfault conditions (GH14256) #17857

BUG: to_json - prevent various segfault conditions (GH14256) #17857

Conversation

matthiashuschle commented Oct 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 13, 2017

Codecov Report

codecov bot commented Oct 13, 2017 • edited Loading

Codecov Report

matthiashuschle commented Oct 13, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 13, 2017

matthiashuschle commented Oct 13, 2017

jreback commented Oct 14, 2017

codecov bot commented Oct 13, 2017 •

edited

Loading