Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_json - prevent various segfault conditions (GH14256) #17857

Merged
merged 1 commit into from
Oct 14, 2017

Conversation

matthiashuschle
Copy link
Contributor

There were several sources for the JSON string buffer at enc->start exceeding the reserved space:

  • Loops on DataFrame columns being stuck on the same column if enc->errorMsg was set. This is fixed by adding the check for the errorMsg to the respective iterNext function.
  • Loops on objects (Dir_iterNext) not breaking due to similar reasons, which allows infinite recursion. Also added check for errorMsg.
  • Column labels were added in iterName methods inside objToJSON.c where the check for remaining buffer was not accessible. Moved Buffer_Reserve to the ujson header file.
  • Closing brackets could be added without sufficient buffer. Added Buffer_Reserve calls.

@gfyoung gfyoung added Bug IO JSON read_json, to_json, json_normalize labels Oct 12, 2017
@@ -940,3 +940,5 @@ Other
^^^^^
- Bug where some inplace operators were not being wrapped and produced a copy when invoked (:issue:`12962`)
- Bug in :func:`eval` where the ``inplace`` parameter was being incorrectly handled (:issue:`16732`)
- Bug in :func:`to_json` where several conditions (including objects with unprintable symbols, objects with deep recursion, overlong labels) caused segfaults instead of raising the appropriate exception (:issue:`14256`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an I/O bug, so add it under that sub-section.

Buffer_Realloc((__enc), (__len)); \
}

void Buffer_Realloc(JSONObjectEncoder *enc, size_t cbNeeded);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you implement this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already in ultrajsonenc.c. No changes, just exposed in the header to be usable from objToJSON.c

assert df_printable.to_json() == '{"A":{"0":"%s"}}' % hexed
df_nonprintable = DataFrame({'A': [binthing]})
pytest.raises(exc_type, df_nonprintable.to_json)
# GH14256: failing column caused segfaults, if it is not the last one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to reference this at the top of the function definition.

'{"A":{"0":"%s"},"B":{"0":1}}' % hexed

def test_label_overflow(self):
df = pd.DataFrame({'foo': [1337], 'bar' * 100000: [1]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference the issue number above this line.

@codecov
Copy link

codecov bot commented Oct 13, 2017

Codecov Report

Merging #17857 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17857      +/-   ##
==========================================
- Coverage   91.22%    91.2%   -0.02%     
==========================================
  Files         163      163              
  Lines       50069    50038      -31     
==========================================
- Hits        45673    45639      -34     
- Misses       4396     4399       +3
Flag Coverage Δ
#multiple 89.01% <ø> (ø) ⬆️
#single 40.26% <ø> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/dtypes/concat.py 98.26% <0%> (-0.87%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.48% <0%> (-0.1%) ⬇️
pandas/core/internals.py 94.38% <0%> (-0.07%) ⬇️
pandas/core/sparse/series.py 95.26% <0%> (-0.02%) ⬇️
pandas/core/generic.py 92.2% <0%> (-0.01%) ⬇️
pandas/core/dtypes/dtypes.py 95.14% <0%> (ø) ⬆️
pandas/core/reshape/concat.py 97.6% <0%> (+0.03%) ⬆️
pandas/core/indexing.py 93% <0%> (+0.18%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c277cd7...41a1aa0. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 13, 2017

Codecov Report

Merging #17857 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17857      +/-   ##
==========================================
+ Coverage   91.23%   91.24%   +<.01%     
==========================================
  Files         163      163              
  Lines       50075    50075              
==========================================
+ Hits        45688    45691       +3     
+ Misses       4387     4384       -3
Flag Coverage Δ
#multiple 89.05% <ø> (+0.02%) ⬆️
#single 40.29% <ø> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c964a4...747a942. Read the comment docs.

@matthiashuschle
Copy link
Contributor Author

Thank you for the comments. I moved the test comments and the whatsnew entry, and pushed the changes.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. just some minor formatting comments of the tests.

hexed = '574b4454ba8c5eb4f98a8f45'
exc_type = OverflowError
binthing = BinaryThing(hexed)
df_printable = DataFrame({'A': [binthing.hexed]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before each sub-section, can put a comment on what you are testing; and blank lines between sub-sections

pytest.raises(exc_type, df_nonprintable.to_json)
df_mixed = DataFrame({'A': [binthing], 'B': [1]},
columns=['A', 'B'])
pytest.raises(exc_type, df_mixed.to_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the with version to test raising

return self.hexed

hexed = '574b4454ba8c5eb4f98a8f45'
exc_type = OverflowError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't define this separately, just inline the exceptions you are checking

@jreback jreback added this to the 0.21.0 milestone Oct 13, 2017
@jreback
Copy link
Contributor

jreback commented Oct 13, 2017

@matthiashuschle also pls rebase on master. some CI things were updated to make circleci work with the new version of mpl.

@matthiashuschle
Copy link
Contributor Author

thanks, I just incorporated your suggestions.

@jreback jreback merged commit 446d5b4 into pandas-dev:master Oct 14, 2017
@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

thanks @matthiashuschle nice patch!

ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
* upstream/master: (76 commits)
  CategoricalDtype construction: actually use fastpath (pandas-dev#17891)
  DEPR: Deprecate tupleize_cols in to_csv (pandas-dev#17877)
  BUG: Fix wrong column selection in drop_duplicates when duplicate column names (pandas-dev#17879)
  DOC: Adding examples to update docstring (pandas-dev#16812) (pandas-dev#17859)
  TST: Skip if no openpyxl in test_excel (pandas-dev#17883)
  TST: Catch read_html slow test warning (pandas-dev#17874)
  flake8 cleanup (pandas-dev#17873)
  TST: remove moar warnings (pandas-dev#17872)
  ENH: tolerance now takes list-like argument for reindex and get_indexer. (pandas-dev#17367)
  ERR: Raise ValueError when week is passed in to_datetime format witho… (pandas-dev#17819)
  TST: remove some deprecation warnings (pandas-dev#17870)
  Refactor index-as-string groupby tests and fix spurious warning (Bug 17383) (pandas-dev#17843)
  BUG: merging with a boolean/int categorical column (pandas-dev#17841)
  DEPR: Deprecate read_csv arguments fully (pandas-dev#17865)
  BUG: to_json - prevent various segfault conditions (GH14256) (pandas-dev#17857)
  CLN: Use pandas.core.common for None checks (pandas-dev#17816)
  BUG: set tz on DTI from fixed format HDFStore (pandas-dev#17844)
  RLS: v0.21.0rc1
  Whatsnew cleanup (pandas-dev#17858)
  DEPR: Deprecate the convert parameter completely (pandas-dev#17831)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: to_json with objects causing segfault
3 participants