Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

offset overflow at large (>2.4GB) string or binary columns #294

Closed
wants to merge 3 commits into from

Conversation

skirpichenko
Copy link

This PR closes the issue #232.

Feather saves string and binary columns as a flat buffer preposed by an array of offsets to the beginning and ending point of each field. The types of offsets are int32_t (32-bit signed integer). This leads to overflow when column size is bigger than 2.4GB and makes impossible to work with large data tables. The problem can be fixed without changing offset type and binary file format. The difference between the ending and beginning points still gives correct field length even with overflowed offsets. Thus the correct location of each field can be reconstructed summing up the lengths of all previous fields.

@codecov-io
Copy link

Codecov Report

Merging #294 into master will decrease coverage by 27.82%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #294       +/-   ##
===========================================
- Coverage   87.19%   59.37%   -27.83%     
===========================================
  Files           5        6        +1     
  Lines         414     2449     +2035     
===========================================
+ Hits          361     1454     +1093     
- Misses         53      995      +942
Impacted Files Coverage Δ
python/feather/__init__.py 55.23% <0%> (-44.77%) ⬇️
python/feather/api.py 71.42% <0%> (-26.2%) ⬇️
integration-tests/util.py 22.13% <0%> (ø)
python/feather/compat.py 70.65% <0%> (+9.36%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8872294...eb44c65. Read the comment docs.

@wesm
Copy link
Owner

wesm commented Apr 1, 2017

I'll have a closer look at this.

I just opened https://issues.apache.org/jira/browse/ARROW-750 -- note that the C++ code base here is effectively deprecated at this point. I'm looking for R developers to help with building more general Arrow Rcpp bindings, and continuing to support Feather in the Apache Arrow repo. Would you or anyone else from the R community like to get involved? There will be a lot of long term value in having high quality R bindings to the Arrow libraries (e.g. it would give you Parquet file support with relatively little effort)

@jameslamb jameslamb mentioned this pull request Apr 18, 2017
@terrytangyuan
Copy link

@wesm Have you got a chance to look at this yet?

@terrytangyuan
Copy link

@wesm Regarding your comment on high quality R bindings, do you guys have a list of potential items or roadmap? We should probably open a new ticket with more details so more people from R community can comment on it and discuss designs. cc: @hadley @kevinushey @krlmlr

@wesm
Copy link
Owner

wesm commented Jun 8, 2017

Hi,

We should start with R bindings for the Arrow C++ shared library that provide the same Feather support we have now (using the Rcpp code here), then expand to Arrow's more general stream and file formats (which are like "chunked Feather supporting nested data and more logical data type"). From there, it should be use case driven, e.g. supporting better Spark interoperability for sparklyr and SparkR. Does that make sense?

@wesm
Copy link
Owner

wesm commented Jun 11, 2017

Closing this as Won't Fix. There is a JIRA about adding LargeBinary and LargeString (UTF-8) types to the Arrow metadata, so that will be the approach to fix this issue: https://issues.apache.org/jira/browse/ARROW-750. I would love to have some R developers get involved with the Apache Arrow project to carry on Feather development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants