-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
offset overflow at large (>2.4GB) string or binary columns #294
Conversation
Codecov Report
@@ Coverage Diff @@
## master #294 +/- ##
===========================================
- Coverage 87.19% 59.37% -27.83%
===========================================
Files 5 6 +1
Lines 414 2449 +2035
===========================================
+ Hits 361 1454 +1093
- Misses 53 995 +942
Continue to review full report at Codecov.
|
I'll have a closer look at this. I just opened https://issues.apache.org/jira/browse/ARROW-750 -- note that the C++ code base here is effectively deprecated at this point. I'm looking for R developers to help with building more general Arrow Rcpp bindings, and continuing to support Feather in the Apache Arrow repo. Would you or anyone else from the R community like to get involved? There will be a lot of long term value in having high quality R bindings to the Arrow libraries (e.g. it would give you Parquet file support with relatively little effort) |
@wesm Have you got a chance to look at this yet? |
@wesm Regarding your comment on high quality R bindings, do you guys have a list of potential items or roadmap? We should probably open a new ticket with more details so more people from R community can comment on it and discuss designs. cc: @hadley @kevinushey @krlmlr |
Hi, We should start with R bindings for the Arrow C++ shared library that provide the same Feather support we have now (using the Rcpp code here), then expand to Arrow's more general stream and file formats (which are like "chunked Feather supporting nested data and more logical data type"). From there, it should be use case driven, e.g. supporting better Spark interoperability for sparklyr and SparkR. Does that make sense? |
Closing this as Won't Fix. There is a JIRA about adding LargeBinary and LargeString (UTF-8) types to the Arrow metadata, so that will be the approach to fix this issue: https://issues.apache.org/jira/browse/ARROW-750. I would love to have some R developers get involved with the Apache Arrow project to carry on Feather development. |
This PR closes the issue #232.
Feather
saves string and binary columns as a flat buffer preposed by an array of offsets to the beginning and ending point of each field. The types of offsets areint32_t
(32-bit signed integer). This leads to overflow when column size is bigger than 2.4GB and makes impossible to work with large data tables. The problem can be fixed without changing offset type and binary file format. The difference between the ending and beginning points still gives correct field length even with overflowed offsets. Thus the correct location of each field can be reconstructed summing up the lengths of all previous fields.