offset overflow at large (>2.4GB) string or binary columns #294

skirpichenko · 2017-04-01T13:41:13Z

This PR closes the issue #232.

Feather saves string and binary columns as a flat buffer preposed by an array of offsets to the beginning and ending point of each field. The types of offsets are int32_t (32-bit signed integer). This leads to overflow when column size is bigger than 2.4GB and makes impossible to work with large data tables. The problem can be fixed without changing offset type and binary file format. The difference between the ending and beginning points still gives correct field length even with overflowed offsets. Thus the correct location of each field can be reconstructed summing up the lengths of all previous fields.

…r binary columns

codecov-io · 2017-04-01T14:41:27Z

Codecov Report

Merging #294 into master will decrease coverage by 27.82%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master     #294       +/-   ##
===========================================
- Coverage   87.19%   59.37%   -27.83%     
===========================================
  Files           5        6        +1     
  Lines         414     2449     +2035     
===========================================
+ Hits          361     1454     +1093     
- Misses         53      995      +942

Impacted Files	Coverage Δ
python/feather/__init__.py	`55.23% <0%> (-44.77%)`	⬇️
python/feather/api.py	`71.42% <0%> (-26.2%)`	⬇️
integration-tests/util.py	`22.13% <0%> (ø)`
python/feather/compat.py	`70.65% <0%> (+9.36%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8872294...eb44c65. Read the comment docs.

wesm · 2017-04-01T15:30:05Z

I'll have a closer look at this.

I just opened https://issues.apache.org/jira/browse/ARROW-750 -- note that the C++ code base here is effectively deprecated at this point. I'm looking for R developers to help with building more general Arrow Rcpp bindings, and continuing to support Feather in the Apache Arrow repo. Would you or anyone else from the R community like to get involved? There will be a lot of long term value in having high quality R bindings to the Arrow libraries (e.g. it would give you Parquet file support with relatively little effort)

terrytangyuan · 2017-04-27T16:38:51Z

@wesm Have you got a chance to look at this yet?

terrytangyuan · 2017-04-27T17:20:40Z

@wesm Regarding your comment on high quality R bindings, do you guys have a list of potential items or roadmap? We should probably open a new ticket with more details so more people from R community can comment on it and discuss designs. cc: @hadley @kevinushey @krlmlr

wesm · 2017-06-08T01:49:00Z

Hi,

We should start with R bindings for the Arrow C++ shared library that provide the same Feather support we have now (using the Rcpp code here), then expand to Arrow's more general stream and file formats (which are like "chunked Feather supporting nested data and more logical data type"). From there, it should be use case driven, e.g. supporting better Spark interoperability for sparklyr and SparkR. Does that make sense?

wesm · 2017-06-11T15:40:58Z

Closing this as Won't Fix. There is a JIRA about adding LargeBinary and LargeString (UTF-8) types to the Arrow metadata, so that will be the approach to fix this issue: https://issues.apache.org/jira/browse/ARROW-750. I would love to have some R developers get involved with the Apache Arrow project to carry on Feather development.

Sergey Kirpichenko added 3 commits April 1, 2017 07:58

fixed offset overflow problem when processing large (>2.4GB) string o…

1196c23

…r binary columns

R test removed due to github memory limit

1735fd0

c++ code formatting

eb44c65

jameslamb mentioned this pull request Apr 18, 2017

data.table integration #293

Closed

wesm closed this Jun 11, 2017

jameslamb mentioned this pull request Nov 12, 2018

ARROW-3439: [R] R language bindings for Feather format apache/arrow#2947

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

offset overflow at large (>2.4GB) string or binary columns #294

offset overflow at large (>2.4GB) string or binary columns #294

skirpichenko commented Apr 1, 2017

codecov-io commented Apr 1, 2017

wesm commented Apr 1, 2017

terrytangyuan commented Apr 27, 2017

terrytangyuan commented Apr 27, 2017

wesm commented Jun 8, 2017

wesm commented Jun 11, 2017

offset overflow at large (>2.4GB) string or binary columns #294

offset overflow at large (>2.4GB) string or binary columns #294

Conversation

skirpichenko commented Apr 1, 2017

codecov-io commented Apr 1, 2017

Codecov Report

wesm commented Apr 1, 2017

terrytangyuan commented Apr 27, 2017

terrytangyuan commented Apr 27, 2017

wesm commented Jun 8, 2017

wesm commented Jun 11, 2017