-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Format] Add string view type to the arrow format #35627
Milestone
Comments
github-actions
bot
added
Component: C++
Component: Format
Component: Integration
labels
May 16, 2023
bkietz
added a commit
to bkietz/arrow
that referenced
this issue
Sep 1, 2023
kou
changed the title
Add string view type to the arrow format
[Format] Add string view type to the arrow format
Sep 12, 2023
bkietz
added a commit
to bkietz/arrow
that referenced
this issue
Sep 13, 2023
bkietz
added a commit
to bkietz/arrow
that referenced
this issue
Sep 19, 2023
bkietz
added a commit
to bkietz/arrow
that referenced
this issue
Sep 20, 2023
bkietz
added a commit
to bkietz/arrow
that referenced
this issue
Sep 21, 2023
bkietz
added a commit
that referenced
this issue
Sep 21, 2023
String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from #35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: #35627 Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
etseidl
pushed a commit
to etseidl/arrow
that referenced
this issue
Sep 28, 2023
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
JerAguilon
pushed a commit
to JerAguilon/arrow
that referenced
this issue
Oct 23, 2023
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
loicalleyne
pushed a commit
to loicalleyne/arrow
that referenced
this issue
Nov 13, 2023
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this issue
Feb 19, 2024
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
Adding an issue to GH to track adding string view to the arrow format, as discussed on this ML thread https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
Current ML discussion at https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
Component(s)
C++, Format, Integration
The text was updated successfully, but these errors were encountered: