Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37876: [Format] Add list-view specification to arrow format #37877

Merged
merged 13 commits into from
Oct 5, 2023

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Sep 26, 2023

Rationale for this change

More details in the draft implementations of this spec:

What changes are included in this PR?

  • Some unrelated fixes to the spec text (I can extract these to another PR if necessary)
  • Changes to the spec text
  • Additions to the Flatbuffers specifications of the Arrow format

Are these changes tested?

N/A.

Are there any user-facing changes?

Changes in documentation and backwards compatible additions to the format spec.

@felipecrv felipecrv requested review from bkietz and pitrou September 26, 2023 15:58
@felipecrv felipecrv marked this pull request as ready for review September 26, 2023 15:58
@github-actions
Copy link

⚠️ GitHub issue #37876 has been automatically assigned in GitHub to PR creator.

@felipecrv felipecrv changed the title GH-37876: [Format] Add string-view to arrow format GH-37876: [Format] Add list-view to arrow format Sep 26, 2023
@felipecrv felipecrv changed the title GH-37876: [Format] Add list-view to arrow format GH-37876: [Format] Add list-view specification to arrow format Sep 26, 2023
@pitrou pitrou requested a review from wjones127 September 28, 2023 16:49
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits in wording, otherwise looks good

docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 29, 2023
format/Schema.fbs Outdated Show resolved Hide resolved
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Co-authored-by: David Li <li.davidm96@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 30, 2023
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
@tustvold
Copy link
Contributor

tustvold commented Oct 2, 2023

Would it be enough to require that sizes[i] == 0 when i is null to call it a "valid empty list-view"

At least in Rust the rule is that a slice must have an end index less than or equal to the length of the data being sliced.

So in this case a slice would be valid iff sizes[i] + offsets[i] <= child_data[0].length().

It has been a while since I worked in C++, but if I recall correctly this is consistent with the way iterators work as well.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 3, 2023
@felipecrv
Copy link
Contributor Author

Would it be enough to require that sizes[i] == 0 when i is null to call it a "valid empty list-view"

At least in Rust the rule is that a slice must have an end index less than or equal to the length of the data being sliced.

So in this case a slice would be valid iff sizes[i] + offsets[i] <= child_data[0].length().

It has been a while since I worked in C++, but if I recall correctly this is consistent with the way iterators work as well.

I will rewrite the text saying that non-empty nulls are allowed, then.

@felipecrv felipecrv requested review from pitrou and tustvold October 3, 2023 18:29
@zeroshade
Copy link
Member

The vote on the mailing list is officially passed, @bkietz you have an outstanding change requested can you take a look at the updates and update your review accordingly?

@pitrou @tustvold Any outstanding comments here or can we approve this?

@tustvold
Copy link
Contributor

tustvold commented Oct 5, 2023

LGTM

@@ -100,15 +100,15 @@ Arrays are defined by a few pieces of metadata and data:
Nested arrays additionally have a sequence of one or more sets of
these items, called the **child arrays**.

Each logical data type has a well-defined physical layout. Here are
the different physical layouts defined by Arrow:
Each logical data type has one or more well-defined physical layouts. Here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the singular. There is no disjunction in Arrow (unlike Parquet) between "logical" data type and physical layout. ListView and StringView are simply distinct types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this back to singular and all the other places I've changed it. But in the future, the "logical data type" terminology should probably be removed altogether because it's very confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree with that. The spec was often confusing to me at the start.

@felipecrv felipecrv requested a review from pitrou October 5, 2023 14:05
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @felipecrv !

@pitrou
Copy link
Member

pitrou commented Oct 5, 2023

@bkietz Any other comment?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 5, 2023
@zeroshade zeroshade merged commit 6d551aa into apache:main Oct 5, 2023
10 checks passed
@zeroshade zeroshade removed the awaiting merge Awaiting merge label Oct 5, 2023
@felipecrv felipecrv deleted the format_list_view branch October 5, 2023 17:59
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 6d551aa.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…pache#37877)

### Rationale for this change

More details in the draft implementations of this spec:

 - C++: apache#35345
 - Go: apache#37468

### What changes are included in this PR?

 - Some unrelated fixes to the spec text (I can extract these to another PR if necessary)
 - Changes to the spec text
 - Additions to the Flatbuffers specifications of the Arrow format

### Are these changes tested?

N/A.

### Are there any user-facing changes?

Changes in documentation and backwards compatible additions to the format spec.

* Closes: apache#37876

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#37877)

### Rationale for this change

More details in the draft implementations of this spec:

 - C++: apache#35345
 - Go: apache#37468

### What changes are included in this PR?

 - Some unrelated fixes to the spec text (I can extract these to another PR if necessary)
 - Changes to the spec text
 - Additions to the Flatbuffers specifications of the Arrow format

### Are these changes tested?

N/A.

### Are there any user-facing changes?

Changes in documentation and backwards compatible additions to the format spec.

* Closes: apache#37876

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#37877)

### Rationale for this change

More details in the draft implementations of this spec:

 - C++: apache#35345
 - Go: apache#37468

### What changes are included in this PR?

 - Some unrelated fixes to the spec text (I can extract these to another PR if necessary)
 - Changes to the spec text
 - Additions to the Flatbuffers specifications of the Arrow format

### Are these changes tested?

N/A.

### Are there any user-facing changes?

Changes in documentation and backwards compatible additions to the format spec.

* Closes: apache#37876

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Format] Add ListView to FlatBuffers and specification text
8 participants