apache · zeroshade · Oct 5, 2023 · Sep 26, 2023 · Sep 26, 2023 · Sep 26, 2023
diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst
@@ -100,15 +100,15 @@ Arrays are defined by a few pieces of metadata and data:
 Nested arrays additionally have a sequence of one or more sets of
 these items, called the **child arrays**.
 
-Each logical data type has a well-defined physical layout. Here are
-the different physical layouts defined by Arrow:
+Each logical data type has one or more well-defined physical layouts. Here
+are the different physical layouts defined by Arrow:
 
 * **Primitive (fixed-size)**: a sequence of values each having the
   same byte or bit width
 * **Variable-size Binary**: a sequence of values each having a variable
   byte length. Two variants of this layout are supported using 32-bit
   and 64-bit length encoding.
-* **Views of Variable-size Binary**: a sequence of values each having a
+* **View of Variable-size Binary**: a sequence of values each having a
   variable byte length. In contrast to Variable-size Binary, the values
   of this layout are distributed across potentially multiple buffers
   instead of densely and sequentially packed in a single buffer.
@@ -118,6 +118,11 @@ the different physical layouts defined by Arrow:
   variable-length sequence of values taken from a child data type. Two
   variants of this layout are supported using 32-bit and 64-bit length
   encoding.
+* **View of Variable-size List**: a nested layout where each value is a
+  variable-length sequence of values taken from a child data type. This
+  layout differs from **Variable-size List** by having an additional
+  buffer containing the sizes of each list value. This removes a constraint
+  on the offsets buffer — it does not need to be in order.
 * **Struct**: a nested layout consisting of a collection of named
   child **fields** each having the same length but possibly different
   types.
@@ -382,7 +387,7 @@ In both the long and short string cases, the first four bytes encode the
 length of the string and can be used to determine how the rest of the view
 should be interpreted.
 
-In the short string case the string's bytes are inlined- stored inside the
+In the short string case the string's bytes are inlined — stored inside the
 view itself, in the twelve bytes which follow the length.
 
 In the long string case, a buffer index indicates which data buffer
@@ -401,11 +406,18 @@ This layout is adapted from TU Munich's `UmbraDB`_.
 
 .. _variable-size-list-layout:
 
-Variable-size List Layout
--------------------------
+Variable-size List Layouts
+--------------------------
 
 List is a nested type which is semantically similar to variable-size
-binary. It is defined by two buffers, a validity bitmap and an offsets
+binary. There are two list layout variations — "list" and "list-view" —
+and each variation can be delimited by either 32-bit or 64-bit offsets
+integers.
+
+List Layout
+~~~~~~~~~~~
+
+The List layout is defined by two buffers, a validity bitmap and an offsets
 buffer, and a child array. The offsets are the same as in the
 variable-size binary case, and both 32-bit and 64-bit signed integer
 offsets are supported options for the offsets. Rather than referencing
@@ -441,7 +453,7 @@ will have the following representation: ::
       |------------|-------------|-------------|-------------|-------------|-----------------------|
       | 0          | 3           | 3           | 7           | 7           | unspecified (padding) |
 
-    * Values array (Int8array):
+    * Values array (Int8Array):
       * Length: 7,  Null count: 0
       * Validity bitmap buffer: Not required
       * Values buffer (int8)
@@ -487,6 +499,102 @@ will be represented as follows: ::
           |-------------------------------|-----------------------|
           | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified (padding) |
 
+ListView Layout
+~~~~~~~~~~~~~~~
+
+The ListView layout is defined by three buffers: a validity bitmap, an offsets
+buffer, and an additional sizes buffer. Sizes and offsets have the identical bit
+width and both 32-bit and 64-bit signed integer options are supported.
+
+As in the List layout, the offsets encode the start position of each slot in the
+child array. In contrast to the List layout, list lengths are stored explicitly
+in the sizes buffer instead of inferred. This allows offsets to be out of order.
+Elements of the child array do not have to be stored in the same order they
+logically appear in the list elements of the parent array.
+
+When a value is null, the corresponding offset and size can have arbitrary
+values. When size is 0, the corresponding offset can have an arbitrary value.
+If choosing a value is possible, we recommend setting offsets and sizes to 0 in
+these cases.
+
+A list-view type is specified like ``ListView<T>``, where ``T`` is any type
+(primitive or nested). In these examples we use 32-bit offsets where
+the 64-bit offset version would be denoted by ``LargeListView<T>``.
+
+**Example Layout: ``ListView<Int8>`` Array**
+
+We illustrate an example of ``ListView<Int8>`` with length 4 having values::
+
+    [[12, -7, 25], null, [0, -127, 127, 50], []]
+
+It may have the following representation: ::
+
+    * Length: 4, Null count: 1
+    * Validity bitmap buffer:
+
+      | Byte 0 (validity bitmap) | Bytes 1-63            |
+      |--------------------------|-----------------------|
+      | 00001101                 | 0 (padding)           |
+
+    * Offsets buffer (int32)
+
+      | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63           |
+      |------------|-------------|-------------|-------------|-----------------------|
+      | 0          | unspecified | 3           | unspecified | unspecified (padding) |
+
+    * Sizes buffer (int32)
+
+      | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63           |
+      |------------|-------------|-------------|-------------|-----------------------|
+      | 3          | unspecified | 4           | 0           | unspecified (padding) |
+
+    * Values array (Int8Array):
+      * Length: 7,  Null count: 0
+      * Validity bitmap buffer: Not required
+      * Values buffer (int8)
+
+        | Bytes 0-6                    | Bytes 7-63            |
+        |------------------------------|-----------------------|
+        | 12, -7, 25, 0, -127, 127, 50 | unspecified (padding) |
+
+**Example Layout: ``ListView<Int8>`` Array**
+
+We continue with the ``ListView<Int8>`` type, but this instance illustrates out
+of order offsets and sharing of child array values. It is an array with length 5
+having logical values::
+
+    [[12, -7, 25], null, [0, -127, 127, 50], [], [50, 12]]
+
+It will have the following representation: ::
+
+    * Length: 4, Null count: 1
+    * Validity bitmap buffer:
+
+      | Byte 0 (validity bitmap) | Bytes 1-63            |
+      |--------------------------|-----------------------|
+      | 00011101                 | 0 (padding)           |
+
+    * Offsets buffer (int32)
+
+      | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63           |
+      |------------|-------------|-------------|-------------|-------------|-----------------------|
+      | 4          | unspecified | 0           | unspecified | 3           | unspecified (padding) |
+
+    * Sizes buffer (int32)
+
+      | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63           |
+      |------------|-------------|-------------|-------------|-------------|-----------------------|
+      | 3          | unspecified | 4           | 0           | 2           | unspecified (padding) |
+
+    * Values array (Int8Array):
+      * Length: 7,  Null count: 0
+      * Validity bitmap buffer: Not required
+      * Values buffer (int8)
+
+        | Bytes 0-6                    | Bytes 7-63            |
+        |------------------------------|-----------------------|
+        | 0, -127, 127, 50, 12, -7, 25 | unspecified (padding) |
+
 Fixed-Size List Layout
 ----------------------
 
@@ -618,8 +726,8 @@ for the null struct but they are "hidden" by the struct array's validity
 bitmap. However, when treated independently, corresponding entries of the
 children array will be non-null.
 
-Union Layout
-------------
+Union Layouts
+-------------
 
 A union is defined by an ordered sequence of types; each slot in the
 union can have a value chosen from these types. The types are named
@@ -858,19 +966,19 @@ are held in the second child array.
 For the purposes of determining field names and schemas, these child arrays
 are prescribed the standard names of **run_ends** and **values** respectively.
 
-The values in the first child array represent the accumulated length of all runs 
+The values in the first child array represent the accumulated length of all runs
 from the first to the current one, i.e. the logical index where the
 current run ends. This allows relatively efficient random access from a logical
 index using binary search. The length of an individual run can be determined by
 subtracting two adjacent values. (Contrast this with run-length encoding, in
 which the lengths of the runs are represented directly, and in which random
-access is less efficient.) 
+access is less efficient.)
 
 .. note::
    Because the ``run_ends`` child array cannot have nulls, it's reasonable
    to consider why the ``run_ends`` are a child array instead of just a
    buffer, like the offsets for a :ref:`variable-size-list-layout`. This
-   layout was considered, but it was decided to use the child arrays. 
+   layout was considered, but it was decided to use the child arrays.
 
    Child arrays allow us to keep the "logical length" (the decoded length)
    associated with the parent array and the "physical length" (the number

diff --git a/format/Schema.fbs b/format/Schema.fbs
@@ -22,7 +22,8 @@
 /// Version 1.1 - Add Decimal256.
 /// Version 1.2 - Add Interval MONTH_DAY_NANO.
 /// Version 1.3 - Add Run-End Encoded.
-/// Version 1.4 - Add BinaryView, Utf8View, and variadicBufferCounts.
+/// Version 1.4 - Add BinaryView, Utf8View, variadicBufferCounts, ListView, and
+/// LargeListView.
 
 namespace org.apache.arrow.flatbuf;
 
@@ -97,6 +98,17 @@ table List {
 table LargeList {
 }
 
+/// Represents the same logical types that List can, but contains offsets and
+/// sizes allowing for writes in any order and sharing of child values among
+/// list values.
+table ListView {
+}
+
+/// Same as ListView, but with 64-bit offsets and sizes, allowing to represent
+/// extremely large data values.
+table LargeListView {
+}
+
 table FixedSizeList {
   /// Number of list items per value
   listSize: int;
@@ -451,6 +463,8 @@ union Type {
   RunEndEncoded,
   BinaryView,
   Utf8View,
+  ListView,
+  LargeListView,
 }
 
 /// ----------------------------------------------------------------------