-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction #34
Conversation
@@ -90,11 +90,27 @@ maximum of 2^31 - 1 elements. We choose a signed int32 for a couple reasons: | |||
Any relative type can be nullable or non-nullable. | |||
|
|||
Nullable arrays have a contiguous memory buffer, known as the null bitmask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to frame this paragraph in terms of validity. Drills documentation phrases it well I think (it doesn't reference a null bitmark)
Nullable values are represented by a vector of bit values. Each bit in the vector corresponds to an element in the ValueVector. If the bit is not set, the value is NULL.
If you do decide rephrase the bitmask as a validity vector, you should probably update the documentation on the flatbufferschema
|
I rephrased the language a little bit. I'll appeal to one of the other committers to review. |
Postgres uses the term "null bitmap" if that seems reasonable so I will try to use that consistently in the code and format docs |
Thanks, I probably should have prefaced the above with IMHO. In my mind, I prefix the name of the bitmap with an "is_" and assume 1 means true. |
null) bitmap, whose length is large enough to have 1 bit for each array slot. | ||
Nullable arrays have a contiguous memory buffer, known as the null (or | ||
validity) bitmap, whose length is large enough to have 1 bit for each array | ||
slot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose that the null bitmap is always an multiple of 8 bytes in length. This simplifies some code to avoid having to manage partial word conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. There's also the SIMD question — if these buffers are word-aligned then there won't be concerns (? someone with more expertise should opine) with aligned allocations
@jacques-n I'm going to go ahead and expand the patch to account for ARROW-76, changes to be posted shortly. I'll await your +1 and further comments. |
Merging these format changes. Debate on these subjects may continue on the mailing list. Thank you |
Requires PARQUET-485 (apache#32) The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with other implementations of Parquet. This patch adds an implementation of plain encoding and uses BitReader instead of RleDecoder to decode plain-encoded boolean data. Unit tests to verify. Also closes PR apache#12. Thanks to @edani for reporting. Author: Wes McKinney <wes@cloudera.com> Closes apache#34 from wesm/PARQUET-454 and squashes the following commits: 01cb5a7 [Wes McKinney] Use a seed in the data generation 0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding.
Requires PARQUET-485 (apache#32) The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with other implementations of Parquet. This patch adds an implementation of plain encoding and uses BitReader instead of RleDecoder to decode plain-encoded boolean data. Unit tests to verify. Also closes PR apache#12. Thanks to @edani for reporting. Author: Wes McKinney <wes@cloudera.com> Closes apache#34 from wesm/PARQUET-454 and squashes the following commits: 01cb5a7 [Wes McKinney] Use a seed in the data generation 0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding. Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403
Requires PARQUET-485 (apache#32) The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with other implementations of Parquet. This patch adds an implementation of plain encoding and uses BitReader instead of RleDecoder to decode plain-encoded boolean data. Unit tests to verify. Also closes PR apache#12. Thanks to @edani for reporting. Author: Wes McKinney <wes@cloudera.com> Closes apache#34 from wesm/PARQUET-454 and squashes the following commits: 01cb5a7 [Wes McKinney] Use a seed in the data generation 0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding. Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403
Requires PARQUET-485 (apache#32) The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with other implementations of Parquet. This patch adds an implementation of plain encoding and uses BitReader instead of RleDecoder to decode plain-encoded boolean data. Unit tests to verify. Also closes PR apache#12. Thanks to @edani for reporting. Author: Wes McKinney <wes@cloudera.com> Closes apache#34 from wesm/PARQUET-454 and squashes the following commits: 01cb5a7 [Wes McKinney] Use a seed in the data generation 0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding. Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403
Requires PARQUET-485 (apache#32) The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with other implementations of Parquet. This patch adds an implementation of plain encoding and uses BitReader instead of RleDecoder to decode plain-encoded boolean data. Unit tests to verify. Also closes PR apache#12. Thanks to @edani for reporting. Author: Wes McKinney <wes@cloudera.com> Closes apache#34 from wesm/PARQUET-454 and squashes the following commits: 01cb5a7 [Wes McKinney] Use a seed in the data generation 0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding. Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403
This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). #7131 enabled a minimal set of tests as a starting point. I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`. ``` $ git log | head -1 commit ed5f534 % ctest ... Start 1: arrow-array-test 1/51 Test #1: arrow-array-test ..................... Passed 4.62 sec Start 2: arrow-buffer-test 2/51 Test #2: arrow-buffer-test .................... Passed 0.14 sec Start 3: arrow-extension-type-test 3/51 Test #3: arrow-extension-type-test ............ Passed 0.12 sec Start 4: arrow-misc-test 4/51 Test #4: arrow-misc-test ...................... Passed 0.14 sec Start 5: arrow-public-api-test 5/51 Test #5: arrow-public-api-test ................ Passed 0.12 sec Start 6: arrow-scalar-test 6/51 Test #6: arrow-scalar-test .................... Passed 0.13 sec Start 7: arrow-type-test 7/51 Test #7: arrow-type-test ...................... Passed 0.14 sec Start 8: arrow-table-test 8/51 Test #8: arrow-table-test ..................... Passed 0.13 sec Start 9: arrow-tensor-test 9/51 Test #9: arrow-tensor-test .................... Passed 0.13 sec Start 10: arrow-sparse-tensor-test 10/51 Test #10: arrow-sparse-tensor-test ............. Passed 0.16 sec Start 11: arrow-stl-test 11/51 Test #11: arrow-stl-test ....................... Passed 0.12 sec Start 12: arrow-concatenate-test 12/51 Test #12: arrow-concatenate-test ............... Passed 0.53 sec Start 13: arrow-diff-test 13/51 Test #13: arrow-diff-test ...................... Passed 1.45 sec Start 14: arrow-c-bridge-test 14/51 Test #14: arrow-c-bridge-test .................. Passed 0.18 sec Start 15: arrow-io-buffered-test 15/51 Test #15: arrow-io-buffered-test ............... Passed 0.20 sec Start 16: arrow-io-compressed-test 16/51 Test #16: arrow-io-compressed-test ............. Passed 3.48 sec Start 17: arrow-io-file-test 17/51 Test #17: arrow-io-file-test ................... Passed 0.74 sec Start 18: arrow-io-hdfs-test 18/51 Test #18: arrow-io-hdfs-test ................... Passed 0.12 sec Start 19: arrow-io-memory-test 19/51 Test #19: arrow-io-memory-test ................. Passed 2.77 sec Start 20: arrow-utility-test 20/51 Test #20: arrow-utility-test ...................***Failed 5.65 sec Start 21: arrow-threading-utility-test 21/51 Test #21: arrow-threading-utility-test ......... Passed 1.34 sec Start 22: arrow-compute-compute-test 22/51 Test #22: arrow-compute-compute-test ........... Passed 0.13 sec Start 23: arrow-compute-boolean-test 23/51 Test #23: arrow-compute-boolean-test ........... Passed 0.15 sec Start 24: arrow-compute-cast-test 24/51 Test #24: arrow-compute-cast-test .............. Passed 0.22 sec Start 25: arrow-compute-hash-test 25/51 Test #25: arrow-compute-hash-test .............. Passed 2.61 sec Start 26: arrow-compute-isin-test 26/51 Test #26: arrow-compute-isin-test .............. Passed 0.81 sec Start 27: arrow-compute-match-test 27/51 Test #27: arrow-compute-match-test ............. Passed 0.40 sec Start 28: arrow-compute-sort-to-indices-test 28/51 Test #28: arrow-compute-sort-to-indices-test ... Passed 3.33 sec Start 29: arrow-compute-nth-to-indices-test 29/51 Test #29: arrow-compute-nth-to-indices-test .... Passed 1.51 sec Start 30: arrow-compute-util-internal-test 30/51 Test #30: arrow-compute-util-internal-test ..... Passed 0.13 sec Start 31: arrow-compute-add-test 31/51 Test #31: arrow-compute-add-test ............... Passed 0.12 sec Start 32: arrow-compute-aggregate-test 32/51 Test #32: arrow-compute-aggregate-test ......... Passed 14.70 sec Start 33: arrow-compute-compare-test 33/51 Test #33: arrow-compute-compare-test ........... Passed 7.96 sec Start 34: arrow-compute-take-test 34/51 Test #34: arrow-compute-take-test .............. Passed 4.80 sec Start 35: arrow-compute-filter-test 35/51 Test #35: arrow-compute-filter-test ............ Passed 8.23 sec Start 36: arrow-dataset-dataset-test 36/51 Test #36: arrow-dataset-dataset-test ........... Passed 0.25 sec Start 37: arrow-dataset-discovery-test 37/51 Test #37: arrow-dataset-discovery-test ......... Passed 0.13 sec Start 38: arrow-dataset-file-ipc-test 38/51 Test #38: arrow-dataset-file-ipc-test .......... Passed 0.21 sec Start 39: arrow-dataset-file-test 39/51 Test #39: arrow-dataset-file-test .............. Passed 0.12 sec Start 40: arrow-dataset-filter-test 40/51 Test #40: arrow-dataset-filter-test ............ Passed 0.16 sec Start 41: arrow-dataset-partition-test 41/51 Test #41: arrow-dataset-partition-test ......... Passed 0.13 sec Start 42: arrow-dataset-scanner-test 42/51 Test #42: arrow-dataset-scanner-test ........... Passed 0.20 sec Start 43: arrow-filesystem-test 43/51 Test #43: arrow-filesystem-test ................ Passed 1.62 sec Start 44: arrow-hdfs-test 44/51 Test #44: arrow-hdfs-test ...................... Passed 0.13 sec Start 45: arrow-feather-test 45/51 Test #45: arrow-feather-test ................... Passed 0.91 sec Start 46: arrow-ipc-read-write-test 46/51 Test #46: arrow-ipc-read-write-test ............ Passed 5.77 sec Start 47: arrow-ipc-json-simple-test 47/51 Test #47: arrow-ipc-json-simple-test ........... Passed 0.16 sec Start 48: arrow-ipc-json-test 48/51 Test #48: arrow-ipc-json-test .................. Passed 0.27 sec Start 49: arrow-json-integration-test 49/51 Test #49: arrow-json-integration-test .......... Passed 0.13 sec Start 50: arrow-json-test 50/51 Test #50: arrow-json-test ...................... Passed 0.26 sec Start 51: arrow-orc-adapter-test 51/51 Test #51: arrow-orc-adapter-test ............... Passed 1.92 sec 98% tests passed, 1 tests failed out of 51 Label Time Summary: arrow-tests = 27.38 sec (27 tests) arrow_compute = 45.11 sec (14 tests) arrow_dataset = 1.21 sec (7 tests) arrow_ipc = 6.20 sec (3 tests) unittest = 79.91 sec (51 tests) Total Test time (real) = 79.99 sec The following tests FAILED: 20 - arrow-utility-test (Failed) Errors while running CTest ``` Closes #7142 from kiszk/ARROW-8754 Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
As the initial scribe for the Arrow format, I made a mistake in what the null bits mean (1 for not-null, 0 for null). I also addressed ARROW-56 (bit-numbering) here.
Database systems are split on this subject. PostgreSQL for example does it this way:
http://www.postgresql.org/docs/9.5/static/storage-page-layout.html
Since the Drill implementation predates the Arrow project, I think it's safe to go with this.
This patch also includes ARROW-76 which adds a "null count" to the memory layout indicating the actual number of nulls in an array. This also strikes the "non-nullable" distinction from the memory layout as there is no semantic difference between arrays with null count 0 and a non-nullable array. Instead, users may choose to set
nullable=false
in the schema metadata and verify that Arrow memory conforms to the schema.