forked from apache/arrow
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
PARQUET-435: Change column reader methods to be array-oriented rather…
… than scalar Column scanning and record reconstruction is independent of the Parquet file format and depends, among other things, on the data structures where the reconstructed data will end up. This is a work-in progress, but the basic idea is: - APIs for reading a batch of repetition `ReadRepetitionLevels` or definition levels `ReadDefinitionLevels` into a preallocated `int16_t*` - APIs for reading arrays of decoded values into preallocated memory (`ReadValues`) These methods are only able to read data within a particular data page. Once you exhaust the data available in the data page (`ReadValues` returns 0), you must call `ReadNewPage`, which returns `true` is there is more data available. Separately, I added a simple `Scanner` class that emulates the scalar value iteration functionality that existed previously. I used this to reimplement the `DebugPrint` method in `parquet_scanner.cc`. This obviously only works currently for flat data. I would like to keep the `ColumnReader` low level and primitive, concerned only with providing access to the raw data in a Parquet file as fast as possible. We can devise separate algorithms for inferring nested record structure by examining the arrays of decoded values and repetition/definition levels. The major benefit of separating raw data access from structure inference is that this can be pipelined with threads: one thread decompresses and decodes values and levels, and another thread can turn batches into a nested record- or column-oriented structure. Author: Wes McKinney <wes@cloudera.com> Closes apache#26 from wesm/PARQUET-435 and squashes the following commits: 4bf5cd4 [Wes McKinney] Fix cpplint 852f4ec [Wes McKinney] Address review comments, also be sure to use Scanner::HasNext 7ea261e [Wes McKinney] Add TODO comment 4999719 [Wes McKinney] Make ColumnReader::ReadNewPage private and call HasNext() in ReadBatch 0d2e111 [Wes McKinney] Fix function description. Change #define to constexpr 111ef13 [Wes McKinney] Incorporate review comments and add some better comments e16f7fd [Wes McKinney] Typo ef52404 [Wes McKinney] Fix function doc 5e95cda [Wes McKinney] Configurable scanner batch size. Do not use printf in DebugPrint 1b4eca0 [Wes McKinney] New batch read API which reads levels and values in one shot de4d6b6 [Wes McKinney] Move column_* files into parquet/column folder aad4a86 [Wes McKinney] Finish refactoring scanner API with shared pointers 4506748 [Wes McKinney] Refactoring, do not have shared_from_this working yet 6489b15 [Wes McKinney] Batch level/value read interface on ColumnReader. Add Scanner class for flat columns. Add a couple smoke unit tests
- Loading branch information
Showing
11 changed files
with
561 additions
and
149 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,7 +18,6 @@ | |
# Headers: top level | ||
install(FILES | ||
parquet.h | ||
column_reader.h | ||
reader.h | ||
exception.h | ||
types.h | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
# Headers: top level | ||
install(FILES | ||
reader.h | ||
scanner.h | ||
DESTINATION include/parquet/column) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.