Stub out Skip Records API (#1792) #1998

tustvold · 2022-07-03T02:24:53Z

Which issue does this PR close?

Part of #1792

Rationale for this change

Stubs out an API for providing skip records functionality within parquet. I think this will work to support #1792, #1191 and potentially other functionality down the line.

Let me know what you think @Ted-Jiang @sunchao

What changes are included in this PR?

Stubs out APIs for adding row skipping logic to the parquet implementation

Are there any user-facing changes?

No 🎉

codecov-commenter · 2022-07-03T02:52:03Z

Codecov Report

Merging #1998 (c81b77d) into master (c757829) will decrease coverage by 0.15%.
The diff coverage is 62.29%.

❗ Current head c81b77d differs from pull request most recent head 2a572d7. Consider uploading reports for the commit 2a572d7 to get more accurate results

@@            Coverage Diff             @@
##           master    #1998      +/-   ##
==========================================
- Coverage   83.58%   83.42%   -0.16%     
==========================================
  Files         222      222              
  Lines       57522    57906     +384     
==========================================
+ Hits        48078    48309     +231     
- Misses       9444     9597     +153

Impacted Files	Coverage Δ
parquet/src/arrow/array_reader/byte_array.rs	`84.47% <0.00%> (-1.24%)`	⬇️
...et/src/arrow/array_reader/byte_array_dictionary.rs	`82.26% <0.00%> (-1.66%)`	⬇️
...uet/src/arrow/array_reader/complex_object_array.rs	`93.20% <0.00%> (-1.07%)`	⬇️
parquet/src/arrow/array_reader/empty_array.rs	`45.45% <0.00%> (-10.11%)`	⬇️
parquet/src/arrow/array_reader/list_array.rs	`92.69% <0.00%> (-0.72%)`	⬇️
parquet/src/arrow/array_reader/map_array.rs	`58.82% <0.00%> (-8.98%)`	⬇️
parquet/src/arrow/array_reader/mod.rs	`88.23% <ø> (ø)`
parquet/src/arrow/array_reader/null_array.rs	`81.48% <0.00%> (-6.52%)`	⬇️
parquet/src/arrow/array_reader/primitive_array.rs	`88.63% <0.00%> (-1.02%)`	⬇️
parquet/src/arrow/array_reader/struct_array.rs	`78.99% <0.00%> (-9.69%)`	⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c757829...2a572d7. Read the comment docs.

Ted-Jiang · 2022-07-04T01:59:01Z

cool! 👍 @tustvold Are you the Flash 😄! i will try to go through this and give your my opinion today.

Ted-Jiang · 2022-07-04T03:47:01Z

parquet/src/arrow/arrow_reader.rs

+    pub(crate) fn with_row_selection(
+        self,
+        selection: impl Into<Vec<RowSelection>>,
+    ) -> Self {


Could we add total_row_count to check this selection is valid(maybe like continuous)

Is it actually an issue if it isn't, e.g. if I only want the first 100 rows?

yes, got it, it should check in user side.

Ted-Jiang · 2022-07-04T03:50:33Z

parquet/src/column/page.rs

+
+    /// Gets metadata about the next page, returns an error if no
+    /// column index information
+    fn peek_next_page(&self) -> Result<Option<PageMetadata>>;


👍 really need this abstraction！

Ted-Jiang · 2022-07-04T03:57:16Z

parquet/src/arrow/arrow_reader.rs

 }

 impl Iterator for ParquetRecordBatchReader {
    type Item = ArrowResult<RecordBatch>;

    fn next(&mut self) -> Option<Self::Item> {
-        match self.array_reader.next_batch(self.batch_size) {
+        let to_read = match self.selection.as_mut() {


👍 pass mask here not each col is more reasonable 😂

Ted-Jiang

👍 I think this abstraction is great ! Thanks for your effort！❤️

Left some comments, most are
Maybe after this pr merge, i will continue to work on page index.

Ted-Jiang · 2022-07-04T04:19:17Z

parquet/src/column/page.rs

 /// API for reading pages from a column chunk.
 /// This offers a iterator like API to get the next page.
 pub trait PageReader: Iterator<Item = Result<Page>> + Send {
    /// Gets the next page in the column chunk associated with this reader.
    /// Returns `None` if there are no pages left.
    fn get_next_page(&mut self) -> Result<Option<Page>>;
+
+    /// Gets metadata about the next page, returns an error if no
+    /// column index information


Is there we only need offset index, without the min max index?🤔

parquet/src/arrow/record_reader/mod.rs

Ted-Jiang · 2022-07-04T08:29:57Z

parquet/src/arrow/record_reader/mod.rs

+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();


Is this for the situation a page which has been read_records but left some unreaded buffer?

Sorry, i don't get this point, why not directly call column_reader.skip_records(num_records)
could you give me some hint?

RecordReader is a bit of an odd cookie, let me try to explain what it is doing.

In the absence of repetition levels, it can simply read batch size levels, and the corresponding number of values.

However, if repetition levels are present, it will likely need to read more than batch_size levels in order to read batch_size actual records (rows).

To achieve this it reads to its internal buffer and then splits off the data corresponding to batch_size rows, leaving the excess behind.

It is this excess of data that has been read to its buffers but not yielded to the caller yet, which we must consume here

👍 nice write up ! Save me some time 😄!
So, i got it. More specific details to ask:
This is a part of skip, we need to read the rp ,dp to skip some records in the page(maybe have been readed or never readed ).

let (buffered_records, buffered_values) = self.count_records(num_records); self.num_records += buffered_records; self.num_values += buffered_values; self.consume_def_levels(); self.consume_rep_levels(); self.consume_record_data(); self.consume_bitmap(); self.reset(); let remaining = buffered_records - num_records;

This also part of skip, remaining > 0, I think this we skip start at a new page

if remaining == 0 { return Ok(buffered_records); } let skipped = match self.column_reader.as_mut() { Some(column_reader) => column_reader.skip_records(remaining)?, None => 0, };

This is a part of skip, we need to read the rp ,dp to skip some records in the page(maybe have been readed or never readed ).

Yes, this is just to consume the data that has been read to the internal buffers of RecordReader if any

This also part of skip, remaining > 0, I think this we skip start at a new page

Not necessarily, the only thing RecordReader needs to handle is skipping any data that has already been read from ColumnReader into its own buffers. It can then delegate to ColumnReader to skip the remaining rows, with no requirement that this is done at a page boundary - ColumnReader must be able to handle any case.

Co-authored-by: Yang Jiang <jiangyang381@163.com>

alamb

The API looks good to me -- I had some questions and I think it would be nicer to return NotImplemented errors rather than panic in certain cases but I think this PR could also be merged as is to unblock further dev work

alamb · 2022-07-06T18:56:10Z

parquet/src/arrow/array_reader/byte_array.rs

@@ -210,6 +214,10 @@ impl<I: OffsetSizeTrait + ScalarValue> ColumnValueDecoder

        decoder.read(out, range.end - range.start, self.dict.as_ref())
    }
+
+    fn skip_values(&mut self, _num_values: usize) -> Result<usize> {
+        todo!()


I think adding a ticket reference here like
unimplemented!("See https://github.com/apache/arrow-rs/.....") would help future readers

Bonus points for returning ArrowError::Unimplemented

This comment applies to everything below as well

alamb · 2022-07-06T19:00:30Z

parquet/src/arrow/arrow_reader.rs

+
+    /// Scan rows from the parquet file according to the provided `selection`
+    ///
+    /// TODO: Make public once row selection fully implemented


perhaps worth a ticket?

alamb · 2022-07-06T19:01:33Z

parquet/src/arrow/arrow_reader.rs

+/// [`RowSelection`] allows selecting or skipping a provided number of rows
+/// when scanning the parquet file
+#[derive(Debug, Clone, Copy)]
+pub(crate) struct RowSelection {


You probably already have thought about this, but I would expect that in certain scenarios, non contiguous rows / skips would be desired

Like "fetch the first 100 rows, skip the next 200, and then fetch the remaining"

Would this interface handle that case?

See with_row_selection which takes a Vec to allow for this use-case

alamb · 2022-07-06T19:02:43Z

parquet/src/file/serialized_reader.rs

@@ -555,6 +555,14 @@ impl<T: Read + Send> PageReader for SerializedPageReader<T> {
        // We are at the end of this column chunk and no more page left. Return None.
        Ok(None)
    }
+
+    fn peek_next_page(&self) -> Result<Option<PageMetadata>> {
+        todo!()


ditto returning "not yet implemented" would probably be nicer

alamb · 2022-07-06T19:03:54Z

parquet/src/arrow/record_reader/definition_levels.rs

@@ -146,15 +146,15 @@ impl LevelsBufferSlice for DefinitionLevelBuffer {
    }
 }

-pub struct DefinitionLevelDecoder {
+pub struct DefinitionLevelBufferDecoder {


I this rename a public API change as well? It does not appear in the docs

https://docs.rs/parquet/17.0.0/parquet/?search=DefinitionLevelDecoder

No it is crate local

github-actions bot added the parquet Changes to the parquet crate label Jul 3, 2022

tustvold force-pushed the skip-records-api branch from c413686 to 0071931 Compare July 3, 2022 02:33

Ted-Jiang reviewed Jul 4, 2022

View reviewed changes

Ted-Jiang approved these changes Jul 4, 2022

View reviewed changes

Ted-Jiang reviewed Jul 4, 2022

View reviewed changes

tustvold force-pushed the skip-records-api branch from 0071931 to 7527750 Compare July 5, 2022 12:40

tustvold marked this pull request as ready for review July 5, 2022 12:46

Stub API for parquet record skipping

7324873

tustvold force-pushed the skip-records-api branch from 7527750 to 7324873 Compare July 5, 2022 13:13

tustvold mentioned this pull request Jul 5, 2022

Enable serialized_reader read specific Page by passing row ranges. #1977

Closed

github-actions bot added the arrow-flight Changes to the arrow-flight crate label Jul 5, 2022

tustvold and others added 2 commits July 5, 2022 09:23

Update parquet/src/arrow/record_reader/mod.rs

7996cd2

Co-authored-by: Yang Jiang <jiangyang381@163.com>

Remove empty google.protobuf.rs

45cbee0

alamb reviewed Jul 6, 2022

View reviewed changes

alamb approved these changes Jul 6, 2022

View reviewed changes

tustvold added 2 commits July 7, 2022 09:16

Replace todo with nyi_err

d856240

Update doc comment

2a572d7

tustvold merged commit e59b023 into apache:master Jul 7, 2022

tustvold mentioned this pull request Jul 27, 2022

Add ArrayReader::skip_records API #2197

Closed

Ted-Jiang mentioned this pull request Aug 8, 2022

Combine multiple selections into the same batch size in skip_records #2358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stub out Skip Records API (#1792) #1998

Stub out Skip Records API (#1792) #1998

tustvold commented Jul 3, 2022 •

edited

Loading

codecov-commenter commented Jul 3, 2022 •

edited

Loading

Ted-Jiang commented Jul 4, 2022

Ted-Jiang Jul 4, 2022 •

edited

Loading

tustvold Jul 4, 2022

Ted-Jiang Jul 5, 2022

Ted-Jiang Jul 4, 2022

Ted-Jiang Jul 4, 2022 •

edited

Loading

Ted-Jiang left a comment •

edited

Loading

Ted-Jiang Jul 4, 2022 •

edited

Loading

tustvold Jul 4, 2022

Ted-Jiang Jul 4, 2022

Ted-Jiang Jul 4, 2022

tustvold Jul 4, 2022

Ted-Jiang Jul 5, 2022 •

edited

Loading

tustvold Jul 5, 2022

alamb left a comment

alamb Jul 6, 2022

alamb Jul 6, 2022

alamb Jul 6, 2022

tustvold Jul 6, 2022 •

edited

Loading

alamb Jul 6, 2022

alamb Jul 6, 2022

tustvold Jul 6, 2022

Stub out Skip Records API (#1792) #1998

Stub out Skip Records API (#1792) #1998

Conversation

tustvold commented Jul 3, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Jul 3, 2022 • edited Loading

Codecov Report

Ted-Jiang commented Jul 4, 2022

Ted-Jiang Jul 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ted-Jiang Jul 4, 2022 • edited Loading

Choose a reason for hiding this comment

Ted-Jiang left a comment • edited Loading

Choose a reason for hiding this comment

Ted-Jiang Jul 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ted-Jiang Jul 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jul 3, 2022 •

edited

Loading

codecov-commenter commented Jul 3, 2022 •

edited

Loading

Ted-Jiang Jul 4, 2022 •

edited

Loading

Ted-Jiang Jul 4, 2022 •

edited

Loading

Ted-Jiang left a comment •

edited

Loading

Ted-Jiang Jul 4, 2022 •

edited

Loading

Ted-Jiang Jul 5, 2022 •

edited

Loading

tustvold Jul 6, 2022 •

edited

Loading