[BUG] ORC file-level statistics omitted with chunked writes #5826

vuule · 2020-08-01T01:53:33Z

This came up in the discussion of #5707.
Currently, the file-level statistics are not written when a file is written in multiple chunks.

Figure out how to unit test this change.
Refactor gather_statistic_blobs to facilitate file stats merging across multiple chunks.
Collect and merge file stats to write them in the final file footer.

The text was updated successfully, but these errors were encountered:

Related to #5826 Refactor the `ProtobufReader` API to facilitate expansion to support robust reading of column statistics. Changes include: - Move `orc::metadata` from `readder_impl.cu` to `orc.h` so it can be reused for statistics related APIs. - Removed duplicated code in `read_orc_statistics` - use `orc::metadata` instead. - Rename `ColumnStatistics` to `ColStatsBlob`, since that's what it currently is. - Avoid redundant copies in `read_orc_statistics`, - Replace `get_u32`, `get_i32`, etc. with templated `get`. - Replace per-type functors (e.g. `FieldUInt64`) with templated `field_reader`s to reduce code repetition. - The two type-specific parts of `FieldXYZ` functors (field enum and read impl) are now separate to avoid redundant code. - `field_reader` dispatches based on the value type, so also added `packed_field_reader` and `raw_field_reader` for packed fields and blob reads (respectively). - Replace return value based error checking in `ProtobufReader` with `CUDF_EXPECTS`. - Removed `InitSchema` from `ProtobufReader` - schema is only used to determine column names. The names are now lazily calculated in `metadata::get_column_name` Authors: - vuule <vmilovanovic@nvidia.com> - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com> Approvers: - Kumar Aatish - Conor Hoekstra URL: #7055

vuule · 2021-01-20T18:58:00Z

This was marked as P0 as the assumption was that the file statistics are incorrect with chunked writes. This is not the case - the file statistics are not present with chunked writes. Based on this, changing to P1.

vuule · 2021-01-26T21:31:54Z

Some notes about the remaining work, since I won't be back to this for a few weeks at least:
We cannot merge string stats across chunks as they hold pointers to the column data.
To be able to merge stats with chunked writes, we need to save the min/max strings in a separate column and update the stats pointers to point to this column.
In each chunked write, we add the min/max strings column and (non-encoded) stats to the set of per-chunk file stats.
When we close the writer, these stats need to be merged, encoded and included in the footer.

Refactoring needed to facilitate the steps above:
Return a struct from gather_statistic_blobs (instead of a vector containing all stats blobs) - stripe stats can always be returned as blobs, but file stats need to be returned as a struct in case of chunked writes.

github-actions · 2021-03-14T19:12:55Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

We are making the changes necessary to fix issue #5826. This is the first of those changes and is a refactor of the way statistics are captured to better support chunked writing. The next change will include things like multiple data pointers passed to the kernels and storage between calls to write. Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - MithunR (https://github.com/mythrocks) URL: #10567

…rite (#10694) This is the second half of the chunked orc write statistics work. This part enables persisting the string data between write calls, building the file-level statistics from the stripe data, and writing out the statistics in a chunked-write file. Care is made to ensure that everything is persisted by re-using the same variable in the added test for both writes to ensure nothing is missed. This was verified to invalidate the first table before the second call to write. ~This will clean up once 10567 goes in as this is branched off that work.~ depends on #10567 closes #5826 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) URL: #10694

vuule added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Aug 1, 2020

kkraus14 removed the Needs Triage Need team to review and classify label Aug 3, 2020

calebwin mentioned this issue Sep 4, 2020

[REVIEW] Reading ORC statistics #6142

Merged

vuule self-assigned this Dec 14, 2020

vuule mentioned this issue Dec 15, 2020

Correct ORC docstring; other minor cuIO improvements #7012

Merged

vuule mentioned this issue Dec 30, 2020

Refactor ORC ProtobufReader to make it more extendable #7055

Merged

vuule changed the title ~~[BUG] ORC file-level statistics not correct for chunked writes~~ [BUG] ORC file-level statistics omitted with chunked writes Jan 21, 2021

github-actions bot added the inactive-30d label Mar 14, 2021

This was referenced Mar 10, 2022

[BUG] GPU writing ORC columns statistics NVIDIA/spark-rapids#4860

Closed

[FEA] Add File Statistic when writing the ORC file #10075

Closed

hyperbolic2346 self-assigned this Mar 21, 2022

hyperbolic2346 mentioned this issue Apr 1, 2022

First step toward statistics in ORC files with chunked writes #10567

Merged

vuule removed the inactive-30d label Apr 7, 2022

hyperbolic2346 mentioned this issue Apr 20, 2022

Persist string statistics data across multiple calls to orc chunked write #10694

Merged

rapids-bot bot closed this as completed in #10694 May 6, 2022

amahussein mentioned this issue Jun 1, 2022

Update GPU ORC statistics write support NVIDIA/spark-rapids#5715

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC file-level statistics omitted with chunked writes #5826

[BUG] ORC file-level statistics omitted with chunked writes #5826

vuule commented Aug 1, 2020 •

edited by hyperbolic2346

Loading

vuule commented Jan 20, 2021

vuule commented Jan 26, 2021

github-actions bot commented Mar 14, 2021

[BUG] ORC file-level statistics omitted with chunked writes #5826

[BUG] ORC file-level statistics omitted with chunked writes #5826

Comments

vuule commented Aug 1, 2020 • edited by hyperbolic2346 Loading

vuule commented Jan 20, 2021

vuule commented Jan 26, 2021

github-actions bot commented Mar 14, 2021

vuule commented Aug 1, 2020 •

edited by hyperbolic2346

Loading