Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC file-level statistics omitted with chunked writes #5826

Closed
3 tasks done
vuule opened this issue Aug 1, 2020 · 4 comments · Fixed by #10694
Closed
3 tasks done

[BUG] ORC file-level statistics omitted with chunked writes #5826

vuule opened this issue Aug 1, 2020 · 4 comments · Fixed by #10694
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@vuule
Copy link
Contributor

vuule commented Aug 1, 2020

This came up in the discussion of #5707.
Currently, the file-level statistics are not written when a file is written in multiple chunks.

  • Figure out how to unit test this change.
  • Refactor gather_statistic_blobs to facilitate file stats merging across multiple chunks.
  • Collect and merge file stats to write them in the final file footer.
@vuule vuule added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Aug 1, 2020
@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Aug 3, 2020
@vuule vuule self-assigned this Dec 14, 2020
rapids-bot bot pushed a commit that referenced this issue Jan 5, 2021
Related to #5826

Refactor the `ProtobufReader` API to facilitate expansion to support robust reading of column statistics.
Changes include:

- Move `orc::metadata` from `readder_impl.cu` to `orc.h` so it can be reused for statistics related APIs.
- Removed duplicated code in `read_orc_statistics` - use `orc::metadata` instead.
- Rename `ColumnStatistics` to `ColStatsBlob`, since that's what it currently is.
- Avoid redundant copies in `read_orc_statistics`,
- Replace `get_u32`, `get_i32`, etc. with templated `get`.
- Replace per-type functors (e.g. `FieldUInt64`) with templated `field_reader`s to reduce code repetition.
- The two type-specific parts of `FieldXYZ` functors (field enum and read impl) are now separate to avoid redundant code.
- `field_reader` dispatches based on the value type, so also added `packed_field_reader` and `raw_field_reader` for packed fields and blob reads (respectively).
- Replace return value based error checking in `ProtobufReader` with `CUDF_EXPECTS`.
- Removed `InitSchema` from `ProtobufReader` - schema is only used to determine column names. The names are now lazily calculated in `metadata::get_column_name`

Authors:
  - vuule <vmilovanovic@nvidia.com>
  - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com>

Approvers:
  - Kumar Aatish
  - Conor Hoekstra

URL: #7055
@vuule
Copy link
Contributor Author

vuule commented Jan 20, 2021

This was marked as P0 as the assumption was that the file statistics are incorrect with chunked writes. This is not the case - the file statistics are not present with chunked writes. Based on this, changing to P1.

@vuule vuule changed the title [BUG] ORC file-level statistics not correct for chunked writes [BUG] ORC file-level statistics omitted with chunked writes Jan 21, 2021
@vuule
Copy link
Contributor Author

vuule commented Jan 26, 2021

Some notes about the remaining work, since I won't be back to this for a few weeks at least:
We cannot merge string stats across chunks as they hold pointers to the column data.
To be able to merge stats with chunked writes, we need to save the min/max strings in a separate column and update the stats pointers to point to this column.
In each chunked write, we add the min/max strings column and (non-encoded) stats to the set of per-chunk file stats.
When we close the writer, these stats need to be merged, encoded and included in the footer.

Refactoring needed to facilitate the steps above:
Return a struct from gather_statistic_blobs (instead of a vector containing all stats blobs) - stripe stats can always be returned as blobs, but file stats need to be returned as a struct in case of chunked writes.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@hyperbolic2346 hyperbolic2346 self-assigned this Mar 21, 2022
@vuule vuule removed the inactive-30d label Apr 7, 2022
rapids-bot bot pushed a commit that referenced this issue Apr 26, 2022
We are making the changes necessary to fix issue #5826. This is the first of those changes and is a refactor of the way statistics are captured to better support chunked writing. The next change will include things like multiple data pointers passed to the kernels and storage between calls to write.

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - MithunR (https://github.com/mythrocks)

URL: #10567
rapids-bot bot pushed a commit that referenced this issue May 6, 2022
…rite (#10694)

This is the second half of the chunked orc write statistics work. This part enables persisting the string data between write calls, building the file-level statistics from the stripe data, and writing out the statistics in a chunked-write file. Care is made to ensure that everything is persisted by re-using the same variable in the added test for both writes to ensure nothing is missed. This was verified to invalidate the first table before the second call to write.

~This will clean up once 10567 goes in as this is branched off that work.~

depends on #10567 
closes #5826

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - MithunR (https://github.com/mythrocks)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #10694
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants