-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ORC file-level statistics omitted with chunked writes #5826
Comments
Related to #5826 Refactor the `ProtobufReader` API to facilitate expansion to support robust reading of column statistics. Changes include: - Move `orc::metadata` from `readder_impl.cu` to `orc.h` so it can be reused for statistics related APIs. - Removed duplicated code in `read_orc_statistics` - use `orc::metadata` instead. - Rename `ColumnStatistics` to `ColStatsBlob`, since that's what it currently is. - Avoid redundant copies in `read_orc_statistics`, - Replace `get_u32`, `get_i32`, etc. with templated `get`. - Replace per-type functors (e.g. `FieldUInt64`) with templated `field_reader`s to reduce code repetition. - The two type-specific parts of `FieldXYZ` functors (field enum and read impl) are now separate to avoid redundant code. - `field_reader` dispatches based on the value type, so also added `packed_field_reader` and `raw_field_reader` for packed fields and blob reads (respectively). - Replace return value based error checking in `ProtobufReader` with `CUDF_EXPECTS`. - Removed `InitSchema` from `ProtobufReader` - schema is only used to determine column names. The names are now lazily calculated in `metadata::get_column_name` Authors: - vuule <vmilovanovic@nvidia.com> - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com> Approvers: - Kumar Aatish - Conor Hoekstra URL: #7055
This was marked as P0 as the assumption was that the file statistics are incorrect with chunked writes. This is not the case - the file statistics are not present with chunked writes. Based on this, changing to P1. |
Some notes about the remaining work, since I won't be back to this for a few weeks at least: Refactoring needed to facilitate the steps above: |
This issue has been labeled |
We are making the changes necessary to fix issue #5826. This is the first of those changes and is a refactor of the way statistics are captured to better support chunked writing. The next change will include things like multiple data pointers passed to the kernels and storage between calls to write. Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - MithunR (https://github.com/mythrocks) URL: #10567
…rite (#10694) This is the second half of the chunked orc write statistics work. This part enables persisting the string data between write calls, building the file-level statistics from the stripe data, and writing out the statistics in a chunked-write file. Care is made to ensure that everything is persisted by re-using the same variable in the added test for both writes to ensure nothing is missed. This was verified to invalidate the first table before the second call to write. ~This will clean up once 10567 goes in as this is branched off that work.~ depends on #10567 closes #5826 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) URL: #10694
This came up in the discussion of #5707.
Currently, the file-level statistics are not written when a file is written in multiple chunks.
gather_statistic_blobs
to facilitate file stats merging across multiple chunks.The text was updated successfully, but these errors were encountered: