Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading multiple dictionary pages in Parquet file might lead to an exception #18061

Closed
2 tasks done
rolfskog opened this issue Aug 6, 2024 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working needs repro Bug does not yet have a reproducible example python Related to Python Polars

Comments

@rolfskog
Copy link

rolfskog commented Aug 6, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Log output

thread 'polars-0' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
stack backtrace:
   0:        0x32fb06c00 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h0690378318382463
   1:        0x32d85a888 - core::fmt::write::h32b0c8c26742285d
   2:        0x32fadbc04 - std::io::Write::write_fmt::h933c8f690febf69c
   3:        0x32fb0a6f0 - std::panicking::default_hook::{{closure}}::he930886aa81288c0
   4:        0x32fb0a218 - std::panicking::default_hook::h4f6c24245aca9b27
   5:        0x32fb0bcf8 - std::panicking::rust_panic_with_hook::h1d87874b94b8a178
   6:        0x32fb0b000 - std::panicking::begin_panic_handler::{{closure}}::hcaa207ee4a92d54b
   7:        0x32fb0af9c - std::sys::backtrace::__rust_end_short_backtrace::h07884ebaad5621e2
   8:        0x32fb0af90 - _rust_begin_unwind
   9:        0x32fc6ae4c - core::panicking::panic_fmt::h40c53935d133e936
  10:        0x32ef187b4 - <polars_parquet::parquet::read::compression::BasicDecompressor as core::iter::traits::iterator::Iterator>::next::h6a47f1fb5ef177d7
  11:        0x32ef09400 - polars_parquet::arrow::read::deserialize::simple::page_iter_to_array::h3246a55ae7e50655
  12:        0x32e5bd84c - polars_io::parquet::read::read_impl::column_idx_to_series::h599afd733396eec4
  13:        0x32e5bf204 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  14:        0x32e5bfd5c - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::h8b28840b62c6a969
  15:        0x32ff0495c - rayon_core::registry::WorkerThread::wait_until_cold::hf4400df91e7bd60a
  16:        0x32e5bfa0c - rayon_core::join::join_context::{{closure}}::ha92d24df2623a136
  17:        0x32e5bf448 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  18:        0x32e5bf914 - rayon_core::join::join_context::{{closure}}::ha92d24df2623a136
  19:        0x32e5bf448 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  20:        0x32e5bf914 - rayon_core::join::join_context::{{closure}}::ha92d24df2623a136
  21:        0x32e5bf448 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  22:        0x32e5bf914 - rayon_core::join::join_context::{{closure}}::ha92d24df2623a136
  23:        0x32e5bf448 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  24:        0x32e5be67c - rayon_core::thread_pool::ThreadPool::install::{{closure}}::h14434fd3ead777ff
  25:        0x32e5bc490 - polars_io::parquet::read::read_impl::rg_to_dfs::h856309a87e8f9fb3
  26:        0x32e7f9684 - rayon::iter::plumbing::bridge_producer_consumer::helper::h32bf4a01e217c624
  27:        0x32e7b98b0 - rayon_core::thread_pool::ThreadPool::install::{{closure}}::h5329d292d8a4fd3d
  28:        0x32e807db4 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hdebded9e2264ac38
  29:        0x32ff0495c - rayon_core::registry::WorkerThread::wait_until_cold::hf4400df91e7bd60a
  30:        0x32f8bff4c - std::sys::backtrace::__rust_begin_short_backtrace::h07ee9f36fde246d4
  31:        0x32f8bfd2c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h13a0f11c626472fa
  32:        0x32fb0e690 - std::sys::pal::unix::thread::Thread::new::thread_start::h83845b13417e2e4f
  33:        0x180e9ef94 - __pthread_joiner_wake
thread 'polars-8' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
stack backtrace:
   0:        0x32fb06c00 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h0690378318382463
   1:        0x32d85a888 - core::fmt::write::h32b0c8c26742285d
   2:        0x32fadbc04 - std::io::Write::write_fmt::h933c8f690febf69c
   3:        0x32fb0a6f0 - std::panicking::default_hook::{{closure}}::he930886aa81288c0
   4:        0x32fb0a218 - std::panicking::default_hook::h4f6c24245aca9b27
   5:        0x32fb0bcf8 - std::panicking::rust_panic_with_hook::h1d87874b94b8a178
   6:        0x32fb0b000 - std::panicking::begin_panic_handler::{{closure}}::hcaa207ee4a92d54b
   7:        0x32fb0af9c - std::sys::backtrace::__rust_end_short_backtrace::h07884ebaad5621e2
   8:        0x32fb0af90 - _rust_begin_unwind
   9:        0x32fc6ae4c - core::panicking::panic_fmt::h40c53935d133e936
  10:        0x32ef187b4 - <polars_parquet::parquet::read::compression::BasicDecompressor as core::iter::traits::iterator::Iterator>::next::h6a47f1fb5ef177d7
  11:        0x32ef09400 - polars_parquet::arrow::read::deserialize::simple::page_iter_to_array::h3246a55ae7e50655
  12:        0x32e5bd84c - polars_io::parquet::read::read_impl::column_idx_to_series::h599afd733396eec4
  13:        0x32e5bf204 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  14:        0x32e5bfd5c - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::h8b28840b62c6a969
  15:        0x32ff0495c - rayon_core::registry::WorkerThread::wait_until_cold::hf4400df91e7bd60a
  16:        0x32f8bff4c - std::sys::backtrace::__rust_begin_short_backtrace::h07ee9f36fde246d4
  17:        0x32f8bfd2c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h13a0f11c626472fa
  18:        0x32fb0e690 - std::sys::pal::unix::thread::Thread::new::thread_start::h83845b13417e2e4f
  19:        0x180e9ef94 - __pthread_joiner_wake
thread 'polars-2' panicked at crates/polars-parquet/src/parquet/read/compression.rs:222:17:
Found compressed page in the middle of the pages
stack backtrace:
   0:        0x32fb06c00 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h0690378318382463
   1:        0x32d85a888 - core::fmt::write::h32b0c8c26742285d
   2:        0x32fadbc04 - std::io::Write::write_fmt::h933c8f690febf69c
   3:        0x32fb0a6f0 - std::panicking::default_hook::{{closure}}::he930886aa81288c0
   4:        0x32fb0a218 - std::panicking::default_hook::h4f6c24245aca9b27
   5:        0x32fb0bcf8 - std::panicking::rust_panic_with_hook::h1d87874b94b8a178
   6:        0x32fb0b000 - std::panicking::begin_panic_handler::{{closure}}::hcaa207ee4a92d54b
   7:        0x32fb0af9c - std::sys::backtrace::__rust_end_short_backtrace::h07884ebaad5621e2
   8:        0x32fb0af90 - _rust_begin_unwind
   9:        0x32fc6ae4c - core::panicking::panic_fmt::h40c53935d133e936
  10:        0x32ef187b4 - <polars_parquet::parquet::read::compression::BasicDecompressor as core::iter::traits::iterator::Iterator>::next::h6a47f1fb5ef177d7
  11:        0x32ef09400 - polars_parquet::arrow::read::deserialize::simple::page_iter_to_array::h3246a55ae7e50655
  12:        0x32e5bd84c - polars_io::parquet::read::read_impl::column_idx_to_series::h599afd733396eec4
  13:        0x32e5bf204 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  14:        0x32e5bfd5c - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::h8b28840b62c6a969
  15:        0x32ff0495c - rayon_core::registry::WorkerThread::wait_until_cold::hf4400df91e7bd60a
  16:        0x32e5bfa0c - rayon_core::join::join_context::{{closure}}::ha92d24df2623a136
  17:        0x32e5bf448 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  18:        0x32e5bfd5c - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::h8b28840b62c6a969
  19:        0x32ff0495c - rayon_core::registry::WorkerThread::wait_until_cold::hf4400df91e7bd60a
  20:        0x32e5bfa0c - rayon_core::join::join_context::{{closure}}::ha92d24df2623a136
  21:        0x32e5bf448 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  22:        0x32e5bf914 - rayon_core::join::join_context::{{closure}}::ha92d24df2623a136
  23:        0x32e5bf448 - rayon::iter::plumbing::bridge_producer_consumer::helper::h06aa0604ac6844c9
  24:        0x32e5bfd5c - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::h8b28840b62c6a969
  25:        0x32ff0495c - rayon_core::registry::WorkerThread::wait_until_cold::hf4400df91e7bd60a
  26:        0x32f8bff4c - std::sys::backtrace::__rust_begin_short_backtrace::h07ee9f36fde246d4
  27:        0x32f8bfd2c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h13a0f11c626472fa
  28:        0x32fb0e690 - std::sys::pal::unix::thread::Thread::new::thread_start::h83845b13417e2e4f
  29:        0x180e9ef94 - __pthread_joiner_wake

Issue description

Reading local parquet file leads to exception. File works fine on 1.3.0m but can't be read on >=1.4.0

Expected behavior

Load the file

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.10.13 (main, Nov 28 2023, 09:27:45) [Clang 15.0.0 (clang-1500.0.40.1)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
nest_asyncio:         1.5.8
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.7.2
pyiceberg:            <not installed>
sqlalchemy:           2.0.23
torch:                2.1.1
xlsx2csv:             <not installed>
xlsxwriter:           3.2.0
@rolfskog rolfskog added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 6, 2024
@rolfskog rolfskog changed the title OanicException when reading parquet file PanicException when reading parquet file Aug 6, 2024
@coastalwhite
Copy link
Collaborator

Did Polars also write the parquet file that you are reading? It seems to me like you are using a Parquet file that does not follow the Apache Format specification. If it was written with another Parquet writer, then you should probably refer this issue to them.

@coastalwhite coastalwhite added needs repro Bug does not yet have a reproducible example and removed needs triage Awaiting prioritization by a maintainer labels Aug 7, 2024
@coastalwhite coastalwhite changed the title PanicException when reading parquet file Reading multiple dictionary pages in Parquet file might lead to an exception Aug 7, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this issue Aug 8, 2024
This fixes an issue with some Parquet writers that write dictionary pages for Null arrays (why?? I have no idea?).

Fixes pola-rs#18085.
Fixes pola-rs#18079.

Possibly also pola-rs#18061.
coastalwhite added a commit to coastalwhite/polars that referenced this issue Aug 9, 2024
This fixes an issue with some Parquet writers that write dictionary pages for Null arrays (why?? I have no idea?).

Fixes pola-rs#18085.
Fixes pola-rs#18079.

Possibly also pola-rs#18061.
@coastalwhite
Copy link
Collaborator

I am pretty sure this was closed by #18112. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs repro Bug does not yet have a reproducible example python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants