Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Parquet reader is unable to read LargeString columns #39682

Closed
nicki-dese opened this issue Jan 18, 2024 · 30 comments
Closed

[C++] Parquet reader is unable to read LargeString columns #39682

nicki-dese opened this issue Jan 18, 2024 · 30 comments

Comments

@nicki-dese
Copy link

nicki-dese commented Jan 18, 2024

Describe the bug, including details regarding any error messages, version, and platform.

read_parquet() is giving the following error with large parquet files:

Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Versions etc from sessionInfo:

  • arrow 14.0.0.2
  • R version 4.3.0 (2023-04-21 ucrt)
  • Platform: x86_64-w64-ming32/x64
  • Windows 11 x64 (build 22621)

Descriptive info on example problematic table, with two columns:

  • 140 million rows.
  • id: large_string, 4.2 Gb
  • state: int_32, 0.5 Gb

The id is a hashed string, 24 characters long. It is not practical to change it, as it's the joining key.

Note, the data above is stored as a data.table in R and left that way when saving it with write_parquet(). But I've converted it to an arrow table for the above descriptive stats, because I thought they'd be more useful to you!

Other relevant information:

  • The large parquet files were created with arrow::write_parquet()
  • The same files previously opened with an earlier version of read_parquet()
    (unfortunately I'm not sure which version, but it was working late November/early December, we work in a closed environment and use Posit Package manager, VMs rebuild every 30 days, so it would have been a fairly recent version)
  • I've duplicated the error, and it still occurs with newly created large parquet files, such as the one described above
  • Loading the same files with open_dataset() works. However, our team uses targets, which implicitly calls read_parquet, so this bug has unfortunately efffected many of our workflows.

Note: I haven't been able to roll back to an earlier version of arrow - because we only have earlier source versions and not binaries and I'm using windows, I get libarrow errors. If there is a work around for this please let me know.

UPDATE

With the help of IT, I have been able to install earlier versions of arrow in my environment, and have shown that:

  • Bug was introduced in arrow version 14.0.0, and it's the read_parquet function that is the issue, not write_parquet.
  • With version 13.0.0.1 I was able to open the above problematic file, which was created with write_parquet version 14.0.0.2.

Component(s)

Parquet, R

@assignUser
Copy link
Member

Note: I haven't been able to roll back to an earlier version of arrow - because we only have earlier source versions and not binaries and I'm using windows, I get libarrow errors. If there is a work around for this please let me know.

devtools::install_version('arrow', '13.0.0.1') works for me on windows. It does need to download a binary so that might get blocked? If that's the case you can get the binary manually here and then set the envvar RWINLIB_LOCAL to point to the zip file.

@nicki-dese
Copy link
Author

@assignUser - thank you! unfortunately the closed environment does not connect to the internet. I will point the IT team who I've asked for help with this internally at that link and see if they'd be willing to test or upload it for me so I can test.

@nicki-dese
Copy link
Author

Given that the bug was introduced in version 14.0.0, I am wondering if it relates to #37274

@thisisnic thisisnic added the Priority: Blocker Marks a blocker for the release label Jan 30, 2024
@thisisnic
Copy link
Member

@felipecrv Mind giving this a quick skim over? I'm not sure whether this is at the R or C++ layer, but one of the places in the codebase when I grep for the array cannot contain more than error was updated in #35345 which appears to be have merged after this error appeared, so I figure you might be more familar with that area of the codebase than us, even if it's not related to this?

@felipecrv
Copy link
Contributor

@thisisnic sure. I'm going to take a look now.

@felipecrv
Copy link
Contributor

Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

log(2147483646 + 2, 2) is 31, so this is a check coming from BaseBinaryBuilder<T> which uses sizeof<offset_type> in the memory_limit() method. FixedSizeBinaryBuilder can produce the same message, but it always uses int64_t as these arrays don't need an offsets buffer.

The regular string builder is based on BaseBinaryBuilder<T> and it uses 32-bits for the offsets array, so it requires all the concatenated strings to be addressable with 31 bits (~2GB). But the bug report says the column type is LargeString which uses 64-bit offsets! It shouldn't be a problem.

Digging into the Parquet reader code, I find...

      // XXX: if a LargeBinary chunk is larger than 2GB, the MSBs of offsets
      // will be lost because they are first created as int32 and then cast to int64.

...leading to a commit from November:

commit b4a0751effe316aee1a0fd80fb1c444ecd6842c5
Author:     Antoine Pitrou <...>
AuthorDate: Mon Nov 23 14:53:56 2020 +0100

    ARROW-10426: [C++] Allow writing large strings to Parquet

    Large strings are still read back as regular strings.

    Closes #8632 from pitrou/ARROW-10426-parquet-large-binary

    Authored-by: Antoine Pitrou <antoine@python.org>

This is from PR #8632 that fixed Issue #26405 which is about writing LargeString into Arrow, but doesn't fix the reading them back part. I suppose the value in this is that C++/R/Python scripts can produce files that the Java Parquet reader can read without problems.

Next steps:

  • Change this issue to "[C++] Parquet reader is unable to read LargeString columns"
  • Get help/tips from @pitrou @mapleFU on how to fix this
  • Prepare a fix PR

@thisisnic
Copy link
Member

thisisnic commented Feb 6, 2024

Thanks for figuring that one out @felipecrv. I've updated the title to reflect that it's a C++ bug, though don't have the capacity to implement the fix myself.

@thisisnic thisisnic changed the title [R] read_parquet error - unable to read in large parquet files that previously opened [C++] read_parquet error - unable to read in large parquet files that previously opened Feb 6, 2024
@thisisnic thisisnic changed the title [C++] read_parquet error - unable to read in large parquet files that previously opened [C++] Parquet reader is unable to read LargeString columns Feb 6, 2024
@mapleFU
Copy link
Member

mapleFU commented Feb 6, 2024

arrow 14.0.0.2

Do you mean 14.0.2 or other? I did a bug fix here on 14.0.2 ( #38784 ), not sure whether this is related

@thisisnic
Copy link
Member

It could be related, we haven't been able to release 14.0.2 to CRAN, so it'll be 14.0.0.2, which is 14.0.0 but with some R patches.

@felipecrv
Copy link
Contributor

arrow 14.0.0.2

Do you mean 14.0.2 or other? I did a bug fix here on 14.0.2 ( #38784 ), not sure whether this is related

Haha. I would never guess that perf regression fix also fixes the reading of large binary arrays :)

@mapleFU
Copy link
Member

mapleFU commented Feb 6, 2024

Sigh, see #38577

The binary behavior changes, both cause allocation more memory and regression...

@felipecrv
Copy link
Contributor

The binary behavior changes, both cause allocation more memory and regression...

Sure, but @nicki-dese probably has enough RAM to read the file they want to read. Is it possible since #38784?

@mapleFU
Copy link
Member

mapleFU commented Feb 6, 2024

Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

@felipecrv this error raise from BinaryBuilder or other, which limit the string/binary size as 2GB🤔 Maybe not related to RAM

@felipecrv
Copy link
Contributor

felipecrv commented Feb 7, 2024

@mapleFU I understand the error is raised from the BaseBinaryBuilder<T> (superclass of the StringBuilder). The issue is not the inability to allocate more than 2GBs of RAM, the issue is that the StringArray can't address more than 2GBs of RAM from the offsets buffer (32-bit offsets). The Parquet reader should figure how to read these string into a LargeStringArray. Which requires producing 64-bit offsets that point into the data buffer of the resulting LargeStringArray. Might be as easy as introducing the right builder class at the right place or might require annoying changes across many layers.

@mapleFU
Copy link
Member

mapleFU commented Feb 8, 2024

@felipecrv A previous patch ( #35825 ) want to address this is not merged in, sigh. Maybe we can re-checking that

@nicki-dese
Copy link
Author

Hi @felipecrv - thank you for looking at this. I have 128 GB of RAM, which is enough RAM to read the file. I have successfully read the file since rolling back to 13.0.0.1.

Regarding testing since #38784, unfortunately the nature of the closed environment I'm in does not allow me to test anything until it's released on CRAN, and the current CRAN version (14.0.0.2 in R) is the one where I found the bug.

@felipecrv
Copy link
Contributor

Hi @felipecrv - thank you for looking at this. I have 128 GB of RAM, which is enough RAM to read the file.

Plenty of RAM :) The problem is not memory allocation per se, but the representation instantiated by the Parquet reader being based on 32-bit integers.

Regarding testing since #38784

The actual fix would be attempt linked by @mapleFU on the last message: #35825. Looks like a PR that was abandoned by the author.

@mapleFU
Copy link
Member

mapleFU commented Jun 13, 2024

Sorry for late replying, @nicki-dese Have this problem solved with later R or 14.0.2? If not I'll dive into it now

@raulcd
Copy link
Member

raulcd commented Jun 26, 2024

Is this still considered a blocker for the release?
@thisisnic @assignUser @jonkeane

@jonkeane
Copy link
Member

We should get @mapleFU and @felipecrv input on this — from reading there's an issue in our C++ parquet reader that is causing this. From reading #41104 and #35825 there's a proposal but it doesn't look close enough to finished that it'll make it in the release (please, do correct me if I'm wrong on that!)

I'm going to remove the blocker label so we don't prevent this release, though we absolutely should fix this issue in the parquet reader

@jonkeane jonkeane removed the Priority: Blocker Marks a blocker for the release label Jun 26, 2024
@mapleFU
Copy link
Member

mapleFU commented Jun 26, 2024

We should get @mapleFU and @felipecrv input on this — from reading there's an issue in our C++ parquet reader that is causing this. From reading #41104 and #35825 there's a proposal but it doesn't look close enough to finished that it'll make it in the release (please, do correct me if I'm wrong on that!)

Personally I think this is likely to be solved in 14.0.2 . The issue is that it can be read in 13.0, so it could handled by #38784 . But I didn't got enough input on this :-(

@jonkeane
Copy link
Member

Personally I think this is likely to be solved in 14.0.2 . The issue is that it can be read in 13.0, so it could handled by #38784 . But I didn't got enough input on this :-(

Thanks for the update! Yeah, I tried a gentle nudge there too, let's see if that goes forward for next release (17.0.0)?

@mapleFU
Copy link
Member

mapleFU commented Jun 26, 2024

I think large-column can be read now, and for next release we may support read large dictionary. If there're any reproduce-able bug just report and let me fix it

@raulcd
Copy link
Member

raulcd commented Jun 26, 2024

ok, so I understand that this should be fixed for 17.0.0 and no further action is required. Should we close this issue then? If we want to maintain the issue opened, should we remove the blocker label and the 17.0.0 milestone?

@raulcd
Copy link
Member

raulcd commented Jun 26, 2024

Thansk @mapleFU @jonkeane !

@jonkeane
Copy link
Member

I think large-column can be read now, and for next release we may support read large dictionary.

Oh, sorry, I misunderstood this! That's great that this is fixed for strings just not dictionaries. Do we have a test for this somewhere in the C++ suite that confirms this? I could probably add one in R too, but that's a bit higher and superfluous if we already have a C++ one.

@mapleFU
Copy link
Member

mapleFU commented Jun 26, 2024

Do we have a test for this somewhere in the C++ suite that confirms this?

You're right I think I'd better adding a testcase for this 🤔 The read large column logic is tested but I guess this behavior is left behind. I'm busy in workdays these days. I'll try construct a unittest this weekend

Technically, Parquet has the limit below:

  • A single string value should not greater than 2GB

but when read a batch or a table, it could be greater, so when reading string, parquet reader would use a group of api to "split to small chunks" using multiple BinaryBuilder / StringBuilder, during building, once incoming string can make the buffer exceeds 2GB, reader would switch to a new Builder. Finally, reader casting them to LargeString/LargeBinary later. I think this logic is already tested, but maybe some test for corner case should be added.

@felipecrv
Copy link
Contributor

@mapleFU yeah. The issue stems from offsets of StringArray being cumulative.

I think a better strategy to support LargeStringArray would be reading StringView from Parquet instead of StringArray and then casting from StringView to either StringArray or LargeStringArray at the end.

The StringView arrays can use a variadic number of buffers, so before cumulative offsets reach 2GB, you can start building up another Buffer.

I have code that casts string-views on my machine (PR soon). A StringViewArray to StringArray cast can be made very efficient if only one variadic Buffer was used — something that we can try to maximize in the Parquet reader.

@nicki-dese
Copy link
Author

Apologies for the delay in replying @mapleFU, I've just tested arrow 16.1.0 and read_parquet successfully read in a previously problematic file with large strings - so the bug appears fixed from my end.

@assignUser
Copy link
Member

As we have a few follow up issues for the underlying problems and the user facing issues is fixed I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants