[C++] Parquet reader is unable to read LargeString columns #39682

nicki-dese · 2024-01-18T03:56:41Z

Describe the bug, including details regarding any error messages, version, and platform.

read_parquet() is giving the following error with large parquet files:

Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

Versions etc from sessionInfo:

arrow 14.0.0.2
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-ming32/x64
Windows 11 x64 (build 22621)

Descriptive info on example problematic table, with two columns:

140 million rows.
id: large_string, 4.2 Gb
state: int_32, 0.5 Gb

The id is a hashed string, 24 characters long. It is not practical to change it, as it's the joining key.

Note, the data above is stored as a data.table in R and left that way when saving it with write_parquet(). But I've converted it to an arrow table for the above descriptive stats, because I thought they'd be more useful to you!

Other relevant information:

The large parquet files were created with arrow::write_parquet()
The same files previously opened with an earlier version of read_parquet()
(unfortunately I'm not sure which version, but it was working late November/early December, we work in a closed environment and use Posit Package manager, VMs rebuild every 30 days, so it would have been a fairly recent version)
I've duplicated the error, and it still occurs with newly created large parquet files, such as the one described above
Loading the same files with open_dataset() works. However, our team uses targets, which implicitly calls read_parquet, so this bug has unfortunately efffected many of our workflows.

Note: I haven't been able to roll back to an earlier version of arrow - because we only have earlier source versions and not binaries and I'm using windows, I get libarrow errors. If there is a work around for this please let me know.

UPDATE

With the help of IT, I have been able to install earlier versions of arrow in my environment, and have shown that:

Bug was introduced in arrow version 14.0.0, and it's the read_parquet function that is the issue, not write_parquet.
With version 13.0.0.1 I was able to open the above problematic file, which was created with write_parquet version 14.0.0.2.

Component(s)

Parquet, R

assignUser · 2024-01-18T04:56:44Z

Note: I haven't been able to roll back to an earlier version of arrow - because we only have earlier source versions and not binaries and I'm using windows, I get libarrow errors. If there is a work around for this please let me know.

devtools::install_version('arrow', '13.0.0.1') works for me on windows. It does need to download a binary so that might get blocked? If that's the case you can get the binary manually here and then set the envvar RWINLIB_LOCAL to point to the zip file.

nicki-dese · 2024-01-18T05:09:29Z

@assignUser - thank you! unfortunately the closed environment does not connect to the internet. I will point the IT team who I've asked for help with this internally at that link and see if they'd be willing to test or upload it for me so I can test.

nicki-dese · 2024-01-22T00:02:33Z

Given that the bug was introduced in version 14.0.0, I am wondering if it relates to #37274

thisisnic · 2024-01-30T15:07:50Z

@felipecrv Mind giving this a quick skim over? I'm not sure whether this is at the R or C++ layer, but one of the places in the codebase when I grep for the array cannot contain more than error was updated in #35345 which appears to be have merged after this error appeared, so I figure you might be more familar with that area of the codebase than us, even if it's not related to this?

felipecrv · 2024-02-06T01:01:00Z

@thisisnic sure. I'm going to take a look now.

felipecrv · 2024-02-06T01:47:30Z

Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

log(2147483646 + 2, 2) is 31, so this is a check coming from BaseBinaryBuilder<T> which uses sizeof<offset_type> in the memory_limit() method. FixedSizeBinaryBuilder can produce the same message, but it always uses int64_t as these arrays don't need an offsets buffer.

The regular string builder is based on BaseBinaryBuilder<T> and it uses 32-bits for the offsets array, so it requires all the concatenated strings to be addressable with 31 bits (~2GB). But the bug report says the column type is LargeString which uses 64-bit offsets! It shouldn't be a problem.

Digging into the Parquet reader code, I find...

      // XXX: if a LargeBinary chunk is larger than 2GB, the MSBs of offsets
      // will be lost because they are first created as int32 and then cast to int64.

...leading to a commit from November:

commit b4a0751effe316aee1a0fd80fb1c444ecd6842c5
Author:     Antoine Pitrou <...>
AuthorDate: Mon Nov 23 14:53:56 2020 +0100

    ARROW-10426: [C++] Allow writing large strings to Parquet

    Large strings are still read back as regular strings.

    Closes #8632 from pitrou/ARROW-10426-parquet-large-binary

    Authored-by: Antoine Pitrou <antoine@python.org>

This is from PR #8632 that fixed Issue #26405 which is about writing LargeString into Arrow, but doesn't fix the reading them back part. I suppose the value in this is that C++/R/Python scripts can produce files that the Java Parquet reader can read without problems.

Next steps:

Change this issue to "[C++] Parquet reader is unable to read LargeString columns"
Get help/tips from @pitrou @mapleFU on how to fix this
Prepare a fix PR

thisisnic · 2024-02-06T08:57:47Z

Thanks for figuring that one out @felipecrv. I've updated the title to reflect that it's a C++ bug, though don't have the capacity to implement the fix myself.

mapleFU · 2024-02-06T10:07:14Z

arrow 14.0.0.2

Do you mean 14.0.2 or other? I did a bug fix here on 14.0.2 ( #38784 ), not sure whether this is related

thisisnic · 2024-02-06T12:32:29Z

It could be related, we haven't been able to release 14.0.2 to CRAN, so it'll be 14.0.0.2, which is 14.0.0 but with some R patches.

felipecrv · 2024-02-06T16:16:16Z

arrow 14.0.0.2

Do you mean 14.0.2 or other? I did a bug fix here on 14.0.2 ( #38784 ), not sure whether this is related

Haha. I would never guess that perf regression fix also fixes the reading of large binary arrays :)

mapleFU · 2024-02-06T16:23:37Z

Sigh, see #38577

The binary behavior changes, both cause allocation more memory and regression...

felipecrv · 2024-02-06T16:54:19Z

The binary behavior changes, both cause allocation more memory and regression...

Sure, but @nicki-dese probably has enough RAM to read the file they want to read. Is it possible since #38784?

mapleFU · 2024-02-06T17:00:17Z

Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180

@felipecrv this error raise from BinaryBuilder or other, which limit the string/binary size as 2GB🤔 Maybe not related to RAM

felipecrv · 2024-02-07T20:55:45Z

@mapleFU I understand the error is raised from the BaseBinaryBuilder<T> (superclass of the StringBuilder). The issue is not the inability to allocate more than 2GBs of RAM, the issue is that the StringArray can't address more than 2GBs of RAM from the offsets buffer (32-bit offsets). The Parquet reader should figure how to read these string into a LargeStringArray. Which requires producing 64-bit offsets that point into the data buffer of the resulting LargeStringArray. Might be as easy as introducing the right builder class at the right place or might require annoying changes across many layers.

mapleFU · 2024-02-08T04:21:37Z

@felipecrv A previous patch ( #35825 ) want to address this is not merged in, sigh. Maybe we can re-checking that

nicki-dese · 2024-02-08T22:21:31Z

Hi @felipecrv - thank you for looking at this. I have 128 GB of RAM, which is enough RAM to read the file. I have successfully read the file since rolling back to 13.0.0.1.

Regarding testing since #38784, unfortunately the nature of the closed environment I'm in does not allow me to test anything until it's released on CRAN, and the current CRAN version (14.0.0.2 in R) is the one where I found the bug.

felipecrv · 2024-02-10T01:29:02Z

Hi @felipecrv - thank you for looking at this. I have 128 GB of RAM, which is enough RAM to read the file.

Plenty of RAM :) The problem is not memory allocation per se, but the representation instantiated by the Parquet reader being based on 32-bit integers.

Regarding testing since #38784

The actual fix would be attempt linked by @mapleFU on the last message: #35825. Looks like a PR that was abandoned by the author.

mapleFU · 2024-06-13T05:16:59Z

Sorry for late replying, @nicki-dese Have this problem solved with later R or 14.0.2? If not I'll dive into it now

raulcd · 2024-06-26T10:48:06Z

Is this still considered a blocker for the release?
@thisisnic @assignUser @jonkeane

jonkeane · 2024-06-26T13:12:44Z

We should get @mapleFU and @felipecrv input on this — from reading there's an issue in our C++ parquet reader that is causing this. From reading #41104 and #35825 there's a proposal but it doesn't look close enough to finished that it'll make it in the release (please, do correct me if I'm wrong on that!)

I'm going to remove the blocker label so we don't prevent this release, though we absolutely should fix this issue in the parquet reader

mapleFU · 2024-06-26T13:15:10Z

We should get @mapleFU and @felipecrv input on this — from reading there's an issue in our C++ parquet reader that is causing this. From reading #41104 and #35825 there's a proposal but it doesn't look close enough to finished that it'll make it in the release (please, do correct me if I'm wrong on that!)

Personally I think this is likely to be solved in 14.0.2 . The issue is that it can be read in 13.0, so it could handled by #38784 . But I didn't got enough input on this :-(

jonkeane · 2024-06-26T13:36:05Z

Personally I think this is likely to be solved in 14.0.2 . The issue is that it can be read in 13.0, so it could handled by #38784 . But I didn't got enough input on this :-(

Thanks for the update! Yeah, I tried a gentle nudge there too, let's see if that goes forward for next release (17.0.0)?

mapleFU · 2024-06-26T14:04:42Z

I think large-column can be read now, and for next release we may support read large dictionary. If there're any reproduce-able bug just report and let me fix it

raulcd · 2024-06-26T14:09:52Z

ok, so I understand that this should be fixed for 17.0.0 and no further action is required. Should we close this issue then? If we want to maintain the issue opened, should we remove the blocker label and the 17.0.0 milestone?

raulcd · 2024-06-26T14:10:19Z

Thansk @mapleFU @jonkeane !

jonkeane · 2024-06-26T14:52:08Z

I think large-column can be read now, and for next release we may support read large dictionary.

Oh, sorry, I misunderstood this! That's great that this is fixed for strings just not dictionaries. Do we have a test for this somewhere in the C++ suite that confirms this? I could probably add one in R too, but that's a bit higher and superfluous if we already have a C++ one.

mapleFU · 2024-06-26T15:03:07Z

Do we have a test for this somewhere in the C++ suite that confirms this?

You're right I think I'd better adding a testcase for this 🤔 The read large column logic is tested but I guess this behavior is left behind. I'm busy in workdays these days. I'll try construct a unittest this weekend

Technically, Parquet has the limit below:

A single string value should not greater than 2GB

but when read a batch or a table, it could be greater, so when reading string, parquet reader would use a group of api to "split to small chunks" using multiple BinaryBuilder / StringBuilder, during building, once incoming string can make the buffer exceeds 2GB, reader would switch to a new Builder. Finally, reader casting them to LargeString/LargeBinary later. I think this logic is already tested, but maybe some test for corner case should be added.

felipecrv · 2024-06-26T15:16:29Z

@mapleFU yeah. The issue stems from offsets of StringArray being cumulative.

I think a better strategy to support LargeStringArray would be reading StringView from Parquet instead of StringArray and then casting from StringView to either StringArray or LargeStringArray at the end.

The StringView arrays can use a variadic number of buffers, so before cumulative offsets reach 2GB, you can start building up another Buffer.

I have code that casts string-views on my machine (PR soon). A StringViewArray to StringArray cast can be made very efficient if only one variadic Buffer was used — something that we can try to maximize in the Parquet reader.

nicki-dese · 2024-06-26T23:20:51Z

Apologies for the delay in replying @mapleFU, I've just tested arrow 16.1.0 and read_parquet successfully read in a previously problematic file with large strings - so the bug appears fixed from my end.

assignUser · 2024-07-18T03:20:40Z

As we have a few follow up issues for the underlying problems and the user facing issues is fixed I'll close this.

nicki-dese added the Type: bug label Jan 18, 2024

github-actions bot added Component: R Component: Parquet labels Jan 18, 2024

thisisnic added the Priority: Blocker Marks a blocker for the release label Jan 30, 2024

thisisnic changed the title ~~[R] read_parquet error - unable to read in large parquet files that previously opened~~ [C++] read_parquet error - unable to read in large parquet files that previously opened Feb 6, 2024

thisisnic changed the title ~~[C++] read_parquet error - unable to read in large parquet files that previously opened~~ [C++] Parquet reader is unable to read LargeString columns Feb 6, 2024

felipecrv mentioned this issue Feb 10, 2024

GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types #35825

Open

jonkeane removed the Priority: Blocker Marks a blocker for the release label Jun 26, 2024

felipecrv mentioned this issue Jun 26, 2024

[C++][Parquet] Read string columns directly into STRING_VIEW arrays and cast to LARGE_STRING if necessary #43068

Open

2 tasks

assignUser closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Parquet reader is unable to read LargeString columns #39682

[C++] Parquet reader is unable to read LargeString columns #39682

nicki-dese commented Jan 18, 2024 •

edited

Loading

assignUser commented Jan 18, 2024

nicki-dese commented Jan 18, 2024

nicki-dese commented Jan 22, 2024

thisisnic commented Jan 30, 2024

felipecrv commented Feb 6, 2024

felipecrv commented Feb 6, 2024

thisisnic commented Feb 6, 2024 •

edited

Loading

mapleFU commented Feb 6, 2024 •

edited

Loading

thisisnic commented Feb 6, 2024

felipecrv commented Feb 6, 2024

mapleFU commented Feb 6, 2024

felipecrv commented Feb 6, 2024

mapleFU commented Feb 6, 2024

felipecrv commented Feb 7, 2024 •

edited

Loading

mapleFU commented Feb 8, 2024 •

edited

Loading

nicki-dese commented Feb 8, 2024

felipecrv commented Feb 10, 2024

mapleFU commented Jun 13, 2024

raulcd commented Jun 26, 2024

jonkeane commented Jun 26, 2024

mapleFU commented Jun 26, 2024

jonkeane commented Jun 26, 2024

mapleFU commented Jun 26, 2024

raulcd commented Jun 26, 2024

raulcd commented Jun 26, 2024

jonkeane commented Jun 26, 2024

mapleFU commented Jun 26, 2024 •

edited

Loading

felipecrv commented Jun 26, 2024

nicki-dese commented Jun 26, 2024

assignUser commented Jul 18, 2024

[C++] Parquet reader is unable to read LargeString columns #39682

[C++] Parquet reader is unable to read LargeString columns #39682

Comments

nicki-dese commented Jan 18, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

UPDATE

Component(s)

assignUser commented Jan 18, 2024

nicki-dese commented Jan 18, 2024

nicki-dese commented Jan 22, 2024

thisisnic commented Jan 30, 2024

felipecrv commented Feb 6, 2024

felipecrv commented Feb 6, 2024

thisisnic commented Feb 6, 2024 • edited Loading

mapleFU commented Feb 6, 2024 • edited Loading

thisisnic commented Feb 6, 2024

felipecrv commented Feb 6, 2024

mapleFU commented Feb 6, 2024

felipecrv commented Feb 6, 2024

mapleFU commented Feb 6, 2024

felipecrv commented Feb 7, 2024 • edited Loading

mapleFU commented Feb 8, 2024 • edited Loading

nicki-dese commented Feb 8, 2024

felipecrv commented Feb 10, 2024

mapleFU commented Jun 13, 2024

raulcd commented Jun 26, 2024

jonkeane commented Jun 26, 2024

mapleFU commented Jun 26, 2024

jonkeane commented Jun 26, 2024

mapleFU commented Jun 26, 2024

raulcd commented Jun 26, 2024

raulcd commented Jun 26, 2024

jonkeane commented Jun 26, 2024

mapleFU commented Jun 26, 2024 • edited Loading

felipecrv commented Jun 26, 2024

nicki-dese commented Jun 26, 2024

assignUser commented Jul 18, 2024

nicki-dese commented Jan 18, 2024 •

edited

Loading

thisisnic commented Feb 6, 2024 •

edited

Loading

mapleFU commented Feb 6, 2024 •

edited

Loading

felipecrv commented Feb 7, 2024 •

edited

Loading

mapleFU commented Feb 8, 2024 •

edited

Loading

mapleFU commented Jun 26, 2024 •

edited

Loading