Write Bloom filters between row groups instead of the end #5860

progval · 2024-06-10T13:25:58Z

Which issue does this PR close?

Closes #5859.

Rationale for this change

This allows Bloom filters to not be saved in memory, which can save significant space when writing long files. This switches between the two layouts mentioned in the spec

What changes are included in this PR?

This includes a script that demonstrates the memory usage.

Increases linearly up to 4.3GB of RAM before the change:

$ cargo run --example write_parquet --release --features=log
    Finished release [optimized] target(s) in 0.11s
     Running `target/release/examples/write_parquet`
12:52:11 [INFO] Writing batches
12:52:21 [INFO] 267 iterations, 10s, 26.68 iterations/s, 37.48 ms/iterations; 8.90% done, 1m 42s to end; res/vir/avail/free/total mem 399.72MB/419.99MB/25.93GB/10.45GB/33.44GB
12:52:31 [INFO] 536 iterations, 20s, 26.75 iterations/s, 37.38 ms/iterations; 17.87% done, 1m 31s to end; res/vir/avail/free/total mem 805.78MB/829.16MB/25.93GB/10.45GB/33.44GB
12:52:41 [INFO] 805 iterations, 30s, 26.80 iterations/s, 37.31 ms/iterations; 26.83% done, 1m 21s to end; res/vir/avail/free/total mem 1.24GB/1.27GB/25.93GB/10.45GB/33.44GB
12:52:51 [INFO] 1,073 iterations, 40s, 26.79 iterations/s, 37.33 ms/iterations; 35.77% done, 1m 11s to end; res/vir/avail/free/total mem 1.61GB/1.64GB/25.93GB/10.45GB/33.44GB
12:53:01 [INFO] 1,342 iterations, 50s, 26.80 iterations/s, 37.31 ms/iterations; 44.73% done, 1m 1s to end; res/vir/avail/free/total mem 2.00GB/2.03GB/25.93GB/10.45GB/33.44GB
12:53:11 [INFO] 1,610 iterations, 1m 0s, 26.80 iterations/s, 37.32 ms/iterations; 53.67% done, 51s to end; res/vir/avail/free/total mem 2.39GB/2.42GB/25.93GB/10.45GB/33.44GB
12:53:21 [INFO] 1,869 iterations, 1m 10s, 26.65 iterations/s, 37.52 ms/iterations; 62.30% done, 42s to end; res/vir/avail/free/total mem 2.78GB/2.82GB/25.93GB/10.45GB/33.44GB
12:53:31 [INFO] 2,130 iterations, 1m 20s, 26.57 iterations/s, 37.63 ms/iterations; 71.00% done, 32s to end; res/vir/avail/free/total mem 3.16GB/3.21GB/25.93GB/10.45GB/33.44GB
12:53:41 [INFO] 2,391 iterations, 1m 30s, 26.52 iterations/s, 37.71 ms/iterations; 79.70% done, 22s to end; res/vir/avail/free/total mem 3.54GB/3.59GB/25.93GB/10.45GB/33.44GB
12:53:51 [INFO] 2,650 iterations, 1m 40s, 26.45 iterations/s, 37.80 ms/iterations; 88.33% done, 13s to end; res/vir/avail/free/total mem 3.93GB/3.98GB/25.93GB/10.45GB/33.44GB
12:54:01 [INFO] 2,908 iterations, 1m 50s, 26.39 iterations/s, 37.90 ms/iterations; 96.93% done, 3s to end; res/vir/avail/free/total mem 4.32GB/4.37GB/25.93GB/10.45GB/33.44GB
12:54:05 [INFO] Completed.
12:54:05 [INFO] Elapsed: 1m 53s [3,000 iterations, 26.36 iterations/s, 37.93 ms/iterations]; res/vir/avail/free/total mem 4.49GB/4.54GB/25.93GB/10.45GB/33.44GB

Remains constant at 55.2MB after the change:

$ cargo run --example write_parquet --release --features=log
   Compiling parquet v51.0.0 (/home/rust/arrow-rs/parquet)
    Finished release [optimized] target(s) in 11.24s
     Running `target/release/examples/write_parquet`
12:54:29 [INFO] Writing batches
12:54:39 [INFO] 261 iterations, 10s, 26.02 iterations/s, 38.43 ms/iterations; 8.70% done, 1m 44s to end; res/vir/avail/free/total mem 49.92MB/69.59MB/25.87GB/10.40GB/33.44GB
12:54:49 [INFO] 525 iterations, 20s, 26.20 iterations/s, 38.17 ms/iterations; 17.50% done, 1m 34s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:54:59 [INFO] 791 iterations, 30s, 26.32 iterations/s, 38.00 ms/iterations; 26.37% done, 1m 23s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:09 [INFO] 1,058 iterations, 40s, 26.40 iterations/s, 37.88 ms/iterations; 35.27% done, 1m 13s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:19 [INFO] 1,325 iterations, 50s, 26.45 iterations/s, 37.81 ms/iterations; 44.17% done, 1m 3s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:29 [INFO] 1,593 iterations, 1m 0s, 26.50 iterations/s, 37.74 ms/iterations; 53.10% done, 53s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:39 [INFO] 1,861 iterations, 1m 10s, 26.54 iterations/s, 37.68 ms/iterations; 62.03% done, 42s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:49 [INFO] 2,128 iterations, 1m 20s, 26.55 iterations/s, 37.66 ms/iterations; 70.93% done, 32s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:55:59 [INFO] 2,384 iterations, 1m 30s, 26.44 iterations/s, 37.82 ms/iterations; 79.47% done, 23s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:56:09 [INFO] 2,642 iterations, 1m 40s, 26.37 iterations/s, 37.91 ms/iterations; 88.07% done, 13s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:56:19 [INFO] 2,900 iterations, 1m 50s, 26.32 iterations/s, 38.00 ms/iterations; 96.67% done, 3s to end; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB
12:56:23 [INFO] Completed.
12:56:23 [INFO] Elapsed: 1m 54s [3,000 iterations, 26.29 iterations/s, 38.04 ms/iterations]; res/vir/avail/free/total mem 55.23MB/73.79MB/25.87GB/10.40GB/33.44GB

This is a demo of the change, just to make sure this is something we want.
~~In particular, this breaks arrow::arrow_writer::tests::*_bloom_filter because they expect to read the Bloom Filters from the memory at the end except... they aren't anymore.~~ wrong, see comment

~~So if this looks good to you, I'll add a field in WriterProperties to switch between the old behavior (all Bloom Filters at the end) and this one (interleaved Bloom Filters). How should I call it?~~ done

Are there any user-facing changes?

The layout of output files changes significantly. This may have a negative performance effect on readers expecting data locality, as Bloom Filters are now scattered across the file.

This required changes to the flushed_row_groups return type (Arc<T> to T) and OnCloseRowGroup, as we now need to mutate row groups while SerializedRowGroupWriter is "live" instead of just at the end in write_metadata() (which used to leave the structure in an inconsistent state, but it didn't matter because only close() called it)

This allows Bloom filters to not be saved in memory, which can save significant space when writing long files

progval · 2024-06-10T15:25:10Z

Based on the test failures, it seems the Bloom Filters are either not written, or not picked up by the readers. Not sure why that is.

alamb · 2024-06-10T20:05:39Z

Thank you @progval

cc @Ted-Jiang and @jimexist

I think there is a tradeoff:

Writing all the bloom filters at the end requires them to be buffered (which you point out)
Writing all the bloom filters at the end means they are contiguous and thus the reader can fetch multiple bloom filters in a single IO (which is important if reading from something like S3)

Thus given there is a tradeoff it seems like we should at least offer an config setting of where to write the bloom filters.

I don't know if the parquet bloom filter spec dictates where the bloom filters should be written or if the ecosystem (aka paruqet-java) implicity requires them in a particular location

When using BloomFilterPosition::AfterRowGroup this was only writing the Bloom Filter offset to a temporary clone of the metadata, causing the Bloom Filter to never be seen by readers

progval · 2024-06-10T20:25:14Z

Thus given there is a tradeoff it seems like we should at least offer an config setting of where to write the bloom filters.

Indeed, done.

I believe my changes should make it easy to add an API to allow writers to trigger flushing of Bloom Filters, so they can pick a middle-ground themselves by writing all Bloom Filters for a group of row groups next to each other.

I don't know if the parquet bloom filter spec dictates where the bloom filters should be written

The way I read it, it allows them to be anywhere with any layout we like

or if the ecosystem (aka paruqet-java) implicity requires them in a particular location

I don't know about this

alamb

Thank you @progval -- this looks very nice (as always 🙏 )

The only thing I think needs to be changed is removing the new dependencies. Otherwise this PR looks ready to me

alamb · 2024-06-13T18:40:12Z

parquet/Cargo.toml

@@ -68,6 +68,9 @@ twox-hash = { version = "1.6", default-features = false }
 paste = { version = "1.0" }
 half = { version = "2.1", default-features = false, features = ["num-traits"] }

+dsi-progress-logger = { version = "0.2.4", optional = true }


Could you please remove these new dependencies (even though I do realize they are optional and won't be activated very often)

I think they will add some ongoing maintenance cost (keeping the dependencies updated) which I would prefer to avoid if possible

How do you feel about depending only on sysinfo to display the RAM usage? It has a small set of dependencies

I think it would be ok

Done. It now looks like this:

$ cargo run --release --features="cli sysinfo" --example write_parquet -- /tmp/test.parquet 2024-06-13 21:45:40 Writing 1000 batches of 1000000 rows. RSS = 1MB 2024-06-13 21:45:50 Iteration 260/1000. RSS = 50MB 2024-06-13 21:46:00 Iteration 518/1000. RSS = 50MB 2024-06-13 21:46:10 Iteration 772/1000. RSS = 50MB 2024-06-13 21:46:19 Done. RSS = 17MB $ cargo run --release --features="cli sysinfo" --example write_parquet -- /tmp/test.parquet --bloom-filter-position end 2024-06-13 21:46:29 Writing 1000 batches of 1000000 rows. RSS = 1MB 2024-06-13 21:46:39 Iteration 267/1000. RSS = 451MB 2024-06-13 21:46:49 Iteration 533/1000. RSS = 791MB 2024-06-13 21:46:59 Iteration 799/1000. RSS = 1151MB 2024-06-13 21:47:07 Done. RSS = 1055MB

parquet/src/file/properties.rs

parquet/src/file/writer.rs

alamb · 2024-06-13T18:52:01Z

parquet/examples/write_parquet.rs

+use parquet::errors::Result;
+use parquet::file::properties::WriterProperties;
+
+fn main() -> Result<()> {


Perhaps we could add some comments here explaining what this example is trying to show

Done, along with a Clap argument parser:

$ cargo run --release --features="cli sysinfo" --example write_parquet -- -h Writes sequences of integers, with a Bloom Filter, while logging timing and memory usage Usage: write_parquet [OPTIONS] <PATH> Arguments: <PATH> Path to the file to write Options: --iterations <ITERATIONS> Number of batches to write [default: 1000] --batch <BATCH> Number of rows in each batch [default: 1000000] --bloom-filter-position <BLOOM_FILTER_POSITION> Where to write Bloom Filters [default: after-row-group] [possible values: end, after-row-group] -h, --help Print help -V, --version Print version

Improve documentation Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb

Thank you @progval -- this PR now looks good to me

alamb · 2024-06-21T14:31:52Z

I merged this branch up to main to resolve conflicts and I double checked that this is an additive API (rather than an API change) so I think it can be merged for inclusion in the next minor release

progval · 2024-06-21T15:40:43Z

OnCloseRowGroup grew a type parameter and flushed_row_groups changed return type (Arc<RowGroupMetaData> -> RowGroupMetaData)

)" This reverts commit 3930d5b.

alamb · 2024-06-21T15:58:57Z

OnCloseRowGroup grew a type parameter and flushed_row_groups changed return type (Arc<RowGroupMetaData> -> RowGroupMetaData)

Shoot -- you are right I merged this PR by accident. I will revert this change in #5932 and open a new PR to re-add it marked correctly with api-change

)" (#5932) This reverts commit 3930d5b.

alamb · 2024-06-21T16:00:46Z

PR with the changes re-introduced: #5933

alamb · 2024-07-02T18:31:16Z

This was reverted and thus does will not be present in 52.1.0 release #5905

alamb · 2024-07-02T18:31:34Z

I will merge #5933 when we open for breaking API changes

to avoid OOMs due to storing Bloom Filters in memory while writing, see apache/arrow-rs#5860

progval added 2 commits June 10, 2024 15:03

Add example script to write Parquet files with a Bloom filter

759767b

Write Bloom filters between row groups instead of the end

a417f01

This allows Bloom filters to not be saved in memory, which can save significant space when writing long files

github-actions bot added the parquet Changes to the parquet crate label Jun 10, 2024

progval added 2 commits June 10, 2024 16:37

Merge branch 'master' into interleave-bloom

5daf96f

Add WriterProperties::bloom_filter_position

6effa7f

progval added 2 commits June 10, 2024 22:10

Mutate the right row group metadata

f23759a

When using BloomFilterPosition::AfterRowGroup this was only writing the Bloom Filter offset to a temporary clone of the metadata, causing the Bloom Filter to never be seen by readers

Add a test for Bloom Filters written at the end

83b475e

progval force-pushed the interleave-bloom branch from f3e7e78 to 83b475e Compare June 10, 2024 20:18

Update async writer accordingly

3f810b5

progval added 2 commits June 10, 2024 22:28

Undo accidental commit

ad0c40e

Clippy

74c40ee

alamb reviewed Jun 13, 2024

View reviewed changes

progval and others added 6 commits June 13, 2024 21:07

Apply suggestions from code review

f237e8c

Improve documentation Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Rewrite example with constants as parameters and fewer dependencies

d2a7ab8

Merge branch 'master' into interleave-bloom

b434eea

rustfmt

942c6ab

Clippy

e4a588d

Fix MSRV

ed9e576

alamb approved these changes Jun 17, 2024

View reviewed changes

alamb added the api-change Changes to the arrow API label Jun 21, 2024

Merge remote-tracking branch 'apache/master' into interleave-bloom

abd81ef

alamb merged commit 3930d5b into apache:master Jun 21, 2024
17 checks passed

alamb added a commit that referenced this pull request Jun 21, 2024

Revert "Write Bloom filters between row groups instead of the end (#5860

619d77e

)" This reverts commit 3930d5b.

alamb mentioned this pull request Jun 21, 2024

Revert "Write Bloom filters between row groups instead of the end " #5932

Merged

alamb added a commit that referenced this pull request Jun 21, 2024

Revert "Write Bloom filters between row groups instead of the end (#5860

22e0b44

)" (#5932) This reverts commit 3930d5b.

alamb mentioned this pull request Jun 21, 2024

Reintroduce: Write Bloom filters between row groups instead of the end #5933

Merged

alamb mentioned this pull request Jul 2, 2024

parquet::ArrowWriter show allow writing Bloom filters before the end of the file #5859

Closed

alamb removed the api-change Changes to the arrow API label Jul 2, 2024

alamb mentioned this pull request Jul 24, 2024

Release arrow-rs / parquet minor version 52.1.0 #5905

Closed

3 tasks

alamb mentioned this pull request Aug 8, 2024

Move ParquetMetadataWriter to its own module, update documentation #6202

Merged

swhmirror pushed a commit to SoftwareHeritage/swh-graph that referenced this pull request Dec 10, 2024

provenance: update parquet to 53.0

4fa25a4

to avoid OOMs due to storing Bloom Filters in memory while writing, see apache/arrow-rs#5860

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write Bloom filters between row groups instead of the end #5860

Write Bloom filters between row groups instead of the end #5860

progval commented Jun 10, 2024 •

edited

Loading

progval commented Jun 10, 2024

alamb commented Jun 10, 2024

progval commented Jun 10, 2024 •

edited

Loading

alamb left a comment

alamb Jun 13, 2024

progval Jun 13, 2024 •

edited

Loading

alamb Jun 13, 2024

progval Jun 13, 2024 •

edited

Loading

alamb Jun 13, 2024

progval Jun 13, 2024

alamb left a comment

alamb commented Jun 21, 2024

progval commented Jun 21, 2024

alamb commented Jun 21, 2024

alamb commented Jun 21, 2024

alamb commented Jul 2, 2024

alamb commented Jul 2, 2024

Write Bloom filters between row groups instead of the end #5860

Write Bloom filters between row groups instead of the end #5860

Conversation

progval commented Jun 10, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

progval commented Jun 10, 2024

alamb commented Jun 10, 2024

progval commented Jun 10, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Jun 13, 2024

Choose a reason for hiding this comment

progval Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Jun 13, 2024

Choose a reason for hiding this comment

progval Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Jun 13, 2024

Choose a reason for hiding this comment

progval Jun 13, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jun 21, 2024

progval commented Jun 21, 2024

alamb commented Jun 21, 2024

alamb commented Jun 21, 2024

alamb commented Jul 2, 2024

alamb commented Jul 2, 2024

progval commented Jun 10, 2024 •

edited

Loading

progval commented Jun 10, 2024 •

edited

Loading

progval Jun 13, 2024 •

edited

Loading

progval Jun 13, 2024 •

edited

Loading