From 9b48ff4f94dc5e89592d46a119884dbb88100884 Mon Sep 17 00:00:00 2001 From: Chungmin Lee Date: Sun, 21 Jul 2024 00:43:59 -0700 Subject: [PATCH] Add a Parquet file with column chunk key-value metadata (#49) * Add a Parquet file with column chunk key-value metadata This file has a single row group with 0 row and 1 column. The column chunk has key-value metadata, with a key "foo" mapped to a value "bar". Created with this code: ```c++ PARQUET_ASSIGN_OR_THROW( auto sink, arrow::io::FileOutputStream::Open( "column-chunk-key-value-metadata.parquet")); parquet::ParquetFileWriter::Open( sink, std::static_pointer_cast( parquet::schema::GroupNode::Make( "schema", parquet::Repetition::REQUIRED, {parquet::schema::PrimitiveNode::Make( "column1", parquet::Repetition::OPTIONAL, parquet::Type::INT32)}))) ->AppendRowGroup() ->NextColumn() ->key_value_metadata() .Append("foo", "bar"); ``` * Rename to match the prevalent style * Make it 2 columns * Update data/README.md * Add a KeyValue entry without Value * Update data/README.md Co-authored-by: mwish * Update README.md * Update README.md --------- Co-authored-by: mwish --- data/README.md | 3 ++- data/column_chunk_key_value_metadata.parquet | Bin 0 -> 400 bytes 2 files changed, 2 insertions(+), 1 deletion(-) create mode 100644 data/column_chunk_key_value_metadata.parquet diff --git a/data/README.md b/data/README.md index 2782a93..70bfb21 100644 --- a/data/README.md +++ b/data/README.md @@ -51,6 +51,7 @@ | concatenated_gzip_members.parquet | 513 UINT64 numbers compressed using 2 concatenated gzip members in a single data page | | byte_stream_split.zstd.parquet | Standard normals with `BYTE_STREAM_SPLIT` encoding. See [note](#byte-stream-split) below | | incorrect_map_schema.parquet | Contains a Map schema without explicitly required keys, produced by Presto. See [note](#incorrect-map-schema) | +| column_chunk_key_value_metadata.parquet | two INT32 columns, one with column chunk key-value metadata {"foo": "bar", "thisiskeywithoutvalue": null} note that the second key "thisiskeywithoutvalue", does not have a value, but the value can be mapped to an empty string "" when read depending on the client | TODO: Document what each file is in the table above. @@ -425,4 +426,4 @@ message hive_schema { } } } -``` \ No newline at end of file +``` diff --git a/data/column_chunk_key_value_metadata.parquet b/data/column_chunk_key_value_metadata.parquet new file mode 100644 index 0000000000000000000000000000000000000000..bcaf871c2142edb280e24646092f0b225d5c08fd GIT binary patch literal 400 zcmWG=3^EjD5oHi%@BtA*3=C>2GNMe9stjzB670$OIi|DE%5GO9!L?!*Adib#h%hi{ zXci