Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Parquet 3 metadata with decoupled column metadata #242

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 92 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,13 @@ more pages.
- Encoding/Compression - Page

## File format
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.

This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"MUST", RFC-2119 style, maybe

together to understand the format.

## Overall file structure

A Parquet file is a collection of binary blocks representing data and metadata.

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
Expand All @@ -107,12 +113,97 @@ start locations. More details on what is contained in the metadata can be found
in the Thrift definition.

Metadata is written after the data to allow for single pass writing.
This is especially useful when writing to backends such as S3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a requirement for HDFS too, S3 is simply a beneficiary of a design requirement from the outset of an append-only filesystem being the target from day 1


Readers are expected to first read the file metadata to find all the column
chunks they are interested in. The columns chunks should then be read sequentially.

![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)

### Parquet 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People, who are not involved in the discussion of Parquet 3, might wonder why suddenly Parquet '3'?


Parquet 3 files have the following overall structure:

```
4-byte magic number "PAR1"
4-byte magic number "PAR3"

<Column 1 Chunk 1 + Column Metadata>
Copy link

@mhaseeb123 mhaseeb123 May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't <Column 1 Chunk 1 + Column 1 Chunk 1 Metadata> and so on be better here according to the V3 metadata format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've literally copied this from the original snippet above. This proposal doesn't change anything in the layout here, so it would be confusing if I expressed it differently, IMHO.

<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>

<File-level Column 1 Metadata v3>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'll have a file-level column metadata, we can optimize the storage of key_metadata of encrypted columns - by keeping the column key_metadata here, instead of the ColumnChunk structures.
Note: this is a storage-only optimization [O(N) instead of O(NxM)]; the reader processes only one key_metadata object per column [O(N)] already today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that each column uses a single key_metadata for all chunks, this sounds like a good idea.

...
<File-level Column N Metadata v3>
pitrou marked this conversation as resolved.
Show resolved Hide resolved

File Metadata v3
4-byte length in bytes of File Metadata v3 (little endian)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we might want a slightly extended. In particular, I think it is possible we want a digest (e.g. sha256) of the v3 header. This can serve two purposes:

  1. Be able to truly disambiguate the unlikely case that par3 ends here (e.g. unencoded data page where the last value in "PAR3".
  2. Additional ability to ensure contents are reliable.

Second I think we might want to record an offset to the "full footer" containing additional index information, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastly, we might want to consider if compression is should be a setting, if we move enough stuff out of thrift this probably isn't a concern anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How many users have asked to sha256-protect the header, or how likely is it to get a corrupt header in the real world? We shouldn't start making gratuitous additions that are not backed by concrete use cases.

  2. I don't know what you mean with "full footer", are you talking about the FileMetadata struct (which is still here, just below)? Something else?

  3. As for a false positive "PAR3", there is indeed a small risk, though the additional "PAR3" at the beginning of the file should help disambiguate.

  4. What does compression have to do here? I'm not following you.

Copy link
Contributor

@emkornfield emkornfield May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. A very small fraction likely, a lighter-weight digest is also fine, we have digests in other parts of the spec, and I think the main reasion for likely not having it on the footer was to avoid compatibility issues.
  2. FileMetadata + Serialized metadata like indeces/bloom filters and anything we move to the data page. after all the column chunks is what I mean by "Full Footer"
  3. It isn't clear to me that everyone will check the header. This adds an additional IO for not too much benefit unless, the entire file is being retrieved from disk.
  4. Compressing the thrift serialized data to minimize size if consumers want the ultimate smallest file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can add a CRC32 here if other people want it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield IIUC the digest is not to protect for corruption but to make sure we do not mistakenly read a V3 footer in a file without one if we happen to see "PAR3" bytes before the V1 footer, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can serve both purposes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posit the latter is more important. Reading a valid encoded PAR1 file as PAR3 by accident is unlikely. But in zettabytes of data stored in parquet globally it will happen. When it happens someone is going to come here with a pitchfork.

Whereever we store a new metadata footer, its content hash needs to be stored somewhere for verification. A crc32 of the footer should be good enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whereever we store a new metadata footer, its content hash needs to be stored somewhere for verification. A crc32 of the footer should be good enough.

This is exactly what I'm suggesting, I think it solves both use-cases (sha1 is probably overkill). Today there is no digest on the footer as far as I'm know.

4-byte magic number "PAR3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more thought is potentially having a 16-bit or 32-bit flag map, initially set to zero if we do want to allow for future iterations on metadata intepretation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data point: we added this to the Arrow IPC format and nobody to my knowledge is making use of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be superior to ParquetVersion of parquet-cpp in terms of representing a collection of enabled features.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what would be a concrete use case for those feature flags?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test if the reader is compatible and time to upgrade?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readers can already scan the metadata and see if they recognize all encodings, etc.
Also, what would be the point of erroring out if a feature is optional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of current discussion on evolution, some of these might come fruitition some of them won't.

As a concrete example if we have a bitmap, we not longer need different magic footer/header values for compression.

Other potential items that we might not want to close off in :

  1. Move to flatbuffers in the future.
  2. Allow for discontinous column chunks (i.e. require offset index to read data)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can perhaps add 4 reserved bytes just before the magic number. But I would recommend against giving them any specific meaning until necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for the concrete use case I meant to say "encryption" which I think we could use immediately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that encryption would be an immediate use case (as you did in #250, btw).


File Metadata
4-byte length in bytes of File Metadata (little endian)
4-byte magic number "PAR1"
```

Unlike the legacy File Metadata, the File Metadata v3 is designed to be light-weight
to decode, regardless of the number of columns in the file. Individual column
metadata can be opportunistically decoded depending on actual needs.

This file structure is backwards-compatible. Parquet 1 readers will read and
decode the legacy File Metadata in the file footer, while Parquet 3 readers
will notice the "PAR3" magic number just before the File Metadata and will
instead read and decode the File Metadata v3.
Comment on lines +162 to +165
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Efficient I/O can be challenging with this proposal. A reader needs to read the last 8 bytes of the file, then read 8 bytes before the legacy footer, figure out a v3 footer exists and then read that.

It would be better if the v3 metadata are at the end of the file, right before the 4-byte len + PAR1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest a layout that preserves compatibility with PAR1 readers?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few options:

  1. Use FileMetadata.version to introduce a new version of the metadata. Starting from the minimal change that can be done in place (DRAFT: Incremental improvements to parquet metadata #248) we can bump the version and remove columns from RowGroup and decouple the column metadata completely.
  2. Add a binary field to FileMetadata named v3_metadata with tag number 10003. This field will encode the flatbuffer/thrift representation of the new footer. This field is going to be encoded last by thrift. Readers can manually look at the tail of the file and if they find this field, they can ignore the rest of the footer and parse these bytes only, ignoring the old style footer alltogether.

Copy link
Member Author

@pitrou pitrou May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is twofold:

  1. achieve better metadata parsing performance for PAR3 readers
  2. keep compatibility with PAR1 readers

This is why this proposal creates a separate array of structure types: so that PAR3 readers don't have to eagerly decode those pesky columns, while letting PAR1 readers correctly access column information.

I don't think any of your two proposals is able of achieving those two goals simultaneously, are they?
(admittedly, I'm not sure I understand proposal number 2, though it seems to require hand-coded Thrift parsing which doesn't sound like a tremendous idea)

Copy link
Contributor

@emkornfield emkornfield May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for 2, thrift could avoid parsing it assuming that we still follow the pattern of nested footer.

e.g. <field_marker 10003 and byte size><[serialized v3 metadatadata] + <v3 trailing bits (length, digest, feature bitmask)>"PAR3">0x0000<footer size>PAR1 as long as the byte size in the thrift header accounts for everything through PAR3 (as @alkis mentions below) it should work.

So the encoding/serialization would be manual but on decoding old readers should automatically drop the unknown field (it is possible some thrift implementations retain unknown fields, I know proto does) (i.e. the field ID 10003 should never actually be modeled in the schema).

"note 0x0000" is the stop field for structs if I am reading the thrift spec correctly

So the trade-offs of doing this approach are:

  1. A bit of extra data to be copied for readers accessing the original version.
  2. A guaranteed lower bound on amount of IO operations for V3 since it is incorporated into v2
  3. Potentially more memory utilization if accessing the original version if unknown fields are maintained by thrift implementation.

Effectively for doing the operation currently as proposed in V3 the trade-offs are reverse. I

Copy link
Contributor

@tustvold tustvold May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With fetch latencies of that order, does the decode latency of the existing thrift payload even matter? I would be interested to see any empirical data that would suggest the current metadata structures are an appreciable bottleneck for use-cases involving a catalog. I created a mailing list thread related to this here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the concrete numbers. This gives us more precise insight for us to work on (though not all Parquet files are read from S3; local reads and writes are a very common use case as well). However, I'm a bit surprised by your numbers because typical guidance for S3 involves larger reads (i.e. multi-megabyte).

Copy link

@alkis alkis May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On S3 reads:

Guidance from S3 involves larger reads because of cost: reads from S3 typically cost an API call and bandwidth/transfer volume is free (same zone). 8MB reads will cost half the price of 4MB reads. Each connection can go up to ~100MB/sec so just transfering 4MB of data is at least 25ms. If one is reading large files, doing 100s of 8MB reads in parallel will saturate any network card - which is typically what engines do.

I posit that the vast majority of parquet encoded data is in some cloud somewhere (like >90% of all data). Hence working well with object stores (high latency, immutable files) is a requirement for any change. This is also lesson 4 from An Empirical Evaluation of Columnar Storage Formats.

With fetch latencies of that order, does the decode latency of the existing thrift payload even matter?

Yes it does. With the numbers above in mind:

  • cold reads: baseline 110ms (as above). Optimized metadata are down to 2mb and parsing in 5ms translates to 20ms + 30ms + 5ms = 55ms --> 2x speedup
  • warm reads: footer bytes are cached on disk/s3express/something-with-low-latency. It takes 5ms fetch + 40ms parse. Optimized, it takes 2ms to fetch + 5ms to parse = 7ms --> 6x speedup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posit that many PiB of parquet data lives in HDFS which also has read latencies, but not as bad as S3.
S3 standard has pretty bad latency for small byte reads plus you get billed. best to grab large amounts speculatively

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to this thread and haven't fully read it. I just want to give my +1 that the current proposal would require too many I/O requests and thus make this basically a deal breaker for high latency storage like S3. We would not use this due to this.

Any change that increases the number of data-dependent requests necessary to decode it basically a deal breaker for us and I guess is so for a lot of other data lake companies.


### Parquet 3 without legacy metadata
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac @gszadovszky Let me know if it is a good idea to even mention this :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents I think at some point we want to avoid the duplication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. There will be a time when people will agree to remove the old metadata. I added the "PAR1" magic number specifically for this :)


It is possible, though not recommended for the time being, to produce Parquet 3
files without the legacy metadata blocks:
```
4-byte magic number "PAR3"

<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>

<File-level Column 1 Metadata v3>
...
<File-level Column N Metadata v3>

File Metadata v3
4-byte length in bytes of File Metadata v3 (little endian)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset + length

4-byte magic number "PAR3"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do encrypted footers work in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully, the same way they work for PAR1, but I agree this will need to be spelled out more explicitly :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, might need some analysis. If we continue using "PARE", can the readers know if the file is v1 or v3?
Maybe we can use something like "PRE3" instead (PaRquet Encrypted, or "PQE3" - ParQuet Encrypted, or "PEF3" - parquet encrypted footer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I thought the "PARE" footer was in addition to the "PAR1" footer, but apparently it replaces it. Darn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using PAR3 + bit in a bitmap to represent encryption be a reasonable approach here to reduce the proliferation of magic footer values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, as long as we can read the bitmap before decrypting the footer

```

These files are not backwards compatible with Parquet 1 readers.
Parquet 3 readers should be prepared to read such files, so that the ecosystem
can gradually switch to this new format.

## Encryption

Encryption with footer encryption enabled changes the above file structure slightly.
In particular, the "PAR1" magic number is replaced with "PARE".
Copy link
Contributor

@ggershinsky ggershinsky May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the [encryption specification](Encryption.md) for details.

## Metadata
There are three types of metadata: file metadata, column (chunk) metadata and page
header metadata. All thrift structures are serialized using the TCompactProtocol.
Expand Down
208 changes: 208 additions & 0 deletions src/main/thrift/parquet.thrift
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest culprit in parsing metadata is Statistics because every one of the values is a binary variable length thing. We could improve Statistics trivially and in place if we add a few fixed size fields:

struct Statistics {
   /**
    * DEPRECATED
    */
   1: optional binary max;
   2: optional binary min;
   /** count of null value in the column */
   3: optional i64 null_count;
   /** count of distinct values occurring */
   4: optional i64 distinct_count;
   /**
    * Only one pair of min/max will be populated. For fixed sized types, one of the minN/maxN variants will be used. Otherwise min_value/max_value is used.
    */
   5: optional binary max_value;
   6: optional binary min_value;

   7: optional byte max1;
   8: optional byte min1;
   9: optional i32 max4;
   10: optional i32 min4;
   11: optional i64 max8;
   12: optional i64 min8;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since that structure is deprecated, do we really want to "improve" it, or should we focus our efforts on non-deprecated structures such as ColumnIndex?

More generally, this should probably be discussed in a separate ML thread, for better separation of concerns.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware the whole structure was deprecated. I thought only max amd min fields are deprecated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a separate topic. I still want to say that Apache ORC has defined separate statistics for different types: https://github.com/apache/orc-format/blob/6e30e63f5071b616ec5cedaac7b2e0a98ae377c5/src/main/proto/orc/proto/orc_proto.proto#L87-L99. But the dirty thing is that we might still need some sort of encoding for data types which do not have direct representation from protobuf or thrift, e.g. decimal type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an encoding shouldn't be a problem as long as the encoding is fast enough to decode (e.g. PLAIN), should it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree separate topic. I think you still need variable length types that rely on byte_array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. Defining specific statistics for different data types adds complexity as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a union work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware the whole structure was deprecated. I thought only max amd min fields are deprecated.

Ahah, sorry. I think you're right actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought only max amd min fields are deprecated.

Sigh, a unrelated issue is that currently min max might still be written even if min_value and max_value is being provided.

Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,35 @@ struct SchemaElement {
10: optional LogicalType logicalType
}

struct SchemaElementV3 {
Copy link
Contributor

@emkornfield emkornfield May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Antoine, on the question of keeping implementation simpler, would it pay to not revise this and just reuse the existing one?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 these changes are not necessary, we should mark the relevant bits in SchemaElement deprecated instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, of course, but deprecated fields will probably never be removed from the format (except perhaps in 20 years).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are deprecated fields a problem?

Assuming a writer writes them, the writer wastes cpu and bytes.
A reader can chooce to parse them or ignore them (by removing/commenting them out) from the .thrift file. The latter means the deprecated fields will be ignored by the thrift parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mandating that readers remove fields from the official Parquet Thrift file sounds like a complicated and error-prone arrangement.

Copy link

@alkis alkis May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no mandate. The official parquet thrift will comment them out.

  1. Writers compiled with old version of the official thrift file may write the fields.
  2. Writers compiled with new version of the official thrift file won't write the fields.
  3. Readers compiled with old version of the official thrift file may read the fields.
  4. Readers compiled with new version of the official thrift file will ignore the fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that, for any given implementation, we'll have either 1+3 (compiled with old version: backwards compatibility but no perf improvement when reading), or 2+4 (compiled with new version: better performance on read, but backwards compatibility is lost).

This doesn't satisfy the requirements we're trying to satisfy here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. The problem is old readers and new writers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with no cleanup of deprecated fields. Marking them semantically as logically required for PAR1 footers seems reasonable to me. I hope at some point standalone PAR3 would become the default and they would not need to be written if they present meaningful overhead once that happens.

from a wire compatibility perspective required->optional should be forward/backward compatible.

/** Data type for this field. */
1: optional Type type;

/** If type is FIXED_LEN_BYTE_ARRAY, this is the byte length of the values.
*
* CHANGED from v1: this must be omitted for other column types.
*/
2: optional i32 type_length;

/** repetition of the field. */
3: optional FieldRepetitionType repetition_type;

/** Name of the field in the schema */
4: required string name;

/** Nested fields. */
5: optional i32 num_children;

/** CHANGED from v1: from i32 to i64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you encountered a use-case for this? 2 Billion field ID seems quite high?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. I don't think this change is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It seemed a bit arbitrary to limit the width of this, but I can undo the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is backwards compatible and in the long run, it probably isn't a big deal either way. I as just curious if there as a reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concrete reason actually :-)

*/
6: optional i64 field_id;

/** The logical type of this SchemaElement */
7: optional LogicalType logicalType

/** REMOVED from v1: converted_type, scale, precision (obsolete) */
}

/**
* Encodings supported by Parquet. Not all encodings are valid for all types. These
* enums are also used to specify the encoding of definition and repetition levels.
Expand Down Expand Up @@ -835,6 +864,65 @@ struct ColumnMetaData {
16: optional SizeStatistics size_statistics;
}

struct ColumnChunkMetaDataV3 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to SchemaElement, we can reuse the existing struct and deprecate the useless fields.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also don't we need a index into the list of SchemaElement to reference it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misunderstanding something, there should be one FileColumnMetadataV3 element per leaf SchemaElement.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an index referencing a SchemaElement means that:

  1. a writer can skip encoding columns that do not have values in a rowgroup range
  2. a writer can encode/write columns in different order than metadata

(1) is important when schemata are very wide but data is sparse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ignoring the fact that Parquet is currently not a sparse format, your proposal implies that readers have to do a O(n) search to find a given column?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readers have to do a O(n) search to find a given column?

Why would they need an O(n) search? The index indexes an SchemaElement[] in java or std::vector<SchemaElement> in C++ which is O(1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's step back because I think I am not following you.

Here is the file organization as proposed here:

  1. FileMetaDataV3 points to one FileColumnMetadataV3 per column
  2. FileColumnMetadataV3 points to one ColumnChunkV3 per row group

I'm not sure what you're proposing exactly when you mean "can skip encoding columns that do not have values"? This is currently not possible using Parquet, as at least the corresponding definition levels would be expected (even if they are all 0).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ignoring the fact that Parquet is currently not a sparse format, your proposal implies that readers have to do a O(n) search to find a given column?

IIUC, Finding a column via schema elements today is also O(N) assuming no nesting. I think the difference is today the first thing implementations do create an efficient dictionary structure to amortize lookup of further columns.

I think if we want fast lookups without building any additional dictionaries in memory we should be considering a new stored index structure (or reconsider how we organize schema elements instead of a straight BFS).

/** REMOVED from v1: type (redundant with SchemaElementV3) */
/** REMOVED from v1: encodings (unnecessary and non-trivial to get right) */
/** REMOVED from v1: path_in_schema (unnecessary and wasteful) */
/** REMOVED from v1: index_page_offset (unused in practice?) */

/** Compression codec **/
1: required CompressionCodec codec
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could go in the SchemaElement since it should not in principle vary between row groups.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, right now it's possible to make it vary. I don't know if any writers make use of this possibility, but I'm not sure there's any harm in keeping it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforcing same codec to all row groups will prohibit fast merging row groups of different parquet files without rewriting chunk data. So I vote for keeping it as is.


/** Number of values in this column chunk **/
2: required i64 num_values

/** total byte size of all uncompressed pages in this column chunk (including the headers) **/
3: required i64 total_uncompressed_size

/** total byte size of all compressed, and potentially encrypted, pages
* in this column chunk (including the headers) **/
4: required i64 total_compressed_size

/** Optional key/value metadata for this column chunk.
** CHANGED from v1: only use this for chunk-specific metadata, otherwise
** use `FileColumnMetadataV3.key_value_metadata`.
**/
5: optional list<KeyValue> key_value_metadata

/** Byte offset from beginning of file to first data page **/
6: required i64 data_page_offset

/** Byte offset from the beginning of file to first (only) dictionary page **/
7: optional i64 dictionary_page_offset

/** optional statistics for this column chunk */
8: optional Statistics statistics;

/** Set of all encodings used for pages in this column chunk.
* This information can be used to determine if all data pages are
* dictionary encoded for example **/
9: optional list<PageEncodingStats> encoding_stats;

/** Byte offset from beginning of file to Bloom filter data. **/
10: optional i64 bloom_filter_offset;

/** Size of Bloom filter data including the serialized header, in bytes.
* Added in 2.10 so readers may not read this field from old files and
* it can be obtained after the BloomFilterHeader has been deserialized.
* Writers should write this field so readers can read the bloom filter
* in a single I/O.
*/
11: optional i32 bloom_filter_length;

/**
* Optional statistics to help estimate total memory when converted to in-memory
* representations. The histograms contained in these statistics can
* also be useful in some cases for more fine-grained nullability/list length
* filter pushdown.
*/
12: optional SizeStatistics size_statistics;
}

struct EncryptionWithFooterKey {
}

Expand Down Expand Up @@ -885,6 +973,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}

struct ColumnChunkV3 {
/** File where column data is stored. **/
1: optional string file_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this concept of having the metadata in a separate file? I did not see it working anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do support it in PyArrow (see tests here and here), and I think Dask makes use of it.

@jorisvandenbossche @mrocklin Am I right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the summary metadata file, IIUC. cc @rdblue for more context.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring this question from Dask's perspective to @fjetter and @phofl

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thanks for pinging us)

Copy link
Member

@rok rok Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArrow provides write_metadata and parquet_dataset for such usecases. Original PR goes more in depth.
cc @jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the benefits of using this _metadata index file should translate to cloud object stores too by reducing the number of objects/files to be read.

not really

  1. forces a full read of all generated files in job commit, which even if done in parallel is really slow. If it were to be done, it'd be better off done on demand in the first query. (note, faster reads would improve this)
  2. it doesn't work with the cloud committer design, which #1361 formalises without doing some bridging classes.

the reason for (2) is that the hadoop cloud-native committer design kept clear of making any changes to the superclass of ParquetOutputCommitter as it is a critical piece of code in so many existing workflows, and really hard to understand. Not just a co-recursive algorithm, but two intermingled algorithms, one of which lacks the correctness guarantees (failures during task commit can be recovered from).

with a move to table based formats rather than directory trees, that whole commit process becomes much easier as well as supporting atomic job commits on a table (including deletes!). And as you note, these formats can include schema info too.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, AFAIK the _metadata summary file was a practice originally used in Spark (and supported by Parquet-mr), and inspired on that for example also taken over in Dask. We then implemented support for this in Arrow C++ / PyArrow mostly based on the dask usage (as a downstream user of pyarrow). In the meantime though, Spark disabled writing those files by default a long time ago (https://issues.apache.org/jira/browse/SPARK-15719), and also dask stopped doing this 2 years ago (dask/dask#8901),

Another Parquet dev mailing list thread with some discussion about this: https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamreeve I see so the parquet file is one with all the metadata and all the data is in files pointed to by this singleton.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the benefits of using this _metadata index file should translate to cloud object stores too by reducing the number of objects/files to be read.

not really

Sorry I couldn't really follow this argument, that sounds like a Hadoop specific problem. To me a cloud object store means something like S3, and for our use case we're mostly concerned with reducing the number of objects that need to be read to satisfy read queries that filter data and don't need to read all files in a dataset, as we have many concurrent jobs running and adding load on storage.

Another Parquet dev mailing list thread with some discussion about this: https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7

Much of the discussion there seems to be related to issues users can run into if doing things like overwriting Parquet files or having heterogeneous schemas, which this feature was not designed for. But it sounds like others have also found this feature useful. I think this quote from Patrick Woody matches our experience: "outstandingly useful when you have well laid out data with a sort-order"

@adamreeve I see so the parquet file is one with all the metadata and all the data is in files pointed to by this singleton.

Yes, exactly.

The _metadata file format could have been designed so that the file_path field wasn't needed in the column chunk metadata. But it's there now and provides value to users while adding minimal overhead to those not using it (missing fields require zero space in serialized messages if I've understood the Thrift Compact Protocol correctly).


/** Byte offset in file_path to the ColumnChunkMetaDataV3, optionally encrypted
** CHANGED from v1: renamed to metadata_file_offset
**/
2: required i64 metadata_file_offset

/** NEW from v1: Byte length in file_path of ColumnChunkMetaDataV3, optionally encrypted
**/
3: required i32 metadata_file_length
Comment on lines +985 to +987
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

The naming is a little awkward...I read these as "metadata-file offset"/"metadata-file length". Perhaps instead just "metadata_offset" and "metadata_length"?

Aside: just curious, but in the current format how do you make use of the file_offset? Is there a way to deduce the length of the metadata? Or do you have to use a file based reader and seek to the offset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we still want to put a copy ColumnMetaData at the end of column chunk and another copy here at 4: optional binary encoded_metadata? I know it is good to keep backward compatibility. Does any implementation actually read it from the end of column chunk?


/** REMOVED from v1: meta_data, encrypted_column_metadata.
** Use encoded_metadata instead.
**/

/** NEW from v1: Column metadata for this chunk, duplicated here from file_path.
** This is a Thrift-encoded ColumnChunkMetaDataV3, optionally encrypted.
**/
4: optional binary encoded_metadata
pitrou marked this conversation as resolved.
Show resolved Hide resolved

/** File offset of ColumnChunk's OffsetIndex **/
5: optional i64 offset_index_offset

/** Size of ColumnChunk's OffsetIndex, in bytes **/
6: optional i32 offset_index_length

/** File offset of ColumnChunk's ColumnIndex **/
7: optional i64 column_index_offset

/** Size of ColumnChunk's ColumnIndex, in bytes **/
8: optional i32 column_index_length

/** Crypto metadata of encrypted columns **/
9: optional ColumnCryptoMetaData crypto_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed if we keep this in FileColumnMetadataV3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed especially if it can't be different for individual chunks

}

struct RowGroup {
/** Metadata for each column chunk in this row group.
* This list must have the same order as the SchemaElement list in FileMetaData.
Expand Down Expand Up @@ -914,6 +1040,32 @@ struct RowGroup {
7: optional i16 ordinal
}

struct RowGroupV3 {
/** REMOVED from v1: columns.
* Instead, decode each FileColumnMetadataV3 individually as needed.
*/

/** Total byte size of all the uncompressed column data in this row group **/
1: required i64 total_byte_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this field is particularly useful, as it doesn't account for encodings. We might want a different field name with a clearer meaning.

Copy link
Member Author

@pitrou pitrou May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field was introduced in 2013, I'm not sure anyone still has a memory of the reasons. The author is named Nong Li, without a corresponding GitHub account. @nongli Is it you?
Also ping @julienledem, who was there too.

Copy link
Member Author

@pitrou pitrou May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be used by parquet-mr's ParquetInputFormat.getSplits, which hasn't changed much since 2014. It doesn't seemed used elsewhere in parquet-mr, and isn't used by Parquet C++ at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might be used for a proxy as to actual data size, but it isn't a good one maybe we can keep it and always add another better field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with removing this, FTR, I would just like to make sure this isn't actually useful for some unexpected use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was added before that. this commit is just referencing when we created the separate parquet-format repo. It might well be part of the original metadata. The redfile.thrift name refers to the old RedElm name before we rename it Parquet. It is the size of the decompressed (after decompression, before decoding) page. Agreed it's not super useful.


/** Number of rows in this row group **/
2: required i64 num_rows

/** If set, specifies a sort ordering of the rows in this row group. */
3: optional list<SortingColumn> sorting_columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be per file. I guess the downside is expense concatenating files with different sort orders.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea. Not knowing the precise use cases for this, I would err on the side of safety and not change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to make the same comment. I would imagine the point of sorting_columns is to tip off a query engine as to which columns can be used for filtering (and perhaps also columns used as a compount key?). I can't see how it would make sense for this to vary per row group. Since we're blowing things up, I'd vote to move this to FileMetaData.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders)

Yes, but it can be worked around by simply dropping sort order metadata if row groups have inconsistent values. This is different to codec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fair, but why lose the information per row-group it seems some engines could make sure of this, although it is probably superceded by page indexes?


/** REMOVED from v1: file_offset.
* Use the OffsetIndex for each column instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is useful when we want to estimate the whole read range but do not want to read offset index.

Copy link
Member Author

@pitrou pitrou May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my question would be: how do you estimate the read range if you only have the file offset, but not the on-disk length? Using total_compressed_size perhaps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_compressed_size should give the range I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the current spec does not prohibit placing column chunks in random order. So we must use file_offset together with total_compressed_size to determine the read range. This is a trick used to place small column chunks together which may be fetched in a single I/O.

*/

/** Total byte size of all compressed (and potentially encrypted) column data
* in this row group **/
4: optional i64 total_compressed_size

/** Row group ordinal in the file **/
5: optional i16 ordinal
}

/** Empty struct to signal the order defined by the physical or logical type */
struct TypeDefinedOrder {}

Expand Down Expand Up @@ -1165,6 +1317,62 @@ struct FileMetaData {
9: optional binary footer_signing_key_metadata
}

/** Metadata for a column in this file. */
struct FileColumnMetadataV3 {
/** All column chunks in this file (one per row group) **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in this column in the comment? also, rename the field to chunks?

/** Metadata for a column in this file. */
struct FileColumnMetadataV3 {
  /** All column chunks in this column (in this file - one per row group) **/
  1: required list<ColumnChunkV3> chunks

1: required list<ColumnChunkV3> columns
shangxinli marked this conversation as resolved.
Show resolved Hide resolved

/** Sort order used for the Statistics min_value and max_value fields
**/
2: optional ColumnOrder column_order;

/** NEW from v1: Optional key/value metadata for this column at the file level
**/
3: optional list<KeyValue> key_value_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to model this as a two data pages (key and value) or follow the suggestion above for simply using a page to individually store the encodings.

Copy link
Member Author

@pitrou pitrou May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is file-level metadata. Do we have reasons to believe there will be enough key-values to warrant the complexity of using dedicated data pages for this?

The more we deviate from Parquet 1 file organization, the more work it will create for implementors and the more potential for incompatibilites and bugs.

We should perhaps ask on the ML for opinions...

Edit: started a discussion on https://lists.apache.org/thread/9z4o0zbz34lt5jtllzfrr4gmxczddqyb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is mostly due to the fact that I think we want to follow a policy of "lazy" decoding as much as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that "lazy as much as possible" is that desirable. "Lazy where necessary" sounds more reasonable to me. Hence my question about typical metadata sizes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size. Switching it to separate "page" or something may unblock other potentials like putting extra user-defined secondary index. For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata if we want to add custom index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size.

Ok, but is it a practical concern? In Parquet C++ we have:

constexpr int32_t kDefaultThriftStringSizeLimit = 100 * 1000 * 1000;
// Structs in the thrift definition are relatively large (at least 300 bytes).
// This limits total memory to the same order of magnitude as
// kDefaultStringSizeLimit.
constexpr int32_t kDefaultThriftContainerSizeLimit = 1000 * 1000;

For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the key_value_metadata if we want to add custom index.

Well, you could also have a special-named column with 1 defined BYTE_ARRAY value for the piece of metadata you care about (or you could also model it more finely using Parquet types).

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can add the column encryption key_metadata here,

/** Crypto metadata of encrypted column **/
  4: optional ColumnCryptoMetaData crypto_metadata


struct FileMetaDataV3 {
/** Version of this file **/
1: required i32 version

/** Parquet schema for this file **/
2: required list<SchemaElementV3> schema;
Copy link
Contributor

@emkornfield emkornfield May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lists cannot really be decoded in a random access manner. I suggest in V3 we should consider moving any list elements to a Page that has type byte_array where each element is a serialized struct (thrift or something else if we choose to move away from it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For heavily nested nested lists we might want to separate type specific fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not here. I think I need to do a more formal auditing to see what makes sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip lists do improve list random IO perf, or some variant where as the element list is built up somehow an offset to a later list element is added. But I don't know of any easy way to do that here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concrete suggestion is to introduce a page encoding that supports random access: #250 which IIUC is similar to the approach described here for Columns but allows for the solution to be more easily generalized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you build random access at the encoding level, then how about the compression step?


/** Number of rows in this file **/
3: required i64 num_rows

/** Row groups in this file **/
4: required list<RowGroupV3> row_groups

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per discussion above, we would really like to move away from using a list for the row_groups so that individual row_groups can be read in a random access way. That is, we don't have to read data about all the row_groups if we just want a single row group from the parquet file.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that RowGroupV3 is heavily reduced compared to RowGroup. That said, I understand the concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query?

I think yes? An engine might dispatch individual row-groups to different workers if there is external metadata to support knowing the number of row-group in a file before hand. Again given reduced size of row-groups this might be too much of a micro-optimization.


/** Optional key/value metadata for this file. **/
5: optional list<KeyValue> key_value_metadata

/** String for application that wrote this file. **/
6: optional string created_by

/** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/
7: required list<i64> file_column_metadata_offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be at the beginning of the column chunk? Or is it part of the footer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the column information in the footer is probably still useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "in the footer", you mean after all the data?
In the README above, it is laid out thusly, so I suppose it is in the footer :-)

[...]
<File-level Column 1 Metadata v3>
...
<File-level Column N Metadata v3>

File Metadata v3
[...]

Though, of course, this is purely advisory and writers are free to place them elsewhere.

/** NEW from v1: byte length of FileColumnMetadataV3, for each column **/
8: required list<i32> file_column_metadata_length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too late to add something like below for all offset/length pair?

struct Reference {
  1: i64 offset
  2: i32 length
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're looking to speed up thrift deserialization, I'd bet two int lists are going to be faster to parse than a list of structs. If the metadata objects are contiguous, maybe instead extend the offsets list by one and use deltas to calculate the lengths.

Copy link
Member Author

@pitrou pitrou May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going with the same intuition as @etseidl 's, though I don't have any precise insight into typical Thrift deserializer performance characteristics.

Mandating contiguous column metadata objects and using N+1 offsets is an intriguing idea. It could perhaps allow preloading many column metadata at once more easily.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to N+1 offsets. They are going to parse a lot faster (>2x) than structs.


/** REMOVED from v1: column_orders.
** Use `FileColumnMetadataV3.column_order` instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@emkornfield emkornfield May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I misread where this was being moved to, this seems fine.

**/

/**
* Encryption algorithm. This field is set only in encrypted files
* with plaintext footer. Files with encrypted footer store algorithm id
* in FileCryptoMetaData structure.
*/
9: optional EncryptionAlgorithm encryption_algorithm

/**
* Retrieval metadata of key used for signing the footer.
* Used only in encrypted files with plaintext footer.
*/
10: optional binary footer_signing_key_metadata
}

/** Crypto metadata for files with encrypted footer **/
struct FileCryptoMetaData {
/**
Expand Down