Skip to content

Commit

Permalink
PARQUET-2139: Deprecate ColumnChunk::file_offset field (#440)
Browse files Browse the repository at this point in the history
This field is not consistently set or read by implementations.
  • Loading branch information
etseidl committed Jul 3, 2024
1 parent 3857dc1 commit 5b564f3
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 18 deletions.
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,38 +89,38 @@ more pages.
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
<Column 1 Chunk 1>
<Column 2 Chunk 1>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
<Column N Chunk 1>
<Column 1 Chunk 2>
<Column 2 Chunk 2>
...
<Column N Chunk 2 + Column Metadata>
<Column N Chunk 2>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
<Column 1 Chunk M>
<Column 2 Chunk M>
...
<Column N Chunk M + Column Metadata>
<Column N Chunk M>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"

In the above example, there are N columns in this table, split into M row
groups. The file metadata contains the locations of all the column metadata
groups. The file metadata contains the locations of all the column chunk
start locations. More details on what is contained in the metadata can be found
in the Thrift definition.

Metadata is written after the data to allow for single pass writing.
File Metadata is written after the data to allow for single pass writing.

Readers are expected to first read the file metadata to find all the column
chunks they are interested in. The columns chunks should then be read sequentially.

![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)

## Metadata
There are three types of metadata: file metadata, column (chunk) metadata and page
header metadata. All thrift structures are serialized using the TCompactProtocol.
There are two types of metadata: file metadata and page header metadata. All thrift structures
are serialized using the TCompactProtocol.

![Metadata diagram](https://github.com/apache/parquet-format/raw/master/doc/images/FileFormat.gif)

Expand Down
19 changes: 14 additions & 5 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -867,12 +867,21 @@ struct ColumnChunk {
**/
1: optional string file_path

/** Byte offset in file_path to the ColumnMetaData **/
2: required i64 file_offset
/** Deprecated: Byte offset in file_path to the ColumnMetaData
*
* Past use of this field has been inconsistent, with some implementations
* using it to point to the ColumnMetaData and some using it to point to
* the first page in the column chunk. In many cases, the ColumnMetaData at this
* location is wrong. This field is now deprecated and should not be used.
* Writers should set this field to 0 if no ColumnMetaData has been written outside
* the footer.
*/
2: required i64 file_offset = 0

/** Column metadata for this chunk. This is the same content as what is at
* file_path/file_offset. Having it here has it replicated in the file
* metadata.
/** Column metadata for this chunk. Some writers may also replicate this at the
* location pointed to by file_path/file_offset.
* Note: while marked as optional, this field is in fact required by most major
* Parquet implementations. As such, writers MUST populate this field.
**/
3: optional ColumnMetaData meta_data

Expand Down

0 comments on commit 5b564f3

Please sign in to comment.