Redelm File format design

The RedElm file format is a columnar format inspired by the ColumnIO format described in the Dremel paper from Google. Its primary target is HDFS.

This document reflects the current state of thinking around this format. It is evolving and being discussed.

Overview

The records are organized in row groups (data blocks). Each row group is organized by column. There are three columns per leaf of the schema (primitive type):

repetition level
definition level
data Each column chunk in a row group is compressed using the configured compression codec (using standard codecs). Meta data is stored in a footer. The format is cross language compatible

Goals

The primary target is HDFS
Each column chunk can be read independently so that only the data for the fields accessed is read.
All the column data for the same row group should be collocated to speedup record assembly.
Repetition levels and definition levels can be read independently of the data (for example for fast count() implementation)
Row groups should be large enough to amortize IO seek cost over scan.
Backward compatibility: The format should allow for new extensions (column data encoding: dictionary, delta, ...) to be implemented while allowing new versions of the library to read old versions of the files.
The file can be generated in one pass from a M/R jobs (either from mappers or reducers) without requiring more than a given budget of memory.
The format should be readable in any language and not be tied to Java.
Application-specific metadata blocks can be added in the footer to allow customization of the conversion. For example when writing from Pig, a Pig metadata block is added with the Pig Schema. This allows converting the data back to the original Pig schema. (for example Pig Maps are converted to list of key/value pairs, the Pig schema is required to convert it back to a Map and not a bag of key/value tuples)
Column data in a row group has a header to record the encoding used in this particular data block for this particular column. (This would also allow skipping compression)

Design choices

the meta data is stored in a footer as it is accumulated in memory while writing row groups. Once all the data blocks (1 per row group) have been written, the meta data blocks are appended at the end of the file.
the index of the footer is stored at the very end of the file to allow finding the footer.
a given row group is buffered in memory and is flushed to disk when a threshold is reached.
meta data blocks are stored in JSON (TODO) (To simplify backward and cross language compatibility)
the footer is gZipped (to be discussed).
In general caching the meta data would be a good idea (HBase, HCatalog?). As a simple version of this, we create a summary file consolidating all the footers from the part files in the partition directory along with the part files. This allows faster look up of metadata and easy addition of custom file footers (for example: custom conversion hints to 3rd party schemas).
we want to allow application specific customization without breaking cross compatibility. (Pig, Hive, ...)

To do

Skiplists to skip values in a row group when a filter is being evaluated on a different column (see CIF paper).
Look into setting the HDFS block size to a large value (1-2GB) for the resulting file. The goal being to ensure complete collocation of column data withing the same data block (the row groups do no cross HDFS block boundary)
Explore a two pass generation: As row groups are buffered in memory, we want to flush them to disc when they get too big, however when scanning the data we want to have larger blocks so that we can have more effective scans. We could write files with relatively small row groups and rewrite them in the end one column at a time in bigger chunks.
figure out the best way to deal with schema conversion across systems (Pig, Hive, Thrift, ...) and a good way to capture types that are not in the Protobuf spec (Map, Set, ...)

File layout:

See the picture after the description

8 bytes magic number: 82, 101, 100, 32, 69, 108, 109, 10 ("RedElm\n": to avoid trying to read non-redelm files)
row group (repeated):
- column (repeated):
  - repetition levels (ShortInt compressed)
  - definition levels (compressed)
  - data:
    - header: describes the encoding of the data (ToDo)
    - actual data: (compressed)
the footer stores metadata blocks:
- file version
- metadata blocks count
- RedElm block: (required)
  - File level metadata
    - RedElm schema
    - Codec used to compress (TODO: This is one of a predefined set to ensure cross language compatibility. for example => lzo, gz, snappy, ...) this is a generic compression codec applied on top of existing encoding. It may be optionally turn off per column. (TBD)
  - block level meta data. For each block:
    - start index (could replace by block length)
    - end index (could replace by block length)
    - record count
    - columns for this block:
      - r start (could replace by r length)
      - d start (could replace by d length)
      - data start (could replace by data length)
      - data end (could replace by data length)
      - identifier (path in the schema)
      - type (one of the primitive types)
      - values count
      - r uncompressed size (useful for planning)
      - d uncompressed size
      - data uncompressed size
- Pig Block: (optional: shown as an example, only the Pig extension of the InputFormat knows about this)
  - Pig Schema (to allow reading with the exact same schema used to write as the conversion from Pig schema to RedElm schema is lossy)
- Thrift Block: (optional: same thing, the format just allows arbitrary named metadata blocs in the footer)
  - thrift class name (same thing)
footer index
Magic number

RedElm File Format

See also https://en.wikipedia.org/wiki/Apache_Parquet

Provide feedback

Saved searches