Skip to content

Commit

Permalink
Create object storage file formats page
Browse files Browse the repository at this point in the history
  • Loading branch information
m57lyra committed Aug 15, 2023
1 parent fda75bc commit 14d5885
Show file tree
Hide file tree
Showing 5 changed files with 123 additions and 153 deletions.
3 changes: 3 additions & 0 deletions docs/src/main/sphinx/connector/delta-lake.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ To connect to Databricks Delta Lake, you need:
* Access to the Hive metastore service (HMS) of Delta Lake or a separate HMS.
* Network access to the HMS from the coordinator and workers. Port 9083 is the
default port for the Thrift protocol used by the HMS.
* Data files stored in the Parquet file format. These can be configured using
:ref:`file format configuration properties <hive-parquet-configuration>` per
catalog.

General configuration
---------------------
Expand Down
107 changes: 14 additions & 93 deletions docs/src/main/sphinx/connector/hive.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Hive connector
IBM Cloud Object Storage <hive-cos>
Storage Caching <hive-caching>
Alluxio <hive-alluxio>
Object storage file formats <object-storage-file-formats>

The Hive connector allows querying data stored in an
`Apache Hive <https://hive.apache.org/>`_
Expand Down Expand Up @@ -54,6 +55,19 @@ The coordinator and all workers must have network access to the Hive metastore
and the storage system. Hive metastore access with the Thrift protocol defaults
to using port 9083.

Data files must be in a supported file format. Some file formats can be
configured using file format configuration properties per catalog:

* :ref:`ORC <hive-orc-configuration>`
* :ref:`Parquet <hive-parquet-configuration>`
* Avro
* RCText (RCFile using ColumnarSerDe)
* RCBinary (RCFile using LazyBinaryColumnarSerDe)
* SequenceFile
* JSON (using org.apache.hive.hcatalog.data.JsonSerDe)
* CSV (using org.apache.hadoop.hive.serde2.OpenCSVSerde)
* TextFile

General configuration
---------------------

Expand Down Expand Up @@ -1591,99 +1605,6 @@ connector.
also have more overhead and increase load on the system.
- ``64 MB``

File formats
------------

The following file types and formats are supported for the Hive connector:

* ORC
* Parquet
* Avro
* RCText (RCFile using ``ColumnarSerDe``)
* RCBinary (RCFile using ``LazyBinaryColumnarSerDe``)
* SequenceFile
* JSON (using ``org.apache.hive.hcatalog.data.JsonSerDe``)
* CSV (using ``org.apache.hadoop.hive.serde2.OpenCSVSerde``)
* TextFile

ORC format configuration properties
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following properties are used to configure the read and write operations
with ORC files performed by the Hive connector.

.. list-table:: ORC format configuration properties
:widths: 30, 50, 20
:header-rows: 1

* - Property Name
- Description
- Default
* - ``hive.orc.time-zone``
- Sets the default time zone for legacy ORC files that did not declare a
time zone.
- JVM default
* - ``hive.orc.use-column-names``
- Access ORC columns by name. By default, columns in ORC files are
accessed by their ordinal position in the Hive table definition. The
equivalent catalog session property is ``orc_use_column_names``.
- ``false``
* - ``hive.orc.bloom-filters.enabled``
- Enable bloom filters for predicate pushdown.
- ``false``
* - ``hive.orc.read-legacy-short-zone-id``
- Allow reads on ORC files with short zone ID in the stripe footer.
- ``false``

.. _hive-parquet-configuration:

Parquet format configuration properties
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following properties are used to configure the read and write operations
with Parquet files performed by the Hive connector.

.. list-table:: Parquet format configuration properties
:widths: 30, 50, 20
:header-rows: 1

* - Property Name
- Description
- Default
* - ``hive.parquet.time-zone``
- Adjusts timestamp values to a specific time zone. For Hive 3.1+, set
this to UTC.
- JVM default
* - ``hive.parquet.use-column-names``
- Access Parquet columns by name by default. Set this property to
``false`` to access columns by their ordinal position in the Hive table
definition. The equivalent catalog session property is
``parquet_use_column_names``.
- ``true``
* - ``parquet.writer.validation-percentage``
- Percentage of Parquet files to validate after write by re-reading the whole file.
The equivalent catalog session property is ``parquet_optimized_writer_validation_percentage``.
Validation can be turned off by setting this property to ``0``.
- ``5``
* - ``parquet.writer.page-size``
- Maximum page size for the Parquet writer.
- ``1 MB``
* - ``parquet.writer.block-size``
- Maximum row group size for the Parquet writer.
- ``128 MB``
* - ``parquet.writer.batch-size``
- Maximum number of rows processed by the parquet writer in a batch.
- ``10000``
* - ``parquet.use-bloom-filter``
- Whether bloom filters are used for predicate pushdown when reading
Parquet files. Set this property to ``false`` to disable the usage of
bloom filters by default. The equivalent catalog session property is
``parquet_use_bloom_filter``.
- ``true``
* - ``parquet.max-read-block-row-count``
- Sets the maximum number of rows read in a batch.
- ``8192``

Hive 3-related limitations
--------------------------

Expand Down
14 changes: 6 additions & 8 deletions docs/src/main/sphinx/connector/hudi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ To use the Hudi connector, you need:
* Network access from the Trino coordinator and workers to the Hudi storage.
* Access to the Hive metastore service (HMS).
* Network access from the Trino coordinator to the HMS.
* Data files stored in the Parquet file format. These can be configured using
:ref:`file format configuration properties <hive-parquet-configuration>` per
catalog.

General configuration
---------------------
Expand Down Expand Up @@ -204,15 +207,10 @@ The output of the query has the following columns:
- Description
* - ``timestamp``
- ``VARCHAR``
- Instant time is typically a timestamp when the actions performed
- Instant time is typically a timestamp when the actions performed.
* - ``action``
- ``VARCHAR``
- `Type of action <https://hudi.apache.org/docs/concepts/#timeline>`_ performed on the table
- `Type of action <https://hudi.apache.org/docs/concepts/#timeline>`_ performed on the table.
* - ``state``
- ``VARCHAR``
- Current state of the instant

File formats
------------

The connector supports Parquet file format.
- Current state of the instant.
58 changes: 6 additions & 52 deletions docs/src/main/sphinx/connector/iceberg.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,11 @@ To use Iceberg, you need:
:ref:`AWS Glue catalog<iceberg-glue-catalog>`, a :ref:`JDBC catalog
<iceberg-jdbc-catalog>`, a :ref:`REST catalog<iceberg-rest-catalog>`, or a
:ref:`Nessie server<iceberg-nessie-catalog>`.
* Network access from the Trino coordinator to the HMS. Hive metastore access
with the Thrift protocol defaults to using port 9083.
* Data files stored in a supported file format. These can be configured using
file format configuration properties per catalog:

- :ref:`ORC <hive-orc-configuration>`
- :ref:`Parquet <hive-parquet-configuration>` (default)

General configuration
---------------------
Expand Down Expand Up @@ -1551,53 +1554,4 @@ Table redirection
.. include:: table-redirection.fragment

The connector supports redirection from Iceberg tables to Hive tables with the
``iceberg.hive-catalog-name`` catalog configuration property.

File formats
------------

The following file types and formats are supported for the Iceberg connector:

* ORC
* Parquet
* Avro

ORC format configuration
^^^^^^^^^^^^^^^^^^^^^^^^

The following properties are used to configure the read and write operations
with ORC files performed by the Iceberg connector.

.. list-table:: ORC format configuration properties
:widths: 30, 58, 12
:header-rows: 1

* - Property name
- Description
- Default
* - ``hive.orc.bloom-filters.enabled``
- Enable bloom filters for predicate pushdown.
- ``false``

Parquet format configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following properties are used to configure the read and write operations
with Parquet files performed by the Iceberg connector.

.. list-table:: Parquet format configuration properties
:widths: 30, 50, 20
:header-rows: 1

* - Property Name
- Description
- Default
* - ``parquet.max-read-block-row-count``
- Sets the maximum number of rows read in a batch.
- ``8192``
* - ``parquet.use-bloom-filter``
- Whether bloom filters are used for predicate pushdown when reading
Parquet files. Set this property to ``false`` to disable the usage of
bloom filters by default. The equivalent catalog session property is
``parquet_use_bloom_filter``.
- ``true``
``iceberg.hive-catalog-name`` catalog configuration property.
94 changes: 94 additions & 0 deletions docs/src/main/sphinx/connector/object-storage-file-formats.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
===========================
Object storage file formats
===========================

Object storage connectors support one or more file formats specified by the
underlying data source.

In the case of serializable formats, only specific
`SerDes <https://www.wikipedia.org/wiki/SerDes>`_ are allowed:

* RCText - RCFile ``ColumnarSerDe``
* RCBinary - RCFile ``LazyBinaryColumnarSerDe``
* JSON - ``org.apache.hive.hcatalog.data.JsonSerDe``
* CSV - ``org.apache.hadoop.hive.serde2.OpenCSVSerde``

.. _hive-orc-configuration:

ORC format configuration properties
-----------------------------------

The following properties are used to configure the read and write operations
with ORC files performed by supported object storage connectors:

.. list-table:: ORC format configuration properties
:widths: 30, 50, 20
:header-rows: 1

* - Property Name
- Description
- Default
* - ``hive.orc.time-zone``
- Sets the default time zone for legacy ORC files that did not declare a
time zone.
- JVM default
* - ``hive.orc.use-column-names``
- Access ORC columns by name. By default, columns in ORC files are
accessed by their ordinal position in the Hive table definition. The
equivalent catalog session property is ``orc_use_column_names``.
- ``false``
* - ``hive.orc.bloom-filters.enabled``
- Enable bloom filters for predicate pushdown.
- ``false``
* - ``hive.orc.read-legacy-short-zone-id``
- Allow reads on ORC files with short zone ID in the stripe footer.
- ``false``

.. _hive-parquet-configuration:

Parquet format configuration properties
---------------------------------------

The following properties are used to configure the read and write operations
with Parquet files performed by supported object storage connectors:

.. list-table:: Parquet format configuration properties
:widths: 30, 50, 20
:header-rows: 1

* - Property Name
- Description
- Default
* - ``hive.parquet.time-zone``
- Adjusts timestamp values to a specific time zone. For Hive 3.1+, set
this to UTC.
- JVM default
* - ``hive.parquet.use-column-names``
- Access Parquet columns by name by default. Set this property to
``false`` to access columns by their ordinal position in the Hive table
definition. The equivalent catalog session property is
``parquet_use_column_names``.
- ``true``
* - ``parquet.writer.validation-percentage``
- Percentage of parquet files to validate after write by re-reading the whole file.
The equivalent catalog session property is ``parquet_optimized_writer_validation_percentage``.
Validation can be turned off by setting this property to ``0``.
- ``5``
* - ``parquet.writer.page-size``
- Maximum page size for the Parquet writer.
- ``1 MB``
* - ``parquet.writer.block-size``
- Maximum row group size for the Parquet writer.
- ``128 MB``
* - ``parquet.writer.batch-size``
- Maximum number of rows processed by the parquet writer in a batch.
- ``10000``
* - ``parquet.use-bloom-filter``
- Whether bloom filters are used for predicate pushdown when reading
Parquet files. Set this property to ``false`` to disable the usage of
bloom filters by default. The equivalent catalog session property is
``parquet_use_bloom_filter``.
- ``true``
* - ``parquet.max-read-block-row-count``
- Sets the maximum number of rows read in a batch.
- ``8192``

0 comments on commit 14d5885

Please sign in to comment.