Create object storage file formats page

trinodb · Aug 15, 2023 · 14d5885 · 14d5885
1 parent fda75bc
commit 14d5885
Show file tree

Hide file tree

Showing 5 changed files with 123 additions and 153 deletions.
diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst
@@ -25,6 +25,9 @@ To connect to Databricks Delta Lake, you need:
 * Access to the Hive metastore service (HMS) of Delta Lake or a separate HMS.
 * Network access to the HMS from the coordinator and workers. Port 9083 is the
   default port for the Thrift protocol used by the HMS.
+* Data files stored in the Parquet file format. These can be configured using
+  :ref:`file format configuration properties <hive-parquet-configuration>` per
+  catalog.
 
 General configuration
 ---------------------

diff --git a/docs/src/main/sphinx/connector/hive.rst b/docs/src/main/sphinx/connector/hive.rst
@@ -17,6 +17,7 @@ Hive connector
     IBM Cloud Object Storage <hive-cos>
     Storage Caching <hive-caching>
     Alluxio <hive-alluxio>
+    Object storage file formats <object-storage-file-formats>
 
 The Hive connector allows querying data stored in an
 `Apache Hive <https://hive.apache.org/>`_
@@ -54,6 +55,19 @@ The coordinator and all workers must have network access to the Hive metastore
 and the storage system. Hive metastore access with the Thrift protocol defaults
 to using port 9083.
 
+Data files must be in a supported file format. Some file formats can be
+configured using file format configuration properties per catalog:
+
+* :ref:`ORC <hive-orc-configuration>`
+* :ref:`Parquet <hive-parquet-configuration>`
+* Avro
+* RCText (RCFile using ColumnarSerDe)
+* RCBinary (RCFile using LazyBinaryColumnarSerDe)
+* SequenceFile
+* JSON (using org.apache.hive.hcatalog.data.JsonSerDe)
+* CSV (using org.apache.hadoop.hive.serde2.OpenCSVSerde)
+* TextFile
+
 General configuration
 ---------------------
 
@@ -1591,99 +1605,6 @@ connector.
         also have more overhead and increase load on the system.
       - ``64 MB``
 
-File formats
-------------
-
-The following file types and formats are supported for the Hive connector:
-
-* ORC
-* Parquet
-* Avro
-* RCText (RCFile using ``ColumnarSerDe``)
-* RCBinary (RCFile using ``LazyBinaryColumnarSerDe``)
-* SequenceFile
-* JSON (using ``org.apache.hive.hcatalog.data.JsonSerDe``)
-* CSV (using ``org.apache.hadoop.hive.serde2.OpenCSVSerde``)
-* TextFile
-
-ORC format configuration properties
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The following properties are used to configure the read and write operations
-with ORC files performed by the Hive connector.
-
-.. list-table:: ORC format configuration properties
-    :widths: 30, 50, 20
-    :header-rows: 1
-
-    * - Property Name
-      - Description
-      - Default
-    * - ``hive.orc.time-zone``
-      - Sets the default time zone for legacy ORC files that did not declare a
-        time zone.
-      - JVM default
-    * - ``hive.orc.use-column-names``
-      - Access ORC columns by name. By default, columns in ORC files are
-        accessed by their ordinal position in the Hive table definition. The
-        equivalent catalog session property is ``orc_use_column_names``.
-      - ``false``
-    * - ``hive.orc.bloom-filters.enabled``
-      - Enable bloom filters for predicate pushdown.
-      - ``false``
-    * - ``hive.orc.read-legacy-short-zone-id``
-      - Allow reads on ORC files with short zone ID in the stripe footer.
-      - ``false``
-
-.. _hive-parquet-configuration:
-
-Parquet format configuration properties
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The following properties are used to configure the read and write operations
-with Parquet files performed by the Hive connector.
-
-.. list-table:: Parquet format configuration properties
-    :widths: 30, 50, 20
-    :header-rows: 1
-
-    * - Property Name
-      - Description
-      - Default
-    * - ``hive.parquet.time-zone``
-      - Adjusts timestamp values to a specific time zone. For Hive 3.1+, set
-        this to UTC.
-      - JVM default
-    * - ``hive.parquet.use-column-names``
-      - Access Parquet columns by name by default. Set this property to
-        ``false`` to access columns by their ordinal position in the Hive table
-        definition. The equivalent catalog session property is
-        ``parquet_use_column_names``.
-      - ``true``
-    * - ``parquet.writer.validation-percentage``
-      - Percentage of Parquet files to validate after write by re-reading the whole file.
-        The equivalent catalog session property is ``parquet_optimized_writer_validation_percentage``.
-        Validation can be turned off by setting this property to ``0``.
-      - ``5``
-    * - ``parquet.writer.page-size``
-      - Maximum page size for the Parquet writer.
-      - ``1 MB``
-    * - ``parquet.writer.block-size``
-      - Maximum row group size for the Parquet writer.
-      - ``128 MB``
-    * - ``parquet.writer.batch-size``
-      - Maximum number of rows processed by the parquet writer in a batch.
-      - ``10000``
-    * - ``parquet.use-bloom-filter``
-      - Whether bloom filters are used for predicate pushdown when reading
-        Parquet files. Set this property to ``false`` to disable the usage of
-        bloom filters by default. The equivalent catalog session property is
-        ``parquet_use_bloom_filter``.
-      - ``true``
-    * - ``parquet.max-read-block-row-count``
-      - Sets the maximum number of rows read in a batch.
-      - ``8192``
-
 Hive 3-related limitations
 --------------------------
 

diff --git a/docs/src/main/sphinx/connector/hudi.rst b/docs/src/main/sphinx/connector/hudi.rst
@@ -17,6 +17,9 @@ To use the Hudi connector, you need:
 * Network access from the Trino coordinator and workers to the Hudi storage.
 * Access to the Hive metastore service (HMS).
 * Network access from the Trino coordinator to the HMS.
+* Data files stored in the Parquet file format. These can be configured using
+  :ref:`file format configuration properties <hive-parquet-configuration>` per
+  catalog.
 
 General configuration
 ---------------------
@@ -204,15 +207,10 @@ The output of the query has the following columns:
     - Description
   * - ``timestamp``
     - ``VARCHAR``
-    - Instant time is typically a timestamp when the actions performed
+    - Instant time is typically a timestamp when the actions performed.
   * - ``action``
     - ``VARCHAR``
-    - `Type of action <https://hudi.apache.org/docs/concepts/#timeline>`_ performed on the table
+    - `Type of action <https://hudi.apache.org/docs/concepts/#timeline>`_ performed on the table.
   * - ``state``
     - ``VARCHAR``
-    - Current state of the instant
-
-File formats
-------------
-
-The connector supports Parquet file format.
+    - Current state of the instant.
diff --git a/docs/src/main/sphinx/connector/iceberg.rst b/docs/src/main/sphinx/connector/iceberg.rst
@@ -42,8 +42,11 @@ To use Iceberg, you need:
   :ref:`AWS Glue catalog<iceberg-glue-catalog>`, a :ref:`JDBC catalog
   <iceberg-jdbc-catalog>`, a :ref:`REST catalog<iceberg-rest-catalog>`, or a
   :ref:`Nessie server<iceberg-nessie-catalog>`.
-* Network access from the Trino coordinator to the HMS. Hive metastore access
-  with the Thrift protocol defaults to using port 9083.
+* Data files stored in a supported file format. These can be configured using
+  file format configuration properties per catalog:
+
+  - :ref:`ORC <hive-orc-configuration>`
+  - :ref:`Parquet <hive-parquet-configuration>` (default)
 
 General configuration
 ---------------------
@@ -1551,53 +1554,4 @@ Table redirection
 .. include:: table-redirection.fragment
 
 The connector supports redirection from Iceberg tables to Hive tables with the
-``iceberg.hive-catalog-name`` catalog configuration property.
-
-File formats
-------------
-
-The following file types and formats are supported for the Iceberg connector:
-
-* ORC
-* Parquet
-* Avro
-
-ORC format configuration
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-The following properties are used to configure the read and write operations
-with ORC files performed by the Iceberg connector.
-
-.. list-table:: ORC format configuration properties
-  :widths: 30, 58, 12
-  :header-rows: 1
-
-  * - Property name
-    - Description
-    - Default
-  * - ``hive.orc.bloom-filters.enabled``
-    - Enable bloom filters for predicate pushdown.
-    - ``false``
-
-Parquet format configuration
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The following properties are used to configure the read and write operations
-with Parquet files performed by the Iceberg connector.
-
-.. list-table:: Parquet format configuration properties
-    :widths: 30, 50, 20
-    :header-rows: 1
-
-    * - Property Name
-      - Description
-      - Default
-    * - ``parquet.max-read-block-row-count``
-      - Sets the maximum number of rows read in a batch.
-      - ``8192``
-    * - ``parquet.use-bloom-filter``
-      - Whether bloom filters are used for predicate pushdown when reading
-        Parquet files. Set this property to ``false`` to disable the usage of
-        bloom filters by default. The equivalent catalog session property is
-        ``parquet_use_bloom_filter``.
-      - ``true``
+``iceberg.hive-catalog-name`` catalog configuration property.
diff --git a/docs/src/main/sphinx/connector/object-storage-file-formats.rst b/docs/src/main/sphinx/connector/object-storage-file-formats.rst
@@ -0,0 +1,94 @@
+===========================
+Object storage file formats
+===========================
+
+Object storage connectors support one or more file formats specified by the
+underlying data source.
+
+In the case of serializable formats, only specific
+`SerDes <https://www.wikipedia.org/wiki/SerDes>`_ are allowed:
+
+* RCText - RCFile ``ColumnarSerDe``
+* RCBinary - RCFile ``LazyBinaryColumnarSerDe``
+* JSON - ``org.apache.hive.hcatalog.data.JsonSerDe``
+* CSV - ``org.apache.hadoop.hive.serde2.OpenCSVSerde``
+
+.. _hive-orc-configuration:
+
+ORC format configuration properties
+-----------------------------------
+
+The following properties are used to configure the read and write operations
+with ORC files performed by supported object storage connectors:
+
+.. list-table:: ORC format configuration properties
+    :widths: 30, 50, 20
+    :header-rows: 1
+
+    * - Property Name
+      - Description
+      - Default
+    * - ``hive.orc.time-zone``
+      - Sets the default time zone for legacy ORC files that did not declare a
+        time zone.
+      - JVM default
+    * - ``hive.orc.use-column-names``
+      - Access ORC columns by name. By default, columns in ORC files are
+        accessed by their ordinal position in the Hive table definition. The
+        equivalent catalog session property is ``orc_use_column_names``.
+      - ``false``
+    * - ``hive.orc.bloom-filters.enabled``
+      - Enable bloom filters for predicate pushdown.
+      - ``false``
+    * - ``hive.orc.read-legacy-short-zone-id``
+      - Allow reads on ORC files with short zone ID in the stripe footer.
+      - ``false``
+
+.. _hive-parquet-configuration:
+
+Parquet format configuration properties
+---------------------------------------
+
+The following properties are used to configure the read and write operations
+with Parquet files performed by supported object storage connectors:
+
+.. list-table:: Parquet format configuration properties
+    :widths: 30, 50, 20
+    :header-rows: 1
+
+    * - Property Name
+      - Description
+      - Default
+    * - ``hive.parquet.time-zone``
+      - Adjusts timestamp values to a specific time zone. For Hive 3.1+, set
+        this to UTC.
+      - JVM default
+    * - ``hive.parquet.use-column-names``
+      - Access Parquet columns by name by default. Set this property to
+        ``false`` to access columns by their ordinal position in the Hive table
+        definition. The equivalent catalog session property is
+        ``parquet_use_column_names``.
+      - ``true``
+    * - ``parquet.writer.validation-percentage``
+      - Percentage of parquet files to validate after write by re-reading the whole file.
+        The equivalent catalog session property is ``parquet_optimized_writer_validation_percentage``.
+        Validation can be turned off by setting this property to ``0``.
+      - ``5``
+    * - ``parquet.writer.page-size``
+      - Maximum page size for the Parquet writer.
+      - ``1 MB``
+    * - ``parquet.writer.block-size``
+      - Maximum row group size for the Parquet writer.
+      - ``128 MB``
+    * - ``parquet.writer.batch-size``
+      - Maximum number of rows processed by the parquet writer in a batch.
+      - ``10000``
+    * - ``parquet.use-bloom-filter``
+      - Whether bloom filters are used for predicate pushdown when reading
+        Parquet files. Set this property to ``false`` to disable the usage of
+        bloom filters by default. The equivalent catalog session property is
+        ``parquet_use_bloom_filter``.
+      - ``true``
+    * - ``parquet.max-read-block-row-count``
+      - Sets the maximum number of rows read in a batch.
+      - ``8192``