Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document domain-compaction-threshold description #12965

Closed
wants to merge 9 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 14 additions & 13 deletions docs/src/main/sphinx/connector/delta-lake.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ metastore configuration properties as the :doc:`Hive connector
</connector/hive>`. At a minimum, ``hive.metastore.uri`` must be configured.

The connector recognizes Delta tables created in the metastore by the Databricks
runtime. If non-Delta tables are present in the metastore, as well, they are not
runtime. If non-Delta tables are present in the metastore as well, they are not
visible to the connector.

To configure the Delta Lake connector, create a catalog properties file, for
Expand Down Expand Up @@ -173,10 +173,11 @@ connector.
- Description
- Default
* - ``delta.domain-compaction-threshold``
- Sets the number of transactions to act as threshold. Once reached the
connector initiates compaction of the underlying files and the delta
files. A higher compaction threshold means reading less data from the
underlying data source, but a higher memory and network consumption.
- Minimum size of query predicates above which Trino starts compaction.
Pushing a large list of predicates down to the data source can
compromise performance. For optimization in that situation, Trino can
compact the large predicates. If necessary, adjust the threshold to
ensure a balance between performance and predicate pushdown.
- 100
* - ``delta.max-outstanding-splits``
- The target number of buffered splits for each table scan in a query,
Expand All @@ -203,8 +204,8 @@ connector.
- ``32MB``
* - ``delta.max-split-size``
- Sets the largest :ref:`prop-type-data-size` for a single read section
assigned to a worker after max-initial-splits have been processed. You
can also use the corresponding catalog session property
assigned to a worker after ``max-initial-splits`` have been processed.
You can also use the corresponding catalog session property
``<catalog-name>.max_split_size``.
- ``64MB``
* - ``delta.minimum-assigned-split-weight``
Expand Down Expand Up @@ -441,7 +442,7 @@ to register them::
)

Columns listed in the DDL, such as ``dummy`` in the preceeding example, are
ignored. The table schema is read from the transaction log, instead. If the
ignored. The table schema is read from the transaction log instead. If the
schema is changed by an external system, Trino automatically uses the new
schema.

Expand Down Expand Up @@ -499,7 +500,7 @@ Write operations are supported for tables stored on the following systems:

Writes to :doc:`Amazon S3 <hive-s3>` and S3-compatible storage must be enabled
with the ``delta.enable-non-concurrent-writes`` property. Writes to S3 can
safely be made from multiple Trino clusters, however write collisions are not
safely be made from multiple Trino clusters; however, write collisions are not
detected when writing concurrently from other Delta Lake engines. You need to
make sure that no concurrent data modifications are run to avoid data
corruption.
Expand All @@ -519,7 +520,7 @@ Table statistics

You can use :doc:`/sql/analyze` statements in Trino to populate the table
statistics in Delta Lake. Data size and number of distinct values (NDV)
statistics are supported, while Minimum value, maximum value, and null value
statistics are supported; whereas minimum value, maximum value, and null value
count statistics are not supported. The :doc:`cost-based optimizer
</optimizer/cost-based-optimizations>` then uses these statistics to improve
query performance.
Expand Down Expand Up @@ -578,7 +579,7 @@ disable it for a session, with the :doc:`catalog session property
</sql/set-session>` ``extended_statistics_enabled`` set to ``false``.

If a table is changed with many delete and update operation, calling ``ANALYZE``
does not result in accurate statistics. To correct the statistics you have to
does not result in accurate statistics. To correct the statistics, you have to
drop the extended stats and analyze table again.

Use the ``system.drop_extended_stats`` procedure in the catalog to drop the
Expand Down Expand Up @@ -630,7 +631,7 @@ this property is ``0s``. There is a minimum retention session property as well,
Memory monitoring
"""""""""""""""""

When using the Delta Lake connector you need to monitor memory usage on the
When using the Delta Lake connector, you need to monitor memory usage on the
coordinator. Specifically monitor JVM heap utilization using standard tools as
part of routine operation of the cluster.

Expand All @@ -657,5 +658,5 @@ Following is an example result:
node | trino-master
object_name | io.trino.plugin.deltalake.transactionlog:type=TransactionLogAccess,name=delta

In a healthy system both ``datafilemetadatacachestats.hitrate`` and
In a healthy system, both ``datafilemetadatacachestats.hitrate`` and
``metadatacachestats.hitrate`` are close to ``1.0``.