-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document OPTIMIZE for Iceberg #10790
Conversation
da9659c
to
cbbedbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me have a look again after you updated.
ALTER TABLE EXECUTE | ||
^^^^^^^^^^^^^^^^^^^ | ||
|
||
The connector supports the following commands for use with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simplify .. something like this:
The connector supports collapsing files in a table into fewer larger files with identical data. This operation improves performance and reduces disk usage.
All files with a size below the optional file_size_threshold
parameter are collapsed. The default value for the threshold is 100MB
:
.. code-block:: sql
ALTER TABLE test_table EXECUTE OPTIMIZE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did reword the paragraph with your suggestion, but kept the reference towards partitioning. I think it is important to mention that the collapsing takes place per partition in case of partitioned tables.
cbbedbc
to
2f5b181
Compare
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The connector offers :ref:`ALTER TABLE EXECUTE <alter-table-execute>` | ||
``OPTIMIZE`` command for collapsing files which correspond to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi please do provide feedback in case that I'm wrong in the assumption that OPTIMIZE
touches only the files which correspond to the current snapshot of the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes it reads off of current snapshot
- talking about snapshots here is not necessary. Do you specify that UPDATE or SELECT operate on current snapshot? worse, it can be misunderstood, as if OPTIMIZE was replacing current snapshot, or rewriting the files in situ
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The connector offers :ref:`ALTER TABLE EXECUTE <alter-table-execute>` | ||
``OPTIMIZE`` command for collapsing files which correspond to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes it reads off of current snapshot
- talking about snapshots here is not necessary. Do you specify that UPDATE or SELECT operate on current snapshot? worse, it can be misunderstood, as if OPTIMIZE was replacing current snapshot, or rewriting the files in situ
988ee44
to
4d24677
Compare
@findepi think that because this command acts only on the current snapshot, it is worth mentioning this aspect. Not mentioning it could imply for a user not knowing the internals of the Iceberg connector that all the existing snapshots of the table are compacted. |
4d24677
to
c5b5734
Compare
User not knowing internals of iceberg won't know & won't care about snapshots. |
c5b5734
to
147a867
Compare
This is the current state that I have:
@findepi I don't know how can I exclude the |
How do we talk about inserts, updates and deletes? |
147a867
to
dfe48ea
Compare
@findepi I took out "snapshot" from the documentation. Please do have another look on the PR. |
@findinpath please rebase, there is a conflict. |
ALTER TABLE EXECUTE optimize | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The connector offers :ref:`ALTER TABLE EXECUTE <alter-table-execute>` | ||
``optimize`` command for rewriting the current content of the specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At risk of overriding your work so far, we have a similar section that documents ALTER TABLE EXECUTE for the Hive connector: https://trino.io/docs/current/connector/hive.html#alter-table-execute
For the sake of consistency, we should make these as similar as possible. Are there any differences in implementation on ALTER TABLE EXECUTE on Hive vs on Iceberg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimize procedure is disabled by default, and can be enabled for a catalog with the
<catalog name>.non_transactional_optimize_enabled
session property.
This property is not required for the Iceberg connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any differences in implementation on ALTER TABLE EXECUTE on Hive vs on Iceberg?
@findepi can you please answer this question? I am not familiar enough with the Hive implementation of OPTIMIZE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any differences in implementation on ALTER TABLE EXECUTE on Hive vs on Iceberg?
They are independent. Hive's is a dangerous, non-transactional operation that can, in case of failure, lead to corrupted table state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any differences in implementation on ALTER TABLE EXECUTE on Hive vs on Iceberg?
They are independent. Hive's is a dangerous, non-transactional operation that can, in case of failure, lead to corrupted table state.
Would the content in https://trino.io/docs/current/connector/hive.html#alter-table-execute make sense here then, just without the warning note about it being a non-transactional operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhlodin mostly, yes, except for minor wording #10790 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findinpath Can you update the contents of this commit to reuse the Hive documentation? Minus the note as mentioned, and with whatever other feedback there is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have aligned both Hive and Iceberg documentation regarding optimize
command to follow the same structure.
I've taken the liberty of adding along the way a few more usage examples for the command in the Hive documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
dfe48ea
to
23c4665
Compare
23c4665
to
cc14ad0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending comments
The ``optimize`` command is used for rewriting the active content | ||
of the specified table so that it is merged into potentially | ||
fewer larger files. | ||
In case that the table is partitioned, the data compaction | ||
acts separately on each partition selected for optimization. | ||
This operation improves read performance and reduces disk usage. | ||
|
||
All files with a size below the optional ``file_size_threshold`` | ||
parameter (default value for the threshold is ``100MB``) are | ||
merged: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this introduction is better than what we had before in Hive, can you apply this there as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly did a rewording for Hive because the optimize operation applies only for non-transactional tables.
485714d
to
7d61f68
Compare
Hey all, are we ready to merge this PR? |
7d61f68
to
3f90332
Compare
3f90332
to
db59b49
Compare
No description provided.