Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable task level writer scaling from TableExecute #18388

Merged

Conversation

gaurav8297
Copy link
Member

@gaurav8297 gaurav8297 commented Jul 24, 2023

Description

Currently, task writer scaling can lead to file sizes which are not controlled via OPTIMIZE command thus making it unreliable.

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Hudi, Delta, Iceberg
* Fix `OPTIMIZE` to avoid creating files smaller than `file_size_threshold` due to task writer scaling. ({issue}`18388`)

@cla-bot cla-bot bot added the cla-signed label Jul 24, 2023
@gaurav8297 gaurav8297 requested a review from raunaqmorarka July 24, 2023 15:01
@gaurav8297 gaurav8297 force-pushed the disable_scaling_table_execute branch from bb6e2ce to 29db74d Compare July 24, 2023 15:08
@github-actions github-actions bot added tests:hive hive Hive connector labels Jul 24, 2023
@gaurav8297 gaurav8297 force-pushed the disable_scaling_table_execute branch from 29db74d to 8536820 Compare July 24, 2023 16:44
@gaurav8297 gaurav8297 force-pushed the disable_scaling_table_execute branch from 8536820 to c06d835 Compare July 24, 2023 16:57
@gaurav8297 gaurav8297 force-pushed the disable_scaling_table_execute branch from c06d835 to 0f6a774 Compare July 24, 2023 19:43
assertUpdate("INSERT INTO " + tableName + " VALUES 10", 1);
assertUpdate("INSERT INTO " + tableName + " VALUES 20", 1);
assertUpdate("INSERT INTO " + tableName + " VALUES NULL", 1);
assertUpdate("ALTER TABLE " + tableName + " EXECUTE OPTIMIZE");
assertUpdate(Session.builder(getQueryRunner().getDefaultSession())
.setSystemProperty("task_writer_count", "1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting this session property, can we instead adjust the assertions for the increased file count ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explicitly set the value of task_writer_count. This makes the test more reliable.

@github-actions github-actions bot added iceberg Iceberg connector delta-lake Delta Lake connector labels Jul 24, 2023
It can result in smaller files than file_size_threshold,
which can be undesirable behaviour.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

2 participants