Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for preferred insert or create table layouts #2358

Merged
merged 10 commits into from
Jan 30, 2020

Conversation

sopel39
Copy link
Member

@sopel39 sopel39 commented Dec 28, 2019

No description provided.

@sopel39 sopel39 added the WIP label Dec 28, 2019
@cla-bot cla-bot bot added the cla-signed label Dec 28, 2019
@sopel39 sopel39 force-pushed the ks/insert_repartition branch 6 times, most recently from fefa7f0 to 69a4f64 Compare December 31, 2019 11:01
@sopel39 sopel39 changed the title [WIP] Add support for preferred insert or create table layouts Add support for preferred insert or create table layouts Dec 31, 2019
@sopel39 sopel39 requested a review from martint December 31, 2019 11:01
@sopel39 sopel39 assigned electrum and unassigned electrum Dec 31, 2019
@sopel39 sopel39 requested a review from electrum December 31, 2019 11:01
@sopel39
Copy link
Member Author

sopel39 commented Dec 31, 2019

this is ready to go

@sopel39 sopel39 removed the WIP label Dec 31, 2019
@sopel39 sopel39 force-pushed the ks/insert_repartition branch 2 times, most recently from 18c9274 to 7ef8a04 Compare December 31, 2019 12:25
@@ -2002,7 +2002,16 @@ private static Domain buildColumnDomain(ColumnHandle column, List<HivePartition>

Optional<HiveBucketHandle> hiveBucketHandle = getHiveBucketHandle(table, typeManager);
if (!hiveBucketHandle.isPresent()) {
return Optional.empty();
// return preferred layout which is partitioned by partition columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply partitioning by partition columns also when the table is bucketed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When table is bucketed then data partitioning on partition columns should be required, but this is outside of scope of this PR.
For example,

  • currently: if you have 10 buckets per partition, then we partition data by bucket columns. This means only 10 workers will write data for all partitions
  • better: if you partition data by bucket columns AND partition columns, then still 10 workers will write data per partition. However, set of workers per partition changes therefore improving parallelism.

@findepi
Copy link
Member

findepi commented Jan 1, 2020

This adds use-preferred-write-partitioning configuration / use_preferred_write_partitioning session toggle.
Shouldn't we have a Hive- (catalog-) specific toggle instead?

@sopel39
Copy link
Member Author

sopel39 commented Jan 2, 2020

Shouldn't we have a Hive- (catalog-) specific toggle instead?

Some more context on engine vs connector approach. While connector specific toggle could be some stop-gap here, I don't think it's a good approach. This PR opens a way for CBO to make decision on using preferred insert layout. For example there could be very basic CBO rule that chooses proffered partitioning if number of groups if greater than some constant. I think such rule would work good enough in practice. This PR is step toward it. I don't like too many manual toggles
and insert queries are often simpler so stats are easier to estimate.

The SPI needs to be changed regardless of approach as connectors need to be able to return FIXED_HASH ConnectorPartitioningHandle. This is system partitioning handle and reimplementing it in connectors (there are various methods associated with it) doesn't make much sense.

if (getTaskWriterCount(session) > 1 && !node.getPartitioningScheme().isPresent()) {
requiredProperties = fixedParallelism();
preferredProperties = fixedParallelism();
// TODO: add support for arbitrary partitioning in local exchanges
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain mentioned there was some PR to add support for arbitrary partitioning in local exchanges.

@@ -58,7 +58,7 @@ public ConnectorNewTableLayout getLayout()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand what the commit message is referring to. Whats "evenly partitioning"?


import static java.util.Objects.requireNonNull;

public class ConnectorNewTableLayout
{
private final ConnectorPartitioningHandle partitioning;
private final Optional<ConnectorPartitioningHandle> partitioning;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange. Why would partioningColumns be missing when partitioningColumns is present?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange. Why would partitioning be missing when partitioningColumns is present?

If you just care that data is partitioned on columns, but don't care how exactly, then you don't need to specify partitioning

Comment on lines 423 to 446
if (writeTableLayout.get().getLayout().getPartitioning().isPresent()) {
partitioningHandle = writeTableLayout.get().getPartitioning();
}
else {
// empty connector partitioning handle means evenly partitioning on partitioning columns
partitioningHandle = FIXED_HASH_DISTRIBUTION;
}
Copy link
Member

@martint martint Jan 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the fact we're looking at getLayout().getPartitioning() to decide whether to call getPartitioning() very confusing, and probably a sign of an API design issue.

It'd be cleaner to make getPartitioning() for NewTableLayout return optional and just do:

PartitioningHandle partitioningHandle = writeTableLayout.get()
    .getPartitioning()
    .orElse(FIXED_HASH_DISTRIBUTION)

requiredProperties = fixedParallelism();
preferredProperties = fixedParallelism();
// TODO: add support for arbitrary partitioning in local exchanges
boolean hasFixedHashDistribution = node.getPartitioningScheme().isPresent()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe...

boolean hasFixedHashDistribution = node.getPartitioningScheme()
        .map(scheme -> scheme.getPartitioning().getHandle())
        .filter(isEqual(FIXED_HASH_DISTRIBUTION))
        .isPresent();

or

boolean hasFixedHashDistribution = node.getPartitioningScheme()
        .filter(scheme -> scheme.getPartitioning().getHandle().equals(FIXED_HASH_DISTRIBUTION))
        .isPresent();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can move inside the if (getTaskWriterCount(session) > 1) { block

@sopel39 sopel39 force-pushed the ks/insert_repartition branch from 3edf486 to 3c65d37 Compare January 28, 2020 16:28
Copy link
Member Author

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ac


import static java.util.Objects.requireNonNull;

public class ConnectorNewTableLayout
{
private final ConnectorPartitioningHandle partitioning;
private final Optional<ConnectorPartitioningHandle> partitioning;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange. Why would partitioning be missing when partitioningColumns is present?

If you just care that data is partitioned on columns, but don't care how exactly, then you don't need to specify partitioning

@sopel39 sopel39 force-pushed the ks/insert_repartition branch from 3c65d37 to 0c629af Compare January 28, 2020 16:34
@martint martint self-requested a review January 28, 2020 16:52
@sopel39 sopel39 force-pushed the ks/insert_repartition branch from 0c629af to 835ec62 Compare January 29, 2020 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants