Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple where calls in DeltaTable.optimize #1353

Conversation

sherlockbeard
Copy link
Contributor

Description

Resolves #1338

How was this patch tested?

test suite & by check the metrics of optimize operation in history table
one test added which replicates the previous (test above it ) with multiple where statement

Does this PR introduce any user-facing changes?

@sherlockbeard
Copy link
Contributor Author

#1344 continued

@tdas tdas requested a review from zsxwing August 24, 2022 18:15
Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Only one style comment

@@ -125,14 +125,16 @@ case class OptimizeTableCommand(
val partitionColumns = txn.snapshot.metadata.partitionColumns
// Parse the predicate expression into Catalyst expression and verify only simple filters
// on partition columns are present
val partitionPredicates = partitionPredicate.map(predicate => {

val partitionPredicates = userPartitionPredicates.flatMap(partitionPredicate =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe re-format the code a bit like this:

    val partitionPredicates = userPartitionPredicates.flatMap { partitionPredicate =>
      partitionPredicate.map { predicate =>
        val predicates = parsePredicates(sparkSession, predicate)
        verifyPartitionPredicates(
          sparkSession,
          partitionColumns,
          predicates)
        predicates
      }.getOrElse(Seq.empty)
    }

The current format took me a while to realize getOrElse is inside the function.

@@ -39,7 +39,7 @@ class DeltaOptimizeBuilder private(
sparkSession: SparkSession,
tableIdentifier: String,
options: Map[String, String]) extends AnalysisHelper {
@volatile private var partitionFilter: Option[String] = None
@volatile private var partitionFilter: Seq[Option[String]] = Seq.empty
Copy link
Member

@zsxwing zsxwing Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found we may be able to write a cleaner code if we change the type to Seq[String] (including changing OptimizeTableCommand.userPartitionPredicates to Seq[String]) ? Sorry. I should have found this earlier.

Could you also remove @volatile since you are touching the code? It's not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed volatile . if you don't mind me asking what is the use Annotation volatile here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volatile is the same as Java volatile. As DeltaOptimizeBuilder is not supposed to be used in multiple thread, we don't need it.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vkorukanti vkorukanti added this to the 2.2.0 milestone Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Python DeltaOptimizeBuilder only applies the last where condition
3 participants