Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Fix changelog table bug for start time older than current snapshot #11564

Merged
merged 5 commits into from
Nov 21, 2024

Conversation

Acehaidrey
Copy link
Contributor

@Acehaidrey Acehaidrey commented Nov 16, 2024

Fix changelog table bug for start time older than current snapshot

Problem

After upgrading to Iceberg 1.5 and Spark 3.5.1, the create_changelog_view procedure occasionally returns records that were inserted outside (before) the specified time range. Specifically:

  • When querying for recent changes (e.g. last 10 minutes)
  • The view sometimes includes records from 3-5 hours ago
  • Re-running the same query returns the expected 0 records
  • This inconsistency wasn't present in Iceberg 1.3

Changes

Added a unit test testChangelogViewOutsideTimeRange to TestChangelogTable that:

  1. Inserts test records into the table
  2. Waits for a time period to ensure those records are "old"
  3. Creates a changelog view with a time range that starts after the inserts
  4. Verifies the view returns 0 records (expected behavior)
  5. The test helps validate that records outside the specified time window are not incorrectly included

Testing

  • Added unit test directly targeting the time range filtering behavior
  • Test runs as part of the existing TestChangelogTable suite
  • Follows existing test patterns and conventions

To invoke the unit test:
./gradlew :iceberg-spark:iceberg-spark-extensions-3.5_2.12:test --tests "org.apache.iceberg.spark.extensions.TestChangelogTable"

Related Issue

TBD

Additional Context

@github-actions github-actions bot added the spark label Nov 16, 2024
@flyrain
Copy link
Contributor

flyrain commented Nov 20, 2024

Thank you, @Acehaidrey, for reporting this issue! It seems that the method buildChangelogScan() does not properly set up the scan when the startTimestamp is newer than the timestamp of the current snapshot. Adding the following logic at this line should address the problem:

if (startTimestamp != null && table.currentSnapshot().timestampMillis() < startTimestamp) {
  emptyScan = true;
}

However, I believe there's an opportunity to revisit the buildChangelogScan() method more holistically. Refactoring it could improve its clarity and make it less prone to errors in the future. I'd be happy to collaborate or discuss further ideas for restructuring the method. Let me know what you think!

@flyrain
Copy link
Contributor

flyrain commented Nov 20, 2024

Hi @bryanck, do we still have time to include this bug fix in 1.7.1?

@bryanck
Copy link
Contributor

bryanck commented Nov 20, 2024

The 1.7.1 ship has sailed, but if this is a regression or high priority, we can try to get it in.

@flyrain
Copy link
Contributor

flyrain commented Nov 20, 2024

It is a regression, better to get it in. @Acehaidrey should be able to provide a quick fix soon, like today. Appreciate if you can hold 1.7.1 a bit more, @bryanck.

@RussellSpitzer
Copy link
Member

@Acehaidrey

Execution failed for task ':iceberg-spark:iceberg-spark-extensions-3.5_2.12:spotlessJavaCheck'.
> The following files had format violations:
      src/test/java/org/apache/iceberg/spark/extensions/TestChangelogTable.java
          @@ -460,7 +460,6 @@
           ····//········[4,·"d",·"INSERT",·3,·7807948732874377651L]]

           ····assertThat(results).as("Num·records·must·be·zero").isEmpty();
           
          -
           ····//·Clean·up·the·changelog·view
           ····sql("DROP·VIEW·IF·EXISTS·test_changelog_view");
           ··}
           ```

@RussellSpitzer
Copy link
Member

@flyrain + @Acehaidrey We expect this test to fail right? Like we need to still add a fix to this pr?

@sfc-gh-ygu
Copy link

We expect this test to fail right? Like we need to still add a fix to this pr?

That's correct!

@Acehaidrey
Copy link
Contributor Author

Acehaidrey commented Nov 20, 2024 via email

@Acehaidrey
Copy link
Contributor Author

Hey @sfc-gh-ygu @flyrain @RussellSpitzer @bryanck I have gone ahead and updated this . Please if you can take a look - the test passes now as do the other tests.

Sorry for delays, home had a water leak today, and been a mess.

@Acehaidrey
Copy link
Contributor Author

Acehaidrey commented Nov 21, 2024 via email

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, left minor comments.

@Acehaidrey
Copy link
Contributor Author

thank you @flyrain for all the help here - I actually took your advice, think it looks cleaner this way if you dont mind seeing once more

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending tests

@Acehaidrey
Copy link
Contributor Author

Thank you, I had to fix a formatting issue so pushed another update

@flyrain flyrain changed the title Add unit test for create_changelog_view time range behavior Spark: Fix changelog table bug for start time older than current snapshot Nov 21, 2024
@flyrain flyrain merged commit c448a4b into apache:main Nov 21, 2024
31 checks passed
@Acehaidrey Acehaidrey deleted the cdc-timerange-test branch November 21, 2024 02:24
@manuzhang
Copy link
Collaborator

@Acehaidrey @flyrain

The bug was due to we were not checking the timestamp of endSnapshot calculated from endTimestamp is less than the startTimestamp. I've submitted #11612 to make the logic easier to reason about, hopefully.

amogh-jahagirdar pushed a commit that referenced this pull request Nov 21, 2024
…shot (#11564) (#11613)

Co-authored-by: Ace Haidrey <acehaidrey@gmail.com>
nastra pushed a commit to nastra/iceberg that referenced this pull request Nov 21, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants