Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 4: Fix miscellaneous tests including logic, repart, hive_delimited. [databricks] #11129

Merged

Conversation

mythrocks
Copy link
Collaborator

@mythrocks mythrocks commented Jul 2, 2024

Fixes #11031.

This PR addresses tests that fail on Spark 4.0 in the following files:

  1. integration_tests/src/main/python/datasourcev2_read_test.py
  2. integration_tests/src/main/python/expand_exec_test.py
  3. integration_tests/src/main/python/get_json_test.py
  4. integration_tests/src/main/python/hive_delimited_text_test.py
  5. integration_tests/src/main/python/logic_test.py
  6. integration_tests/src/main/python/repart_test.py
  7. integration_tests/src/main/python/time_window_test.py
  8. integration_tests/src/main/python/json_matrix_test.py
  9. integration_tests/src/main/python/misc_expr_test.py
  10. integration_tests/src/main/python/orc_write_test.py

Signed-off-by: MithunR <mithunr@nvidia.com>
Test is inexplicably failing with ANSI off.
Moved overflowing test into a separate function, tested with ANSI on/off.
@mythrocks mythrocks changed the title WIP:Spark 4: Fix miscellaneous tests including logic, repart, hive_delimited. WIP: Spark 4: Fix miscellaneous tests including logic, repart, hive_delimited. Jul 2, 2024
@mythrocks mythrocks marked this pull request as draft July 2, 2024 23:49
@mythrocks
Copy link
Collaborator Author

Still a work in progress. A couple of other tests to be addressed.

@mythrocks mythrocks self-assigned this Jul 5, 2024
Record comparisons do not currently account for legitimate whitespace
differences.
See NVIDIA#11154.
@mythrocks mythrocks marked this pull request as ready for review July 8, 2024 21:27
@mythrocks mythrocks changed the title WIP: Spark 4: Fix miscellaneous tests including logic, repart, hive_delimited. Spark 4: Fix miscellaneous tests including logic, repart, hive_delimited. Jul 8, 2024
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

That last failure was an interesting one to track down.

Time interval calculations on Spark < 3.3 involve multiplication/division aggregation operations. These tend to fall off the GPU in ANSI mode because of #5114. This test is guaranteed to fail, because part of the plan is off the GPU.

For Spark >= 3.3, the same calculations seem to involve modulo operations that don't seem susceptible to ANSI-mode failures.

I've included a skip for this test with ANSI enabled, on Spark < 3.3. This can be rolled back once #5114 is addressed.

@mythrocks
Copy link
Collaborator Author

Build

Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, a couple of minor documentation/comment nits.

1. Cited the source for the clumsy error message.
2. Fixed comment regarding fallback to CPU.
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

@NVnavkumar, I was wondering if you might take another look at this one.

Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit left here.

integration_tests/src/main/python/get_json_test.py Outdated Show resolved Hide resolved
@mythrocks mythrocks changed the title Spark 4: Fix miscellaneous tests including logic, repart, hive_delimited. Spark 4: Fix miscellaneous tests including logic, repart, hive_delimited. [databricks] Jul 16, 2024
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

There seems to be an error on Spark 3.3, where the expected exception isn't thrown. It's taking a bit of time to repro. I'll update here once I have something.

Changed datagen to guarantee overflow.
Dropped superfluous num_parts value.
@mythrocks
Copy link
Collaborator Author

I think I've addressed the Databricks failure. I'll kick off another build, and request the reviewers for another round.

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

@NVnavkumar, I've fixed the last nit. Does this look agreeable?

Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mythrocks mythrocks merged commit 125feb2 into NVIDIA:branch-24.08 Jul 18, 2024
43 checks passed
@mythrocks
Copy link
Collaborator Author

Thank you for reviewing, @NVnavkumar. This change has now been merged.

@sameerz sameerz added the Spark 4.0+ Spark 4.0+ issues label Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spark 4.0+ Spark 4.0+ issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix tests failures in multiple files
3 participants