-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pd: skip certain UT and fix paddle ver in in test_cuda.yml #4439
pd: skip certain UT and fix paddle ver in in test_cuda.yml #4439
Conversation
… in jit.save; fix the paddlepaddle-gpu version in test_cuda.yml
📝 WalkthroughWalkthroughThe pull request introduces several updates to the workflow configuration for testing CUDA, specifically in the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (2)
source/tests/pd/test_dp_show.py (2)
32-33
: Enhance skip message with issue tracking reference.While the skip message clearly indicates the CUDA error 709 issue, it would be helpful to add a reference to the tracking issue for future maintainers.
-@unittest.skip("Skip until solving cuda error 709 in jit.save") +@unittest.skip("Skip until solving cuda error 709 in jit.save (see issue #XXXX)")Also applies to: 105-106
Line range hint
107-204
: Consider adding error test case for invalid model branch.The TestMultiTaskModel thoroughly tests the happy path but could benefit from additional error handling tests, similar to test_checkpoint_error in TestSingleTaskModel.
Consider adding a test case like:
def test_invalid_model_branch(self): INPUT = "model.ckpt.pd" ATTRIBUTES = "model-branch type-map descriptor fitting-net" with self.assertRaisesRegex(RuntimeError, "Invalid model branch"): run_dp(f"dp --pd show {INPUT} {ATTRIBUTES} --model-branch invalid_branch")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
.github/workflows/test_cuda.yml
(1 hunks)source/tests/pd/test_dp_show.py
(2 hunks)source/tests/pd/test_multitask.py
(2 hunks)
🔇 Additional comments (5)
.github/workflows/test_cuda.yml (3)
Line range hint 19-20
: Verify CUDA version compatibility across dependencies
The workflow uses CUDA 12.6.2 base image but installs packages with different CUDA versions:
- PaddlePaddle: CUDA 12.3
- TensorFlow: 2.18.0rc2
- PyTorch: 2.5.0
- JAX: CUDA 12
This version mix might lead to compatibility issues. Consider:
- Aligning CUDA versions across all dependencies
- Documenting why specific versions are required
- Adding explicit tests to verify CUDA compatibility
Let's check for potential CUDA conflicts:
Also applies to: 54-54
🧰 Tools
🪛 actionlint (1.7.4)
51-51: shellcheck reported issue in this script: SC2155:warning:1:8: Declare and assign separately to avoid masking return values
(shellcheck)
51-51: shellcheck reported issue in this script: SC2155:warning:2:8: Declare and assign separately to avoid masking return values
(shellcheck)
51-51: shellcheck reported issue in this script: SC2102:info:4:61: Ranges can only match single chars (mentioned due to duplicates)
(shellcheck)
54-54
: Consider using a stable PaddlePaddle release instead of nightly build
The current installation uses a nightly development build (dev20241126
) which may introduce instability in the CI pipeline. While this might temporarily fix the CUDA error 709, it's generally recommended to use stable releases in CI environments.
Consider either:
- Using the latest stable release with CUDA 12.3 support
- If the nightly build is necessary, document why it's required and add a reminder to update it to a stable version once available
Let's verify if there are stable versions available:
✅ Verification successful
Let me try another verification approach to understand the context better:
Using development build is justified due to CUDA error 709
Based on the search results, using the nightly build appears to be a deliberate choice to address the CUDA error 709 that's affecting multiple tests. The codebase shows:
- Multiple test files are skipped with the comment "Skip until solving cuda error 709 in jit.save"
- The project requires PaddlePaddle >= 3.0.0b1 as seen in
backend/find_paddle.py
- No stable CUDA 12.3 versions are currently available
Since this is a known issue being actively worked on, and the development build is specifically chosen to address the CUDA error 709, the current approach is appropriate for now.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for stable PaddlePaddle versions with CUDA 12.3 support
curl -s "https://www.paddlepaddle.org.cn/whl/linux/gpu-cuda-12.3/flagged" | grep -o 'paddlepaddle_gpu-[0-9.]*-cp311-cp311-linux_x86_64.whl' || echo "No stable versions found"
Length of output: 198
Script:
#!/bin/bash
# Check the repository for any documentation or comments about PaddlePaddle version requirements
rg -i "paddlepaddle" -g "!*.whl" -g "!*.pyc" -g "!*.log" -A 3 -B 3
# Also check if there are any issues or PRs mentioning CUDA error 709
rg -i "cuda.*709|error.*709" -g "!*.whl" -g "!*.pyc" -g "!*.log" -A 2 -B 2
Length of output: 4375
Line range hint 65-69
: Consider optimizing test environment for CUDA stability
Given the CUDA error 709 issues mentioned in the PR objectives, consider these environment optimizations:
- Add memory management settings:
env: NUM_WORKERS: 0 CUDA_VISIBLE_DEVICES: 0 XLA_PYTHON_CLIENT_PREALLOCATE: false
- CUDA_LAUNCH_BLOCKING: 1
- PYTORCH_CUDA_ALLOC_CONF: max_split_size_mb:512
2. Add error logging to capture CUDA-specific issues
Let's check current CUDA memory management settings:
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 actionlint (1.7.4)</summary>
51-51: shellcheck reported issue in this script: SC2155:warning:1:8: Declare and assign separately to avoid masking return values
(shellcheck)
---
51-51: shellcheck reported issue in this script: SC2155:warning:2:8: Declare and assign separately to avoid masking return values
(shellcheck)
---
51-51: shellcheck reported issue in this script: SC2102:info:4:61: Ranges can only match single chars (mentioned due to duplicates)
(shellcheck)
</details>
</details>
</details>
<details>
<summary>source/tests/pd/test_dp_show.py (1)</summary>
Line range hint `34-104`: **LGTM! Well-structured test implementation.**
The TestSingleTaskModel implementation follows testing best practices with:
- Proper setup and teardown
- Comprehensive attribute verification
- Good error handling for invalid cases
</details>
<details>
<summary>source/tests/pd/test_multitask.py (1)</summary>
Line range hint `43-185`: **Verify the scope of CUDA error 709 impact.**
Let's verify if this CUDA error affects other test files and if there are any potential workarounds.
</details>
</details>
</details>
<!-- This is an auto-generated comment by CodeRabbit for review status -->
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## devel #4439 +/- ##
==========================================
- Coverage 83.27% 82.75% -0.52%
==========================================
Files 667 667
Lines 61446 61446
Branches 3486 3486
==========================================
- Hits 51167 50851 -316
- Misses 9151 9469 +318
+ Partials 1128 1126 -2 ☔ View full report in Codecov by Sentry. |
In two unit tests under
pd/
, paddle.jit.save is called, which leads to occasional cuda error 709. Before resolving this issue, temporarily mark these two unittests to be skipped(pd/test_dp_show.py
andpd/test_multikask
).Meanwhile, the version of paddlepaddle-gpu in test_cuda.yml has been fixed.
@njzjz
Summary by CodeRabbit
Bug Fixes
Tests