-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pt): add more information to summary and error message of loading library #3895
Conversation
… loading library Most are copied from TF. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Warning Rate limit exceeded@njzjz has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 33 seconds before requesting another review. How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe new changes introduce improved robustness and compatibility checks for DeepMD-py wrapper based on PyTorch. These enhancements revolve around exception handling when loading shared libraries while ensuring alignment with the CXX11 ABI flag and version compatibility. Additionally, the entry points have been enhanced to provide extra backend information when custom operations are enabled. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Main as main.py
participant PT as cxx_op.py
participant PyTorch
participant Config as deepmd.env
User->>Main: Start application
Main->>Config: Get GLOBAL_CONFIG
Main->>PT: Load CXX operations
PT->>PyTorch: torch.ops.load_library(module_file)
PyTorch-->>PT: Check CXX11 ABI and version compatibility
PT->>PT: Exception Handling and Error Raising
Main->>Main: get_backend_info()
Main-->>User: Provide backend info if ENABLE_CUSTOMIZED_OP is set
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #3895 +/- ##
==========================================
- Coverage 82.74% 82.72% -0.03%
==========================================
Files 519 519
Lines 50491 50510 +19
Branches 3015 3015
==========================================
+ Hits 41781 41786 +5
- Misses 7773 7787 +14
Partials 937 937 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
…fo = None` to `op_info = {}` (#3912) Solve issue #3911 When I run `examples/water/dpa2` using `dp --pt train input_torch.json`. An error occurs: To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-06-26 07:53:43,325] DEEPMD INFO DeepMD version: 2.2.0b1.dev892+g73dab63f.d20240612 [2024-06-26 07:53:43,325] DEEPMD INFO Configuration path: input_torch.json Traceback (most recent call last): File "/home/data/zhangcq/conda_env/deepmd-pt-1026/bin/dp", line 8, in <module> sys.exit(main()) File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/main.py", line 842, in main deepmd_main(args) File "/home/data/zhangcq/conda_env/deepmd-pt-1026/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/pt/entrypoints/main.py", line 384, in main train(FLAGS) File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/pt/entrypoints/main.py", line 223, in train SummaryPrinter()() File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/utils/summary.py", line 62, in __call__ build_info.update(self.get_backend_info()) File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/pt/entrypoints/main.py", line 213, in get_backend_info return { TypeError: 'NoneType' object is not a mapping This bug is made by PR #3895 ![20240626-160240](https://github.com/deepmodeling/deepmd-kit/assets/100290172/92008b01-1e3d-437d-a09e-cc74b2da6412) When `op_info` is `None`, `{**op_info}` will raise error. Changing `op_info = None` to `op_info = {}` will solve the issue. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Improved system stability by initializing `op_info` as an empty dictionary instead of `None`, preventing potential runtime errors. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…g library (deepmodeling#3895) Most are copied from TF for consistent user experience. The summary is like ``` [2024-06-22 06:19:31,090] DEEPMD INFO -------------------------------------------------------------------------------------------------------------- [2024-06-22 06:19:31,090] DEEPMD INFO installed to: /home/jz748/codes/deepmd-kit/deepmd [2024-06-22 06:19:31,090] DEEPMD INFO /home/jz748/anaconda3/lib/python3.10/site-packages/deepmd [2024-06-22 06:19:31,090] DEEPMD INFO source: v3.0.0a0-229-g6d2c6095 [2024-06-22 06:19:31,090] DEEPMD INFO source brach: pt-add-more-info [2024-06-22 06:19:31,090] DEEPMD INFO source commit: 6d2c609 [2024-06-22 06:19:31,090] DEEPMD INFO source commit at: 2024-06-22 06:16:55 -0400 [2024-06-22 06:19:31,090] DEEPMD INFO use float prec: double [2024-06-22 06:19:31,090] DEEPMD INFO build variant: cuda [2024-06-22 06:19:31,090] DEEPMD INFO Backend: PyTorch [2024-06-22 06:19:31,090] DEEPMD INFO PT ver: v2.1.2.post300-ge32f208075b [2024-06-22 06:19:31,090] DEEPMD INFO Enable custom OP: True [2024-06-22 06:19:31,090] DEEPMD INFO build with PT ver: 2.1.2 [2024-06-22 06:19:31,090] DEEPMD INFO build with PT inc: /home/jz748/anaconda3/lib/python3.10/site-packages/torch/include [2024-06-22 06:19:31,090] DEEPMD INFO /home/jz748/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include [2024-06-22 06:19:31,090] DEEPMD INFO build with PT lib: /home/jz748/anaconda3/lib/python3.10/site-packages/torch/lib [2024-06-22 06:19:31,090] DEEPMD INFO running on: localhost.localdomain [2024-06-22 06:19:31,090] DEEPMD INFO computing device: cuda:0 [2024-06-22 06:19:31,090] DEEPMD INFO CUDA_VISIBLE_DEVICES: unset [2024-06-22 06:19:31,090] DEEPMD INFO Count of visible GPUs: 2 [2024-06-22 06:19:31,090] DEEPMD INFO num_intra_threads: 0 [2024-06-22 06:19:31,091] DEEPMD INFO num_inter_threads: 0 [2024-06-22 06:19:31,091] DEEPMD INFO -------------------------------------------------------------------------------------------------------------- ``` The error message is like ``` deepmd/pt/cxx_op.py:39: in load_library torch.ops.load_library(module_file) ../../anaconda3/lib/python3.10/site-packages/torch/_ops.py:852: in load_library ctypes.CDLL(path) ../../anaconda3/lib/python3.10/ctypes/__init__.py:374: in __init__ self._handle = _dlopen(self._name, mode) E OSError: /home/jz748/anaconda3/lib/python3.10/site-packages/deepmd/lib/libdeepmd_op_pt.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv The above exception was the direct cause of the following exception: source/tests/pt/test_LKF.py:9: in <module> from deepmd.pt.entrypoints.main import ( deepmd/pt/__init__.py:4: in <module> from deepmd.pt.cxx_op import ( deepmd/pt/cxx_op.py:95: in <module> ENABLE_CUSTOMIZED_OP = load_library("deepmd_op_pt") deepmd/pt/cxx_op.py:51: in load_library raise RuntimeError( E RuntimeError: This deepmd-kit package was compiled with CXX11_ABI_FLAG=0, but PyTorch runtime was compiled with CXX11_ABI_FLAG=1. These two library ABIs are incompatible and thus an error is raised when loading deepmd_op_pt. You need to rebuild deepmd-kit against this TensorFlow runtime. ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced compatibility checks between the compiled deepmd-kit package and PyTorch runtime, ensuring consistent ABI flag and versioning. - **Bug Fixes** - Enhanced error handling when loading libraries to prevent runtime issues related to incompatibilities. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
…fo = None` to `op_info = {}` (deepmodeling#3912) Solve issue deepmodeling#3911 When I run `examples/water/dpa2` using `dp --pt train input_torch.json`. An error occurs: To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. [2024-06-26 07:53:43,325] DEEPMD INFO DeepMD version: 2.2.0b1.dev892+g73dab63f.d20240612 [2024-06-26 07:53:43,325] DEEPMD INFO Configuration path: input_torch.json Traceback (most recent call last): File "/home/data/zhangcq/conda_env/deepmd-pt-1026/bin/dp", line 8, in <module> sys.exit(main()) File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/main.py", line 842, in main deepmd_main(args) File "/home/data/zhangcq/conda_env/deepmd-pt-1026/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/pt/entrypoints/main.py", line 384, in main train(FLAGS) File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/pt/entrypoints/main.py", line 223, in train SummaryPrinter()() File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/utils/summary.py", line 62, in __call__ build_info.update(self.get_backend_info()) File "/home/data/zcq/deepmd-source/deepmd-kit/deepmd/pt/entrypoints/main.py", line 213, in get_backend_info return { TypeError: 'NoneType' object is not a mapping This bug is made by PR deepmodeling#3895 ![20240626-160240](https://github.com/deepmodeling/deepmd-kit/assets/100290172/92008b01-1e3d-437d-a09e-cc74b2da6412) When `op_info` is `None`, `{**op_info}` will raise error. Changing `op_info = None` to `op_info = {}` will solve the issue. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Improved system stability by initializing `op_info` as an empty dictionary instead of `None`, preventing potential runtime errors. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Most are copied from TF for consistent user experience.
The summary is like
The error message is like
Summary by CodeRabbit
New Features
Bug Fixes