Perf: load data systems on rank 0 #4478

caic99 · 2024-12-19T03:05:53Z

The current implementation loads data on each rank. This will stress the file system.
In this PR, only rank 0 will load data systems, and it will be broadcasted to each rank.
The data sampler initialized later will still use the exclusive seed of each rank.

Summary by CodeRabbit

New Features
- Enhanced handling of distributed data loading for improved synchronization across processes.
- Added broadcasting of the constructed dataset to ensure consistency in all processes.
Bug Fixes
- Implemented safeguards to prevent incomplete data distribution by asserting the integrity of the dataset.

coderabbitai · 2024-12-19T03:07:39Z

📝 Walkthrough

Walkthrough

The pull request modifies the DpLoaderSet class in the deepmd/pt/utils/dataloader.py file to improve distributed data loading. The changes focus on enhancing the initialization of the self.systems attribute by introducing a process rank-based conditional check. When the global rank is 0, the dataset is constructed using a multiprocessing pool, and the self.systems list is broadcast to all processes using dist.broadcast_object_list(). An assertion is added to ensure complete data distribution.

Changes

File	Change Summary
`deepmd/pt/utils/dataloader.py`	- Modified `DpLoaderSet` class initialization to handle distributed data loading - Added conditional check for global process rank - Implemented `dist.broadcast_object_list()` for synchronizing `self.systems` - Added assertion to verify complete data distribution

Possibly related PRs

Perf: print summary on rank 0 #4434: Modification to print_summary method in DpLoaderSet class related to distributed process handling
refactor: simplify dataset construction #4437: Introduction of construct_dataset method and updates to print_summary method with rank check

Suggested reviewers

njzjz
CaRoLZhangxy
wanghan-iapcm

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

for more information, see https://pre-commit.ci

… load-data

for more information, see https://pre-commit.ci

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

deepmd/pt/utils/dataloader.py (3)

96-97: Consider documenting the self.systems initialization more explicitly.

Here, you add a new typed attribute, but it would be helpful to have a docstring or an inline comment indicating that this list will either be populated with real datasets on rank 0 or with dummy placeholders on other ranks. This clarifies the rank-dependent data flow for future maintainers.

103-104: Explore building partial placeholders instead of a full list of None.

Currently, you allocate a “None” list for all systems on non-zero ranks. This is fine, but consider if there's an even lighter or more descriptive placeholder structure (e.g., a small object with essential metadata) to prevent confusion about what these positions represent during debugging. This can help future readers who might assume that "None" indicates an error rather than a deferred load.

105-107: Strengthen your broadcast verification.

The assertion only checks if the last dataset is non-None. Consider verifying that none of the items in self.systems remain None to ensure a complete broadcast. For example:
assert all(s is not None for s in self.systems)
This reduces the risk of partial broadcast issues going unnoticed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0914e1 and 555c3f9.

📒 Files selected for processing (1)

deepmd/pt/utils/dataloader.py (1 hunks)

🔇 Additional comments (1)

deepmd/pt/utils/dataloader.py (1)

98-102: Evaluate large-scale broadcasting overhead.

When rank 0 constructs numerous (or very large) datasets and then broadcasts them, it might be memory-intensive or time-consuming in large-scale distributed settings. Consider verifying that each dataset remains lightweight enough for object broadcasting. If necessary, explore an alternative approach (e.g., distributing only minimal metadata) to avoid potential OOM or performance bottlenecks.

deepmd/pt/utils/dataloader.py

njzjz · 2024-12-19T03:25:25Z

In this PR, only rank 0 will load data systems, and it will be broadcasted to each rank

Why do you expect each rank loads the same data?

codecov · 2024-12-19T03:53:18Z

Codecov Report

Attention: Patch coverage is 70.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 84.41%. Comparing base (c0914e1) to head (1dcc5d5).

Files with missing lines	Patch %	Lines
deepmd/pt/utils/dataloader.py	70.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            devel    #4478      +/-   ##
==========================================
- Coverage   84.41%   84.41%   -0.01%     
==========================================
  Files         670      670              
  Lines       62147    62152       +5     
  Branches     3487     3487              
==========================================
+ Hits        52464    52467       +3     
- Misses       8556     8558       +2     
  Partials     1127     1127

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

caic99 · 2024-12-19T04:10:48Z

In this PR, only rank 0 will load data systems, and it will be broadcasted to each rank

Why do you expect each rank loads the same data?

Good question. Currently every rank reads the config from the same input.json file, and initializes datasets there.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

deepmd/pt/utils/dataloader.py (1)

96-106: Consider memory management and error handling improvements

While centralizing data loading to rank 0 reduces file system stress, there are several areas for improvement:

Memory Management: Loading all systems on rank 0 before broadcasting could cause memory pressure on that node.
Error Handling: Add explicit error handling for broadcast failures.
Verification: The assertion only checks the last system, consider verifying all systems.

Consider these improvements:

 self.systems: list[DeepmdDataSetForLoader] = []
 global_rank = dist.get_rank() if dist.is_initialized() else 0
 if global_rank == 0:
     log.info(f"Constructing DataLoaders from {len(systems)} systems")
-    with Pool(os.cpu_count()) as pool:
+    # Use a configurable number of workers to avoid overwhelming the system
+    n_workers = min(os.cpu_count(), 8)  # or make this configurable
+    with Pool(n_workers) as pool:
         self.systems = pool.map(construct_dataset, systems)
 else:
     self.systems = [None] * len(systems)  # type: ignore
 if dist.is_initialized():
-    dist.broadcast_object_list(self.systems)
-    assert self.systems[-1] is not None
+    try:
+        dist.broadcast_object_list(self.systems)
+        # Verify all systems were received correctly
+        assert all(system is not None for system in self.systems), "Some systems failed to broadcast"
+    except Exception as e:
+        log.error(f"Failed to broadcast systems: {str(e)}")
+        raise

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 555c3f9 and 1dcc5d5.

📒 Files selected for processing (1)

deepmd/pt/utils/dataloader.py (1 hunks)

🔇 Additional comments (1)

deepmd/pt/utils/dataloader.py (1)

96-106: Distributed design explanation: Same data, different sampling

The implementation correctly handles distributed training by:

Loading data once on rank 0 and broadcasting to all ranks (reducing I/O)
Using separate DistributedSampler instances per rank (ensuring different samples per rank)

This is a standard pattern in distributed deep learning where the dataset is shared but each rank samples differently.

@njzjz: Regarding your question about why each rank loads the same data - while ranks receive identical data, they use different samplers to access different portions during training. This approach reduces file system load while maintaining training correctness.

Let's verify the distributed sampling behavior:

✅ Verification successful

Distributed implementation is correct and follows best practices

The code correctly implements distributed data loading and sampling:

Data is efficiently loaded once on rank 0 and broadcast to other ranks
Each rank gets its own DistributedSampler instance when distributed mode is active (dist.is_available() and dist.is_initialized())
Samplers are properly initialized with the dataset, ensuring each rank gets different portions of data during training
The implementation follows PyTorch's recommended pattern for distributed training

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the code maintains distributed sampling after loading data

# Check for DistributedSampler usage
rg "DistributedSampler" --type py

# Verify sampler initialization is rank-aware
ast-grep --pattern 'DistributedSampler($system)'

Length of output: 315

Script:

#!/bin/bash
# Let's verify the complete context around DistributedSampler usage

# Check the full context of sampler initialization
rg "DistributedSampler" -B 5 -A 5 deepmd/pt/utils/dataloader.py

# Check if there's any rank-specific logic in sampler setup
ast-grep --pattern 'def _setup_samplers(self) {
  $$$
}'

Length of output: 1241

njzjz · 2024-12-19T04:29:10Z

In this PR, only rank 0 will load data systems, and it will be broadcasted to each rank

Why do you expect each rank loads the same data?

Good question. Currently every rank reads the config from the same input.json file, and initializes datasets there.

I don't get your point. I believe the data is loaded asynchronously.

caic99 · 2024-12-19T06:32:46Z

Why do you expect each rank loads the same data?

I don't get your point. I believe the data is loaded asynchronously.

I see. What you mean is not all ranks are accessing the same data files, and this is correct. The case is when the data is stored in a shared file system with a bandwidth limit.

njzjz · 2024-12-19T06:46:22Z

So it just affects the initialization of the data class? Well, I am not sure how much data is loaded in the initialization step.

Perf: load data systems on rank 0

5ad15d1

github-actions bot added the Python label Dec 19, 2024

refactor

96c9f03

caic99 and others added 6 commits December 19, 2024 03:08

update logging

e67ca00

[pre-commit.ci] auto fixes from pre-commit.com hooks

7bf77cb

for more information, see https://pre-commit.ci

revert changes on mp

4954f06

Merge branch 'load-data' of https://github.com/caic99/deepmd-kit into…

10cbd06

… load-data

Merge branch 'devel' into load-data

2ed59c3

[pre-commit.ci] auto fixes from pre-commit.com hooks

555c3f9

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Dec 19, 2024

View reviewed changes

wanghan-iapcm requested changes Dec 19, 2024

View reviewed changes

deepmd/pt/utils/dataloader.py Outdated Show resolved Hide resolved

wanghan-iapcm requested a review from iProzd December 19, 2024 03:20

always output logging info

1dcc5d5

coderabbitai bot reviewed Dec 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: load data systems on rank 0 #4478

Perf: load data systems on rank 0 #4478

caic99 commented Dec 19, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 19, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

njzjz commented Dec 19, 2024

codecov bot commented Dec 19, 2024 •

edited

Loading

caic99 commented Dec 19, 2024

coderabbitai bot left a comment

njzjz commented Dec 19, 2024

caic99 commented Dec 19, 2024

njzjz commented Dec 19, 2024

Perf: load data systems on rank 0 #4478

Are you sure you want to change the base?

Perf: load data systems on rank 0 #4478

Conversation

caic99 commented Dec 19, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 19, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

njzjz commented Dec 19, 2024

codecov bot commented Dec 19, 2024 • edited Loading

Codecov Report

caic99 commented Dec 19, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

njzjz commented Dec 19, 2024

caic99 commented Dec 19, 2024

njzjz commented Dec 19, 2024

caic99 commented Dec 19, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 19, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Dec 19, 2024 •

edited

Loading