Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi node for lmi-dist #2125

Merged
merged 2 commits into from
Jul 9, 2024
Merged

Conversation

xyang16
Copy link
Contributor

@xyang16 xyang16 commented Jun 28, 2024

Description

Brief description of what this PR is about

  • If this change is a backward incompatible change, why must this change be made?
  • Interesting edge cases to note here

@xyang16 xyang16 force-pushed the multinode branch 4 times, most recently from bf4e1ac to d3f7acf Compare June 28, 2024 17:11
engines/python/setup/djl_python_engine.py Outdated Show resolved Hide resolved
logger.info("Printing mpi boolean: {}", pyEnv.isMpiMode());

if (clusterSize > 1) {
String leaderAddress = Utils.getenv("LWS_LEADER_ADDRESS");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need revisit all env names

Copy link
Contributor

@nikhil-sk nikhil-sk Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names are fixed by LWS env, we cannot change this in LWS+EKS env, except the DJL_CLUSTER_SIZE

@xyang16 xyang16 force-pushed the multinode branch 2 times, most recently from 6db60b5 to 7d63ed7 Compare June 28, 2024 17:19
@@ -187,6 +217,20 @@ synchronized void startPythonProcess() {
}
}

public static String[] getHosts() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better create a object to hold all those values.

@xyang16 xyang16 marked this pull request as ready for review June 28, 2024 17:39
@xyang16 xyang16 requested review from zachgk and a team as code owners June 28, 2024 17:39
@xyang16 xyang16 marked this pull request as draft June 28, 2024 17:59
@xyang16 xyang16 marked this pull request as ready for review June 28, 2024 17:59
Device device = model.getNDManager().getDevice();
int deviceId = device.getDeviceId();
int tensorParallelDegree = pyEnv.getTensorParallelDegree();
// int pipelineParallelDegree = pyEnv.getPipelineParallelDegree();

if (clusterSize > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better for us to wrap the settings into different function for better debug-bility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a PyEnv.isMultiNode() utility function.

args[6] = "--bind-to";
args[7] = "none";
args[8] = "--mca";
args[9] = "btl_vader_single_copy_mechanism";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the mca for multi-node might be differemt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -112,10 +113,12 @@ def construct_kwargs_device_map(self):
self.kwargs["device_map"] = self.device_map
self.device = None
logging.info(f"Using device map {self.device_map}")
elif self.tensor_parallel_degree > 0 and torch.cuda.device_count() > 0:
elif self.tensor_parallel_degree > 0 \
and self.cluster_size > 0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case cluster_size is 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_size should not be 0. Just having a safe check.

engines/python/setup/djl_python_engine.py Outdated Show resolved Hide resolved
@xyang16 xyang16 force-pushed the multinode branch 2 times, most recently from 52cd931 to 1ab823e Compare July 8, 2024 17:57
*
* @return the pipeline parallel degree
*/
public int getPipelineParallelDegree() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the pipelineParallelDegree read from this PR as we're not directly using it here. I will add this separately later as required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@xyang16 xyang16 force-pushed the multinode branch 3 times, most recently from a496758 to 7caaad0 Compare July 8, 2024 22:57
@xyang16 xyang16 force-pushed the multinode branch 3 times, most recently from a38f240 to 2b5c3cb Compare July 8, 2024 23:28
String[] res = new String[clusterSize];
res[0] = leaderAddress;
for (int i = 1; i < clusterSize; i++) {
res[i] = String.format("%s-%s-%d.%s.%s", lwsName, groupIndex, i, lwsName, namespace);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
res[i] = String.format("%s-%s-%d.%s.%s", lwsName, groupIndex, i, lwsName, namespace);
res[i] = lwsName + '-' + groupIndex + '-' + i + '.' + lwsName + '.' + namespace;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced the environment variable name.

@xyang16 xyang16 force-pushed the multinode branch 2 times, most recently from 9ddc65c to c3471e3 Compare July 9, 2024 00:10
@xyang16 xyang16 merged commit 70aca0c into deepjavalibrary:master Jul 9, 2024
9 checks passed
@xyang16 xyang16 deleted the multinode branch October 21, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants