Skip to content

Commit

Permalink
Fix bug with --node-count arg for worker.
Browse files Browse the repository at this point in the history
Fixing bug determining the correct value for the --node-count arg passed
to job worker entrypoint script, which was counting allocations rather
than the number of CPUs across the allocations.
  • Loading branch information
robertbartel authored and christophertubbs committed Aug 16, 2024
1 parent f63a6b1 commit 8fad377
Showing 1 changed file with 6 additions and 2 deletions.
8 changes: 6 additions & 2 deletions python/lib/scheduler/dmod/scheduler/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,8 +403,12 @@ def _generate_docker_cmd_args(self, job: 'Job', worker_index: int) -> List[str]:
raise RuntimeError(f"Unexpected request type {job.model_request.__class__.__name__}: cannot build Docker CMD arg list")

# For now at least, all image args sets will have these (i.e, node count, host string, and job id)
docker_cmd_arg_map = {"--node-count": str(len(job.allocations)), "--host-string": self.build_host_list(job),
"--job-id": str(job.job_id), "--worker-index": str(worker_index)}
docker_cmd_arg_map = {
"--node-count": str(sum(a.cpu_count for a in job.allocations)),
"--host-string": self.build_host_list(job),
"--job-id": str(job.job_id),
"--worker-index": str(worker_index)
}

if isinstance(job.model_request, AbstractNgenRequest):
docker_cmd_arg_map.update(self._generate_nextgen_job_docker_cmd_args(job, worker_index))
Expand Down

0 comments on commit 8fad377

Please sign in to comment.