Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask backend: run 15600 sleepers #3991

Closed
4 tasks
sanderegg opened this issue Mar 17, 2023 · 3 comments
Closed
4 tasks

Dask backend: run 15600 sleepers #3991

sanderegg opened this issue Mar 17, 2023 · 3 comments
Assignees
Labels
a:api framework api, data schemas, a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:director-v2 issue related with the director-v2 service

Comments

@sanderegg
Copy link
Member

in order to test the dask-gateway here are the steps:

  • need a new sleeper version that needs less than 2 Gb (probably 100Mb is enough)
  • create a separate cluster in ITIS account
    • create a dask-gateway-manager node
    • create 25-50 worker nodes t2.medium instances (@Surfict we need enough vpcus for that)
  • test a few sleepers run on the separate cluster from the GUI
  • run 15600 jobs with a sleeper through the osparc API

--> get results and observations/problems

@sanderegg sanderegg added this to the Mithril milestone Mar 17, 2023
@sanderegg sanderegg added Feedback a:api framework api, data schemas, Epic a:director-v2 issue related with the director-v2 service a:dask-service Any of the dask services: dask-scheduler/sidecar or worker and removed Feedback labels Mar 17, 2023
@sanderegg
Copy link
Member Author

sanderegg commented Mar 29, 2023

  • using resource overrides, the sleepers:2.0.2 was made to use 0.1 CPU and 100MB
  • created 1 dask-gateway-manager node
  • created 21 dask-gateway worker nodes (t2.large)
  • tested running 10, 20, 200 sleepers
  • delete job projects POC
  • test with 2000 sleepers
  • test with 16000 sleepers
  • check get results

@sanderegg
Copy link
Member Author

sanderegg commented Apr 18, 2023

  • disabling dask-gateway auto-adapt mode unleashes issue-free running of 2200 jobs on 51 machines
  • director-v2 now asks for a large amount of workers (10000) to ensure the number of machines is maximized

This shows the status on the gateway while running 2200 jobs
Image

Image

Adaptive scaling seems to be evil
Image

remaining issues:

  • throughput of director-v2 to send the tasks could be increased

@sanderegg
Copy link
Member Author

closing this. outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:api framework api, data schemas, a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:director-v2 issue related with the director-v2 service
Projects
None yet
Development

No branches or pull requests

4 participants