Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Cache gpu_alloc_map in Redis, and Add RescanGPUAllocMaps mutation #3293

Open
wants to merge 24 commits into
base: topic/06-13-feat_support_scanning_gpu_allocation
Choose a base branch
from

Conversation

jopemachine
Copy link
Member

@jopemachine jopemachine commented Dec 24, 2024

Why?

Since gpu_alloc_map exists on the agent, querying this field requires an RPC call.

In a production environment with multiple agents, repeatedly querying this field is inefficient and can significantly slow down response time.

How it works

This PR addresses the inefficiency by caching gpu_alloc_map in the REDIS_STAT_DB, and introduces RescanGPUAllocMaps mutation for updating this cache.

RescanGPUAllocMaps takes agent_id as an argument to scan the gpu_alloc_map of a specific agent. Or it can accept None as an argument to iterate through all agents in the "Alive" state and scan their allocation maps.
The scan results are cached in Redis in JSON format.

Example

For example, the key for the alloc_map of an agent with the ID i-ubuntu will be gpu_alloc_map.i-ubuntu, and its value is like the below form.

// {"device_id": "value"}
{"c59395cd-ac91-4cd3-a1b0-3d2568aa2d04": "8.00"}

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation

📚 Documentation preview 📚: https://sorna--3293.org.readthedocs.build/en/3293/


📚 Documentation preview 📚: https://sorna-ko--3293.org.readthedocs.build/ko/3293/

@github-actions github-actions bot added area:docs Documentations comp:manager Related to Manager component labels Dec 24, 2024
Copy link
Member Author

jopemachine commented Dec 24, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions github-actions bot added the size:M 30~100 LoC label Dec 24, 2024
@jopemachine jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from f3056c8 to 0c45322 Compare December 24, 2024 02:35
@jopemachine jopemachine force-pushed the topic/12-24-feat_cache_gpu_alloc_map_and_add_scangpuallocmap_mutation branch from f679650 to 3df9731 Compare December 24, 2024 02:36
@jopemachine jopemachine changed the title feat: Cache gpu_alloc_map, and Add ScanGPUAllocMap mutation feat: Cache gpu_alloc_map, and Add ScanGPUAllocMap mutation Dec 24, 2024
@jopemachine jopemachine added the type:feature Add new features label Dec 24, 2024
@jopemachine jopemachine force-pushed the topic/12-24-feat_cache_gpu_alloc_map_and_add_scangpuallocmap_mutation branch from 06e7e47 to 68617a4 Compare December 24, 2024 03:31
@jopemachine jopemachine added this to the 24.12 milestone Dec 24, 2024
@jopemachine jopemachine changed the title feat: Cache gpu_alloc_map, and Add ScanGPUAllocMap mutation feat: Cache gpu_alloc_map, and Add ScanGPUAllocMaps mutation Dec 24, 2024
@jopemachine jopemachine force-pushed the topic/12-24-feat_cache_gpu_alloc_map_and_add_scangpuallocmap_mutation branch from 41285f7 to fa5ce5b Compare December 24, 2024 03:59
@github-actions github-actions bot added size:L 100~500 LoC and removed size:M 30~100 LoC labels Dec 24, 2024
@jopemachine jopemachine changed the title feat: Cache gpu_alloc_map, and Add ScanGPUAllocMaps mutation feat: Cache gpu_alloc_map, and Add RescanGPUAllocMaps mutation Dec 26, 2024
@jopemachine jopemachine marked this pull request as ready for review December 26, 2024 03:10
@jopemachine jopemachine marked this pull request as draft December 26, 2024 03:34
@jopemachine jopemachine marked this pull request as ready for review December 26, 2024 06:42
@jopemachine jopemachine changed the title feat: Cache gpu_alloc_map, and Add RescanGPUAllocMaps mutation feat: Cache gpu_alloc_map in Redis, and Add RescanGPUAllocMaps mutation Dec 26, 2024
@jopemachine jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from cc44683 to 4704dd6 Compare December 26, 2024 06:57
@jopemachine jopemachine force-pushed the topic/12-24-feat_cache_gpu_alloc_map_and_add_scangpuallocmap_mutation branch from d79ad90 to 26dd587 Compare December 26, 2024 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:docs Documentations comp:manager Related to Manager component size:L 100~500 LoC type:feature Add new features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant