Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deriving resources from a scope #4

Open
eleon opened this issue Jul 6, 2021 · 12 comments
Open

Deriving resources from a scope #4

eleon opened this issue Jul 6, 2021 · 12 comments

Comments

@eleon
Copy link
Member

eleon commented Jul 6, 2021

We need a function to add resources to a given scope. For example, if a user starts with an empty scope, she may want to add resources assigned to her that are part of the user scope.

int qv_scope_add(QV_handle qv,
	       QV_scope_t scope,
	       QV_obj_type_t obj,
	       int num_objs);

This function adds num_objs objects of type obj to the input scope.

@eleon
Copy link
Member Author

eleon commented Oct 12, 2021

Actually, rather than having a function to add resources to a scope, we need a function to derive or extract resources from an existing scope (and thus from resources owned). This function can be used when individual workers want to launch work on a specific set of resources.

int
qv_scope_extract(
    qv_context_t *ctx,
    qv_scope_t *scope,
    qv_hw_obj_type_t obj_type,
    int num_objs,
    int hint, 
    qv_scope_t **subscope
);

This function creates subscope with num_objs objects of type obj_type from scope. The hint parameter is intended to tell QV of a desired trait such as close to a NIC. We still need to determine what hints would be useful to have and how to represent them.

@eleon eleon changed the title Adding resources to a scope Deriving resources from a scope Oct 12, 2021
@eleon
Copy link
Member Author

eleon commented Oct 19, 2021

We should keep track of what objects are given so that we do not give them again until they are freed. We can use scope_free to free the resources associated with a scope. I am not sure we would need qv_scope_nobjs_free(qv, scope, obj_type, nobjs).

@eleon
Copy link
Member Author

eleon commented Oct 21, 2021

New suggestion for the function name: qv_subscope_create()

@samuelkgutierrez
Copy link
Member

@eleon, to maintain tracking on this issue please see progress in c460005. We can discuss the name once we are settled on its functionality.

@eleon
Copy link
Member Author

eleon commented May 13, 2022

Thank you, @samuelkgutierrez! The semantics look good based on test-mpi-scopes.c. Perhaps, we can create an additional test that focuses solely on qv_scope_create. I'm thinking about testing aspects like once a resource is given through this function, we should not give it again (unless the scope is freed); and that GPUs are given correctly as well. As soon as I get an opportunity, I can write the test.

@eleon
Copy link
Member Author

eleon commented Nov 22, 2022

@samuelkgutierrez, I am trying out test-mpi-scopes and found strange behavior from the qv_scope_split operation. I am using a 2-socket architecture, each socket with 18 SMT-2 cores. I would have thought that splitting the node with 2 tasks would have resulted in one socket (18 cores) per task, but that is not the case:

leon@pascal30:qv$ QV_PORT=55996 srun -N1 -n2 quo-vadis/build-pascal/tests/test-mpi-scopes
[1] self_scope taskid is 0
[1] self_scope ntasks is 1
[0] self_scope taskid is 0
[0] self_scope ntasks is 1
[0] base_scope taskid is 0
[0] base_scope ntasks is 2
[1] base_scope taskid is 1
[1] base_scope ntasks is 2
[1] Number of PUs in base_scope is 72
[1] base GID is 1
[0] Number of PUs in base_scope is 72
[0] base GID is 0
[1] Number of PUs in sub_scope is 18
[1] sub_scope taskid is 0
[1] sub_scope ntasks is 1
[0] Number of PUs in sub_scope is 36
[0] sub_scope taskid is 0
[0] sub_scope ntasks is 1
[0] New cpubind is     0-17,36-53
[1] New cpubind is     9-17,45-53
[0] Popped cpubind is  0-17
[1] Popped cpubind is  18-35
[0] Number of PUs in create_scope is 2
[0] create_scope taskid is 0
[0] create_scope ntasks is 1
[1] Number of PUs in sub_sub_scope is 9
[0] Number of PUs in sub_sub_scope is 18

The strange behavior is apparent here:

[1] Number of PUs in sub_scope is 18
[0] Number of PUs in sub_scope is 36

@eleon
Copy link
Member Author

eleon commented Nov 22, 2022

Perhaps, I need the AFFINITY_PRESERVING flag? If so, shouldn't it be the default? I guess, I need to read a bit more to fully understand. I think this issue may be related to Issue #9 rather than this page's issue.
Sorry for the detour, I am now focusing on qv_scope_create :)

@eleon
Copy link
Member Author

eleon commented Nov 22, 2022

Tested qv_scope_create and added associated test test-mpi-scope-create.c
It works! Thanks, @samuelkgutierrez.
The only issue is when a set of cores have been assigned to a scope, they can be re-assigned to another scope even if the original scope has not been released:

leon@pascal30:qv$ QV_PORT=55996 srun -N1 -n2 quo-vadis/build-pascal/tests/test-mpi-scope-create 
[0] Base scope w/36 cores, running on 0-17
[1] Base scope w/36 cores, running on 18-35

===Scope split===
=> [0] Split: got 18 cores, running on 0-17,36-53
=> [1] Split: got 18 cores, running on 18-35,54-71

===Asking and not releasing 1,10 core scopes===

===Scope w/1 cores===
=> [0] Core scope: got 1 cores, running on 0,36
=> [1] Core scope: got 1 cores, running on 18,54
[0] Popped up to 0-17,36-53
[1] Popped up to 18-35,54-71

===Scope w/10 cores===
=> [0] Core scope: got 10 cores, running on 0-9,36-45
=> [1] Core scope: got 10 cores, running on 18-27,54-63
[0] Popped up to 0-17,36-53
[1] Popped up to 18-35,54-71

===Asking and releasing 5-core scopes===

===Scope w/5 cores===
=> [1] Core scope: got 5 cores, running on 18-22,54-58
=> [0] Core scope: got 5 cores, running on 0-4,36-40
[1] Popped up to 18-35,54-71
[0] Popped up to 0-17,36-53

===Scope w/5 cores===
=> [0] Core scope: got 5 cores, running on 0-4,36-40
=> [1] Core scope: got 5 cores, running on 18-22,54-58
[0] Popped up to 0-17,36-53
[1] Popped up to 18-35,54-71

@eleon
Copy link
Member Author

eleon commented Nov 22, 2022

In the example above, each task gets a scope with 1 core:

task 0: 0,36
task 1: 18,54

The scope is not released, then each task asks for 10 cores:

task 0: 0-9,36-45
task 1: 18-27,54-63

In this case, cores 0,36 and 18,54 should have not been used for the second scope, because they are part of an active scope.

@samuelkgutierrez
Copy link
Member

I don't think we want to exclude the possibility of resources being shared across scopes. We could certainly make better decisions when resource reference counting is implemented, but I don't like the idea of returning a resource exhaustion error code.

@eleon
Copy link
Member Author

eleon commented Nov 23, 2022

I agree @samuelkgutierrez, resources being shared across scopes is fine. The issue here is as follows:
Let's say a process has 18 cores in a scope. Then, threads of this process start requesting cores using qv_scope_create (one core per thread, for example). Then, even though the parent scope has 18 cores, all the threads will get the first core, rather than a different core. Like you said, perhaps, this will be solved with reference counting :) Thanks, Sam.

@eleon
Copy link
Member Author

eleon commented Jan 31, 2024

Now that we have qv_scope_create (see below), we talked about implementing this functionality with the qv_scope_create_hint_t named QV_SCOPE_CREATE_EXCLUSIVE. When this parameter is used resources given by qv_scope_create won't be given again until the associated scope is freed.

qv_scope_create(
    qv_context_t *ctx,
    qv_scope_t *scope,
    qv_hw_obj_type_t type,
    int nobjs,
    qv_scope_create_hint_t hint,
    qv_scope_t **subscope
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants