Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMIx Fence: single-job partial barrier #72

Open
artpol84 opened this issue Feb 10, 2021 · 3 comments
Open

PMIx Fence: single-job partial barrier #72

artpol84 opened this issue Feb 10, 2021 · 3 comments
Assignees
Labels
Unit Test Spec Unit Test Specification

Comments

@artpol84
Copy link

artpol84 commented Feb 10, 2021

Test description

Verifies that the partial Fence is properly working

Test sketch

#include "pmix.h"

double max_fence_time()
{
	double fence_time = 0;
	int i;
	
	/* Measure the typical fence execution time */
	for(i = 0; i < 100; i++) {
		ts1 = timestamp();
		PMIx_Fence(without_data_collection);
		ts2 = timestamp();
		fence_time = max(fence_time, ts2 - ts1);
	}
	return fence_time;
}

int main() 
{
    double timeout, fence_time;
	
    PMIx_Init();
	
    fence_time = max_fence_time();
    T = Ratio * fence_time; // Ratio might be 100, should be selected for the particular system
	
    if( rank == 1){
        sleep(T);
    }
    if( rank % 2 ){
        ts1 = timestamp();
        PMIx_Fence(without_data_collection, only-odd-procs);
        ts2 = timestamp();
		// Odd ranks should not be affected by the rank = 1 delay
        assert( (t2 - t1) ~ fence_time);
    }
    ts1 = timestamp();
    PMIx_Fence(without_data_collection);
    ts2 = timestamp();
	
    if(rank != 1) {
        assert( (t2 - t1) ~ T);
    } else {
        assert( (t2 - t1) ~ fence_time);
    }
    PMIx_Finalize();
}

Execution details

  • 4 servers
  • 16 clients
  • Predefined (passed through cmdline) namespace
  • Predefined process placement: "0:0,1,2,3; 1:4,5,6,7; 2:8,9,10,11; 3:12,13,14,15;"
  • Ratio and "~" are selected to match the system
    • The time-dependant checks can be turned off
  • Execute M times to capture race conditions
  • The first rank is simulating the delay. The test verifies that the Fence is really synchronizing;

Client-side expectations:

  1. All PMIx calls return PMIX_SUCCESS
  2. All ranks (except rank=0) experience Fence timeout.

Server-side expectations:

  1. N invocations of:
  • client_connected
  • client_finalized
  1. Verify, that proc structure was set to the individual ranks.
  2. 2 Fence callback invocation with WILDCARD.
  3. Distance between Fence's on node0 is > T
  4. Starting from "modex: avoid exchange unnecessary buffer when collect flag is not set openpmix#1135" the size of Fence should be 0B.
  5. No other callbacks are called (no direct modex requests)
    (? Any event-related activity?)

Reference implementation:

TBD

Notes

The test suite's RTE component should implement the support for multiple in-flight Fence's.
Currently not supported.

@artpol84 artpol84 added the Unit Test Spec Unit Test Specification label Feb 10, 2021
@jjhursey
Copy link
Member

Do you need a if( ! rank % 2 ){ after the second fence to account for the additional delay that those processes not participating in the first fence will see in the start of the synchronization in the second fence? Something like

if( rank % 2 ){
  // Shouldn't see any additional synchronization delay in second fence
} else
  // Account for synchronization delay from other ranks participating in the first fence.

@artpol84
Copy link
Author

The idea is that fence_time is negligible compared to T.
it's like O(T) where O(T) ~ O(fence_time + T)

@cpshereda
Copy link
Contributor

See openpmix/openpmix#2327.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Unit Test Spec Unit Test Specification
Projects
None yet
Development

No branches or pull requests

3 participants