Skip to content
Hüseyin Tuğrul BÜYÜKIŞIK edited this page Feb 26, 2022 · 9 revisions

Welcome to the gpgpu-loadbalancerx wiki!

How To Use Load Balancer X?

  • include the single-header header-only library LoadBalancerX.h
  • Use the -std=C++1y and pthread compiler flags for Eclipse (with g++ compiler in Ubuntu)
  • Use the -lpthread linker flag for Eclipse (with g++ linker in Ubuntu)

Define a device state class of your own

The class will be required for each GrainOfWork sent to a device to select necessary physical device for the computations & data transmissions.

class DeviceState
{
public:
	int gpuId; // a cuda program can use this to select a gpu
};

Define a grain state class of your own

The class can contain any information about device or grain itself to be able to coordinate any compute/copy command later.

// necessary grain state information
class GrainState
{
public:
	GrainState():whichGpuComputedMeLastTime(-1){}
	int whichGpuComputedMeLastTime;

	// just simulating a GPU's video-memory buffer
        // gpu-id --> buffer
	std::map<int,std::vector<float>> cudaInputDevice;
	std::map<int,std::vector<float>> cudaOutputDevice;
};

Create load balancer instance from class LoadBalanceLib::LoadBalancerX

LoadBalanceLib::LoadBalancerX<DeviceState, GrainState> lb;

LoadBalancerX class instance can be freely copied&moved to anywhere as all its fields are unified with any clone of it through smart pointers.

Add grains to load balancer instance

Grains of unit work that are tradable between devices. Trading grains optimizes run-time. Any added grain will be re-used in each lb.run() call so addWork() only required for once per grain.

Creating a grain:

auto grain = LoadBalanceLib::GrainOfWork<DeviceState, GrainState>(
			lambdaFunctionForInitialization,
			lambdaFunctionForHostToDeviceCopy,
			lambdaFunctionForCompute,
			lambdaFunctionForDeviceToHostCopy,
			lambdaFunctionForSynchronization
);

lambdaFunctionForInitialization function is called only when a grain is taken by a device for the first time. It is meant to initialize necessary buffers and other parts of algorithm for the grain (and device).

lambdaFunctionFor(HostToDeviceCopy/Compute/DeviceToHostCopy) functions are called together always in same device. So it is enough to select the device-id in only one of them (such as picking integer 0 for the first cuda gpu in system using cudaSetDevice(id)).

lambdaFunctionForSynchronization is called last for each grain.

Synchronization lambda must contain synchronization commands for current grain (to synchronize between host and selected device but not necessarily whole device, maybe only single stream of device)

Without pipelining enabled (LoadBalancerX::run(p=false)), all grains are launched in parallel in breadth-first scheme. With pipelining enabled (LoadBalancerX::run(p=true)) and 3+ grains arriving at a device, all the grains of device are launched in 3-way concurrency scheme using pipelining:

I: copy input
C: compute
O: copy output

tim0      time 1 time 2 time 3 time 4 time 5 time 6
init all  I      I      I
                 C      C      C
                        O      O      O
                                             sycnhronize all

All lambda functions given have to be this type:

std::function<DeviceState, GrainState&>

Adding grain to load balancer:

lb.addWork(grain);

Add devices to load balancer instance

Creating device instance:

Following line create a device with device state of gpu-id = 0

auto device = LoadBalanceLib::ComputeDevice<DeviceState>({0});

Adding device instance to the load balancer instance:

lb.addDevice(device);

Run the load balancer (optionally with 3-way concurrency by pipelining)

bool pipelined = true;
for(int i=0;i<iterations;i++)
{
    size_t elapsedNanoSeconds = lb.run(pipelined); // every next iteration runs in less latency until it converges to optimum performance
}

// optionally check the current work distribution in percentages
auto performances = lb.getRelativePerformancesOfDevices();
for(auto p:performances)
    std::cout<<p<<"% "; 

Results: