Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda devices monitoring #218

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
b98cd83
introduce envoy config with gpus
igor-davidyuk Aug 27, 2021
f64fc39
added cuda version to message
igor-davidyuk Aug 31, 2021
d3fe055
introduced device_monitor plugin
igor-davidyuk Sep 1, 2021
cc128e3
cuda devices included in shard_info
igor-davidyuk Sep 6, 2021
a04cb1a
experiment update
igor-davidyuk Sep 6, 2021
33a7d58
fix repeated field assignmet
igor-davidyuk Sep 7, 2021
5a7078f
cuda status updates
igor-davidyuk Sep 9, 2021
e487ac4
fixes
igor-davidyuk Sep 10, 2021
a733210
envoy represented as dict
igor-davidyuk Sep 16, 2021
fbb85c5
working example
igor-davidyuk Sep 17, 2021
d765e6e
flake8 fixes
igor-davidyuk Sep 21, 2021
dc401d6
Iliya's suggestion for template unpacking
igor-davidyuk Sep 29, 2021
61854cf
Required fixes
igor-davidyuk Sep 29, 2021
6bd00c8
enum fix in collaborator
igor-davidyuk Sep 29, 2021
309fba2
fix envoy client test
igor-davidyuk Sep 29, 2021
c290ac3
avoid passing new device parameter to old tusk runners
igor-davidyuk Sep 29, 2021
e0c8531
moved experiment
igor-davidyuk Oct 8, 2021
3508cac
removed unsued files
igor-davidyuk Oct 8, 2021
8338f05
fix envoy cli after rebase
igor-davidyuk Oct 15, 2021
c8ede97
director fix
igor-davidyuk Oct 15, 2021
da673eb
fix tests
igor-davidyuk Oct 15, 2021
07c16bb
removed additional notebook
igor-davidyuk Oct 15, 2021
c512e33
added plugin to setup.py and fixed envoy configs in tutorials
igor-davidyuk Oct 15, 2021
3fb7616
fixed default value for device monitor plugin
igor-davidyuk Oct 18, 2021
2bee1f5
fixed tensorflow test
igor-davidyuk Oct 18, 2021
ef2321e
fix rebase
igor-davidyuk Oct 18, 2021
569405a
initialized docks
igor-davidyuk Oct 19, 2021
482ba17
update docs
igor-davidyuk Oct 19, 2021
509e014
restore setup.py content
igor-davidyuk Oct 29, 2021
56cab83
shard-config -> envoy-config renaming
igor-davidyuk Oct 29, 2021
f217a5a
more renamings: shard_config -> envoy_config
igor-davidyuk Oct 29, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 41 additions & 3 deletions docs/source/openfl/plugins.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,25 @@

framework_adapter_
serializer_plugin_
device_monitor_plugin_


|productName| is designed to be a flexible and extensible framework. Plugins are interchangeable parts of
|productName| components. Different plugins support varying usage scenarios. |productName| users are free to provide
their implementations of |productName| plugins to support desired behavior.
|productName| components.
A plugin may be :code:`required` or :code:`optional`. |productName| can run without optional plugins.
|productName| users are free to provide
their implementations of |productName| plugins to achieve a desired behavior.
Technically, a plugin is just a class, that satisfies a certain interface. One may enable a plugin by putting its
import path and initialization parameters to the config file of a corresponding |productName| component
or to the frontend Python API. Please refer to openfl-tutorials for more information.

.. _framework_adapter:

Framework Adapter
######################

Framework Adapter plugins enable |productName| support for Deep Learning frameworks usage in FL experiments.
It is a required plugin for the frontend API component and Envoy.
All the framework-specific operations on model weights are isolated in this plugin so |productName| can be framework-agnostic.
The Framework adapter plugin interface is simple: there are two required methods to load and extract tensors from
a model and an optimizer.
Expand Down Expand Up @@ -57,7 +64,7 @@ Experiment Serializer

Serializer plugins are used on the Frontend API to serialize the Experiment components and then on Envoys to deserialize them back.
Currently, the default serializer is based on pickling.

It is a required plugin.
A Serializer plugin must implement :code:`serialize` method that creates a python object's representation on disk.

.. code-block:: python
Expand All @@ -71,3 +78,34 @@ As well as :code:`restore_object` that will load previously serialized object fr

@staticmethod
def restore_object(filename: str):


.. _device_monitor_plugin:

CUDA Device Monitor
######################

CUDA Device Monitor plugin is an optional plugin for Envoy that can gather status information about GPU devices.
This information may be used by Envoy and included in a healthcheck message that is sent to Director.
Thus the CUDA devices statuses are visible to frontend users that may query this Envoy Registry information from Director.

CUDA Device Monitor plugin must implement the following interface:

.. code-block:: python

class CUDADeviceMonitor:

def get_driver_version(self) -> str:
...

def get_device_memory_total(self, index: int) -> int:
...

def get_device_memory_utilized(self, index: int) -> int:
...

def get_device_utilization(self, index: int) -> str:
"""It is just a general method that returns a string that may be shown to the frontend user."""
...


20 changes: 18 additions & 2 deletions docs/source/workflow/director_based_workflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ To start the Envoy without mTLS use the following CLI command:
.. code-block:: console

$ fx envoy start -n env_one --disable-tls \
--shard-config-path shard_config.yaml -d director_fqdn:port
--envoy-config-path envoy_config.yaml -d director_fqdn:port

Alternatively, use the following command to establish a secured connection:

Expand All @@ -127,7 +127,7 @@ Alternatively, use the following command to establish a secured connection:
$ ENVOY_NAME=envoy_example_name

$ fx envoy start -n "$ENVOY_NAME" \
--shard-config-path shard_config.yaml \
--envoy-config-path envoy_config.yaml \
-d director_fqdn:port -rc cert/root_ca.crt \
-pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt

Expand Down Expand Up @@ -345,6 +345,22 @@ This method:
* Compresses the whole workspace to an archive.
* Sends the experiment archive to the Director so it may distribute the archive across the Federation and start the *Aggregator*.

FLExperiment's code:`start()` method parameters
-------------------------------------------------

* code:`model_provider` - defined earlier code:`ModelInterface` object
* code:`task_keeper` - defined earlier code:`TaskInterface` object
* code:`data_loader` - defined earlier code:`DataInterface` object
* code:`rounds_to_train` - number of aggregation rounds needed to be conducted before the experiment is considered finished
* code:`delta_updates` - use calculated gradients instead of model checkpoints for aggregation
* code:`opt_treatment` - optimizer state treatment in federation. Possible values: 'RESET' means the optimizer state
is initialized each round from noise, if 'CONTINUE_LOCAL' is used the optimizer state will be reused locally by every collaborator,
in case the parameter is set to 'CONTINUE_GLOBAL' the optimizer's state will be aggregated.
* code:`device_assignment_policy` - this setting may be 'CPU_ONLY' or 'CUDA_PREFFERED'. In the first case, the code:`device`
parameter (which is a part of a task contract) that is passed to an FL task each round will be 'cpu'. In case
code:`device_assignment_policy='CUDA_PREFFERED'`, the code:`device` parameter will be 'cuda:{index}' if cuda devices
enabled in Envoy config and 'cpu' otherwise.

Observing the Experiment execution
----------------------------------

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
settings:
listen_host: localhost
listen_port: 50051
listen_port: 50050
sample_shape: ['300', '400', '3']
target_shape: ['300', '400']
envoy_health_check_period: 60 # in seconds
envoy_health_check_period: 5 # in seconds
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
params:
cuda_devices: [0,2]

optional_plugin_components:
cuda_device_monitor:
template: openfl.plugins.processing_units_monitor.pynvml_monitor.PynvmlCUDADeviceMonitor
settings: []

shard_descriptor:
template: kvasir_shard_descriptor.KvasirShardDescriptor
params:
data_folder: kvasir_data
rank_worldsize: 1,10
enforce_image_hw: '300,400'
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: kvasir_shard_descriptor.KvasirShardDescriptor
params:
data_folder: kvasir_data
rank_worldsize: 2,10
enforce_image_hw: '300,400'

Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
numpy
pillow
pillow
nvidia-ml-py3

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
set -e

fx envoy start -n env_one --disable-tls --shard-config-path shard_config.yaml -dh localhost -dp 50051
fx envoy start -n env_one --disable-tls --envoy-config-path envoy_config.yaml -dh localhost -dp 50050
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ set -e
ENVOY_NAME=$1
DIRECTOR_FQDN=$2

fx envoy start -n "$ENVOY_NAME" --shard-config-path shard_config.yaml -dh "$DIRECTOR_FQDN" -dp 50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
fx envoy start -n "$ENVOY_NAME" --envoy-config-path envoy_config.yaml -dh "$DIRECTOR_FQDN" -dp 50050 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
"outputs": [],
"source": [
"# Install dependencies if not already installed\n",
"!pip install torchvision==0.8.1"
"!pip install torchvision"
]
},
{
Expand Down Expand Up @@ -81,7 +81,7 @@
"\n",
"# 2) Run with TLS disabled (trusted environment)\n",
"# Federation can also determine local fqdn automatically\n",
"federation = Federation(client_id='frontend', director_node_fqdn='localhost', director_port='50051', tls=False)\n"
"federation = Federation(client_id='frontend', director_node_fqdn='localhost', director_port='50050', tls=False)\n"
]
},
{
Expand All @@ -91,6 +91,11 @@
"metadata": {},
"outputs": [],
"source": [
"# import time\n",
"# while True:\n",
"# shard_registry = federation.get_shard_registry()\n",
"# print(shard_registry)\n",
"# time.sleep(5)\n",
"shard_registry = federation.get_shard_registry()\n",
"shard_registry"
]
Expand Down Expand Up @@ -385,10 +390,19 @@
" device='device', optimizer='optimizer') \n",
"@TI.set_aggregation_function(aggregation_function)\n",
"def train(unet_model, train_loader, optimizer, device, loss_fn=soft_dice_loss, some_parameter=None):\n",
" \n",
" \"\"\" \n",
" The following constructions, that may lead to resource race\n",
" is no longer needed:\n",
" \n",
" if not torch.cuda.is_available():\n",
" device = 'cpu'\n",
" else:\n",
" device = 'cuda'\n",
" \n",
" \"\"\"\n",
"\n",
" print(f'\\n\\n TASK TRAIN GOT DEVICE {device}\\n\\n')\n",
" \n",
" function_defined_in_notebook(some_parameter)\n",
" \n",
Expand All @@ -414,11 +428,8 @@
"\n",
"@TI.register_fl_task(model='unet_model', data_loader='val_loader', device='device') \n",
"def validate(unet_model, val_loader, device):\n",
" if not torch.cuda.is_available():\n",
" device = 'cpu'\n",
" else:\n",
" device = 'cuda'\n",
" \n",
" print(f'\\n\\n TASK VALIDATE GOT DEVICE {device}\\n\\n')\n",
" \n",
" unet_model.eval()\n",
" unet_model.to(device)\n",
" \n",
Expand Down Expand Up @@ -475,7 +486,7 @@
" data_loader=fed_dataset,\n",
" rounds_to_train=2,\n",
" opt_treatment='CONTINUE_GLOBAL',\n",
" )\n"
" device_assignment_policy='CUDA_PREFERRED')\n"
]
},
{
Expand Down Expand Up @@ -588,7 +599,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -602,7 +613,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
"version": "3.7.10"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: market_shard_descriptor.MarketShardDescriptor
params:
datafolder: Market-1501-v15.09.15
rank_worldsize: 1,2
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: market_shard_descriptor.MarketShardDescriptor
params:
datafolder: Market-1501-v15.09.15
rank_worldsize: 2,2

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
set -e

fx envoy start -n env_one --disable-tls -dh localhost -dp 50051 -sc shard_config_one.yaml
fx envoy start -n env_one --disable-tls -dh localhost -dp 50051 -ec envoy_config_one.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ set -e
ENVOY_NAME=$1
DIRECTOR_FQDN=$2

fx envoy start -n "$ENVOY_NAME" --shard-config-path shard_config.yaml -d "$DIRECTOR_FQDN":50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
fx envoy start -n "$ENVOY_NAME" --envoy-config-path envoy_config.yaml -d "$DIRECTOR_FQDN":50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: tinyimagenet_shard_descriptor.TinyImageNetShardDescriptor
params:
data_folder: tinyimagenet_data
rank_worldsize: 1,2

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
set -e

fx envoy start -n env_one --disable-tls --shard-config-path shard_config.yaml -dh localhost -dp 50051
fx envoy start -n env_one --disable-tls --envoy-config-path envoy_config.yaml -dh localhost -dp 50051
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ set -e
ENVOY_NAME=$1
DIRECTOR_FQDN=$2

fx envoy start -n "$ENVOY_NAME" --shard-config-path shard_config.yaml -dh "$DIRECTOR_FQDN" -dp 50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
fx envoy start -n "$ENVOY_NAME" --envoy-config-path envoy_config.yaml -dh "$DIRECTOR_FQDN" -dp 50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
4 changes: 2 additions & 2 deletions openfl-tutorials/interactive_api/Tensorflow_MNIST/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@ cd director_folder
2. Run envoy:
```sh
cd envoy_folder
./start_envoy.sh env_one shard_config_one.yaml
./start_envoy.sh env_one envoy_config_one.yaml
```

Optional: start second envoy:
- Copy `envoy_folder` to another place and run from there:
```sh
./start_envoy.sh env_two shard_config_two.yaml
./start_envoy.sh env_two envoy_config_two.yaml
```

3. Run `Mnist_Classification_FL.ipybnb` jupyter notebook:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: mnist_shard_descriptor.MnistShardDescriptor
params:
rank_worldsize: 1, 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
params:
cuda_devices: []

optional_plugin_components: {}

shard_descriptor:
template: mnist_shard_descriptor.MnistShardDescriptor
params:
rank_worldsize: 2, 2

This file was deleted.

This file was deleted.

Loading