securefederatedai · alexey-gruzdev · Oct 29, 2021 · Aug 27, 2021 · Aug 31, 2021 · Sep 1, 2021
diff --git a/docs/source/openfl/plugins.rst b/docs/source/openfl/plugins.rst
@@ -10,18 +10,25 @@
 
    framework_adapter_
    serializer_plugin_
+   device_monitor_plugin_
 
 
 |productName| is designed to be a flexible and extensible framework. Plugins are interchangeable parts of 
-|productName| components. Different plugins support varying usage scenarios. |productName| users are free to provide 
-their implementations of |productName| plugins to support desired behavior.
+|productName| components. 
+A plugin may be :code:`required` or :code:`optional`. |productName| can run without optional plugins. 
+|productName| users are free to provide 
+their implementations of |productName| plugins to achieve a desired behavior. 
+Technically, a plugin is just a class, that satisfies a certain interface. One may enable a plugin by putting its 
+import path and initialization parameters to the config file of a corresponding |productName| component 
+or to the frontend Python API. Please refer to openfl-tutorials for more information.
 
 .. _framework_adapter:
 
 Framework Adapter
 ######################
 
 Framework Adapter plugins enable |productName| support for Deep Learning frameworks usage in FL experiments. 
+It is a required plugin for the frontend API component and Envoy.
 All the framework-specific operations on model weights are isolated in this plugin so |productName| can be framework-agnostic.
 The Framework adapter plugin interface is simple: there are two required methods to load and extract tensors from 
 a model and an optimizer. 
@@ -57,7 +64,7 @@ Experiment Serializer
 
 Serializer plugins are used on the Frontend API to serialize the Experiment components and then on Envoys to deserialize them back.
 Currently, the default serializer is based on pickling.
-
+It is a required plugin.
 A Serializer plugin must implement :code:`serialize` method that creates a python object's representation on disk.
 
 .. code-block:: python
@@ -71,3 +78,34 @@ As well as :code:`restore_object` that will load previously serialized object fr
 
    @staticmethod
    def restore_object(filename: str):
+
+
+.. _device_monitor_plugin:
+
+CUDA Device Monitor
+######################
+
+CUDA Device Monitor plugin is an optional plugin for Envoy that can gather status information about GPU devices. 
+This information may be used by Envoy and included in a healthcheck message that is sent to Director. 
+Thus the CUDA devices statuses are visible to frontend users that may query this Envoy Registry information from Director.
+
+CUDA Device Monitor plugin must implement the following interface:
+
+.. code-block:: python
+
+   class CUDADeviceMonitor:
+
+      def get_driver_version(self) -> str:
+         ...
+
+      def get_device_memory_total(self, index: int) -> int:
+         ...
+
+      def get_device_memory_utilized(self, index: int) -> int:
+         ...
+
+      def get_device_utilization(self, index: int) -> str:
+         """It is just a general method that returns a string that may be shown to the frontend user."""
+         ...
+
+
diff --git a/docs/source/workflow/director_based_workflow.rst b/docs/source/workflow/director_based_workflow.rst
@@ -118,7 +118,7 @@ To start the Envoy without mTLS use the following CLI command:
     .. code-block:: console
 
         $ fx envoy start -n env_one --disable-tls \
-            --shard-config-path shard_config.yaml -d director_fqdn:port
+            --envoy-config-path envoy_config.yaml -d director_fqdn:port
 
 Alternatively, use the following command to establish a secured connection:
 
@@ -127,7 +127,7 @@ Alternatively, use the following command to establish a secured connection:
         $ ENVOY_NAME=envoy_example_name
 
         $ fx envoy start -n "$ENVOY_NAME" \
-            --shard-config-path shard_config.yaml \
+            --envoy-config-path envoy_config.yaml \
             -d director_fqdn:port -rc cert/root_ca.crt \
             -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
 
@@ -345,6 +345,22 @@ This method:
 * Compresses the whole workspace to an archive.
 * Sends the experiment archive to the Director so it may distribute the archive across the Federation and start the *Aggregator*.
 
+FLExperiment's code:`start()` method parameters
+-------------------------------------------------
+
+* code:`model_provider` - defined earlier code:`ModelInterface` object
+* code:`task_keeper` - defined earlier code:`TaskInterface` object 
+* code:`data_loader` - defined earlier code:`DataInterface` object
+* code:`rounds_to_train` - number of aggregation rounds needed to be conducted before the experiment is considered finished
+* code:`delta_updates` - use calculated gradients instead of model checkpoints for aggregation
+* code:`opt_treatment` - optimizer state treatment in federation. Possible values: 'RESET' means the optimizer state 
+is initialized each round from noise, if 'CONTINUE_LOCAL' is used the optimizer state will be reused locally by every collaborator, 
+in case the parameter is set to 'CONTINUE_GLOBAL' the optimizer's state will be aggregated.
+* code:`device_assignment_policy` - this setting may be 'CPU_ONLY' or 'CUDA_PREFFERED'. In the first case, the code:`device` 
+parameter (which is a part of a task contract) that is passed to an FL task each round will be 'cpu'. In case 
+code:`device_assignment_policy='CUDA_PREFFERED'`, the code:`device` parameter will be 'cuda:{index}' if cuda devices 
+enabled in Envoy config and 'cpu' otherwise.
+
 Observing the Experiment execution
 ----------------------------------
 

diff --git a/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/director/director_config.yaml b/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/director/director_config.yaml
@@ -1,6 +1,6 @@
 settings:
   listen_host: localhost
-  listen_port: 50051
+  listen_port: 50050
   sample_shape: ['300', '400', '3']
   target_shape: ['300', '400']
-  envoy_health_check_period: 60  # in seconds
+  envoy_health_check_period: 5  # in seconds
diff --git a/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/envoy_config.yaml b/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/envoy_config.yaml
@@ -0,0 +1,14 @@
+params:
+  cuda_devices: [0,2]
+
+optional_plugin_components:
+ cuda_device_monitor:
+   template: openfl.plugins.processing_units_monitor.pynvml_monitor.PynvmlCUDADeviceMonitor
+   settings: []
+
+shard_descriptor:
+  template: kvasir_shard_descriptor.KvasirShardDescriptor
+  params:
+    data_folder: kvasir_data
+    rank_worldsize: 1,10
+    enforce_image_hw: '300,400'
diff --git a/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/envoy_config_no_gpu.yaml b/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/envoy_config_no_gpu.yaml
@@ -0,0 +1,12 @@
+params:
+  cuda_devices: []
+
+optional_plugin_components: {}
+
+shard_descriptor:
+  template: kvasir_shard_descriptor.KvasirShardDescriptor
+  params:
+    data_folder: kvasir_data
+    rank_worldsize: 2,10
+    enforce_image_hw: '300,400'
+
diff --git a/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/sd_requirements.txt b/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/sd_requirements.txt
@@ -1,2 +1,3 @@
 numpy
-pillow
+pillow
+nvidia-ml-py3
diff --git a/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/shard_config.yaml b/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/shard_config.yaml
diff --git a/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/start_envoy.sh b/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/start_envoy.sh
@@ -1,4 +1,4 @@
 #!/bin/bash
 set -e
 
-fx envoy start -n env_one --disable-tls --shard-config-path shard_config.yaml -dh localhost -dp 50051
+fx envoy start -n env_one --disable-tls --envoy-config-path envoy_config.yaml -dh localhost -dp 50050
diff --git a/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/start_envoy_with_tls.sh b/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/envoy/start_envoy_with_tls.sh
@@ -3,4 +3,4 @@ set -e
 ENVOY_NAME=$1
 DIRECTOR_FQDN=$2
 
-fx envoy start -n "$ENVOY_NAME" --shard-config-path shard_config.yaml -dh "$DIRECTOR_FQDN" -dp 50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
+fx envoy start -n "$ENVOY_NAME" --envoy-config-path envoy_config.yaml -dh "$DIRECTOR_FQDN" -dp 50050 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
diff --git a/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/workspace/PyTorch_Kvasir_UNet.ipynb b/openfl-tutorials/interactive_api/PyTorch_Kvasir_UNet/workspace/PyTorch_Kvasir_UNet.ipynb
@@ -44,7 +44,7 @@
    "outputs": [],
    "source": [
     "# Install dependencies if not already installed\n",
-    "!pip install torchvision==0.8.1"
+    "!pip install torchvision"
    ]
   },
   {
@@ -81,7 +81,7 @@
     "\n",
     "# 2) Run with TLS disabled (trusted environment)\n",
     "# Federation can also determine local fqdn automatically\n",
-    "federation = Federation(client_id='frontend', director_node_fqdn='localhost', director_port='50051', tls=False)\n"
+    "federation = Federation(client_id='frontend', director_node_fqdn='localhost', director_port='50050', tls=False)\n"
    ]
   },
   {
@@ -91,6 +91,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# import time\n",
+    "# while True:\n",
+    "#     shard_registry = federation.get_shard_registry()\n",
+    "#     print(shard_registry)\n",
+    "#     time.sleep(5)\n",
     "shard_registry = federation.get_shard_registry()\n",
     "shard_registry"
    ]
@@ -385,10 +390,19 @@
     "                     device='device', optimizer='optimizer')     \n",
     "@TI.set_aggregation_function(aggregation_function)\n",
     "def train(unet_model, train_loader, optimizer, device, loss_fn=soft_dice_loss, some_parameter=None):\n",
+    "    \n",
+    "    \"\"\"    \n",
+    "    The following constructions, that may lead to resource race\n",
+    "    is no longer needed:\n",
+    "    \n",
     "    if not torch.cuda.is_available():\n",
     "        device = 'cpu'\n",
     "    else:\n",
     "        device = 'cuda'\n",
+    "        \n",
+    "    \"\"\"\n",
+    "\n",
+    "    print(f'\\n\\n TASK TRAIN GOT DEVICE {device}\\n\\n')\n",
     "    \n",
     "    function_defined_in_notebook(some_parameter)\n",
     "    \n",
@@ -414,11 +428,8 @@
     "\n",
     "@TI.register_fl_task(model='unet_model', data_loader='val_loader', device='device')     \n",
     "def validate(unet_model, val_loader, device):\n",
-    "    if not torch.cuda.is_available():\n",
-    "        device = 'cpu'\n",
-    "    else:\n",
-    "        device = 'cuda'\n",
-    "        \n",
+    "    print(f'\\n\\n TASK VALIDATE GOT DEVICE {device}\\n\\n')\n",
+    "    \n",
     "    unet_model.eval()\n",
     "    unet_model.to(device)\n",
     "    \n",
@@ -475,7 +486,7 @@
     "                    data_loader=fed_dataset,\n",
     "                    rounds_to_train=2,\n",
     "                    opt_treatment='CONTINUE_GLOBAL',\n",
-    "                    )\n"
+    "                    device_assignment_policy='CUDA_PREFERRED')\n"
    ]
   },
   {
@@ -588,7 +599,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -602,7 +613,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.12"
+   "version": "3.7.10"
   }
  },
  "nbformat": 4,

diff --git a/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/envoy_config_one.yaml b/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/envoy_config_one.yaml
@@ -0,0 +1,10 @@
+params:
+  cuda_devices: []
+
+optional_plugin_components: {}
+
+shard_descriptor:
+  template: market_shard_descriptor.MarketShardDescriptor
+  params:
+    datafolder: Market-1501-v15.09.15
+    rank_worldsize: 1,2
diff --git a/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/envoy_config_two.yaml b/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/envoy_config_two.yaml
@@ -0,0 +1,10 @@
+params:
+  cuda_devices: []
+
+optional_plugin_components: {}
+
+shard_descriptor:
+  template: market_shard_descriptor.MarketShardDescriptor
+  params:
+    datafolder: Market-1501-v15.09.15
+    rank_worldsize: 2,2
diff --git a/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/shard_config_one.yaml b/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/shard_config_one.yaml
diff --git a/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/shard_config_two.yaml b/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/shard_config_two.yaml
diff --git a/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/start_envoy.sh b/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/start_envoy.sh
@@ -1,4 +1,4 @@
 #!/bin/bash
 set -e
 
-fx envoy start -n env_one --disable-tls -dh localhost -dp 50051  -sc shard_config_one.yaml
+fx envoy start -n env_one --disable-tls -dh localhost -dp 50051  -ec envoy_config_one.yaml
diff --git a/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/start_envoy_with_tls.sh b/openfl-tutorials/interactive_api/PyTorch_Market_Re-ID/envoy/start_envoy_with_tls.sh
@@ -3,4 +3,4 @@ set -e
 ENVOY_NAME=$1
 DIRECTOR_FQDN=$2
 
-fx envoy start -n "$ENVOY_NAME" --shard-config-path shard_config.yaml -d "$DIRECTOR_FQDN":50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
+fx envoy start -n "$ENVOY_NAME" --envoy-config-path envoy_config.yaml -d "$DIRECTOR_FQDN":50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
diff --git a/openfl-tutorials/interactive_api/PyTorch_TinyImageNet/envoy/envoy_config.yaml b/openfl-tutorials/interactive_api/PyTorch_TinyImageNet/envoy/envoy_config.yaml
@@ -0,0 +1,10 @@
+params:
+  cuda_devices: []
+
+optional_plugin_components: {}
+
+shard_descriptor:
+  template: tinyimagenet_shard_descriptor.TinyImageNetShardDescriptor
+  params:
+    data_folder: tinyimagenet_data
+    rank_worldsize: 1,2
diff --git a/openfl-tutorials/interactive_api/PyTorch_TinyImageNet/envoy/shard_config.yaml b/openfl-tutorials/interactive_api/PyTorch_TinyImageNet/envoy/shard_config.yaml
diff --git a/openfl-tutorials/interactive_api/PyTorch_TinyImageNet/envoy/start_envoy.sh b/openfl-tutorials/interactive_api/PyTorch_TinyImageNet/envoy/start_envoy.sh
@@ -1,4 +1,4 @@
 #!/bin/bash
 set -e
 
-fx envoy start -n env_one --disable-tls --shard-config-path shard_config.yaml -dh localhost -dp 50051
+fx envoy start -n env_one --disable-tls --envoy-config-path envoy_config.yaml -dh localhost -dp 50051
diff --git a/openfl-tutorials/interactive_api/PyTorch_TinyImageNet/envoy/start_envoy_with_tls.sh b/openfl-tutorials/interactive_api/PyTorch_TinyImageNet/envoy/start_envoy_with_tls.sh
@@ -3,4 +3,4 @@ set -e
 ENVOY_NAME=$1
 DIRECTOR_FQDN=$2
 
-fx envoy start -n "$ENVOY_NAME" --shard-config-path shard_config.yaml -dh "$DIRECTOR_FQDN" -dp 50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
+fx envoy start -n "$ENVOY_NAME" --envoy-config-path envoy_config.yaml -dh "$DIRECTOR_FQDN" -dp 50051 -rc cert/root_ca.crt -pk cert/"$ENVOY_NAME".key -oc cert/"$ENVOY_NAME".crt
diff --git a/openfl-tutorials/interactive_api/Tensorflow_MNIST/README.md b/openfl-tutorials/interactive_api/Tensorflow_MNIST/README.md
@@ -22,13 +22,13 @@ cd director_folder
 2. Run envoy:
 ```sh
 cd envoy_folder
-./start_envoy.sh env_one shard_config_one.yaml
+./start_envoy.sh env_one envoy_config_one.yaml
 ```
 
 Optional: start second envoy:
  - Copy `envoy_folder` to another place and run from there:
 ```sh
-./start_envoy.sh env_two shard_config_two.yaml
+./start_envoy.sh env_two envoy_config_two.yaml
 ```
 
 3. Run `Mnist_Classification_FL.ipybnb` jupyter notebook:

diff --git a/openfl-tutorials/interactive_api/Tensorflow_MNIST/envoy/envoy_config_one.yaml b/openfl-tutorials/interactive_api/Tensorflow_MNIST/envoy/envoy_config_one.yaml
@@ -0,0 +1,9 @@
+params:
+  cuda_devices: []
+
+optional_plugin_components: {}
+
+shard_descriptor:
+  template: mnist_shard_descriptor.MnistShardDescriptor
+  params:
+    rank_worldsize: 1, 2
diff --git a/openfl-tutorials/interactive_api/Tensorflow_MNIST/envoy/envoy_config_two.yaml b/openfl-tutorials/interactive_api/Tensorflow_MNIST/envoy/envoy_config_two.yaml
@@ -0,0 +1,9 @@
+params:
+  cuda_devices: []
+
+optional_plugin_components: {}
+
+shard_descriptor:
+  template: mnist_shard_descriptor.MnistShardDescriptor
+  params:
+    rank_worldsize: 2, 2
diff --git a/openfl-tutorials/interactive_api/Tensorflow_MNIST/envoy/shard_config_one.yaml b/openfl-tutorials/interactive_api/Tensorflow_MNIST/envoy/shard_config_one.yaml
diff --git a/openfl-tutorials/interactive_api/Tensorflow_MNIST/envoy/shard_config_two.yaml b/openfl-tutorials/interactive_api/Tensorflow_MNIST/envoy/shard_config_two.yaml