diff --git a/.binder/postBuild b/.binder/postBuild index c30b91b9..86be4366 100644 --- a/.binder/postBuild +++ b/.binder/postBuild @@ -1,18 +1,3 @@ #!/bin/bash # binder post build script set -ex - -(cd docs && make html) - -# uninstall docs requirements for a lighter docker image -pip uninstall -y -r docs/requirements.txt - -# move examples to the notebooks folder -mv docs/build/html/notebooks . -mv examples notebooks/examples - -# delete everything but the notebooks folder and the substra dependencies -shopt -s extglob -rm -rf .[!.]* -rm -rf !(notebooks|docs) -(cd docs && rm -rf !(src)) diff --git a/.binder/runtime.txt b/.binder/runtime.txt index 9850e861..55090899 100644 --- a/.binder/runtime.txt +++ b/.binder/runtime.txt @@ -1 +1 @@ -python-3.8 +python-3.10 diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 7ed9cd3a..70454283 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -15,7 +15,7 @@ jobs: - name: Set up python uses: actions/setup-python@v2 with: - python-version: 3.8 + python-version: "3.10" - name: Clone substra-tools uses: actions/checkout@v2 @@ -40,7 +40,7 @@ jobs: - name: Install substra, substra-tools and substrafl run: | - pip install -e './substrafl[dev]' + pip install -e ./substrafl pip install -e ./substra pip install -e ./substra-tools @@ -49,8 +49,14 @@ jobs: cp -r substra/references docs/source/documentation/references cp -r substrafl/docs/api docs/source/substrafl_doc/ + - name: Install Pandoc + run: | + sudo wget https://github.com/jgm/pandoc/releases/download/3.1.8/pandoc-3.1.8-1-amd64.deb + sudo dpkg -i pandoc-3.1.8-1-amd64.deb + - name: Install requirements - run: pip install -r requirements.txt + run: | + pip install -r requirements.txt - name: Sphinx make working-directory: ./docs diff --git a/.gitignore b/.gitignore index 2036f6fd..87337b05 100644 --- a/.gitignore +++ b/.gitignore @@ -135,8 +135,6 @@ _build/ # Misc build artefacts tmp/** -docs/source/examples/** -doc/source/substrafl_doc/examples/** docs/source/documentation/references/** docs/source/substrafl_doc/api diff --git a/.readthedocs.yaml b/.readthedocs.yaml index 7bafdf2e..898b309b 100644 --- a/.readthedocs.yaml +++ b/.readthedocs.yaml @@ -8,17 +8,12 @@ version: 2 build: os: "ubuntu-22.04" tools: - python: "3.8" + python: "mambaforge-22.9" # Build documentation in the docs/ directory with Sphinx sphinx: configuration: docs/source/conf.py fail_on_warning: True -# Optionally build your docs in additional formats such as PDF -formats: [] - -# Optionally set the version of Python and requirements required to build your docs -python: - install: - - requirements: requirements.txt +conda: + environment: environment.yml diff --git a/Makefile b/Makefile index c8ed39b8..c3ef32e9 100644 --- a/Makefile +++ b/Makefile @@ -1,27 +1,22 @@ install-examples-dependencies: - pip3 install -r examples/substra_core/diabetes_example/assets/requirements.txt \ - -r examples/substra_core/titanic_example/assets/requirements.txt \ - -r examples/substrafl/get_started/torch_fedavg_assets/requirements.txt \ - -r examples/substrafl/go_further/sklearn_fedavg_assets/requirements.txt \ - -r examples/substrafl/go_further/torch_cyclic_assets/requirements.txt \ - -r examples/substrafl/go_further/diabetes_substrafl_assets/requirements.txt \ + pip3 install -r examples_requirements.txt examples: example-substra example-substrafl example-substra: example-core-diabetes example-core-titanic example-core-diabetes: - cd examples/substra_core/diabetes_example/ && python run_diabetes.py + cd docs/source/examples/substra_core/diabetes_example/ && ipython -c "%run run_diabetes.ipynb" example-core-titanic: - cd examples/substra_core/titanic_example/ && python run_titanic.py + cd docs/source/examples/substra_core/titanic_example/ && ipython -c "%run run_titanic.ipynb" example-substrafl: example-fl-mnist example-fl-iris example-fl-cyclic example-fl-diabetes example-fl-mnist: - cd examples/substrafl/get_started/ && python run_mnist_torch.py + cd docs/source/examples/substrafl/get_started/ && ipython -c "%run run_mnist_torch.ipynb" example-fl-iris: - cd examples/substrafl/go_further/ && python run_iris_sklearn.py + cd docs/source/examples/substrafl/go_further/ && ipython -c "%run run_iris_sklearn.ipynb" example-fl-cyclic: - cd examples/substrafl/go_further/ && python run_mnist_cyclic.py + cd docs/source/examples/substrafl/go_further/ && ipython -c "%run run_mnist_cyclic.ipynb" example-fl-diabetes: - cd examples/substrafl/go_further/ && python run_diabetes_substrafl.py \ No newline at end of file + cd docs/source/examples/substrafl/go_further/ && ipython -c "%run run_diabetes_substrafl.ipynb" \ No newline at end of file diff --git a/docker/substra-documentation-examples/Dockerfile b/docker/substra-documentation-examples/Dockerfile index cd90e77e..88939568 100644 --- a/docker/substra-documentation-examples/Dockerfile +++ b/docker/substra-documentation-examples/Dockerfile @@ -21,7 +21,8 @@ RUN cd substra && python -m pip install --no-cache-dir -e . RUN cd substra-tools && python -m pip install --no-cache-dir -e . COPY substra-documentation/Makefile substra-documentation/ -COPY substra-documentation/examples substra-documentation/examples/ +COPY substra-documentation/examples_requirements.txt substra-documentation/ +COPY substra-documentation/docs/source/examples substra-documentation/docs/source/examples/ RUN cd substra-documentation && make install-examples-dependencies diff --git a/docs/Makefile b/docs/Makefile index 7d144c07..f1870b5e 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -7,9 +7,7 @@ SPHINXOPTS ?= -W --keep-going -n SPHINXBUILD ?= sphinx-build SOURCEDIR = source BUILDDIR = build -SUBSTRAEXAMPLEDIR = source/examples/substra_core SUBSTRADOCDIR = source/documentation/references -SUBSTRAFLEXAMPLEDIR = source/substrafl_doc/examples SUBSTRAFLDOCDIR = source/substrafl_doc/api # Put it first so that "make" without argument is like "make help". @@ -33,8 +31,6 @@ clean: rm -rf $(BUILDDIR) rm -rf $(SUBSTRADOCDIR) rm -rf $(SUBSTRAFLDOCDIR) - rm -rf $(SUBSTRAEXAMPLEDIR) - rm -rf $(SUBSTRAFLEXAMPLEDIR) # Delete the local worker folders in substra-documentation find .. -type d -name local-worker -prune -exec rm -rf {} \; # Delete the tmp folders in substra-documentation diff --git a/docs/requirements.txt b/docs/requirements.txt index 3c82e0d9..23cd6259 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -11,7 +11,9 @@ myst-parser==0.16.1 # but docutils 0.17 changed the output html markup, breaking the RTD theme # original issue: https://github.com/sphinx-doc/sphinx/issues/9051 docutils==0.16 -sphinx-gallery==0.7.0 sphinx-fontawesome==0.0.6 sphinx-copybutton==0.5.0 pyyaml==6.0 +nbsphinx==0.9.3 +pandoc==2.3 +git-python==1.0.3 \ No newline at end of file diff --git a/docs/source/conf.py b/docs/source/conf.py index 3345be8a..c0b90c67 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -24,9 +24,7 @@ import git import yaml -from sphinx_gallery.sorting import ExplicitOrder - -TMP_FOLDER = Path(__file__).parents[2] / "tmp" +TMP_FOLDER = Path(__file__).parent / "tmp" TMP_FOLDER.mkdir(exist_ok=True) # Generate a JSON compatibility table @@ -99,6 +97,31 @@ def __call__(self, directory): return directory +# Nbsphinx config + +nbsphinx_thumbnails = { + "examples/substra_core/diabetes_example/run_diabetes": "_static/example_thumbnail/diabetes.png", + "examples/substra_core/titanic_example/run_titanic": "_static/example_thumbnail/titanic.jpg", + "examples/substrafl/get_started/run_mnist_torch": "_static/example_thumbnail/mnist.png", + "examples/substrafl/go_further/run_diabetes_substrafl": "_static/example_thumbnail/diabetes.png", + "examples/substrafl/go_further/run_iris_sklearn": "_static/example_thumbnail/iris.jpg", + "examples/substrafl/go_further/run_mnist_cyclic": "_static/example_thumbnail/cyclic-mnist.png", +} + +nbsphinx_prolog = r""" +{% set docname = 'doc/' + env.doc2path(env.docname, base=None) %} + +.. raw:: html + +
+ Launch notebook online Binder badge + or download it Download badge +
+""" + +nbsphinx_epilog = nbsphinx_prolog + + # zip the assets directory found in the examples directory and place it in the current dir def zip_dir(source_dir, zip_file_name): # Create archive with compressed files @@ -111,29 +134,29 @@ def zip_dir(source_dir, zip_file_name): ) -assets_dir_titanic = Path(__file__).parents[2] / "examples" / "substra_core" / "titanic_example" / "assets" +assets_dir_titanic = Path(__file__).parent / "examples" / "substra_core" / "titanic_example" / "assets" zip_dir(assets_dir_titanic, "titanic_assets.zip") -assets_dir_diabetes = Path(__file__).parents[2] / "examples" / "substra_core" / "diabetes_example" / "assets" +assets_dir_diabetes = Path(__file__).parent / "examples" / "substra_core" / "diabetes_example" / "assets" zip_dir(assets_dir_diabetes, "diabetes_assets.zip") assets_dir_substrafl_torch_fedavg = ( - Path(__file__).parents[2] / "examples" / "substrafl" / "get_started" / "torch_fedavg_assets" + Path(__file__).parent / "examples" / "substrafl" / "get_started" / "torch_fedavg_assets" ) zip_dir(assets_dir_substrafl_torch_fedavg, "torch_fedavg_assets.zip") assets_dir_substrafl_diabetes = ( - Path(__file__).parents[2] / "examples" / "substrafl" / "go_further" / "diabetes_substrafl_assets" + Path(__file__).parent / "examples" / "substrafl" / "go_further" / "diabetes_substrafl_assets" ) zip_dir(assets_dir_substrafl_diabetes, "diabetes_substrafl_assets.zip") assets_dir_substrafl_sklearn_fedavg = ( - Path(__file__).parents[2] / "examples" / "substrafl" / "go_further" / "sklearn_fedavg_assets" + Path(__file__).parent / "examples" / "substrafl" / "go_further" / "sklearn_fedavg_assets" ) zip_dir(assets_dir_substrafl_sklearn_fedavg, "sklearn_fedavg_assets.zip") assets_dir_substrafl_sklearn_fedavg = ( - Path(__file__).parents[2] / "examples" / "substrafl" / "go_further" / "torch_cyclic_assets" + Path(__file__).parent / "examples" / "substrafl" / "go_further" / "torch_cyclic_assets" ) zip_dir(assets_dir_substrafl_sklearn_fedavg, "torch_cyclic_assets.zip") @@ -248,7 +271,6 @@ def reformat_md_section_links(file_path: Path): for file_path in Path(".").rglob("*.md"): reformat_md_section_links(file_path) - # -- Project information ----------------------------------------------------- project = "Substra" @@ -267,7 +289,9 @@ def reformat_md_section_links(file_path: Path): # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. -extensions = ["sphinx_gallery.gen_gallery"] +extensions = [ + "nbsphinx", +] extensions.extend( [ @@ -297,7 +321,6 @@ def reformat_md_section_links(file_path: Path): "torch": ("https://pytorch.org/docs/stable/", None), } - ################ # Substrafl API ################ @@ -379,7 +402,7 @@ def reformat_md_section_links(file_path: Path): # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This pattern also affects html_static_path and html_extra_path. -exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] +exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "**/description.md"] rst_epilog = f""" .. |substra_version| replace:: {importlib.import_module('substra').__version__} @@ -414,21 +437,3 @@ def reformat_md_section_links(file_path: Path): html_context = { "display_github": False, } - -sphinx_gallery_conf = { - "remove_config_comments": True, - "doc_module": "substra", - "reference_url": {"Substra": None}, - "examples_dirs": ["../../examples/substra_core", "../../examples/substrafl"], - "gallery_dirs": ["examples/substra_core", "examples/substrafl"], - "subsection_order": ExplicitOrder( - [ - "../../examples/substra_core/titanic_example", - "../../examples/substra_core/diabetes_example", - "../../examples/substrafl/get_started", - "../../examples/substrafl/go_further", - ] - ), - "download_all_examples": False, - "filename_pattern": "/run_", -} diff --git a/examples/substra_core/diabetes_example/assets/dataset/diabetes_dataset.py b/docs/source/examples/substra_core/diabetes_example/assets/dataset/diabetes_dataset.py similarity index 100% rename from examples/substra_core/diabetes_example/assets/dataset/diabetes_dataset.py rename to docs/source/examples/substra_core/diabetes_example/assets/dataset/diabetes_dataset.py diff --git a/examples/substra_core/diabetes_example/assets/dataset/diabetes_opener.py b/docs/source/examples/substra_core/diabetes_example/assets/dataset/diabetes_opener.py similarity index 100% rename from examples/substra_core/diabetes_example/assets/dataset/diabetes_opener.py rename to docs/source/examples/substra_core/diabetes_example/assets/dataset/diabetes_opener.py diff --git a/examples/substra_core/diabetes_example/assets/functions/aggregation/Dockerfile b/docs/source/examples/substra_core/diabetes_example/assets/functions/aggregation/Dockerfile similarity index 100% rename from examples/substra_core/diabetes_example/assets/functions/aggregation/Dockerfile rename to docs/source/examples/substra_core/diabetes_example/assets/functions/aggregation/Dockerfile diff --git a/examples/substra_core/diabetes_example/assets/functions/description.md b/docs/source/examples/substra_core/diabetes_example/assets/functions/description.md similarity index 100% rename from examples/substra_core/diabetes_example/assets/functions/description.md rename to docs/source/examples/substra_core/diabetes_example/assets/functions/description.md diff --git a/examples/substra_core/diabetes_example/assets/functions/federated_analytics_functions.py b/docs/source/examples/substra_core/diabetes_example/assets/functions/federated_analytics_functions.py similarity index 100% rename from examples/substra_core/diabetes_example/assets/functions/federated_analytics_functions.py rename to docs/source/examples/substra_core/diabetes_example/assets/functions/federated_analytics_functions.py diff --git a/examples/substra_core/diabetes_example/assets/functions/local_first_order_computation/Dockerfile b/docs/source/examples/substra_core/diabetes_example/assets/functions/local_first_order_computation/Dockerfile similarity index 100% rename from examples/substra_core/diabetes_example/assets/functions/local_first_order_computation/Dockerfile rename to docs/source/examples/substra_core/diabetes_example/assets/functions/local_first_order_computation/Dockerfile diff --git a/examples/substra_core/diabetes_example/assets/functions/local_second_order_computation/Dockerfile b/docs/source/examples/substra_core/diabetes_example/assets/functions/local_second_order_computation/Dockerfile similarity index 100% rename from examples/substra_core/diabetes_example/assets/functions/local_second_order_computation/Dockerfile rename to docs/source/examples/substra_core/diabetes_example/assets/functions/local_second_order_computation/Dockerfile diff --git a/examples/substra_core/diabetes_example/assets/requirements.txt b/docs/source/examples/substra_core/diabetes_example/assets/requirements.txt similarity index 100% rename from examples/substra_core/diabetes_example/assets/requirements.txt rename to docs/source/examples/substra_core/diabetes_example/assets/requirements.txt diff --git a/docs/source/examples/substra_core/diabetes_example/run_diabetes.ipynb b/docs/source/examples/substra_core/diabetes_example/run_diabetes.ipynb new file mode 100644 index 00000000..da7031fa --- /dev/null +++ b/docs/source/examples/substra_core/diabetes_example/run_diabetes.ipynb @@ -0,0 +1,725 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Federated Analytics on the diabetes dataset\n", + "\n", + "This example demonstrates how to use the flexibility of the Substra library to do Federated Analytics.\n", + "\n", + "We use the **Diabetes dataset** available from the [Scikit-Learn dataset module](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).\n", + "This dataset contains medical information such as Age, Sex or Blood pressure.\n", + "The goal of this example is to compute some analytics such as Age mean, Blood pressure standard deviation or Sex percentage.\n", + "\n", + "We simulate having two different data organisations, and a third organisation which wants to compute aggregated analytics\n", + "without having access to the raw data. The example here runs everything locally; however there is only one parameter to\n", + "change to run it on a real network.\n", + "\n", + "**Caution:**\n", + " This example is provided as an illustrative example only. In real life, you should be careful not to\n", + " accidentally leak private information when doing Federated Analytics. For example if a column contains very similar values,\n", + " sharing its mean and its standard deviation is functionally equivalent to sharing the content of the column.\n", + " It is **strongly recommended** to consider what are the potential security risks in your use case, and to act accordingly.\n", + " It is possible to use other privacy-preserving techniques, such as\n", + " [Differential Privacy](https://en.wikipedia.org/wiki/Differential_privacy), in addition to Substra.\n", + " Because the focus of this example is Substra capabilities and for the sake of simplicity, such safeguards are not implemented here.\n", + "\n", + "\n", + "To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example:\n", + "\n", + "- [assets required to run this example](../../../tmp/diabetes_assets.zip)\n", + "\n", + "Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command `pip install -r requirements.txt` to install them.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Importing all the dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import os\n", + "import zipfile\n", + "import pathlib\n", + "\n", + "import substra\n", + "from substra.sdk.schemas import (\n", + " FunctionSpec,\n", + " FunctionInputSpec,\n", + " FunctionOutputSpec,\n", + " AssetKind,\n", + " DataSampleSpec,\n", + " DatasetSpec,\n", + " Permissions,\n", + " TaskSpec,\n", + " ComputeTaskOutputSpec,\n", + " InputRef,\n", + ")\n", + "\n", + "\n", + "from assets.dataset.diabetes_dataset import setup_diabetes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instantiating the Substra clients\n", + "\n", + "We work with three different organizations.\n", + "Two organizations provide data, and a third one performs Federate Analytics to compute aggregated statistics without\n", + "having access to the raw datasets.\n", + "\n", + "This example runs in local mode, simulating a federated learning experiment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Choose the subprocess mode to locally simulate the FL process\n", + "N_CLIENTS = 3\n", + "clients_list = [substra.Client(client_name=f\"org-{i+1}\") for i in range(N_CLIENTS)]\n", + "clients = {client.organization_info().organization_id: client for client in clients_list}\n", + "\n", + "# Store organization IDs\n", + "ORGS_ID = list(clients)\n", + "\n", + "# The provider of the functions for computing analytics is defined as the first organization.\n", + "ANALYTICS_PROVIDER_ORG_ID = ORGS_ID[0]\n", + "# Data providers orgs are the two last organizations.\n", + "DATA_PROVIDER_ORGS_ID = ORGS_ID[1:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating and registering the assets\n", + "\n", + "Every asset will be created in respect to predefined schemas (Spec) previously imported from\n", + "`substra.sdk.schemas`. To register assets, the [schemas](https://docs.substra.org/en/stable/documentation/references/sdk_schemas.html#schemas)\n", + "are first instantiated and the specs are then registered, which generate the real assets.\n", + "\n", + "Permissions are defined when registering assets. In a nutshell:\n", + "\n", + "- Data cannot be seen once it's registered on the platform.\n", + "- Metadata are visible by all the users of a network.\n", + "- Permissions allow you to execute a function on a certain dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "permissions_local = Permissions(public=False, authorized_ids=DATA_PROVIDER_ORGS_ID)\n", + "permissions_aggregation = Permissions(public=False, authorized_ids=[ANALYTICS_PROVIDER_ORG_ID])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we need to define the asset directory. You should have already downloaded the assets folder as stated above.\n", + "\n", + "The function `setup_diabetes` downloads if needed the *diabetes* dataset, and split it in two. Each data organisation\n", + "has access to a chunk of the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "root_dir = pathlib.Path.cwd()\n", + "assets_directory = root_dir / \"assets\"\n", + "assert assets_directory.is_dir(), \"\"\"Did not find the asset directory,\n", + "a directory called 'assets' is expected in the same location as this file\"\"\"\n", + "\n", + "data_path = assets_directory / \"data\"\n", + "data_path.mkdir(exist_ok=True)\n", + "\n", + "setup_diabetes(data_path=data_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Registering data samples and dataset\n", + "\n", + "A dataset represents the data in Substra. It contains some metadata and an *opener*, a script used to load the\n", + "data from files into memory. You can find more details about datasets\n", + "in the [API reference DatasetSpec](https://docs.substra.org/en/stable/documentation/references/sdk_schemas.html#datasetspec).\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "dataset = DatasetSpec(\n", + " name=f\"Diabetes dataset\",\n", + " type=\"csv\",\n", + " data_opener=assets_directory / \"dataset\" / \"diabetes_opener.py\",\n", + " description=data_path / \"description.md\",\n", + " permissions=permissions_local,\n", + " logs_permission=permissions_local,\n", + ")\n", + "\n", + "# We register the dataset for each of the organisations\n", + "dataset_keys = {client_id: clients[client_id].add_dataset(dataset) for client_id in DATA_PROVIDER_ORGS_ID}\n", + "\n", + "for client_id, key in dataset_keys.items():\n", + " print(f\"Dataset key for {client_id}: {key}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The dataset object itself is an empty shell. Data samples are needed in order to add actual data.\n", + "A data sample contains subfolders containing a single data file like a CSV and the key identifying\n", + "the dataset it is linked to.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "datasample_keys = {\n", + " org_id: clients[org_id].add_data_sample(\n", + " DataSampleSpec(\n", + " data_manager_keys=[dataset_keys[org_id]],\n", + " test_only=False,\n", + " path=data_path / f\"org_{i + 1}\",\n", + " ),\n", + " local=True,\n", + " )\n", + " for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID)\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The data has now been added as an asset through the data samples.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Adding functions to execute with Substra\n", + "\n", + "A [Substra function](https://docs.substra.org/en/stable/documentation/references/sdk_schemas.html#functionspec)\n", + "specifies the function to apply to a dataset or the function to aggregate models (artifacts).\n", + "Concretely, a function corresponds to an archive (tar or zip file) containing:\n", + "\n", + "- One or more Python scripts that implement the function.\n", + "- A Dockerfile on which the user can specify the required dependencies of the Python scripts.\n", + " This Dockerfile also specifies the function name to execute.\n", + "\n", + "In this example, we will:\n", + "\n", + "1. compute prerequisites for first-moment statistics on each data organization;\n", + "2. aggregate these values on the analytics computation organization to get aggregated statistics;\n", + "3. send these aggregated values to the data organizations, in order to compute second-moment prerequisite values;\n", + "4. finally, aggregate these values to get second-moment aggregated statistics." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Local step: computing first order statistic moments\n", + "\n", + "First, we will compute on each data node some aggregated values: number of samples, sum of each numerical column\n", + "(it will be used to compute the mean), and counts for each category for the categorical column (*Sex*).\n", + "\n", + "The computation is implemented in a *Python function* in the `federated_analytics_functions.py` file.\n", + "We also write a `Dockerfile` to define the entrypoint, and we wrap everything in a Substra [FunctionSpec](https://docs.substra.org/en/stable/documentation/references/sdk_schemas.html#functionspec) object.\n", + "\n", + "If you're running this example in a Notebook, you can uncomment and execute the next cell to see what code is executed\n", + "on each data node.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# %load -s local_first_order_computation assets/functions/federated_analytics_functions.py" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "local_first_order_computation_docker_files = [\n", + " assets_directory / \"functions\" / \"federated_analytics_functions.py\",\n", + " assets_directory / \"functions\" / \"local_first_order_computation\" / \"Dockerfile\",\n", + "]\n", + "\n", + "local_archive_first_order_computation_path = assets_directory / \"functions\" / \"local_first_order_analytics.zip\"\n", + "with zipfile.ZipFile(local_archive_first_order_computation_path, \"w\") as z:\n", + " for filepath in local_first_order_computation_docker_files:\n", + " z.write(filepath, arcname=os.path.basename(filepath))\n", + "\n", + "local_first_order_function_inputs = [\n", + " FunctionInputSpec(\n", + " identifier=\"datasamples\",\n", + " kind=AssetKind.data_sample,\n", + " optional=False,\n", + " multiple=True,\n", + " ),\n", + " FunctionInputSpec(identifier=\"opener\", kind=AssetKind.data_manager, optional=False, multiple=False),\n", + "]\n", + "\n", + "local_first_order_function_outputs = [\n", + " FunctionOutputSpec(identifier=\"local_analytics_first_moments\", kind=AssetKind.model, multiple=False)\n", + "]\n", + "\n", + "local_first_order_function = FunctionSpec(\n", + " name=\"Local Federated Analytics - step 1\",\n", + " inputs=local_first_order_function_inputs,\n", + " outputs=local_first_order_function_outputs,\n", + " description=assets_directory / \"functions\" / \"description.md\",\n", + " file=local_archive_first_order_computation_path,\n", + " permissions=permissions_local,\n", + ")\n", + "\n", + "\n", + "local_first_order_function_keys = {\n", + " client_id: clients[client_id].add_function(local_first_order_function) for client_id in DATA_PROVIDER_ORGS_ID\n", + "}\n", + "\n", + "print(f\"Local function key for step 1: computing first order moments {local_first_order_function_keys}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### First aggregation step\n", + "\n", + "In a similar way, we define the [FunctionSpec](https://docs.substra.org/en/stable/documentation/references/sdk_schemas.html#functionspec) for the aggregation node.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# %load -s aggregation assets/functions/federated_analytics_functions.py" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "aggregate_function_docker_files = [\n", + " assets_directory / \"functions\" / \"federated_analytics_functions.py\",\n", + " assets_directory / \"functions\" / \"aggregation\" / \"Dockerfile\",\n", + "]\n", + "\n", + "aggregate_archive_path = assets_directory / \"functions\" / \"aggregate_function_analytics.zip\"\n", + "with zipfile.ZipFile(aggregate_archive_path, \"w\") as z:\n", + " for filepath in aggregate_function_docker_files:\n", + " z.write(filepath, arcname=os.path.basename(filepath))\n", + "\n", + "aggregate_function_inputs = [\n", + " FunctionInputSpec(\n", + " identifier=\"local_analytics_list\",\n", + " kind=AssetKind.model,\n", + " optional=False,\n", + " multiple=True,\n", + " ),\n", + "]\n", + "\n", + "aggregate_function_outputs = [FunctionOutputSpec(identifier=\"shared_states\", kind=AssetKind.model, multiple=False)]\n", + "\n", + "aggregate_function = FunctionSpec(\n", + " name=\"Aggregate Federated Analytics\",\n", + " inputs=aggregate_function_inputs,\n", + " outputs=aggregate_function_outputs,\n", + " description=assets_directory / \"functions\" / \"description.md\",\n", + " file=aggregate_archive_path,\n", + " permissions=permissions_aggregation,\n", + ")\n", + "\n", + "\n", + "aggregate_function_key = clients[ANALYTICS_PROVIDER_ORG_ID].add_function(aggregate_function)\n", + "\n", + "print(f\"Aggregation function key {aggregate_function_key}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Local step: computing second order statistic moments\n", + "\n", + "We also register the function for the second round of computations happening locally on the data nodes.\n", + "\n", + "Both aggregation steps will use the same function, so we don't need to register it again.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# %load -s local_second_order_computation assets/functions/federated_analytics_functions.py" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "local_second_order_computation_docker_files = [\n", + " assets_directory / \"functions\" / \"federated_analytics_functions.py\",\n", + " assets_directory / \"functions\" / \"local_second_order_computation\" / \"Dockerfile\",\n", + "]\n", + "\n", + "local_archive_second_order_computation_path = assets_directory / \"functions\" / \"local_function_analytics.zip\"\n", + "with zipfile.ZipFile(local_archive_second_order_computation_path, \"w\") as z:\n", + " for filepath in local_second_order_computation_docker_files:\n", + " z.write(filepath, arcname=os.path.basename(filepath))\n", + "\n", + "local_second_order_function_inputs = [\n", + " FunctionInputSpec(\n", + " identifier=\"datasamples\",\n", + " kind=AssetKind.data_sample,\n", + " optional=False,\n", + " multiple=True,\n", + " ),\n", + " FunctionInputSpec(identifier=\"opener\", kind=AssetKind.data_manager, optional=False, multiple=False),\n", + " FunctionInputSpec(identifier=\"shared_states\", kind=AssetKind.model, optional=False, multiple=False),\n", + "]\n", + "\n", + "local_second_order_function_outputs = [\n", + " FunctionOutputSpec(\n", + " identifier=\"local_analytics_second_moments\",\n", + " kind=AssetKind.model,\n", + " multiple=False,\n", + " )\n", + "]\n", + "\n", + "local_second_order_function = FunctionSpec(\n", + " name=\"Local Federated Analytics - step 2\",\n", + " inputs=local_second_order_function_inputs,\n", + " outputs=local_second_order_function_outputs,\n", + " description=assets_directory / \"functions\" / \"description.md\",\n", + " file=local_archive_second_order_computation_path,\n", + " permissions=permissions_local,\n", + ")\n", + "\n", + "\n", + "local_second_order_function_keys = {\n", + " client_id: clients[client_id].add_function(local_second_order_function) for client_id in DATA_PROVIDER_ORGS_ID\n", + "}\n", + "\n", + "print(f\"Local function key for step 2: computing second order moments {local_second_order_function_keys}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The data and the functions are now registered." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Registering tasks in Substra\n", + "\n", + "The next step is to register the actual machine learning tasks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substra.sdk.models import Status\n", + "import time\n", + "\n", + "\n", + "def wait_task(client: substra.Client, key: str):\n", + " \"\"\"Function to wait the function to be done before continuing.\n", + "\n", + " Args:\n", + " client(substra.Client): client owner of the task.\n", + " key (str): task key of the task to wait.\n", + " \"\"\"\n", + " task_status = client.get_task(key).status\n", + "\n", + " while task_status != Status.done:\n", + " time.sleep(1)\n", + " task_status = client.get_task(key).status\n", + "\n", + " client_id = client.organization_info().organization_id\n", + " print(f\"Status of task {key} on client {client_id}: {task_status}\")\n", + "\n", + "\n", + "data_manager_input = {\n", + " client_id: [InputRef(identifier=\"opener\", asset_key=key)] for client_id, key in dataset_keys.items()\n", + "}\n", + "\n", + "datasample_inputs = {\n", + " client_id: [InputRef(identifier=\"datasamples\", asset_key=key)] for client_id, key in datasample_keys.items()\n", + "}\n", + "\n", + "local_task_1_keys = {\n", + " client_id: clients[client_id].add_task(\n", + " TaskSpec(\n", + " function_key=local_first_order_function_keys[client_id],\n", + " inputs=data_manager_input[client_id] + datasample_inputs[client_id],\n", + " outputs={\"local_analytics_first_moments\": ComputeTaskOutputSpec(permissions=permissions_aggregation)},\n", + " worker=client_id,\n", + " )\n", + " )\n", + " for client_id in DATA_PROVIDER_ORGS_ID\n", + "}\n", + "\n", + "for client_id, key in local_task_1_keys.items():\n", + " wait_task(client=clients[client_id], key=key)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In local mode, the registered task is executed at once:\n", + "the registration function returns a value once the task has been executed.\n", + "\n", + "In deployed mode, the registered task is added to a queue and treated asynchronously: this means that the\n", + "code that registers the tasks keeps executing. To wait for a task to be done, create a loop and get the task\n", + "every `n` seconds until its status is done or failed.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "aggregation_1_inputs = [\n", + " InputRef(\n", + " identifier=\"local_analytics_list\",\n", + " parent_task_key=local_key,\n", + " parent_task_output_identifier=\"local_analytics_first_moments\",\n", + " )\n", + " for local_key in local_task_1_keys.values()\n", + "]\n", + "\n", + "\n", + "aggregation_task_1 = TaskSpec(\n", + " function_key=aggregate_function_key,\n", + " inputs=aggregation_1_inputs,\n", + " outputs={\"shared_states\": ComputeTaskOutputSpec(permissions=permissions_local)},\n", + " worker=ANALYTICS_PROVIDER_ORG_ID,\n", + ")\n", + "\n", + "aggregation_task_1_key = clients[ANALYTICS_PROVIDER_ORG_ID].add_task(aggregation_task_1)\n", + "\n", + "wait_task(client=clients[ANALYTICS_PROVIDER_ORG_ID], key=aggregation_task_1_key)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "shared_inputs = [\n", + " InputRef(\n", + " identifier=\"shared_states\",\n", + " parent_task_key=aggregation_task_1_key,\n", + " parent_task_output_identifier=\"shared_states\",\n", + " )\n", + "]\n", + "\n", + "local_task_2_keys = {\n", + " client_id: clients[client_id].add_task(\n", + " TaskSpec(\n", + " function_key=local_second_order_function_keys[client_id],\n", + " inputs=data_manager_input[client_id] + datasample_inputs[client_id] + shared_inputs,\n", + " outputs={\"local_analytics_second_moments\": ComputeTaskOutputSpec(permissions=permissions_aggregation)},\n", + " worker=client_id,\n", + " )\n", + " )\n", + " for client_id in DATA_PROVIDER_ORGS_ID\n", + "}\n", + "\n", + "for client_id, key in local_task_2_keys.items():\n", + " wait_task(client=clients[client_id], key=key)\n", + "\n", + "aggregation_2_inputs = [\n", + " InputRef(\n", + " identifier=\"local_analytics_list\",\n", + " parent_task_key=local_key,\n", + " parent_task_output_identifier=\"local_analytics_second_moments\",\n", + " )\n", + " for local_key in local_task_2_keys.values()\n", + "]\n", + "\n", + "aggregation_task_2 = TaskSpec(\n", + " function_key=aggregate_function_key,\n", + " inputs=aggregation_2_inputs,\n", + " outputs={\"shared_states\": ComputeTaskOutputSpec(permissions=permissions_local)},\n", + " worker=ANALYTICS_PROVIDER_ORG_ID,\n", + ")\n", + "\n", + "aggregation_task_2_key = clients[ANALYTICS_PROVIDER_ORG_ID].add_task(aggregation_task_2)\n", + "\n", + "wait_task(client=clients[ANALYTICS_PROVIDER_ORG_ID], key=aggregation_task_2_key)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Results\n", + "\n", + "Now we can view the results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import pickle\n", + "import tempfile\n", + "\n", + "\n", + "with tempfile.TemporaryDirectory() as temp_folder:\n", + " out_model1_file = clients[ANALYTICS_PROVIDER_ORG_ID].download_model_from_task(\n", + " aggregation_task_1_key, folder=temp_folder, identifier=\"shared_states\"\n", + " )\n", + " out1 = pickle.load(out_model1_file.open(\"rb\"))\n", + "\n", + " out_model2_file = clients[ANALYTICS_PROVIDER_ORG_ID].download_model_from_task(\n", + " aggregation_task_2_key, folder=temp_folder, identifier=\"shared_states\"\n", + " )\n", + " out2 = pickle.load(out_model2_file.open(\"rb\"))\n", + "\n", + "print(\n", + " f\"\"\"Age mean: {out1['means']['age']:.2f} years\n", + "Sex percentage:\n", + " Male: {100*out1['counts']['sex']['M']:.2f}%\n", + " Female: {100*out1['counts']['sex']['F']:.2f}%\n", + "Blood pressure std: {out2[\"std\"][\"bp\"]:.2f} mm Hg\n", + "\"\"\"\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.17" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/docs/source/examples/substra_core/index.rst b/docs/source/examples/substra_core/index.rst new file mode 100644 index 00000000..ae6cae90 --- /dev/null +++ b/docs/source/examples/substra_core/index.rst @@ -0,0 +1,17 @@ +Substra examples +================ + +The examples below are compatible with Substra |substra_version|. + + +Examples to get started +^^^^^^^^^^^^^^^^^^^^^^^ + +.. nbgallery:: + ../../../examples/substra_core/titanic_example/run_titanic.ipynb + +Examples to go further +^^^^^^^^^^^^^^^^^^^^^^ + +.. nbgallery:: + ../../../examples/substra_core/diabetes_example/run_diabetes.ipynb \ No newline at end of file diff --git a/examples/substra_core/titanic_example/assets/dataset/description.md b/docs/source/examples/substra_core/titanic_example/assets/dataset/description.md similarity index 100% rename from examples/substra_core/titanic_example/assets/dataset/description.md rename to docs/source/examples/substra_core/titanic_example/assets/dataset/description.md diff --git a/examples/substra_core/titanic_example/assets/dataset/titanic_opener.py b/docs/source/examples/substra_core/titanic_example/assets/dataset/titanic_opener.py similarity index 100% rename from examples/substra_core/titanic_example/assets/dataset/titanic_opener.py rename to docs/source/examples/substra_core/titanic_example/assets/dataset/titanic_opener.py diff --git a/examples/substra_core/titanic_example/assets/function_random_forest/description.md b/docs/source/examples/substra_core/titanic_example/assets/function_random_forest/description.md similarity index 100% rename from examples/substra_core/titanic_example/assets/function_random_forest/description.md rename to docs/source/examples/substra_core/titanic_example/assets/function_random_forest/description.md diff --git a/examples/substra_core/titanic_example/assets/function_random_forest/predict/Dockerfile b/docs/source/examples/substra_core/titanic_example/assets/function_random_forest/predict/Dockerfile similarity index 100% rename from examples/substra_core/titanic_example/assets/function_random_forest/predict/Dockerfile rename to docs/source/examples/substra_core/titanic_example/assets/function_random_forest/predict/Dockerfile diff --git a/examples/substra_core/titanic_example/assets/function_random_forest/titanic_function_rf.py b/docs/source/examples/substra_core/titanic_example/assets/function_random_forest/titanic_function_rf.py similarity index 100% rename from examples/substra_core/titanic_example/assets/function_random_forest/titanic_function_rf.py rename to docs/source/examples/substra_core/titanic_example/assets/function_random_forest/titanic_function_rf.py diff --git a/examples/substra_core/titanic_example/assets/function_random_forest/train/Dockerfile b/docs/source/examples/substra_core/titanic_example/assets/function_random_forest/train/Dockerfile similarity index 100% rename from examples/substra_core/titanic_example/assets/function_random_forest/train/Dockerfile rename to docs/source/examples/substra_core/titanic_example/assets/function_random_forest/train/Dockerfile diff --git a/examples/substra_core/titanic_example/assets/metric/Dockerfile b/docs/source/examples/substra_core/titanic_example/assets/metric/Dockerfile similarity index 100% rename from examples/substra_core/titanic_example/assets/metric/Dockerfile rename to docs/source/examples/substra_core/titanic_example/assets/metric/Dockerfile diff --git a/examples/substra_core/titanic_example/assets/metric/description.md b/docs/source/examples/substra_core/titanic_example/assets/metric/description.md similarity index 100% rename from examples/substra_core/titanic_example/assets/metric/description.md rename to docs/source/examples/substra_core/titanic_example/assets/metric/description.md diff --git a/examples/substra_core/titanic_example/assets/metric/titanic_metrics.py b/docs/source/examples/substra_core/titanic_example/assets/metric/titanic_metrics.py similarity index 100% rename from examples/substra_core/titanic_example/assets/metric/titanic_metrics.py rename to docs/source/examples/substra_core/titanic_example/assets/metric/titanic_metrics.py diff --git a/examples/substra_core/titanic_example/assets/requirements.txt b/docs/source/examples/substra_core/titanic_example/assets/requirements.txt similarity index 100% rename from examples/substra_core/titanic_example/assets/requirements.txt rename to docs/source/examples/substra_core/titanic_example/assets/requirements.txt diff --git a/examples/substra_core/titanic_example/assets/test_data_samples/data_sample_0/data_sample_0.csv b/docs/source/examples/substra_core/titanic_example/assets/test_data_samples/data_sample_0/data_sample_0.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/test_data_samples/data_sample_0/data_sample_0.csv rename to docs/source/examples/substra_core/titanic_example/assets/test_data_samples/data_sample_0/data_sample_0.csv diff --git a/examples/substra_core/titanic_example/assets/test_data_samples/data_sample_1/data_sample_1.csv b/docs/source/examples/substra_core/titanic_example/assets/test_data_samples/data_sample_1/data_sample_1.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/test_data_samples/data_sample_1/data_sample_1.csv rename to docs/source/examples/substra_core/titanic_example/assets/test_data_samples/data_sample_1/data_sample_1.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_0/data_sample_0.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_0/data_sample_0.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_0/data_sample_0.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_0/data_sample_0.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_1/data_sample_1.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_1/data_sample_1.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_1/data_sample_1.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_1/data_sample_1.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_2/data_sample_2.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_2/data_sample_2.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_2/data_sample_2.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_2/data_sample_2.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_3/data_sample_3.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_3/data_sample_3.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_3/data_sample_3.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_3/data_sample_3.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_4/data_sample_4.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_4/data_sample_4.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_4/data_sample_4.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_4/data_sample_4.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_5/data_sample_5.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_5/data_sample_5.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_5/data_sample_5.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_5/data_sample_5.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_6/data_sample_6.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_6/data_sample_6.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_6/data_sample_6.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_6/data_sample_6.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_7/data_sample_7.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_7/data_sample_7.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_7/data_sample_7.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_7/data_sample_7.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_8/data_sample_8.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_8/data_sample_8.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_8/data_sample_8.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_8/data_sample_8.csv diff --git a/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_9/data_sample_9.csv b/docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_9/data_sample_9.csv similarity index 100% rename from examples/substra_core/titanic_example/assets/train_data_samples/data_sample_9/data_sample_9.csv rename to docs/source/examples/substra_core/titanic_example/assets/train_data_samples/data_sample_9/data_sample_9.csv diff --git a/docs/source/examples/substra_core/titanic_example/run_titanic.ipynb b/docs/source/examples/substra_core/titanic_example/run_titanic.ipynb new file mode 100644 index 00000000..5e2f4dc7 --- /dev/null +++ b/docs/source/examples/substra_core/titanic_example/run_titanic.ipynb @@ -0,0 +1,568 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# Running Substra with a single organisation on the Titanic dataset\n", + "\n", + "\n", + "This example is based on [the similarly named Kaggle challenge](https://www.kaggle.com/c/titanic/overview).\n", + "\n", + "In this example, we work on the Titanic tabular dataset. This is a classification problem\n", + "that uses a random forest model.\n", + "\n", + "Here you will learn how to interact with Substra, more specifically:\n", + "\n", + "- instantiating Substra Client\n", + "- creating and registering assets\n", + "- launching an experiment\n", + "\n", + "\n", + "There is no federated learning in this example, training and testing will happen on only one [Organization](https://docs.substra.org/en/stable/additional/glossary.html#term-Organization).\n", + "\n", + "\n", + "To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example:\n", + "\n", + "- [assets required to run this example](../../../tmp/titanic_assets.zip)\n", + "\n", + "Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command `pip install -r requirements.txt` to install them.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Import all the dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import os\n", + "import zipfile\n", + "from pathlib import Path\n", + "\n", + "import substra\n", + "from substra.sdk.schemas import (\n", + " AssetKind,\n", + " DataSampleSpec,\n", + " DatasetSpec,\n", + " FunctionSpec,\n", + " FunctionInputSpec,\n", + " FunctionOutputSpec,\n", + " Permissions,\n", + " TaskSpec,\n", + " ComputeTaskOutputSpec,\n", + " InputRef,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instantiating the Substra Client\n", + "\n", + "The client allows us to interact with the Substra platform.\n", + "\n", + "By setting the argument `backend_type` to:\n", + "\n", + " - `docker` all tasks will be executed from docker containers (default)\n", + " - `subprocess` all tasks will be executed from Python subprocesses (faster)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "client = substra.Client(client_name=\"org-1\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creation and Registration of the assets\n", + "\n", + "Every asset will be created in respect to predefined schemas (Spec) previously imported from\n", + "substra.sdk.schemas. To register assets, asset [schemas](https://docs.substra.org/en/stable/documentation/references/sdk_schemas.html#schemas)\n", + "are first instantiated and the specs are then registered, which generates the real assets.\n", + "\n", + "Permissions are defined when registering assets. In a nutshell:\n", + "\n", + "- Data cannot be seen once it's registered on the platform.\n", + "- Metadata are visible by all the users of a channel.\n", + "- Permissions allow you to execute a function on a certain dataset.\n", + "\n", + "In a remote deployment, setting the parameter `public` to false means that the dataset can only be used by tasks in\n", + "the same organization or by organizations that are in the `authorized_ids`. However, these permissions are ignored in local mode.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "permissions = Permissions(public=True, authorized_ids=[])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we need to define the asset directory. You should have already downloaded the assets folder as stated above.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "root_dir = Path.cwd()\n", + "assets_directory = root_dir / \"assets\"\n", + "assert assets_directory.is_dir(), \"\"\"Did not find the asset directory, a directory called 'assets' is\n", + "expected in the same location as this py file\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Registering data samples and dataset\n", + "\n", + "A dataset represents the data in Substra. It is made up of an opener, which is a script used to load the\n", + "data from files into memory. You can find more details about datasets\n", + "in the [API reference](https://docs.substra.org/en/stable/documentation/api_reference.html#sdk-reference).\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "dataset = DatasetSpec(\n", + " name=\"Titanic dataset - Org 1\",\n", + " type=\"csv\",\n", + " data_opener=assets_directory / \"dataset\" / \"titanic_opener.py\",\n", + " description=assets_directory / \"dataset\" / \"description.md\",\n", + " permissions=permissions,\n", + " logs_permission=permissions,\n", + ")\n", + "\n", + "dataset_key = client.add_dataset(dataset)\n", + "print(f\"Dataset key {dataset_key}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Adding train data samples\n", + "\n", + "The dataset object itself is an empty shell. Data samples are needed in order to add actual data.\n", + "A data sample contains subfolders containing a single data file like a CSV and the key identifying\n", + "the dataset it is linked to.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "train_data_sample_folder = assets_directory / \"train_data_samples\"\n", + "train_data_sample_keys = client.add_data_samples(\n", + " DataSampleSpec(\n", + " paths=list(train_data_sample_folder.glob(\"*\")),\n", + " data_manager_keys=[dataset_key],\n", + " )\n", + ")\n", + "\n", + "print(f\"{len(train_data_sample_keys)} data samples were registered\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Adding test data samples\n", + "\n", + "The operation is done again but with the test data samples.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "test_data_sample_folder = assets_directory / \"test_data_samples\"\n", + "test_data_sample_keys = client.add_data_samples(\n", + " DataSampleSpec(\n", + " paths=list(test_data_sample_folder.glob(\"*\")),\n", + " data_manager_keys=[dataset_key],\n", + " )\n", + ")\n", + "\n", + "print(f\"{len(test_data_sample_keys)} data samples were registered\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The data has now been added as an asset through the data samples both for the training and\n", + "testing part of our experience.\n", + "\n", + "### Adding Metrics\n", + "\n", + "A metric corresponds to a function to evaluate the performance of a model on a dataset.\n", + "Concretely, a metric corresponds to an archive (tar or zip file) containing:\n", + "\n", + "- Python scripts that implement the metric computation\n", + "- a Dockerfile on which the user can specify the required dependencies of the Python scripts\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "inputs_metrics = [\n", + " FunctionInputSpec(identifier=\"datasamples\", kind=AssetKind.data_sample, optional=False, multiple=True),\n", + " FunctionInputSpec(identifier=\"opener\", kind=AssetKind.data_manager, optional=False, multiple=False),\n", + " FunctionInputSpec(identifier=\"predictions\", kind=AssetKind.model, optional=False, multiple=False),\n", + "]\n", + "\n", + "outputs_metrics = [FunctionOutputSpec(identifier=\"performance\", kind=AssetKind.performance, multiple=False)]\n", + "\n", + "\n", + "METRICS_DOCKERFILE_FILES = [\n", + " assets_directory / \"metric\" / \"titanic_metrics.py\",\n", + " assets_directory / \"metric\" / \"Dockerfile\",\n", + "]\n", + "\n", + "metric_archive_path = assets_directory / \"metric\" / \"metrics.zip\"\n", + "\n", + "with zipfile.ZipFile(metric_archive_path, \"w\") as z:\n", + " for filepath in METRICS_DOCKERFILE_FILES:\n", + " z.write(filepath, arcname=os.path.basename(filepath))\n", + "\n", + "metric_function = FunctionSpec(\n", + " inputs=inputs_metrics,\n", + " outputs=outputs_metrics,\n", + " name=\"Testing with Accuracy metric\",\n", + " description=assets_directory / \"metric\" / \"description.md\",\n", + " file=metric_archive_path,\n", + " permissions=permissions,\n", + ")\n", + "\n", + "metric_key = client.add_function(metric_function)\n", + "\n", + "print(f\"Metric key {metric_key}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Adding Function\n", + "\n", + "A [Function](https://docs.substra.org/en/stable/documentation/concepts.html#function) specifies the method to train a model on a dataset or the method to aggregate models.\n", + "Concretely, a function corresponds to an archive (tar or zip file) containing:\n", + "\n", + "- One or more Python scripts that implement the function. It is required to define `train` and `predict` functions.\n", + "- A Dockerfile in which the user can specify the required dependencies of the Python scripts.\n", + " This Dockerfile also specifies the method name to execute (either `train` or `predict` here).\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "ALGO_TRAIN_DOCKERFILE_FILES = [\n", + " assets_directory / \"function_random_forest/titanic_function_rf.py\",\n", + " assets_directory / \"function_random_forest/train/Dockerfile\",\n", + "]\n", + "\n", + "train_archive_path = assets_directory / \"function_random_forest\" / \"function_random_forest.zip\"\n", + "with zipfile.ZipFile(train_archive_path, \"w\") as z:\n", + " for filepath in ALGO_TRAIN_DOCKERFILE_FILES:\n", + " z.write(filepath, arcname=os.path.basename(filepath))\n", + "\n", + "train_function_inputs = [\n", + " FunctionInputSpec(identifier=\"datasamples\", kind=AssetKind.data_sample, optional=False, multiple=True),\n", + " FunctionInputSpec(identifier=\"opener\", kind=AssetKind.data_manager, optional=False, multiple=False),\n", + "]\n", + "\n", + "train_function_outputs = [FunctionOutputSpec(identifier=\"model\", kind=AssetKind.model, multiple=False)]\n", + "\n", + "train_function = FunctionSpec(\n", + " name=\"Training with Random Forest\",\n", + " inputs=train_function_inputs,\n", + " outputs=train_function_outputs,\n", + " description=assets_directory / \"function_random_forest\" / \"description.md\",\n", + " file=train_archive_path,\n", + " permissions=permissions,\n", + ")\n", + "\n", + "\n", + "train_function_key = client.add_function(train_function)\n", + "\n", + "print(f\"Train function key {train_function_key}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The predict function uses the same Python file as the function used for training.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "ALGO_PREDICT_DOCKERFILE_FILES = [\n", + " assets_directory / \"function_random_forest/titanic_function_rf.py\",\n", + " assets_directory / \"function_random_forest/predict/Dockerfile\",\n", + "]\n", + "\n", + "predict_archive_path = assets_directory / \"function_random_forest\" / \"function_random_forest.zip\"\n", + "with zipfile.ZipFile(predict_archive_path, \"w\") as z:\n", + " for filepath in ALGO_PREDICT_DOCKERFILE_FILES:\n", + " z.write(filepath, arcname=os.path.basename(filepath))\n", + "\n", + "predict_function_inputs = [\n", + " FunctionInputSpec(identifier=\"datasamples\", kind=AssetKind.data_sample, optional=False, multiple=True),\n", + " FunctionInputSpec(identifier=\"opener\", kind=AssetKind.data_manager, optional=False, multiple=False),\n", + " FunctionInputSpec(identifier=\"models\", kind=AssetKind.model, optional=False, multiple=False),\n", + "]\n", + "\n", + "predict_function_outputs = [FunctionOutputSpec(identifier=\"predictions\", kind=AssetKind.model, multiple=False)]\n", + "\n", + "predict_function_spec = FunctionSpec(\n", + " name=\"Predicting with Random Forest\",\n", + " inputs=predict_function_inputs,\n", + " outputs=predict_function_outputs,\n", + " description=assets_directory / \"function_random_forest\" / \"description.md\",\n", + " file=predict_archive_path,\n", + " permissions=permissions,\n", + ")\n", + "\n", + "predict_function_key = client.add_function(predict_function_spec)\n", + "\n", + "print(f\"Predict function key {predict_function_key}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The data, the functions and the metric are now registered.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Registering tasks\n", + "\n", + "The next step is to register the actual machine learning tasks.\n", + "First a training task is registered which will produce a machine learning model.\n", + "Then a testing task is registered to test the trained model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "data_manager_input = [InputRef(identifier=\"opener\", asset_key=dataset_key)]\n", + "train_data_sample_inputs = [InputRef(identifier=\"datasamples\", asset_key=key) for key in train_data_sample_keys]\n", + "test_data_sample_inputs = [InputRef(identifier=\"datasamples\", asset_key=key) for key in test_data_sample_keys]\n", + "\n", + "train_task = TaskSpec(\n", + " function_key=train_function_key,\n", + " inputs=data_manager_input + train_data_sample_inputs,\n", + " outputs={\"model\": ComputeTaskOutputSpec(permissions=permissions)},\n", + " worker=client.organization_info().organization_id,\n", + ")\n", + "\n", + "train_task_key = client.add_task(train_task)\n", + "\n", + "print(f\"Train task key {train_task_key}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In local mode, the registered task is executed at once:\n", + "the registration function returns a value once the task has been executed.\n", + "\n", + "In deployed mode, the registered task is added to a queue and treated asynchronously: this means that the\n", + "code that registers the tasks keeps executing. To wait for a task to be done, create a loop and get the task\n", + "every ``n`` seconds until its status is done or failed.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "model_input = [\n", + " InputRef(\n", + " identifier=\"models\",\n", + " parent_task_key=train_task_key,\n", + " parent_task_output_identifier=\"model\",\n", + " )\n", + "]\n", + "\n", + "predict_task = TaskSpec(\n", + " function_key=predict_function_key,\n", + " inputs=data_manager_input + test_data_sample_inputs + model_input,\n", + " outputs={\"predictions\": ComputeTaskOutputSpec(permissions=permissions)},\n", + " worker=client.organization_info().organization_id,\n", + ")\n", + "\n", + "predict_task_key = client.add_task(predict_task)\n", + "\n", + "predictions_input = [\n", + " InputRef(\n", + " identifier=\"predictions\",\n", + " parent_task_key=predict_task_key,\n", + " parent_task_output_identifier=\"predictions\",\n", + " )\n", + "]\n", + "\n", + "test_task = TaskSpec(\n", + " function_key=metric_key,\n", + " inputs=data_manager_input + test_data_sample_inputs + predictions_input,\n", + " outputs={\"performance\": ComputeTaskOutputSpec(permissions=permissions)},\n", + " worker=client.organization_info().organization_id,\n", + ")\n", + "\n", + "test_task_key = client.add_task(test_task)\n", + "\n", + "print(f\"Test task key {test_task_key}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Results\n", + "\n", + "Now we can view the results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substra.sdk.models import Status\n", + "import time\n", + "\n", + "test_task = client.get_task(test_task_key)\n", + "while test_task.status != Status.done:\n", + " time.sleep(1)\n", + " test_task = client.get_task(test_task_key)\n", + "\n", + "print(f\"Test tasks status: {test_task.status}\")\n", + "\n", + "performance = client.get_task_output_asset(test_task.key, identifier=\"performance\")\n", + "print(\"Metric: \", test_task.function.name)\n", + "print(\"Performance on the metric: \", performance.asset)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.17" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/docs/source/examples/substrafl/get_started/run_mnist_torch.ipynb b/docs/source/examples/substrafl/get_started/run_mnist_torch.ipynb new file mode 100644 index 00000000..deef11e3 --- /dev/null +++ b/docs/source/examples/substrafl/get_started/run_mnist_torch.ipynb @@ -0,0 +1,797 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using Torch FedAvg on MNIST dataset\n", + "\n", + "This example illustrates the basic usage of SubstraFL and proposes Federated Learning using the Federated Averaging strategy\n", + "on the [MNIST Dataset of handwritten digits](http://yann.lecun.com/exdb/mnist/) using PyTorch.\n", + "In this example, we work on 28x28 pixel sized grayscale images. This is a classification problem\n", + "aiming to recognize the number written on each image.\n", + "\n", + "SubstraFL can be used with any machine learning framework (PyTorch, Tensorflow, Scikit-Learn, etc).\n", + "\n", + "However a specific interface has been developed for PyTorch which makes writing PyTorch code simpler than with other frameworks. This example here uses the specific PyTorch interface.\n", + "\n", + "This example does not use a deployed platform of Substra and runs in local mode.\n", + "\n", + "To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example:\n", + "\n", + "- [assets required to run this example](../../../tmp/torch_fedavg_assets.zip)\n", + "\n", + "Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command `pip install -r requirements.txt` to install them.\n", + "\n", + "**Substra** and **SubstraFL** should already be installed. If not follow the instructions described [here](https://docs.substra.org/en/stable/substrafl_doc/substrafl_overview.html#installation).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "This example runs with three organizations. Two organizations provide datasets, while a third\n", + "one provides the algorithm.\n", + "\n", + "In the following code cell, we define the different organizations needed for our FL experiment.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substra import Client\n", + "\n", + "N_CLIENTS = 3\n", + "\n", + "client_0 = Client(client_name=\"org-1\")\n", + "client_1 = Client(client_name=\"org-2\")\n", + "client_2 = Client(client_name=\"org-3\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Every computation will run in ``subprocess`` mode, where everything runs locally in Python\n", + "subprocesses.\n", + "Other backend_types are:\n", + "\n", + "- ``docker`` mode where computations run locally in docker containers\n", + "- ``remote`` mode where computations run remotely (you need to have a deployed platform for that)\n", + "\n", + "To run in remote mode, use the following syntax:\n", + "\n", + "```py\n", + "client_remote = Client(backend_type=\"remote\", url=\"MY_BACKEND_URL\", username=\"my-username\", password=\"my-password\")\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create a dictionary to easily access each client from its human-friendly id\n", + "clients = {\n", + " client_0.organization_info().organization_id: client_0,\n", + " client_1.organization_info().organization_id: client_1,\n", + " client_2.organization_info().organization_id: client_2,\n", + "}\n", + "\n", + "# Store organization IDs\n", + "ORGS_ID = list(clients)\n", + "ALGO_ORG_ID = ORGS_ID[0] # Algo provider is defined as the first organization.\n", + "DATA_PROVIDER_ORGS_ID = ORGS_ID[1:] # Data providers orgs are the two last organizations." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data and metrics\n", + "\n", + "### Data preparation\n", + "\n", + "This section downloads (if needed) the **MNIST dataset** using the [torchvision library](https://pytorch.org/vision/stable/index.html).\n", + "It extracts the images from the raw files and locally creates a folder for each\n", + "organization.\n", + "\n", + "Each organization will have access to half the training data and half the test data (which\n", + "corresponds to **30,000**\n", + "images for training and **5,000** for testing each).\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import pathlib\n", + "from torch_fedavg_assets.dataset.mnist_dataset import setup_mnist\n", + "\n", + "\n", + "# Create the temporary directory for generated data\n", + "(pathlib.Path.cwd() / \"tmp\").mkdir(exist_ok=True)\n", + "data_path = pathlib.Path.cwd() / \"tmp\" / \"data_mnist\"\n", + "\n", + "setup_mnist(data_path, len(DATA_PROVIDER_ORGS_ID))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dataset registration\n", + "\n", + "A [Dataset](https://docs.substra.org/en/stable/documentation/concepts.html#dataset) is composed of an **opener**, which is a Python script that can load\n", + "the data from the files in memory and a description markdown file.\n", + "The [Dataset](https://docs.substra.org/en/stable/documentation/concepts.html#dataset) object itself does not contain the data. The proper asset that contains the\n", + "data is the **datasample asset**.\n", + "\n", + "A **datasample** contains a local path to the data. A datasample can be linked to a dataset in order to add data to a\n", + "dataset.\n", + "\n", + "Data privacy is a key concept for Federated Learning experiments. That is why we set\n", + "[Permissions](https://docs.substra.org/en/stable/documentation/concepts.html#permissions) for an [Asset](https://docs.substra.org/en/stable/documentation/concepts.html#permissions) to determine how each organization\n", + "can access a specific asset.\n", + "\n", + "Note that metadata such as the assets' creation date and the asset owner are visible to all the organizations of a\n", + "network.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substra.sdk.schemas import DatasetSpec\n", + "from substra.sdk.schemas import Permissions\n", + "from substra.sdk.schemas import DataSampleSpec\n", + "\n", + "assets_directory = pathlib.Path.cwd() / \"torch_fedavg_assets\"\n", + "dataset_keys = {}\n", + "train_datasample_keys = {}\n", + "test_datasample_keys = {}\n", + "\n", + "for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID):\n", + " client = clients[org_id]\n", + "\n", + " permissions_dataset = Permissions(public=False, authorized_ids=[ALGO_ORG_ID])\n", + "\n", + " # DatasetSpec is the specification of a dataset. It makes sure every field\n", + " # is well-defined, and that our dataset is ready to be registered.\n", + " # The real dataset object is created in the add_dataset method.\n", + "\n", + " dataset = DatasetSpec(\n", + " name=\"MNIST\",\n", + " type=\"npy\",\n", + " data_opener=assets_directory / \"dataset\" / \"mnist_opener.py\",\n", + " description=assets_directory / \"dataset\" / \"description.md\",\n", + " permissions=permissions_dataset,\n", + " logs_permission=permissions_dataset,\n", + " )\n", + " dataset_keys[org_id] = client.add_dataset(dataset)\n", + " assert dataset_keys[org_id], \"Missing dataset key\"\n", + "\n", + " # Add the training data on each organization.\n", + " data_sample = DataSampleSpec(\n", + " data_manager_keys=[dataset_keys[org_id]],\n", + " path=data_path / f\"org_{i+1}\" / \"train\",\n", + " )\n", + " train_datasample_keys[org_id] = client.add_data_sample(data_sample)\n", + "\n", + " # Add the testing data on each organization.\n", + " data_sample = DataSampleSpec(\n", + " data_manager_keys=[dataset_keys[org_id]],\n", + " path=data_path / f\"org_{i+1}\" / \"test\",\n", + " )\n", + " test_datasample_keys[org_id] = client.add_data_sample(data_sample)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Metrics definition\n", + "\n", + "A metric is a function used to evaluate the performance of your model on one or several\n", + "**datasamples**.\n", + "\n", + "To add a metric, you need to define a function that computes and returns a performance\n", + "from the datasamples (as returned by the opener) and the predictions_path (to be loaded within the function).\n", + "\n", + "When using a Torch SubstraFL algorithm, the predictions are saved in the `predict` function in numpy format\n", + "so that you can simply load them using `np.load`.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from sklearn.metrics import accuracy_score\n", + "from sklearn.metrics import roc_auc_score\n", + "import numpy as np\n", + "\n", + "\n", + "def accuracy(datasamples, predictions_path):\n", + " y_true = datasamples[\"labels\"]\n", + " y_pred = np.load(predictions_path)\n", + "\n", + " return accuracy_score(y_true, np.argmax(y_pred, axis=1))\n", + "\n", + "\n", + "def roc_auc(datasamples, predictions_path):\n", + " y_true = datasamples[\"labels\"]\n", + " y_pred = np.load(predictions_path)\n", + "\n", + " n_class = np.max(y_true) + 1\n", + " y_true_one_hot = np.eye(n_class)[y_true]\n", + "\n", + " return roc_auc_score(y_true_one_hot, y_pred)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Machine learning components definition\n", + "\n", + "This section uses the PyTorch based SubstraFL API to simplify the definition of machine learning components.\n", + "However, SubstraFL is compatible with any machine learning framework.\n", + "\n", + "\n", + "In this section, you will:\n", + "\n", + "- Register a model and its dependencies\n", + "- Specify the federated learning strategy\n", + "- Specify the training and aggregation nodes\n", + "- Specify the test nodes\n", + "- Actually run the computations\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Model definition\n", + "\n", + "We choose to use a classic torch CNN as the model to train. The model architecture is defined by the user\n", + "independently of SubstraFL.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import torch\n", + "from torch import nn\n", + "import torch.nn.functional as F\n", + "\n", + "seed = 42\n", + "torch.manual_seed(seed)\n", + "\n", + "\n", + "class CNN(nn.Module):\n", + " def __init__(self):\n", + " super(CNN, self).__init__()\n", + " self.conv1 = nn.Conv2d(1, 32, kernel_size=5)\n", + " self.conv2 = nn.Conv2d(32, 32, kernel_size=5)\n", + " self.conv3 = nn.Conv2d(32, 64, kernel_size=5)\n", + " self.fc1 = nn.Linear(3 * 3 * 64, 256)\n", + " self.fc2 = nn.Linear(256, 10)\n", + "\n", + " def forward(self, x, eval=False):\n", + " x = F.relu(self.conv1(x))\n", + " x = F.relu(F.max_pool2d(self.conv2(x), 2))\n", + " x = F.dropout(x, p=0.5, training=not eval)\n", + " x = F.relu(F.max_pool2d(self.conv3(x), 2))\n", + " x = F.dropout(x, p=0.5, training=not eval)\n", + " x = x.view(-1, 3 * 3 * 64)\n", + " x = F.relu(self.fc1(x))\n", + " x = F.dropout(x, p=0.5, training=not eval)\n", + " x = self.fc2(x)\n", + " return F.log_softmax(x, dim=1)\n", + "\n", + "\n", + "model = CNN()\n", + "optimizer = torch.optim.Adam(model.parameters(), lr=0.001)\n", + "criterion = torch.nn.CrossEntropyLoss()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Specifying on how much data to train\n", + "\n", + "To specify on how much data to train at each round, we use the `index_generator` object.\n", + "We specify the batch size and the number of batches (named `num_updates`) to consider for each round.\n", + "See [Index Generator](https://docs.substra.org/en/stable/substrafl_doc/substrafl_overview.html#index-generator) for more details.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.index_generator import NpIndexGenerator\n", + "\n", + "# Number of model updates between each FL strategy aggregation.\n", + "NUM_UPDATES = 100\n", + "\n", + "# Number of samples per update.\n", + "BATCH_SIZE = 32\n", + "\n", + "index_generator = NpIndexGenerator(\n", + " batch_size=BATCH_SIZE,\n", + " num_updates=NUM_UPDATES,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Torch Dataset definition\n", + "\n", + "This torch Dataset is used to preprocess the data using the `__getitem__` function.\n", + "\n", + "This torch Dataset needs to have a specific `__init__` signature, that must contain (self, datasamples, is_inference).\n", + "\n", + "The `__getitem__` function is expected to return (inputs, outputs) if `is_inference` is `False`, else only the inputs.\n", + "This behavior can be changed by re-writing the `_local_train` or `predict` methods.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class TorchDataset(torch.utils.data.Dataset):\n", + " def __init__(self, datasamples, is_inference: bool):\n", + " self.x = datasamples[\"images\"]\n", + " self.y = datasamples[\"labels\"]\n", + " self.is_inference = is_inference\n", + "\n", + " def __getitem__(self, idx):\n", + " if self.is_inference:\n", + " x = torch.FloatTensor(self.x[idx][None, ...]) / 255\n", + " return x\n", + "\n", + " else:\n", + " x = torch.FloatTensor(self.x[idx][None, ...]) / 255\n", + "\n", + " y = torch.tensor(self.y[idx]).type(torch.int64)\n", + " y = F.one_hot(y, 10)\n", + " y = y.type(torch.float32)\n", + "\n", + " return x, y\n", + "\n", + " def __len__(self):\n", + " return len(self.x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### SubstraFL algo definition\n", + "\n", + "A SubstraFL Algo gathers all the defined elements that run locally in each organization.\n", + "This is the only SubstraFL object that is framework specific (here PyTorch specific).\n", + "\n", + "The `TorchDataset` is passed **as a class** to the\n", + "[Torch Algorithms](https://docs.substra.org/en/stable/substrafl_doc/api/algorithms.html#torch-algorithms).\n", + "Indeed, this `TorchDataset` will be instantiated directly on the data provider organization.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.algorithms.pytorch import TorchFedAvgAlgo\n", + "\n", + "\n", + "class TorchCNN(TorchFedAvgAlgo):\n", + " def __init__(self):\n", + " super().__init__(\n", + " model=model,\n", + " criterion=criterion,\n", + " optimizer=optimizer,\n", + " index_generator=index_generator,\n", + " dataset=TorchDataset,\n", + " seed=seed,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Federated Learning strategies\n", + "\n", + "A FL strategy specifies how to train a model on distributed data.\n", + "The most well known strategy is the Federated Averaging strategy: train locally a model on every organization,\n", + "then aggregate the weight updates from every organization, and then apply locally at each organization the averaged\n", + "updates.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.strategies import FedAvg\n", + "\n", + "strategy = FedAvg(algo=TorchCNN())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Where to train where to aggregate\n", + "\n", + "We specify on which data we want to train our model, using the [TrainDataNode](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#traindatanode) object.\n", + "Here we train on the two datasets that we have registered earlier.\n", + "\n", + "The [AggregationNode](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#aggregationnode) specifies the organization on which the aggregation operation\n", + "will be computed.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.nodes import TrainDataNode\n", + "from substrafl.nodes import AggregationNode\n", + "\n", + "\n", + "aggregation_node = AggregationNode(ALGO_ORG_ID)\n", + "\n", + "# Create the Train Data Nodes (or training tasks) and save them in a list\n", + "train_data_nodes = [\n", + " TrainDataNode(\n", + " organization_id=org_id,\n", + " data_manager_key=dataset_keys[org_id],\n", + " data_sample_keys=[train_datasample_keys[org_id]],\n", + " )\n", + " for org_id in DATA_PROVIDER_ORGS_ID\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Where and when to test\n", + "\n", + "With the same logic as the train nodes, we create [TestDataNode](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#testdatanode) to specify on which\n", + "data we want to test our model.\n", + "\n", + "The [Evaluation Strategy](https://docs.substra.org/en/stable/substrafl_doc/api/evaluation_strategy.html) defines where and at which frequency we\n", + "evaluate the model, using the given metric(s) that you registered in a previous section.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.nodes import TestDataNode\n", + "from substrafl.evaluation_strategy import EvaluationStrategy\n", + "\n", + "# Create the Test Data Nodes (or testing tasks) and save them in a list\n", + "test_data_nodes = [\n", + " TestDataNode(\n", + " organization_id=org_id,\n", + " data_manager_key=dataset_keys[org_id],\n", + " test_data_sample_keys=[test_datasample_keys[org_id]],\n", + " metric_functions={\"Accuracy\": accuracy, \"ROC AUC\": roc_auc},\n", + " )\n", + " for org_id in DATA_PROVIDER_ORGS_ID\n", + "]\n", + "\n", + "\n", + "# Test at the end of every round\n", + "my_eval_strategy = EvaluationStrategy(test_data_nodes=test_data_nodes, eval_frequency=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running the experiment\n", + "\n", + "As a last step before launching our experiment, we need to specify the third parties dependencies required to run it.\n", + "The [Dependency](https://docs.substra.org/en/stable/substrafl_doc/api/dependency.html) object is instantiated in order to install the right libraries in\n", + "the Python environment of each organization.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.dependency import Dependency\n", + "\n", + "dependencies = Dependency(pypi_dependencies=[\"numpy==1.23.1\", \"torch==1.11.0\", \"scikit-learn==1.1.1\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now have all the necessary objects to launch our experiment. Please see a summary below of all the objects we created so far:\n", + "\n", + "- A [Client](https://docs.substra.org/en/stable/documentation/references/sdk.html#client) to add or retrieve the assets of our experiment, using their keys to\n", + " identify them.\n", + "- An [Torch Algorithms](https://docs.substra.org/en/stable/substrafl_doc/api/algorithms.html#torch-algorithms) to define the training parameters *(optimizer, train\n", + " function, predict function, etc...)*.\n", + "- A [Strategies](https://docs.substra.org/en/stable/substrafl_doc/api/strategies.html#strategies), to specify how to train the model on\n", + " distributed data.\n", + "- [Train data nodes](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#traindatanode) to indicate on which data to train.\n", + "- An [Evaluation Strategy](https://docs.substra.org/en/stable/substrafl_doc/api/evaluation_strategy.html#evaluation-strategy), to define where and at which frequency we\n", + " evaluate the model.\n", + "- An [Aggregation Node](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#aggregationnode), to specify the organization on which the aggregation operation\n", + " will be computed.\n", + "- The **number of rounds**, a round being defined by a local training step followed by an aggregation operation.\n", + "- An **experiment folder** to save a summary of the operation made.\n", + "- The [Dependency](https://docs.substra.org/en/stable/substrafl_doc/api/dependency.html) to define the libraries on which the experiment needs to run.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.experiment import execute_experiment\n", + "\n", + "# A round is defined by a local training step followed by an aggregation operation\n", + "NUM_ROUNDS = 3\n", + "\n", + "compute_plan = execute_experiment(\n", + " client=clients[ALGO_ORG_ID],\n", + " strategy=strategy,\n", + " train_data_nodes=train_data_nodes,\n", + " evaluation_strategy=my_eval_strategy,\n", + " aggregation_node=aggregation_node,\n", + " num_rounds=NUM_ROUNDS,\n", + " experiment_folder=str(pathlib.Path.cwd() / \"tmp\" / \"experiment_summaries\"),\n", + " dependencies=dependencies,\n", + " clean_models=False,\n", + " name=\"MNIST documentation example\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The compute plan created is composed of 29 tasks:\n", + "\n", + "* For each local training step, we create 3 tasks per organisation: training + prediction + evaluation -> 3 tasks.\n", + "* We are training on 2 data organizations; for each round, we have 3 * 2 local tasks + 1 aggregation task -> 7 tasks.\n", + "* We are training for 3 rounds: 3 * 7 -> 21 tasks.\n", + "* Before the first local training step, there is an initialization step on each data organization: 21 + 2 -> 23 tasks.\n", + "* After the last aggregation step, there are three more tasks: applying the last updates from the aggregator + prediction + evaluation, on both organizations: 23 + 2 * 3 -> 29 tasks\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explore the results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# The results will be available once the compute plan is completed\n", + "client_0.wait_compute_plan(compute_plan.key)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### List results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "performances_df = pd.DataFrame(client.get_performances(compute_plan.key).dict())\n", + "print(\"\\nPerformance Table: \\n\")\n", + "print(performances_df[[\"worker\", \"round_idx\", \"identifier\", \"performance\"]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Plot results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "fig, axs = plt.subplots(1, 2, figsize=(12, 6))\n", + "fig.suptitle(\"Test dataset results\")\n", + "\n", + "axs[0].set_title(\"Accuracy\")\n", + "axs[1].set_title(\"ROC AUC\")\n", + "\n", + "for ax in axs.flat:\n", + " ax.set(xlabel=\"Rounds\", ylabel=\"Score\")\n", + "\n", + "\n", + "for org_id in DATA_PROVIDER_ORGS_ID:\n", + " org_df = performances_df[performances_df[\"worker\"] == org_id]\n", + " acc_df = org_df[org_df[\"identifier\"] == \"Accuracy\"]\n", + " axs[0].plot(acc_df[\"round_idx\"], acc_df[\"performance\"], label=org_id)\n", + "\n", + " auc_df = org_df[org_df[\"identifier\"] == \"ROC AUC\"]\n", + " axs[1].plot(auc_df[\"round_idx\"], auc_df[\"performance\"], label=org_id)\n", + "\n", + "plt.legend(loc=\"lower right\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Download a model\n", + "\n", + "After the experiment, you might be interested in downloading your trained model.\n", + "To do so, you will need the source code in order to reload your code architecture in memory.\n", + "You have the option to choose the client and the round you are interested in downloading.\n", + "\n", + "If `round_idx` is set to `None`, the last round will be selected by default.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.model_loading import download_algo_state\n", + "\n", + "client_to_download_from = DATA_PROVIDER_ORGS_ID[0]\n", + "round_idx = None\n", + "\n", + "algo = download_algo_state(\n", + " client=clients[client_to_download_from],\n", + " compute_plan_key=compute_plan.key,\n", + " round_idx=round_idx,\n", + ")\n", + "\n", + "model = algo.model\n", + "\n", + "print(model)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.17" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/examples/substrafl/get_started/torch_fedavg_assets/dataset/description.md b/docs/source/examples/substrafl/get_started/torch_fedavg_assets/dataset/description.md similarity index 100% rename from examples/substrafl/get_started/torch_fedavg_assets/dataset/description.md rename to docs/source/examples/substrafl/get_started/torch_fedavg_assets/dataset/description.md diff --git a/examples/substrafl/get_started/torch_fedavg_assets/dataset/mnist_dataset.py b/docs/source/examples/substrafl/get_started/torch_fedavg_assets/dataset/mnist_dataset.py similarity index 100% rename from examples/substrafl/get_started/torch_fedavg_assets/dataset/mnist_dataset.py rename to docs/source/examples/substrafl/get_started/torch_fedavg_assets/dataset/mnist_dataset.py diff --git a/examples/substrafl/get_started/torch_fedavg_assets/dataset/mnist_opener.py b/docs/source/examples/substrafl/get_started/torch_fedavg_assets/dataset/mnist_opener.py similarity index 100% rename from examples/substrafl/get_started/torch_fedavg_assets/dataset/mnist_opener.py rename to docs/source/examples/substrafl/get_started/torch_fedavg_assets/dataset/mnist_opener.py diff --git a/examples/substrafl/get_started/torch_fedavg_assets/requirements.txt b/docs/source/examples/substrafl/get_started/torch_fedavg_assets/requirements.txt similarity index 100% rename from examples/substrafl/get_started/torch_fedavg_assets/requirements.txt rename to docs/source/examples/substrafl/get_started/torch_fedavg_assets/requirements.txt diff --git a/examples/substrafl/go_further/diabetes_substrafl_assets/dataset/diabetes_substrafl_dataset.py b/docs/source/examples/substrafl/go_further/diabetes_substrafl_assets/dataset/diabetes_substrafl_dataset.py similarity index 100% rename from examples/substrafl/go_further/diabetes_substrafl_assets/dataset/diabetes_substrafl_dataset.py rename to docs/source/examples/substrafl/go_further/diabetes_substrafl_assets/dataset/diabetes_substrafl_dataset.py diff --git a/examples/substrafl/go_further/diabetes_substrafl_assets/dataset/diabetes_substrafl_opener.py b/docs/source/examples/substrafl/go_further/diabetes_substrafl_assets/dataset/diabetes_substrafl_opener.py similarity index 100% rename from examples/substrafl/go_further/diabetes_substrafl_assets/dataset/diabetes_substrafl_opener.py rename to docs/source/examples/substrafl/go_further/diabetes_substrafl_assets/dataset/diabetes_substrafl_opener.py diff --git a/examples/substrafl/go_further/diabetes_substrafl_assets/requirements.txt b/docs/source/examples/substrafl/go_further/diabetes_substrafl_assets/requirements.txt similarity index 100% rename from examples/substrafl/go_further/diabetes_substrafl_assets/requirements.txt rename to docs/source/examples/substrafl/go_further/diabetes_substrafl_assets/requirements.txt diff --git a/docs/source/examples/substrafl/go_further/run_diabetes_substrafl.ipynb b/docs/source/examples/substrafl/go_further/run_diabetes_substrafl.ipynb new file mode 100644 index 00000000..a987b301 --- /dev/null +++ b/docs/source/examples/substrafl/go_further/run_diabetes_substrafl.ipynb @@ -0,0 +1,698 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Federated Analytics on the diabetes dataset\n", + "\n", + "This example demonstrates how to use the flexibility of the SubstraFL library and the base class\n", + "ComputePlanBuilder to do Federated Analytics. It reproduces the [diabetes example](https://docs.substra.org/en/stable/examples/substra_core/diabetes_example/run_diabetes.html).\n", + "of the Substra SDK example section using SubstraFL.\n", + "If you are new to SubstraFL, we recommend to start by the [MNIST Example](https://docs.substra.org/en/stable/examples/substrafl/get_started/run_mnist_torch.htm).\n", + "to learn how to use the library in the simplest configuration first.\n", + "\n", + "We use the **Diabetes dataset** available from the [Scikit-Learn dataset module](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).\n", + "This dataset contains medical information such as Age, Sex or Blood pressure.\n", + "The goal of this example is to compute some analytics such as Age mean, Blood pressure standard deviation or Sex percentage.\n", + "\n", + "We simulate having two different data organizations, and a third organization which wants to compute aggregated analytics\n", + "without having access to the raw data. The example here runs everything locally; however there is only one parameter to\n", + "change to run it on a real network.\n", + "\n", + "**Caution:**\n", + " This example is provided as an illustrative example only. In real life, you should be careful not to\n", + " accidentally leak private information when doing Federated Analytics. For example if a column contains very similar values,\n", + " sharing its mean and its standard deviation is functionally equivalent to sharing the content of the column.\n", + " It is **strongly recommended** to consider what are the potential security risks in your use case, and to act accordingly.\n", + " It is possible to use other privacy-preserving techniques, such as\n", + " [Differential Privacy](https://en.wikipedia.org/wiki/Differential_privacy), in addition to Substra.\n", + " Because the focus of this example is Substra capabilities and for the sake of simplicity, such safeguards are not implemented here.\n", + "\n", + "\n", + "To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example.\n", + "\n", + "- [assets required to run this example](../../../tmp/diabetes_substrafl_assets.zip)\n", + "\n", + "Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command `pip install -r requirements.txt` to install them.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instantiating the Substra clients\n", + "\n", + "We work with three different organizations.\n", + "Two organizations provide data, and a third one performs Federated Analytics to compute aggregated statistics without\n", + "having access to the raw datasets.\n", + "\n", + "This example runs in local mode, simulating a federated learning experiment.\n", + "\n", + "In the following code cell, we define the different organizations needed for our FL experiment.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substra import Client\n", + "\n", + "# Choose the subprocess mode to locally simulate the FL process\n", + "N_CLIENTS = 3\n", + "client_0 = Client(client_name=\"org-1\")\n", + "client_1 = Client(client_name=\"org-2\")\n", + "client_2 = Client(client_name=\"org-3\")\n", + "\n", + "# Create a dictionary to easily access each client from its human-friendly id\n", + "clients = {\n", + " client_0.organization_info().organization_id: client_0,\n", + " client_1.organization_info().organization_id: client_1,\n", + " client_2.organization_info().organization_id: client_2,\n", + "}\n", + "# Store organization IDs\n", + "ORGS_ID = list(clients)\n", + "\n", + "# The provider of the functions for computing analytics is defined as the first organization.\n", + "ANALYTICS_PROVIDER_ORG_ID = ORGS_ID[0]\n", + "# Data providers orgs are the two last organizations.\n", + "DATA_PROVIDER_ORGS_ID = ORGS_ID[1:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare the data\n", + "\n", + "The function `setup_diabetes` downloads if needed the *diabetes* dataset, and split it in two to simulate a\n", + "federated setup. Each data organization has access to a chunk of the dataset.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import pathlib\n", + "\n", + "from diabetes_substrafl_assets.dataset.diabetes_substrafl_dataset import setup_diabetes\n", + "\n", + "data_path = pathlib.Path.cwd() / \"tmp\" / \"data_diabetes\"\n", + "data_path.mkdir(parents=True, exist_ok=True)\n", + "\n", + "setup_diabetes(data_path=data_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Registering data samples and dataset\n", + "\n", + "Every asset will be created in respect to predefined specifications previously imported from\n", + "`substra.sdk.schemas`. To register assets, [Schemas](https://docs.substra.org/en/stable/documentation/references/sdk_schemas.html#schemas)\n", + "are first instantiated and the specs are then registered, which generate the real assets.\n", + "\n", + "Permissions are defined when registering assets. In a nutshell:\n", + "\n", + "- Data cannot be seen once it's registered on the platform.\n", + "- Metadata are visible by all the users of a network.\n", + "- Permissions allow you to execute a function on a certain dataset.\n", + "\n", + "Next, we need to define the asset directory. You should have already downloaded the assets folder as stated above.\n", + "\n", + "A dataset represents the data in Substra. It contains some metadata and an *opener*, a script used to load the\n", + "data from files into memory. You can find more details about datasets\n", + "in the [API Reference DatasetSpec](https://docs.substra.org/en/stable/documentation/references/sdk_schemas.html#datasetspec).\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substra.sdk.schemas import DataSampleSpec\n", + "from substra.sdk.schemas import DatasetSpec\n", + "from substra.sdk.schemas import Permissions\n", + "\n", + "\n", + "assets_directory = pathlib.Path.cwd() / \"diabetes_substrafl_assets\"\n", + "assert assets_directory.is_dir(), \"\"\"Did not find the asset directory,\n", + "a directory called 'assets' is expected in the same location as this file\"\"\"\n", + "\n", + "permissions_dataset = Permissions(public=False, authorized_ids=[ANALYTICS_PROVIDER_ORG_ID])\n", + "\n", + "dataset = DatasetSpec(\n", + " name=f\"Diabetes dataset\",\n", + " type=\"csv\",\n", + " data_opener=assets_directory / \"dataset\" / \"diabetes_substrafl_opener.py\",\n", + " description=data_path / \"description.md\",\n", + " permissions=permissions_dataset,\n", + " logs_permission=permissions_dataset,\n", + ")\n", + "\n", + "# We register the dataset for each organization\n", + "dataset_keys = {client_id: clients[client_id].add_dataset(dataset) for client_id in DATA_PROVIDER_ORGS_ID}\n", + "\n", + "for client_id, key in dataset_keys.items():\n", + " print(f\"Dataset key for {client_id}: {key}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The dataset object itself is an empty shell. Data samples are needed in order to add actual data.\n", + "A data sample contains subfolders containing a single data file like a CSV and the key identifying\n", + "the dataset it is linked to.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "datasample_keys = {\n", + " org_id: clients[org_id].add_data_sample(\n", + " DataSampleSpec(\n", + " data_manager_keys=[dataset_keys[org_id]],\n", + " test_only=False,\n", + " path=data_path / f\"org_{i + 1}\",\n", + " ),\n", + " local=True,\n", + " )\n", + " for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID)\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The flexibility of the ComputePlanBuilder class\n", + "\n", + "This example aims at explaining how to use the [Compute Plan Builder](https://docs.substra.org/en/stable/substrafl_doc/api/compute_plan_builder.html#compute-plan-builder)\n", + "class, and how to use the full power of the flexibility it provides.\n", + "\n", + "Before starting, we need to have in mind that a federated computation can be represented as a graph of tasks.\n", + "Some of these tasks need data to be executed (training tasks) and others are here to aggregate local results\n", + "(aggregation tasks).\n", + "\n", + "Substra does not store an explicit definition of this graph; instead, it gives the user full flexibility to define\n", + "the compute plan (or computation graph) they need, by linking a task to its parents.\n", + "\n", + "To create this graph of computations, SubstraFL provides the `Node` abstraction. A `Node`\n", + "assigns to an organization (aka a Client) tasks of a given type. The type of the `Node` depends on the type of tasks\n", + "we want to run on this organization (training or aggregation tasks).\n", + "\n", + "An organization (aka Client) without data can host an\n", + "[Aggregation node](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#aggregationnode).\n", + "We will use the [Aggregation node](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#aggregationnode) object to compute the aggregated\n", + "analytics.\n", + "\n", + "An organization (aka a Client) containing the data samples can host a\n", + "[Train data node](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#traindatanode).\n", + "Each node will only have access data from the organization hosting it.\n", + "These data samples must be instantiated with the right permissions to be processed by the given Client.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.nodes import TrainDataNode\n", + "from substrafl.nodes import AggregationNode\n", + "\n", + "\n", + "aggregation_node = AggregationNode(ANALYTICS_PROVIDER_ORG_ID)\n", + "\n", + "train_data_nodes = [\n", + " TrainDataNode(\n", + " organization_id=org_id,\n", + " data_manager_key=dataset_keys[org_id],\n", + " data_sample_keys=[datasample_keys[org_id]],\n", + " )\n", + " for org_id in DATA_PROVIDER_ORGS_ID\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Compute Plan Builder](https://docs.substra.org/en/stable/substrafl_doc/api/compute_plan_builder.html#compute-plan-builder) is an abstract class that asks the user to\n", + "implement only three methods:\n", + "\n", + "- `build_compute_plan(...)`\n", + "- `load_local_state(...)`\n", + "- `save_local_state(...)`\n", + "\n", + "The `build_compute_plan` method is essential to create the graph of the compute plan that will be executed on\n", + "Substra. Using the different `Nodes` we created, we will update their states by applying user defined methods.\n", + "\n", + "These methods are passed as argument to the `Node` using its `update_state` method.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import json\n", + "from collections import defaultdict\n", + "import pandas as pd\n", + "from typing import List, Dict\n", + "\n", + "from substrafl import ComputePlanBuilder\n", + "from substrafl.remote import remote_data, remote\n", + "\n", + "\n", + "class Analytics(ComputePlanBuilder):\n", + " def __init__(self):\n", + " super().__init__()\n", + " self.first_order_aggregated_state = {}\n", + " self.second_order_aggregated_state = {}\n", + "\n", + " @remote_data\n", + " def local_first_order_computation(self, datasamples: pd.DataFrame, shared_state=None):\n", + " \"\"\"Compute from the data samples, expected to be a pandas dataframe,\n", + " the means and counts of each column of the data frame.\n", + " These datasamples are the output of the ``get_data`` function defined\n", + " in the ``diabetes_substrafl_opener.py`` file are available in the asset\n", + " folder downloaded at the beginning of the example.\n", + "\n", + " The signature of a function decorated by @remote_data must contain\n", + " the datasamples and the shared_state arguments.\n", + "\n", + " Args:\n", + " datasamples (pd.DataFrame): Pandas dataframe provided by the opener.\n", + " shared_state (None, optional): Unused here as this function only\n", + " use local information already present in the datasamples.\n", + " Defaults to None.\n", + "\n", + " Returns:\n", + " dict: dictionary containing the local information on means, counts\n", + " and number of sample. This dict will be used as a state to be\n", + " shared to an AggregationNode in order to compute the aggregation\n", + " of the different analytics.\n", + " \"\"\"\n", + " df = datasamples\n", + " states = {\n", + " \"n_samples\": len(df),\n", + " \"means\": df.select_dtypes(include=np.number).sum().to_dict(),\n", + " \"counts\": {\n", + " name: series.value_counts().to_dict() for name, series in df.select_dtypes(include=\"category\").items()\n", + " },\n", + " }\n", + " return states\n", + "\n", + " @remote_data\n", + " def local_second_order_computation(self, datasamples: pd.DataFrame, shared_state: Dict):\n", + " \"\"\"This function will use the output of the ``aggregation`` function to compute\n", + " locally the standard deviation of the different columns.\n", + "\n", + " Args:\n", + " datasamples (pd.DataFrame): Pandas dataframe provided by the opener.\n", + " shared_state (Dict): Output of a first order analytics computation,\n", + " that must contain the means.\n", + "\n", + " Returns:\n", + " Dict: dictionary containing the local information on standard deviation\n", + " and number of sample. This dict will be used as a state to be shared\n", + " to an AggregationNode in order to compute the aggregation of the\n", + " different analytics.\n", + " \"\"\"\n", + " df = datasamples\n", + " means = pd.Series(shared_state[\"means\"])\n", + " states = {\n", + " \"n_samples\": len(df),\n", + " \"std\": np.power(df.select_dtypes(include=np.number) - means, 2).sum(),\n", + " }\n", + " return states\n", + "\n", + " @remote\n", + " def aggregation(self, shared_states: List[Dict]):\n", + " \"\"\"Aggregation function that receive a list on locally computed analytics in order to\n", + " aggregate them.\n", + " The aggregation will be a weighted average using \"n_samples\" as weight coefficient.\n", + "\n", + " Args:\n", + " shared_states (List[Dict]): list of dictionaries containing a field \"n_samples\",\n", + " and the analytics to aggregate in separated fields.\n", + "\n", + " Returns:\n", + " Dict: dictionary containing the aggregated analytics.\n", + " \"\"\"\n", + " total_len = 0\n", + " for state in shared_states:\n", + " total_len += state[\"n_samples\"]\n", + "\n", + " aggregated_values = defaultdict(lambda: defaultdict(float))\n", + " for state in shared_states:\n", + " for analytics_name, col_dict in state.items():\n", + " if analytics_name == \"n_samples\":\n", + " # already aggregated in total_len\n", + " continue\n", + " for col_name, v in col_dict.items():\n", + " if isinstance(v, dict):\n", + " # this column is categorical and v is a dict over\n", + " # the different modalities\n", + " if not aggregated_values[analytics_name][col_name]:\n", + " aggregated_values[analytics_name][col_name] = defaultdict(float)\n", + " for modality, vv in v.items():\n", + " aggregated_values[analytics_name][col_name][modality] += vv / total_len\n", + " else:\n", + " # this is a numerical column and v is numerical\n", + " aggregated_values[analytics_name][col_name] += v / total_len\n", + "\n", + " # transform default_dict to regular dict\n", + " aggregated_values = json.loads(json.dumps(aggregated_values))\n", + "\n", + " return aggregated_values\n", + "\n", + " def build_compute_plan(\n", + " self,\n", + " train_data_nodes: List[TrainDataNode],\n", + " aggregation_node: AggregationNode,\n", + " num_rounds=None,\n", + " evaluation_strategy=None,\n", + " clean_models=False,\n", + " ):\n", + " \"\"\"Method to build and link the different computations to execute with each other.\n", + " We will use the ``update_state``method of the nodes given as input to choose which\n", + " method to apply.\n", + " For our example, we will only use TrainDataNodes and AggregationNodes.\n", + "\n", + " Args:\n", + " train_data_nodes (List[TrainDataNode]): Nodes linked to the data samples on which\n", + " to compute analytics.\n", + " aggregation_node (AggregationNode): Node on which to compute the aggregation\n", + " of the analytics extracted from the train_data_nodes.\n", + " num_rounds Optional[int]: Num rounds to be used to iterate on recurrent part of\n", + " the compute plan. Defaults to None.\n", + " evaluation_strategy Optional[substrafl.EvaluationStrategy]: Object storing the\n", + " TestDataNode. Unused in this example. Defaults to None.\n", + " clean_models (bool): Clean the intermediary models of this round on the\n", + " Substra platform. Default to False.\n", + " \"\"\"\n", + " first_order_shared_states = []\n", + " local_states = {}\n", + "\n", + " for node in train_data_nodes:\n", + " # Call local_first_order_computation on each train data node\n", + " next_local_state, next_shared_state = node.update_states(\n", + " self.local_first_order_computation(\n", + " node.data_sample_keys,\n", + " shared_state=None,\n", + " _algo_name=f\"Computing first order means with {self.__class__.__name__}\",\n", + " ),\n", + " local_state=None,\n", + " round_idx=0,\n", + " authorized_ids=set([node.organization_id]),\n", + " aggregation_id=aggregation_node.organization_id,\n", + " clean_models=False,\n", + " )\n", + "\n", + " # All local analytics are stored in the first_order_shared_states,\n", + " # given as input the the aggregation method.\n", + " first_order_shared_states.append(next_shared_state)\n", + " local_states[node.organization_id] = next_local_state\n", + "\n", + " # Call the aggregation method on the first_order_shared_states\n", + " self.first_order_aggregated_state = aggregation_node.update_states(\n", + " self.aggregation(\n", + " shared_states=first_order_shared_states,\n", + " _algo_name=\"Aggregating first order\",\n", + " ),\n", + " round_idx=0,\n", + " authorized_ids=set([train_data_node.organization_id for train_data_node in train_data_nodes]),\n", + " clean_models=False,\n", + " )\n", + "\n", + " second_order_shared_states = []\n", + "\n", + " for node in train_data_nodes:\n", + " # Call local_second_order_computation on each train data node\n", + " _, next_shared_state = node.update_states(\n", + " self.local_second_order_computation(\n", + " node.data_sample_keys,\n", + " shared_state=self.first_order_aggregated_state,\n", + " _algo_name=f\"Computing second order analytics with {self.__class__.__name__}\",\n", + " ),\n", + " local_state=local_states[node.organization_id],\n", + " round_idx=1,\n", + " authorized_ids=set([node.organization_id]),\n", + " aggregation_id=aggregation_node.organization_id,\n", + " clean_models=False,\n", + " )\n", + "\n", + " # All local analytics are stored in the second_order_shared_states,\n", + " # given as input the the aggregation method.\n", + " second_order_shared_states.append(next_shared_state)\n", + "\n", + " # Call the aggregation method on the second_order_shared_states\n", + " self.second_order_aggregated_state = aggregation_node.update_states(\n", + " self.aggregation(\n", + " shared_states=second_order_shared_states,\n", + " _algo_name=\"Aggregating second order\",\n", + " ),\n", + " round_idx=1,\n", + " authorized_ids=set([train_data_node.organization_id for train_data_node in train_data_nodes]),\n", + " clean_models=False,\n", + " )\n", + "\n", + " def save_local_state(self, path: pathlib.Path):\n", + " \"\"\"This function will save the important local state to retrieve after each new\n", + " call to a train or test task.\n", + "\n", + " Args:\n", + " path (pathlib.Path): Path where to save the local_state. Provided internally by\n", + " Substra.\n", + " \"\"\"\n", + " state_to_save = {\n", + " \"first_order\": self.first_order_aggregated_state,\n", + " \"second_order\": self.second_order_aggregated_state,\n", + " }\n", + " with open(path, \"w\") as f:\n", + " json.dump(state_to_save, f)\n", + "\n", + " def load_local_state(self, path: pathlib.Path):\n", + " \"\"\"Mirror function to load the local_state from a file saved using\n", + " ``save_local_state``.\n", + "\n", + " Args:\n", + " path (pathlib.Path): Path where to load the local_state. Provided internally by\n", + " Substra.\n", + "\n", + " Returns:\n", + " ComputePlanBuilder: return self with the updated local state.\n", + " \"\"\"\n", + " with open(path, \"r\") as f:\n", + " state_to_load = json.load(f)\n", + "\n", + " self.first_order_aggregated_state = state_to_load[\"first_order\"]\n", + " self.second_order_aggregated_state = state_to_load[\"second_order\"]\n", + "\n", + " return self" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we saw the implementation of the custom `Analytics` class, we can add details to some of the previously\n", + "introduced concepts.\n", + "\n", + "The `update_state` method outputs the new state of the node, that can be passed as an argument to a following one.\n", + "This succession of `next_state` passed to a new `node.update_state` is how Substra build the graph of the\n", + "compute plan.\n", + "\n", + "The `load_local_state` and `save_local_state` are two methods used at each new iteration on a Node, in order to\n", + "retrieve the previous local state that have not been shared with the other `Nodes`.\n", + "\n", + "For instance, after updating a [Train data node](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#traindatanode) using its\n", + "`update_state` method, we will have access to its next local state, that we will pass as argument to the\n", + "next `update_state` we will apply on this [Train data node](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#traindatanode).\n", + "\n", + "To summarize, a `substrafl_doc/api/compute_plan_builder:Compute Plan Builder` is composed of several decorated\n", + "user defined functions, that can need some data (decorated with `@remote_data`) or not (decorated with `@remote`).\n", + "\n", + "See `substrafl_doc/api/remote:Decorator` for more information on these decorators.\n", + "\n", + "These user defined functions will be used to create the graph of the compute plan through the `build_compute_plan`\n", + "method and the `update_state` method of the different `Nodes`.\n", + "\n", + "The local state obtained after updating a [Train data node](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#traindatanode) needs the\n", + "methods `save_local_state` and `load_local_state` to retrieve the state where the Node was at the end of\n", + "the last update.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running the experiment\n", + "\n", + "As a last step before launching our experiment, we need to specify the third parties dependencies required to run it.\n", + "The [Dependency](https://docs.substra.org/en/stable/substrafl_doc/api/dependency.html#dependency) object is instantiated in order to install the right libraries in\n", + "the Python environment of each organization.\n", + "\n", + "We now have all the necessary objects to launch our experiment. Please see a summary below of all the objects we created so far:\n", + "\n", + "- A [Client](https://docs.substra.org/en/stable/documentation/references/sdk.html#client) to add or retrieve the assets of our experiment, using their keys to\n", + " identify them.\n", + "- A [Federated Strategy](https://docs.substra.org/en/stable/substrafl_doc/api/strategies.html#strategies), to specify what compute plan we want to execute.\n", + "- [Train data nodes](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#traindatanode) to indicate on which data to train.\n", + "- An [Evaluation Strategy](https://docs.substra.org/en/stable/substrafl_doc/api/evaluation_strategy.html#evaluation-strategy), to define where and at which frequency we\n", + " evaluate the model. Here this does not apply to our experiment. We set it to None.\n", + "- An [Aggregation Node](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#aggregationnode), to specify the organization on which the aggregation operation\n", + " will be computed.\n", + "- An **experiment folder** to save a summary of the operation made.\n", + "- The [Dependency](https://docs.substra.org/en/stable/substrafl_doc/api/dependency.html#dependency) to define the libraries on which the experiment needs to run.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.dependency import Dependency\n", + "from substrafl.experiment import execute_experiment\n", + "\n", + "dependencies = Dependency(pypi_dependencies=[\"numpy==1.23.1\", \"pandas==1.5.3\"])\n", + "\n", + "compute_plan = execute_experiment(\n", + " client=clients[ANALYTICS_PROVIDER_ORG_ID],\n", + " strategy=Analytics(),\n", + " train_data_nodes=train_data_nodes,\n", + " evaluation_strategy=None,\n", + " aggregation_node=aggregation_node,\n", + " experiment_folder=str(pathlib.Path.cwd() / \"tmp\" / \"experiment_summaries\"),\n", + " dependencies=dependencies,\n", + " clean_models=False,\n", + " name=\"Federated Analytics with SubstraFL documentation example\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Results\n", + "\n", + "The output of a task can be downloaded using some utils function provided by SubstraFL, such as\n", + "`download_algo_state`, `download_train_shared_state` or `download_aggregate_shared_state`.\n", + "\n", + "These functions download from a given `Client` and a given `compute_plan_key` the output of a\n", + "given `round_idx` or `rank_idx`.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.model_loading import download_aggregate_shared_state\n", + "\n", + "# The aggregated analytics are computed in the ANALYTICS_PROVIDER_ORG_ID client.\n", + "client_to_download_from = clients[ANALYTICS_PROVIDER_ORG_ID]\n", + "\n", + "# The results will be available once the compute plan is completed\n", + "client_to_download_from.wait_compute_plan(compute_plan.key)\n", + "\n", + "first_rank_analytics = download_aggregate_shared_state(\n", + " client=client_to_download_from,\n", + " compute_plan_key=compute_plan.key,\n", + " round_idx=0,\n", + ")\n", + "\n", + "second_rank_analytics = download_aggregate_shared_state(\n", + " client=client_to_download_from,\n", + " compute_plan_key=compute_plan.key,\n", + " round_idx=1,\n", + ")\n", + "\n", + "print(\n", + " f\"\"\"Age mean: {first_rank_analytics['means']['age']:.2f} years\n", + "Sex percentage:\n", + " Male: {100*first_rank_analytics['counts']['sex']['M']:.2f}%\n", + " Female: {100*first_rank_analytics['counts']['sex']['F']:.2f}%\n", + "Blood pressure std: {second_rank_analytics[\"std\"][\"bp\"]:.2f} mm Hg\n", + "\"\"\"\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.17" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/docs/source/examples/substrafl/go_further/run_iris_sklearn.ipynb b/docs/source/examples/substrafl/go_further/run_iris_sklearn.ipynb new file mode 100644 index 00000000..80602186 --- /dev/null +++ b/docs/source/examples/substrafl/go_further/run_iris_sklearn.ipynb @@ -0,0 +1,679 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using scikit-learn FedAvg on IRIS dataset\n", + "\n", + "This example illustrate an advanced usage of SubstraFL as it does not use the SubstraFL PyTorch interface, but showcases the general SubstraFL interface that you can use with any ML framework.\n", + "\n", + "\n", + "This example is based on:\n", + "\n", + "- Dataset: IRIS, tabular dataset to classify iris type\n", + "- Model type: Logistic regression using Scikit-Learn\n", + "- FL setup: three organizations, two data providers and one algo provider\n", + "\n", + "This example does not use the deployed platform of Substra, it runs in local mode.\n", + "\n", + "To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example:\n", + "\n", + "- [assets required to run this example](../../../tmp/sklearn_fedavg_assets.zip)\n", + "\n", + "Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command `pip install -r requirements.txt` to install them.\n", + "\n", + "**Substra** and **SubstraFL** should already be installed. If not follow the instructions described [here](https://docs.substra.org/en/stable/substrafl_doc/substrafl_overview.html#installation).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "We work with three different organizations. Two organizations provide a dataset, and a third\n", + "one provides the algorithm and registers the machine learning tasks.\n", + "\n", + "This example runs in local mode, simulating a federated learning experiment.\n", + "\n", + "In the following code cell, we define the different organizations needed for our FL experiment.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "from substra import Client\n", + "\n", + "SEED = 42\n", + "np.random.seed(SEED)\n", + "\n", + "# Choose the subprocess mode to locally simulate the FL process\n", + "N_CLIENTS = 3\n", + "clients_list = [Client(client_name=f\"org-{i+1}\") for i in range(N_CLIENTS)]\n", + "clients = {client.organization_info().organization_id: client for client in clients_list}\n", + "\n", + "# Store organization IDs\n", + "ORGS_ID = list(clients)\n", + "ALGO_ORG_ID = ORGS_ID[0] # Algo provider is defined as the first organization.\n", + "DATA_PROVIDER_ORGS_ID = ORGS_ID[1:] # Data provider orgs are the last two organizations." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data and metrics\n", + "\n", + "### Data preparation\n", + "\n", + "This section downloads (if needed) the **IRIS dataset** using the [Scikit-Learn dataset module](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).\n", + "It extracts the data locally create two folders: one for each organization.\n", + "\n", + "Each organization will have access to half the train data, and to half the test data.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import pathlib\n", + "from sklearn_fedavg_assets.dataset.iris_dataset import setup_iris\n", + "\n", + "\n", + "# Create the temporary directory for generated data\n", + "(pathlib.Path.cwd() / \"tmp\").mkdir(exist_ok=True)\n", + "data_path = pathlib.Path.cwd() / \"tmp\" / \"data_iris\"\n", + "\n", + "setup_iris(data_path=data_path, n_client=len(DATA_PROVIDER_ORGS_ID))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dataset registration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substra.sdk.schemas import DatasetSpec\n", + "from substra.sdk.schemas import Permissions\n", + "from substra.sdk.schemas import DataSampleSpec\n", + "\n", + "assets_directory = pathlib.Path.cwd() / \"sklearn_fedavg_assets\"\n", + "\n", + "permissions_dataset = Permissions(public=False, authorized_ids=[ALGO_ORG_ID])\n", + "\n", + "dataset = DatasetSpec(\n", + " name=\"Iris\",\n", + " type=\"npy\",\n", + " data_opener=assets_directory / \"dataset\" / \"iris_opener.py\",\n", + " description=assets_directory / \"dataset\" / \"description.md\",\n", + " permissions=permissions_dataset,\n", + " logs_permission=permissions_dataset,\n", + ")\n", + "\n", + "dataset_keys = {}\n", + "train_datasample_keys = {}\n", + "test_datasample_keys = {}\n", + "\n", + "for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID):\n", + " client = clients[org_id]\n", + "\n", + " # Add the dataset to the client to provide access to the opener in each organization.\n", + " dataset_keys[org_id] = client.add_dataset(dataset)\n", + " assert dataset_keys[org_id], \"Missing data manager key\"\n", + "\n", + " client = clients[org_id]\n", + "\n", + " # Add the training data on each organization.\n", + " data_sample = DataSampleSpec(\n", + " data_manager_keys=[dataset_keys[org_id]],\n", + " path=data_path / f\"org_{i+1}\" / \"train\",\n", + " )\n", + " train_datasample_keys[org_id] = client.add_data_sample(\n", + " data_sample,\n", + " local=True,\n", + " )\n", + "\n", + " # Add the testing data on each organization.\n", + " data_sample = DataSampleSpec(\n", + " data_manager_keys=[dataset_keys[org_id]],\n", + " path=data_path / f\"org_{i+1}\" / \"test\",\n", + " )\n", + " test_datasample_keys[org_id] = client.add_data_sample(\n", + " data_sample,\n", + " local=True,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Metrics registration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from sklearn.metrics import accuracy_score\n", + "import numpy as np\n", + "\n", + "\n", + "def accuracy(datasamples, predictions_path):\n", + " y_true = datasamples[\"targets\"]\n", + " y_pred = np.load(predictions_path)\n", + "\n", + " return accuracy_score(y_true, y_pred)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Specify the machine learning components\n", + "\n", + "SubstraFL can be used with any machine learning framework. The framework\n", + "dependent functions are written in the [Algorithm](https://docs.substra.org/en/stable/substrafl_doc/api/algorithms.html#algorithms) object.\n", + "\n", + "In this section, you will:\n", + "\n", + "- register a model and its dependencies\n", + "- write your own Sklearn SubstraFL algorithm\n", + "- specify the federated learning strategy\n", + "- specify the organizations where to train and where to aggregate\n", + "- specify the organizations where to test the models\n", + "- actually run the computations\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Model definition\n", + "\n", + "The machine learning model used here is a logistic regression.\n", + "The `warm_start` argument is essential in this example as it indicates to use the current state of the model\n", + "as initialization for the future training.\n", + "By default scikit-learn uses `max_iter=100`, which means the model trains on up to 100 epochs.\n", + "When doing federated learning, we don't want to train too much locally at every round\n", + "otherwise the local training will erase what was learned from the other centers. That is why we set `max_iter=3`.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import os\n", + "from sklearn import linear_model\n", + "\n", + "cls = linear_model.LogisticRegression(random_state=SEED, warm_start=True, max_iter=3)\n", + "\n", + "# Optional:\n", + "# Scikit-Learn raises warnings in case of non convergence, that we choose to disable here.\n", + "# As this example runs with python subprocess, the way to disable it is to use following environment\n", + "# variable:\n", + "os.environ[\"PYTHONWARNINGS\"] = \"ignore:lbfgs failed to converge (status=1):UserWarning\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### SubstraFL algo definition\n", + "\n", + "This section is the most important one for this example. We will define here the function that will run locally on\n", + "each node to train the model.\n", + "\n", + "As SubstraFL does not provide an algorithm comptatible with Sklearn, we need to define one using the provided documentation on\n", + "`substrafl_doc/api/algorithms:Base Class`.\n", + "\n", + "To define a custom algorithm, we will need to inherit from the base class Algo, and to define two properties and four\n", + "methods:\n", + "\n", + "- **strategies** (property): the list of strategies our algorithm is compatible with.\n", + "- **model** (property): a property that returns the model from the defined algo.\n", + "- **train** (method): a function to describe the training process to\n", + " apply to train our model in a federated way.\n", + " The train method must accept as parameters `datasamples` and `shared_state`.\n", + "- **predict** (method): a function to describe how to compute the\n", + " predictions from the algo model.\n", + " The predict method must accept as parameters `datasamples`, `shared_state` and `predictions_path`.\n", + "- **save** (method): specify how to save the important states of our algo.\n", + "- **load** (method): specify how to load the important states of our algo from a previously saved filed\n", + " by the `save` function describe above.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl import algorithms\n", + "from substrafl import remote\n", + "from substrafl.strategies import schemas as fl_schemas\n", + "\n", + "import joblib\n", + "from typing import Optional\n", + "import shutil\n", + "\n", + "# The Iris dataset proposes four attributes to predict three different classes.\n", + "INPUT_SIZE = 4\n", + "OUTPUT_SIZE = 3\n", + "\n", + "\n", + "class SklearnLogisticRegression(algorithms.Algo):\n", + " def __init__(self, model, seed=None):\n", + " super().__init__(model=model, seed=seed)\n", + "\n", + " self._model = model\n", + "\n", + " # We need all different instances of the algorithm to have the same\n", + " # initialization.\n", + " self._model.coef_ = np.ones((OUTPUT_SIZE, INPUT_SIZE))\n", + " self._model.intercept_ = np.zeros(3)\n", + " self._model.classes_ = np.array([-1])\n", + "\n", + " if seed is not None:\n", + " np.random.seed(seed)\n", + "\n", + " @property\n", + " def strategies(self):\n", + " \"\"\"List of compatible strategies\"\"\"\n", + " return [fl_schemas.StrategyName.FEDERATED_AVERAGING]\n", + "\n", + " @property\n", + " def model(self):\n", + " return self._model\n", + "\n", + " @remote.remote_data\n", + " def train(\n", + " self,\n", + " datasamples,\n", + " shared_state: Optional[fl_schemas.FedAvgAveragedState] = None,\n", + " ) -> fl_schemas.FedAvgSharedState:\n", + " \"\"\"The train function to be executed on organizations containing\n", + " data we want to train our model on. The @remote_data decorator is mandatory\n", + " to allow this function to be sent and executed on the right organization.\n", + "\n", + " Args:\n", + " datasamples: datasamples extracted from the organizations data using\n", + " the given opener.\n", + " shared_state (Optional[fl_schemas.FedAvgAveragedState], optional):\n", + " shared_state provided by the aggregator. Defaults to None.\n", + "\n", + " Returns:\n", + " fl_schemas.FedAvgSharedState: State to be sent to the aggregator.\n", + " \"\"\"\n", + "\n", + " if shared_state is not None:\n", + " # If we have a shared state, we update the model parameters with\n", + " # the average parameters updates.\n", + " self._model.coef_ += np.reshape(\n", + " shared_state.avg_parameters_update[:-1],\n", + " (OUTPUT_SIZE, INPUT_SIZE),\n", + " )\n", + " self._model.intercept_ += shared_state.avg_parameters_update[-1]\n", + "\n", + " # To be able to compute the delta between the parameters before and after training,\n", + " # we need to save them in a temporary variable.\n", + " old_coef = self._model.coef_\n", + " old_intercept = self._model.intercept_\n", + "\n", + " # Model training.\n", + " self._model.fit(datasamples[\"data\"], datasamples[\"targets\"])\n", + "\n", + " # We compute de delta.\n", + " delta_coef = self._model.coef_ - old_coef\n", + " delta_bias = self._model.intercept_ - old_intercept\n", + "\n", + " # We reset the model parameters to their state before training in order to remove\n", + " # the local updates from it.\n", + " self._model.coef_ = old_coef\n", + " self._model.intercept_ = old_intercept\n", + "\n", + " # We output the length of the dataset to apply a weighted average between\n", + " # the organizations regarding their number of samples, and the local\n", + " # parameters updates.\n", + " # These updates are sent to the aggregator to compute the average\n", + " # parameters updates, that we will receive in the next round in the\n", + " # `shared_state`.\n", + " return fl_schemas.FedAvgSharedState(\n", + " n_samples=len(datasamples[\"targets\"]),\n", + " parameters_update=[p for p in delta_coef] + [delta_bias],\n", + " )\n", + "\n", + " @remote.remote_data\n", + " def predict(self, datasamples, shared_state, predictions_path):\n", + " \"\"\"The predict function to be executed on organizations containing\n", + " data we want to test our model on. The @remote_data decorator is mandatory\n", + " to allow this function to be sent and executed on the right organization.\n", + "\n", + " Args:\n", + " datasamples: datasamples extracted from the organizations data using\n", + " the given opener.\n", + " shared_state: shared_state provided by the aggregator.\n", + " predictions_path: Path where to save the predictions.\n", + " This path is provided by Substra and the metric will automatically\n", + " get access to this path to load the predictions.\n", + " \"\"\"\n", + " predictions = self._model.predict(datasamples[\"data\"])\n", + "\n", + " if predictions_path is not None:\n", + " np.save(predictions_path, predictions)\n", + "\n", + " # np.save() automatically adds a \".npy\" to the end of the file.\n", + " # We rename the file produced by removing the \".npy\" suffix, to make sure that\n", + " # predictions_path is the actual file name.\n", + " shutil.move(str(predictions_path) + \".npy\", predictions_path)\n", + "\n", + " def save_local_state(self, path):\n", + " joblib.dump(\n", + " {\n", + " \"model\": self._model,\n", + " \"coef\": self._model.coef_,\n", + " \"bias\": self._model.intercept_,\n", + " },\n", + " path,\n", + " )\n", + "\n", + " def load_local_state(self, path):\n", + " loaded_dict = joblib.load(path)\n", + " self._model = loaded_dict[\"model\"]\n", + " self._model.coef_ = loaded_dict[\"coef\"]\n", + " self._model.intercept_ = loaded_dict[\"bias\"]\n", + " return self" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Federated Learning strategies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.strategies import FedAvg\n", + "\n", + "strategy = FedAvg(algo=SklearnLogisticRegression(model=cls, seed=SEED))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Where to train where to aggregate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.nodes import TrainDataNode\n", + "from substrafl.nodes import AggregationNode\n", + "\n", + "\n", + "aggregation_node = AggregationNode(ALGO_ORG_ID)\n", + "\n", + "# Create the Train Data Nodes (or training tasks) and save them in a list\n", + "train_data_nodes = [\n", + " TrainDataNode(\n", + " organization_id=org_id,\n", + " data_manager_key=dataset_keys[org_id],\n", + " data_sample_keys=[train_datasample_keys[org_id]],\n", + " )\n", + " for org_id in DATA_PROVIDER_ORGS_ID\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Where and when to test" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.nodes import TestDataNode\n", + "from substrafl.evaluation_strategy import EvaluationStrategy\n", + "\n", + "# Create the Test Data Nodes (or testing tasks) and save them in a list\n", + "test_data_nodes = [\n", + " TestDataNode(\n", + " organization_id=org_id,\n", + " data_manager_key=dataset_keys[org_id],\n", + " test_data_sample_keys=[test_datasample_keys[org_id]],\n", + " metric_functions=accuracy,\n", + " )\n", + " for org_id in DATA_PROVIDER_ORGS_ID\n", + "]\n", + "\n", + "my_eval_strategy = EvaluationStrategy(test_data_nodes=test_data_nodes, eval_frequency=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running the experiment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.experiment import execute_experiment\n", + "from substrafl.dependency import Dependency\n", + "\n", + "# Number of times to apply the compute plan.\n", + "NUM_ROUNDS = 6\n", + "\n", + "dependencies = Dependency(pypi_dependencies=[\"numpy==1.23.1\", \"scikit-learn==1.1.1\"])\n", + "\n", + "compute_plan = execute_experiment(\n", + " client=clients[ALGO_ORG_ID],\n", + " strategy=strategy,\n", + " train_data_nodes=train_data_nodes,\n", + " evaluation_strategy=my_eval_strategy,\n", + " aggregation_node=aggregation_node,\n", + " num_rounds=NUM_ROUNDS,\n", + " experiment_folder=str(pathlib.Path.cwd() / \"tmp\" / \"experiment_summaries\"),\n", + " dependencies=dependencies,\n", + " name=\"IRIS documentation example\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explore the results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# The results will be available once the compute plan is completed\n", + "clients[ALGO_ORG_ID].wait_compute_plan(compute_plan.key)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Listing results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "performances_df = pd.DataFrame(client.get_performances(compute_plan.key).dict())\n", + "print(\"\\nPerformance Table: \\n\")\n", + "print(performances_df[[\"worker\", \"round_idx\", \"performance\"]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Plot results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "plt.title(\"Test dataset results\")\n", + "plt.xlabel(\"Rounds\")\n", + "plt.ylabel(\"Accuracy\")\n", + "\n", + "for org_id in DATA_PROVIDER_ORGS_ID:\n", + " df = performances_df[performances_df[\"worker\"] == org_id]\n", + " plt.plot(df[\"round_idx\"], df[\"performance\"], label=org_id)\n", + "\n", + "plt.legend(loc=\"lower right\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Download a model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.model_loading import download_algo_state\n", + "\n", + "client_to_download_from = DATA_PROVIDER_ORGS_ID[0]\n", + "round_idx = None\n", + "\n", + "algo = download_algo_state(\n", + " client=clients[client_to_download_from],\n", + " compute_plan_key=compute_plan.key,\n", + " round_idx=round_idx,\n", + ")\n", + "\n", + "cls = algo.model\n", + "\n", + "print(\"Coefs: \", cls.coef_)\n", + "print(\"Intercepts: \", cls.intercept_)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.17" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/docs/source/examples/substrafl/go_further/run_mnist_cyclic.ipynb b/docs/source/examples/substrafl/go_further/run_mnist_cyclic.ipynb new file mode 100644 index 00000000..4c14573d --- /dev/null +++ b/docs/source/examples/substrafl/go_further/run_mnist_cyclic.ipynb @@ -0,0 +1,1113 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Creating Torch Cyclic strategy on MNIST dataset\n", + "\n", + "This example illustrates an advanced usage of SubstraFL and proposes to implement a new Federated Learning strategy,\n", + "called **Cyclic Strategy**, using the SubstraFL base classes.\n", + "This example runs on the [MNIST Dataset of handwritten digits](http://yann.lecun.com/exdb/mnist/) using PyTorch.\n", + "In this example, we work on 28x28 pixel sized grayscale images. This is a classification problem\n", + "aiming to recognize the number written on each image.\n", + "\n", + "The **Cyclic Strategy** consists in training locally a model on different organizations (or centers) sequentially (one after the other). We\n", + "consider a round of this strategy to be a full cycle of local trainings.\n", + "\n", + "This example shows an implementation of the CyclicTorchAlgo using\n", + "[TorchAlgo](https://docs.substra.org/en/stable/substrafl_doc/api/algorithms.html#torch-algorithms) as base class, and the CyclicStrategy implementation using\n", + "[Strategy](https://docs.substra.org/en/stable/substrafl_doc/api/strategies.html) as base class.\n", + "\n", + "This example does not use a deployed platform of Substra and runs in local mode.\n", + "\n", + "To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example:\n", + "\n", + "- [assets required to run this example](../../../tmp/torch_cyclic_assets.zip)\n", + "\n", + "Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command `pip install -r requirements.txt` to install them.\n", + "\n", + "**Substra** and **SubstraFL** should already be installed. If not follow the instructions described [here](https://docs.substra.org/en/stable/substrafl_doc/substrafl_overview.html#installation).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "This example runs with three organizations. Two organizations provide datasets, while a third\n", + "one provides the algorithm.\n", + "\n", + "In the following code cell, we define the different organizations needed for our FL experiment.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substra import Client\n", + "\n", + "N_CLIENTS = 3\n", + "\n", + "client_0 = Client(client_name=\"org-1\")\n", + "client_1 = Client(client_name=\"org-2\")\n", + "client_2 = Client(client_name=\"org-3\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Every computation will run in `subprocess` mode, where everything runs locally in Python\n", + "subprocesses.\n", + "Other backend_types are:\n", + "\n", + "- `docker` mode where computations run locally in docker containers\n", + "- `remote` mode where computations run remotely (you need to have a deployed platform for that)\n", + "\n", + "To run in remote mode, use the following syntax:\n", + "\n", + "`client_remote = Client(backend_type=\"remote\", url=\"MY_BACKEND_URL\", username=\"my-username\", password=\"my-password\")`\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create a dictionary to easily access each client from its human-friendly id\n", + "clients = {\n", + " client_0.organization_info().organization_id: client_0,\n", + " client_1.organization_info().organization_id: client_1,\n", + " client_2.organization_info().organization_id: client_2,\n", + "}\n", + "\n", + "# Store organization IDs\n", + "ORGS_ID = list(clients)\n", + "# Algo provider is defined as the first organization.\n", + "ALGO_ORG_ID = ORGS_ID[0]\n", + "# All organizations provide data in this cyclic setup.\n", + "DATA_PROVIDER_ORGS_ID = ORGS_ID" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data and metrics\n", + "\n", + "### Data preparation\n", + "\n", + "This section downloads (if needed) the **MNIST dataset** using the [torchvision library](https://pytorch.org/vision/stable/index.html).\n", + "It extracts the images from the raw files and locally creates a folder for each\n", + "organization.\n", + "\n", + "Each organization will have access to half the training data and half the test data (which\n", + "corresponds to **30,000**\n", + "images for training and **5,000** for testing each).\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import pathlib\n", + "from torch_cyclic_assets.dataset.cyclic_mnist_dataset import setup_mnist\n", + "\n", + "\n", + "# Create the temporary directory for generated data\n", + "(pathlib.Path.cwd() / \"tmp\").mkdir(exist_ok=True)\n", + "data_path = pathlib.Path.cwd() / \"tmp\" / \"data_mnist\"\n", + "\n", + "setup_mnist(data_path, len(DATA_PROVIDER_ORGS_ID))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset registration\n", + "\n", + "A [Dataset](https://docs.substra.org/en/stable/documentation/concepts.html#dataset) is composed of an **opener**, which is a Python script that can load\n", + "the data from the files in memory and a description markdown file.\n", + "The [Dataset](https://docs.substra.org/en/stable/documentation/concepts.html#dataset) object itself does not contain the data. The proper asset that contains the\n", + "data is the **datasample asset**.\n", + "\n", + "A **datasample** contains a local path to the data. A datasample can be linked to a dataset in order to add data to a\n", + "dataset.\n", + "\n", + "Data privacy is a key concept for Federated Learning experiments. That is why we set\n", + "[Permissions](https://docs.substra.org/en/stable/documentation/concepts.html#permissions) for [Assets](https://docs.substra.org/en/stable/documentation/concepts.html#assets) to determine how each organization\n", + "can access a specific asset.\n", + "You can read more about these concepts in the [User Guide](https://docs.substra.org/en/stable/documentation/concepts.htm).\n", + "\n", + "Note that metadata such as the assets' creation date and the asset owner are visible to all the organizations of a\n", + "network.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substra.sdk.schemas import DatasetSpec\n", + "from substra.sdk.schemas import Permissions\n", + "from substra.sdk.schemas import DataSampleSpec\n", + "\n", + "assets_directory = pathlib.Path.cwd() / \"torch_cyclic_assets\"\n", + "dataset_keys = {}\n", + "train_datasample_keys = {}\n", + "test_datasample_keys = {}\n", + "\n", + "for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID):\n", + " client = clients[org_id]\n", + "\n", + " permissions_dataset = Permissions(public=False, authorized_ids=[ALGO_ORG_ID])\n", + "\n", + " # DatasetSpec is the specification of a dataset. It makes sure every field\n", + " # is well-defined, and that our dataset is ready to be registered.\n", + " # The real dataset object is created in the add_dataset method.\n", + "\n", + " dataset = DatasetSpec(\n", + " name=\"MNIST\",\n", + " type=\"npy\",\n", + " data_opener=assets_directory / \"dataset\" / \"cyclic_mnist_opener.py\",\n", + " description=assets_directory / \"dataset\" / \"description.md\",\n", + " permissions=permissions_dataset,\n", + " logs_permission=permissions_dataset,\n", + " )\n", + " dataset_keys[org_id] = client.add_dataset(dataset)\n", + " assert dataset_keys[org_id], \"Missing dataset key\"\n", + "\n", + " # Add the training data on each organization.\n", + " data_sample = DataSampleSpec(\n", + " data_manager_keys=[dataset_keys[org_id]],\n", + " path=data_path / f\"org_{i+1}\" / \"train\",\n", + " )\n", + " train_datasample_keys[org_id] = client.add_data_sample(data_sample)\n", + "\n", + " # Add the testing data on each organization.\n", + " data_sample = DataSampleSpec(\n", + " data_manager_keys=[dataset_keys[org_id]],\n", + " path=data_path / f\"org_{i+1}\" / \"test\",\n", + " )\n", + " test_datasample_keys[org_id] = client.add_data_sample(data_sample)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Metrics definition\n", + "\n", + "A metric is a function used to evaluate the performance of your model on one or several\n", + "**datasamples**.\n", + "\n", + "To add a metric, you need to define a function that computes and returns a performance\n", + "from the datasamples (as returned by the opener) and the predictions_path (to be loaded within the function).\n", + "\n", + "When using a Torch SubstraFL algorithm, the predictions are saved in the `predict` function in numpy format\n", + "so that you can simply load them using `np.load`.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from sklearn.metrics import accuracy_score\n", + "from sklearn.metrics import roc_auc_score\n", + "import numpy as np\n", + "\n", + "\n", + "def accuracy(datasamples, predictions_path):\n", + " y_true = datasamples[\"labels\"]\n", + " y_pred = np.load(predictions_path)\n", + "\n", + " return accuracy_score(y_true, np.argmax(y_pred, axis=1))\n", + "\n", + "\n", + "def roc_auc(datasamples, predictions_path):\n", + " y_true = datasamples[\"labels\"]\n", + " y_pred = np.load(predictions_path)\n", + "\n", + " n_class = np.max(y_true) + 1\n", + " y_true_one_hot = np.eye(n_class)[y_true]\n", + "\n", + " return roc_auc_score(y_true_one_hot, y_pred)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Machine learning components definition\n", + "\n", + "This section uses the PyTorch based SubstraFL API to simplify the definition of machine learning components.\n", + "However, SubstraFL is compatible with any machine learning framework.\n", + "\n", + "\n", + "In this section, you will:\n", + "\n", + "- Register a model and its dependencies\n", + "- Create a federated learning strategy\n", + "- Specify the training and aggregation nodes\n", + "- Specify the test nodes\n", + "- Actually run the computations\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Model definition\n", + "\n", + "We choose to use a classic torch CNN as the model to train. The model architecture is defined by the user\n", + "independently of SubstraFL.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import torch\n", + "from torch import nn\n", + "import torch.nn.functional as F\n", + "\n", + "seed = 42\n", + "torch.manual_seed(seed)\n", + "\n", + "\n", + "class CNN(nn.Module):\n", + " def __init__(self):\n", + " super(CNN, self).__init__()\n", + " self.conv1 = nn.Conv2d(1, 32, kernel_size=5)\n", + " self.conv2 = nn.Conv2d(32, 32, kernel_size=5)\n", + " self.conv3 = nn.Conv2d(32, 64, kernel_size=5)\n", + " self.fc1 = nn.Linear(3 * 3 * 64, 256)\n", + " self.fc2 = nn.Linear(256, 10)\n", + "\n", + " def forward(self, x, eval=False):\n", + " x = F.relu(self.conv1(x))\n", + " x = F.relu(F.max_pool2d(self.conv2(x), 2))\n", + " x = F.dropout(x, p=0.5, training=not eval)\n", + " x = F.relu(F.max_pool2d(self.conv3(x), 2))\n", + " x = F.dropout(x, p=0.5, training=not eval)\n", + " x = x.view(-1, 3 * 3 * 64)\n", + " x = F.relu(self.fc1(x))\n", + " x = F.dropout(x, p=0.5, training=not eval)\n", + " x = self.fc2(x)\n", + " return F.log_softmax(x, dim=1)\n", + "\n", + "\n", + "model = CNN()\n", + "optimizer = torch.optim.Adam(model.parameters(), lr=0.001)\n", + "criterion = torch.nn.CrossEntropyLoss()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Specifying on how much data to train\n", + "\n", + "To specify on how much data to train at each round, we use the `index_generator` object.\n", + "We specify the batch size and the number of batches (named `num_updates`) to consider for each round.\n", + "See [Index Generator](https://docs.substra.org/en/stable/substrafl_doc/substrafl_overview.html#index-generator) for more details.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.index_generator import NpIndexGenerator\n", + "\n", + "# Number of model updates between each FL strategy aggregation.\n", + "NUM_UPDATES = 100\n", + "\n", + "# Number of samples per update.\n", + "BATCH_SIZE = 32\n", + "\n", + "index_generator = NpIndexGenerator(\n", + " batch_size=BATCH_SIZE,\n", + " num_updates=NUM_UPDATES,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Torch Dataset definition\n", + "\n", + "This torch Dataset is used to preprocess the data using the `__getitem__` function.\n", + "\n", + "This torch Dataset needs to have a specific `__init__` signature, that must contain (self, datasamples, is_inference).\n", + "\n", + "The `__getitem__` function is expected to return (inputs, outputs) if `is_inference` is `False`, else only the inputs.\n", + "This behavior can be changed by re-writing the `_local_train` or `predict` methods.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class TorchDataset(torch.utils.data.Dataset):\n", + " def __init__(self, datasamples, is_inference: bool):\n", + " self.x = datasamples[\"images\"]\n", + " self.y = datasamples[\"labels\"]\n", + " self.is_inference = is_inference\n", + "\n", + " def __getitem__(self, idx):\n", + " if self.is_inference:\n", + " x = torch.FloatTensor(self.x[idx][None, ...]) / 255\n", + " return x\n", + "\n", + " else:\n", + " x = torch.FloatTensor(self.x[idx][None, ...]) / 255\n", + "\n", + " y = torch.tensor(self.y[idx]).type(torch.int64)\n", + " y = F.one_hot(y, 10)\n", + " y = y.type(torch.float32)\n", + "\n", + " return x, y\n", + "\n", + " def __len__(self):\n", + " return len(self.x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Cyclic Strategy implementation\n", + "\n", + "A FL strategy specifies how to train a model on distributed data.\n", + "\n", + "The **Cyclic Strategy** passes the model from an organization to the next one, until all\n", + "the data available in Substra has been sequentially presented to the model.\n", + "\n", + "This is not the most efficient strategy. The model will overfit the last dataset it sees,\n", + "and the order of training will impact the performances of the model. But we will use this implementation\n", + "as an example to explain and show how to implement your own strategies using SubstraFL.\n", + "\n", + "To instantiate this new strategy, we need to overwrite three methods:\n", + "\n", + "- `initialization_round`, to indicate what tasks to execute at round 0, in order to setup the variable\n", + " and be able to compute the performances of the model before any training.\n", + "- `perform_round`, to indicate what tasks and in which order we need to compute to execute a round of the strategy.\n", + "- `perform_predict`, to indicate how to compute the predictions and performances .\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from typing import Any\n", + "from typing import List\n", + "from typing import Optional\n", + "\n", + "from substrafl import strategies\n", + "from substrafl.algorithms.algo import Algo\n", + "from substrafl.nodes.aggregation_node import AggregationNode\n", + "from substrafl.nodes.test_data_node import TestDataNode\n", + "from substrafl.nodes.train_data_node import TrainDataNode\n", + "\n", + "\n", + "class CyclicStrategy(strategies.Strategy):\n", + " \"\"\"The base class Strategy proposes a default compute plan structure\n", + " in its ``build_compute_plan``method implementation, dedicated to Federated Learning compute plan.\n", + " This method calls ``initialization_round`` at round 0, and then repeats ``perform_round`` for ``num_rounds``.\n", + "\n", + " The default ``build_compute_plan`` implementation also takes into account the given evaluation\n", + " strategy to trigger the tests tasks when needed.\n", + " \"\"\"\n", + "\n", + " def __init__(self, algo: Algo, *args, **kwargs):\n", + " \"\"\"\n", + " It is possible to add any arguments to a Strategy. It is important to pass these arguments as\n", + " args or kwargs to the parent class, using the super().__init__(...) method.\n", + " Indeed, SubstraFL does not use the instance of the object. It re-instantiates them at each new task\n", + " using the args and kwargs passed to the parent class, and uses the save and load local state method to retrieve\n", + " its state.\n", + "\n", + " Args:\n", + " algo (Algo): A Strategy takes an Algo as argument, in order to deal with framework\n", + " specific function in a dedicated object.\n", + " \"\"\"\n", + " super().__init__(algo=algo, *args, **kwargs)\n", + "\n", + " self._cyclic_local_state = None\n", + " self._cyclic_shared_state = None\n", + "\n", + " @property\n", + " def name(self) -> str:\n", + " \"\"\"The name of the strategy. Useful to indicate which Algo\n", + " are compatible or aren't with this strategy.\n", + "\n", + " Returns:\n", + " str: Name of the strategy\n", + " \"\"\"\n", + " return \"Cyclic Strategy\"\n", + "\n", + " def initialization_round(\n", + " self,\n", + " *,\n", + " train_data_nodes: List[TrainDataNode],\n", + " clean_models: bool,\n", + " round_idx: Optional[int] = 0,\n", + " additional_orgs_permissions: Optional[set] = None,\n", + " ):\n", + " \"\"\"The ``initialization_round`` function is called at round 0 on the\n", + " ``build_compute_plan`` method. In our strategy, we want to initialize\n", + " ``_cyclic_local_state`` in order to be able to test the model before\n", + " any training.\n", + "\n", + " We only initialize the model on the first train data node.\n", + "\n", + " Args:\n", + " train_data_nodes (List[TrainDataNode]): Train data nodes representing the different\n", + " organizations containing data we want to train on.\n", + " clean_models (bool): Boolean to indicate if we want to keep intermediate shared states.\n", + " Only taken into account in ``remote`` mode.\n", + " round_idx (Optional[int], optional): Current round index. The initialization round is zero by default,\n", + " but you are free to change it in the ``build_compute_plan`` method. Defaults to 0.\n", + " additional_orgs_permissions (Optional[set], optional): additional organization ids that could\n", + " have access to the outputs the task. In our case, this corresponds to the organization\n", + " containing test data nodes, in order to provide access to the model and to allow to\n", + " use it on the test data.\n", + " \"\"\"\n", + " first_train_data_node = train_data_nodes[0]\n", + "\n", + " # The algo.initialize method is an empty method useful to load all python object to the platform.\n", + " self._cyclic_local_state = first_train_data_node.init_states(\n", + " operation=self.algo.initialize(\n", + " _algo_name=f\"Initializing with {self.algo.__class__.__name__}\",\n", + " ),\n", + " round_idx=round_idx,\n", + " authorized_ids=set([first_train_data_node.organization_id]) | additional_orgs_permissions,\n", + " clean_models=clean_models,\n", + " )\n", + "\n", + " def perform_round(\n", + " self,\n", + " *,\n", + " train_data_nodes: List[TrainDataNode],\n", + " aggregation_node: Optional[AggregationNode],\n", + " round_idx: int,\n", + " clean_models: bool,\n", + " additional_orgs_permissions: Optional[set] = None,\n", + " ):\n", + " \"\"\"This method is called at each round to perform a series of task. For the cyclic\n", + " strategy we want to design, a round is a full cycle over the different train data\n", + " nodes.\n", + " We link the output of a computed task directly to the next one.\n", + "\n", + " Args:\n", + " train_data_nodes (List[TrainDataNode]): Train data nodes representing the different\n", + " organizations containing data we want to train on.\n", + " aggregation_node (List[AggregationNode]): In the case of the Cyclic Strategy, there is no\n", + " aggregation tasks so no need for AggregationNode.\n", + " clean_models (bool): Boolean to indicate if we want to keep intermediate shared states.\n", + " Only taken into account in ``remote`` mode.\n", + " round_idx (Optional[int], optional): Current round index.\n", + " additional_orgs_permissions (Optional[set], optional): additional organization ids that could\n", + " have access to the outputs the task. In our case, this will correspond to the organization\n", + " containing test data nodes, in order to provide access to the model and to allow to\n", + " use it on the test data.\n", + " \"\"\"\n", + " for i, node in enumerate(train_data_nodes):\n", + " # We get the next train_data_node in order to add the organization of the node\n", + " # to the authorized_ids\n", + " next_train_data_node = train_data_nodes[(i + 1) % len(train_data_nodes)]\n", + "\n", + " self._cyclic_local_state, self._cyclic_shared_state = node.update_states(\n", + " operation=self.algo.train(\n", + " node.data_sample_keys,\n", + " shared_state=self._cyclic_shared_state,\n", + " _algo_name=f\"Training with {self.algo.__class__.__name__}\",\n", + " ),\n", + " local_state=self._cyclic_local_state,\n", + " round_idx=round_idx,\n", + " authorized_ids=set([next_train_data_node.organization_id]) | additional_orgs_permissions,\n", + " aggregation_id=None,\n", + " clean_models=clean_models,\n", + " )\n", + "\n", + " def perform_predict(\n", + " self,\n", + " test_data_nodes: List[TestDataNode],\n", + " train_data_nodes: List[TrainDataNode],\n", + " round_idx: int,\n", + " ):\n", + " \"\"\"This method is called regarding the given evaluation strategy. If the round is included\n", + " in the evaluation strategy, the ``perform_predict`` method will be called on the different concerned nodes.\n", + "\n", + " We are using the last computed ``_cyclic_local_state`` to feed the test task, which mean that we will\n", + " always test the model after its training on the last train data nodes of the list.\n", + "\n", + " Args:\n", + " test_data_nodes (List[TestDataNode]): List of all the register test data nodes containing data\n", + " we want to test on.\n", + " train_data_nodes (List[TrainDataNode]): List of all the register train data nodes.\n", + " round_idx (int): Current round index.\n", + " \"\"\"\n", + " for test_node in test_data_nodes:\n", + " test_node.update_states(\n", + " traintask_id=self._cyclic_local_state.key,\n", + " operation=self.algo.predict(\n", + " data_samples=test_node.test_data_sample_keys,\n", + " _algo_name=f\"Predicting with {self.algo.__class__.__name__}\",\n", + " ),\n", + " round_idx=round_idx,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Torch Cyclic Algo implementation\n", + "\n", + "A SubstraFL Algo gathers all the defined elements that run locally in each organization.\n", + "This is the only SubstraFL object that is framework specific (here PyTorch specific).\n", + "\n", + "In the case of our **Cyclic Strategy**, we need to use the TorchAlgo base class, and\n", + "overwrite the `strategies` property and the `train` method to ensure that we output\n", + "the shared state we need for our Federated Learning compute plan.\n", + "\n", + "For the **Cyclic Strategy**, the **shared state** will be directly the **model parameters**. We will\n", + "retrieve the model from the shared state we receive and send the new parameters updated after\n", + "the local training.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.algorithms.pytorch.torch_base_algo import TorchAlgo\n", + "from substrafl.remote import remote_data\n", + "from substrafl.algorithms.pytorch import weight_manager\n", + "\n", + "\n", + "class TorchCyclicAlgo(TorchAlgo):\n", + " \"\"\"We create here the base class to be inherited for SubstraFL algorithms.\n", + " An Algo is a SubstraFL object that contains all framework specific functions.\n", + " \"\"\"\n", + "\n", + " def __init__(\n", + " self,\n", + " model: torch.nn.Module,\n", + " criterion: torch.nn.modules.loss._Loss,\n", + " optimizer: torch.optim.Optimizer,\n", + " index_generator: NpIndexGenerator,\n", + " dataset: torch.utils.data.Dataset,\n", + " seed: Optional[int] = None,\n", + " use_gpu: bool = True,\n", + " *args,\n", + " **kwargs,\n", + " ):\n", + " \"\"\"It is possible to add any arguments to an Algo. It is important to pass these arguments as\n", + " args or kwargs to the parent class, using the super().__init__(...) method.\n", + " Indeed, SubstraFL does not use the instance of the object. It re-instantiates them at each new task\n", + " using the args and kwargs passed to the parent class, and the save and load local state method to retrieve the\n", + " right state.\n", + "\n", + " Args:\n", + " model (torch.nn.modules.module.Module): A torch model.\n", + " criterion (torch.nn.modules.loss._Loss): A torch criterion (loss).\n", + " optimizer (torch.optim.Optimizer): A torch optimizer linked to the model.\n", + " index_generator (BaseIndexGenerator): a stateful index generator.\n", + " dataset (torch.utils.data.Dataset): an instantiable dataset class whose ``__init__`` arguments are\n", + " ``x``, ``y`` and ``is_inference``.\n", + " seed (typing.Optional[int]): Seed set at the algo initialization on each organization. Defaults to None.\n", + " use_gpu (bool): Whether to use the GPUs if they are available. Defaults to True.\n", + " \"\"\"\n", + " super().__init__(\n", + " model=model,\n", + " criterion=criterion,\n", + " optimizer=optimizer,\n", + " index_generator=index_generator,\n", + " dataset=dataset,\n", + " scheduler=None,\n", + " seed=seed,\n", + " use_gpu=use_gpu,\n", + " *args,\n", + " **kwargs,\n", + " )\n", + "\n", + " @property\n", + " def strategies(self) -> List[str]:\n", + " \"\"\"List of compatible strategies.\n", + "\n", + " Returns:\n", + " List[str]: list of compatible strategy name.\n", + " \"\"\"\n", + " return [\"Cyclic Strategy\"]\n", + "\n", + " @remote_data\n", + " def train(\n", + " self,\n", + " datasamples: Any,\n", + " shared_state: Optional[dict] = None,\n", + " ) -> dict:\n", + " \"\"\"This method decorated with ``@remote_data`` is a method that is executed inside\n", + " the train tasks of our strategy.\n", + " The decorator is used to retrieve the entire Algo object inside the task, to be able to access all values\n", + " useful for the training (such as the model, the optimizer, etc...).\n", + " The objective is to realize the local training on given data samples, and send the right shared state\n", + " to the next task.\n", + "\n", + " Args:\n", + " datasamples (Any): datasamples are the output of the ``get_data`` method of an opener. This opener\n", + " access the data of a train data nodes, and transforms them to feed methods decorated with\n", + " ``@remote_data``.\n", + " shared_state (Optional[dict], optional): a shared state is a dictionary containing the necessary values\n", + " to use from the previous trainings of the compute plan and initialize the model with it. In our case,\n", + " the shared state is the model parameters obtained after the local train on the previous organization.\n", + " The shared state is equal to None it is the first training of the compute plan.\n", + "\n", + " Returns:\n", + " dict: returns a dict corresponding to the shared state that will be used by the next train function on\n", + " a different organization.\n", + " \"\"\"\n", + " # Create torch dataset\n", + " train_dataset = self._dataset(datasamples, is_inference=False)\n", + "\n", + " if self._index_generator.n_samples is None:\n", + " # We need to initiate the index generator number of sample the first time we have access to\n", + " # the information.\n", + " self._index_generator.n_samples = len(train_dataset)\n", + "\n", + " # If the shared state is None, it means that this is the first training of the compute plan,\n", + " # and that we don't have a shared state to take into account yet.\n", + " if shared_state is not None:\n", + " assert self._index_generator.n_samples is not None\n", + " # The shared state is the average of the model parameters for all organizations. We set\n", + " # the model to these updated values.\n", + " model_parameters = [torch.from_numpy(x).to(self._device) for x in shared_state[\"model_parameters\"]]\n", + " weight_manager.set_parameters(\n", + " model=self._model,\n", + " parameters=model_parameters,\n", + " with_batch_norm_parameters=False,\n", + " )\n", + "\n", + " # We set the counter of updates to zero.\n", + " self._index_generator.reset_counter()\n", + "\n", + " # Train mode for torch model.\n", + " self._model.train()\n", + "\n", + " # Train the model.\n", + " self._local_train(train_dataset)\n", + "\n", + " # We verify that we trained the model on the right amount of updates.\n", + " self._index_generator.check_num_updates()\n", + "\n", + " # Eval mode for torch model.\n", + " self._model.eval()\n", + "\n", + " # We get the new model parameters values in order to send them in the shared states.\n", + " model_parameters = weight_manager.get_parameters(model=self._model, with_batch_norm_parameters=False)\n", + " new_shared_state = {\"model_parameters\": [p.cpu().detach().numpy() for p in model_parameters]}\n", + "\n", + " return new_shared_state" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To instantiate your algo, you need to instantiate it in a class with no argument. This comment is only valid when you\n", + "inherit from the TorchAlgo base class.\n", + "\n", + "The `TorchDataset` is passed **as a class** to the `TorchAlgo `.\n", + "Indeed, this `TorchDataset` will be instantiated directly on the data provider organization.\n", + "\n", + "> **âš  WARNING** \n", + "> It is possible to add any arguments to an Algo or a Strategy. It is important to pass these arguments as\n", + "> args or kwargs to the parent class, using the `super().__init__(...)` method.\n", + ">\n", + "> Indeed, SubstraFL does not use the instance of the object. It **re-instantiates** them at each new task\n", + "> using the args and kwargs passed to the parent class, and the save and load local state method to retrieve the\n", + "> right state.\n", + "\n", + "To summarize the `Algo` is the place to put all framework specific code we want to apply in tasks. It is often\n", + "the tasks that needs the data to be executed, and that are decorated with `@remote_data`.\n", + "\n", + "The `Strategy` contains the non-framework specific code, such as the `build_compute_plan` method, that creates the\n", + "graph of tasks, the **initialization round**, **perform round** and **perform predict** methods that links tasks to\n", + "each other and links the functions to the nodes.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class MyAlgo(TorchCyclicAlgo):\n", + " def __init__(self):\n", + " super().__init__(\n", + " model=model,\n", + " criterion=criterion,\n", + " optimizer=optimizer,\n", + " index_generator=index_generator,\n", + " dataset=TorchDataset,\n", + " seed=seed,\n", + " )\n", + "\n", + "\n", + "strategy = CyclicStrategy(algo=MyAlgo())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Where to train where to aggregate\n", + "\n", + "We specify on which data we want to train our model, using the [TrainDataNode](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#traindatanode) object.\n", + "Here we train on the two datasets that we have registered earlier.\n", + "\n", + "The [AggregationNode](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#aggregationnode) specifies the organization on which the aggregation operation\n", + "will be computed.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.nodes import TrainDataNode\n", + "\n", + "# Create the Train Data Nodes (or training tasks) and save them in a list\n", + "train_data_nodes = [\n", + " TrainDataNode(\n", + " organization_id=org_id,\n", + " data_manager_key=dataset_keys[org_id],\n", + " data_sample_keys=[train_datasample_keys[org_id]],\n", + " )\n", + " for org_id in DATA_PROVIDER_ORGS_ID\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Where and when to test\n", + "\n", + "With the same logic as the train nodes, we create [TestDataNode](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#testdatanode) to specify on which\n", + "data we want to test our model.\n", + "\n", + "The [Evaluation Strategy](https://docs.substra.org/en/stable/substrafl_doc/api/evaluation_strategy.html) defines where and at which frequency we\n", + "evaluate the model, using the given metric(s) that you registered in a previous section.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.nodes import TestDataNode\n", + "from substrafl.evaluation_strategy import EvaluationStrategy\n", + "\n", + "# Create the Test Data Nodes (or testing tasks) and save them in a list\n", + "test_data_nodes = [\n", + " TestDataNode(\n", + " organization_id=org_id,\n", + " data_manager_key=dataset_keys[org_id],\n", + " test_data_sample_keys=[test_datasample_keys[org_id]],\n", + " metric_functions={\"Accuracy\": accuracy, \"ROC AUC\": roc_auc},\n", + " )\n", + " for org_id in DATA_PROVIDER_ORGS_ID\n", + "]\n", + "\n", + "\n", + "# Test at the end of every round\n", + "my_eval_strategy = EvaluationStrategy(test_data_nodes=test_data_nodes, eval_frequency=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running the experiment\n", + "\n", + "As a last step before launching our experiment, we need to specify the third parties dependencies required to run it.\n", + "The [Dependency](https://docs.substra.org/en/stable/substrafl_doc/api/dependency.html) object is instantiated in order to install the right libraries in\n", + "the Python environment of each organization.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.dependency import Dependency\n", + "\n", + "dependencies = Dependency(pypi_dependencies=[\"numpy==1.23.1\", \"torch==1.11.0\", \"scikit-learn==1.1.1\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now have all the necessary objects to launch our experiment. Please see a summary below of all the objects we created so far:\n", + "\n", + "- A [Client](https://docs.substra.org/en/stable/documentation/references/sdk.html#client) to add or retrieve the assets of our experiment, using their keys to identify them.\n", + "- An [Torch Algorithms](https://docs.substra.org/en/stable/substrafl_doc/api/algorithms.html#torch-algorithms) to define the training parameters *(optimizer, train, function, predict function, etc...)*.\n", + "- A [Strategies](https://docs.substra.org/en/stable/substrafl_doc/api/strategies.html#strategies), to specify how to train the model on distributed data.\n", + "- [Train data nodes](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#traindatanode) to indicate on which data to train.\n", + "- An [Evaluation Strategy](https://docs.substra.org/en/stable/substrafl_doc/api/evaluation_strategy.html#evaluation-strategy), to define where and at which frequency we evaluate the model.\n", + "- An [Aggregation Node](https://docs.substra.org/en/stable/substrafl_doc/api/nodes.html#aggregationnode), to specify the organization on which the aggregation operation will be computed.\n", + "- The **number of rounds**, a round being defined by a local training step followed by an aggregation operation.\n", + "- An **experiment folder** to save a summary of the operation made.\n", + "- The [Dependency](https://docs.substra.org/en/stable/substrafl_doc/api/dependency.html) to define the libraries on which the experiment needs to run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.experiment import execute_experiment\n", + "\n", + "# A round is defined by a local training step followed by an aggregation operation\n", + "NUM_ROUNDS = 3\n", + "\n", + "compute_plan = execute_experiment(\n", + " client=clients[ALGO_ORG_ID],\n", + " strategy=strategy,\n", + " train_data_nodes=train_data_nodes,\n", + " evaluation_strategy=my_eval_strategy,\n", + " aggregation_node=None,\n", + " num_rounds=NUM_ROUNDS,\n", + " experiment_folder=str(pathlib.Path.cwd() / \"tmp\" / \"experiment_summaries\"),\n", + " dependencies=dependencies,\n", + " clean_models=False,\n", + " name=\"Cyclic MNIST documentation example\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explore the results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# The results will be available once the compute plan is completed\n", + "client_0.wait_compute_plan(compute_plan.key)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### List results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "performances_df = pd.DataFrame(client.get_performances(compute_plan.key).dict())\n", + "print(\"\\nPerformance Table: \\n\")\n", + "print(performances_df[[\"worker\", \"round_idx\", \"identifier\", \"performance\"]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Plot results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "fig, axs = plt.subplots(1, 2, figsize=(12, 6))\n", + "fig.suptitle(\"Test dataset results\")\n", + "\n", + "axs[0].set_title(\"Accuracy\")\n", + "axs[1].set_title(\"ROC AUC\")\n", + "\n", + "for ax in axs.flat:\n", + " ax.set(xlabel=\"Rounds\", ylabel=\"Score\")\n", + "\n", + "\n", + "for org_id in DATA_PROVIDER_ORGS_ID:\n", + " org_df = performances_df[performances_df[\"worker\"] == org_id]\n", + " acc_df = org_df[org_df[\"identifier\"] == \"Accuracy\"]\n", + " axs[0].plot(acc_df[\"round_idx\"], acc_df[\"performance\"], label=org_id)\n", + "\n", + " auc_df = org_df[org_df[\"identifier\"] == \"ROC AUC\"]\n", + " axs[1].plot(auc_df[\"round_idx\"], auc_df[\"performance\"], label=org_id)\n", + "\n", + "plt.legend(loc=\"lower right\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Download a model\n", + "\n", + "After the experiment, you might be interested in downloading your trained model.\n", + "To do so, you will need the source code in order to reload your code architecture in memory.\n", + "You have the option to choose the client and the round you are interested in downloading.\n", + "\n", + "If `round_idx` is set to `None`, the last round will be selected by default.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from substrafl.model_loading import download_algo_state\n", + "\n", + "client_to_download_from = DATA_PROVIDER_ORGS_ID[-1]\n", + "round_idx = None\n", + "\n", + "algo = download_algo_state(\n", + " client=clients[client_to_download_from],\n", + " compute_plan_key=compute_plan.key,\n", + " round_idx=round_idx,\n", + ")\n", + "\n", + "model = algo.model\n", + "\n", + "print(model)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.17" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/examples/substrafl/go_further/sklearn_fedavg_assets/dataset/description.md b/docs/source/examples/substrafl/go_further/sklearn_fedavg_assets/dataset/description.md similarity index 100% rename from examples/substrafl/go_further/sklearn_fedavg_assets/dataset/description.md rename to docs/source/examples/substrafl/go_further/sklearn_fedavg_assets/dataset/description.md diff --git a/examples/substrafl/go_further/sklearn_fedavg_assets/dataset/iris_dataset.py b/docs/source/examples/substrafl/go_further/sklearn_fedavg_assets/dataset/iris_dataset.py similarity index 100% rename from examples/substrafl/go_further/sklearn_fedavg_assets/dataset/iris_dataset.py rename to docs/source/examples/substrafl/go_further/sklearn_fedavg_assets/dataset/iris_dataset.py diff --git a/examples/substrafl/go_further/sklearn_fedavg_assets/dataset/iris_opener.py b/docs/source/examples/substrafl/go_further/sklearn_fedavg_assets/dataset/iris_opener.py similarity index 100% rename from examples/substrafl/go_further/sklearn_fedavg_assets/dataset/iris_opener.py rename to docs/source/examples/substrafl/go_further/sklearn_fedavg_assets/dataset/iris_opener.py diff --git a/examples/substrafl/go_further/sklearn_fedavg_assets/requirements.txt b/docs/source/examples/substrafl/go_further/sklearn_fedavg_assets/requirements.txt similarity index 100% rename from examples/substrafl/go_further/sklearn_fedavg_assets/requirements.txt rename to docs/source/examples/substrafl/go_further/sklearn_fedavg_assets/requirements.txt diff --git a/examples/substrafl/go_further/torch_cyclic_assets/dataset/cyclic_mnist_dataset.py b/docs/source/examples/substrafl/go_further/torch_cyclic_assets/dataset/cyclic_mnist_dataset.py similarity index 100% rename from examples/substrafl/go_further/torch_cyclic_assets/dataset/cyclic_mnist_dataset.py rename to docs/source/examples/substrafl/go_further/torch_cyclic_assets/dataset/cyclic_mnist_dataset.py diff --git a/examples/substrafl/go_further/torch_cyclic_assets/dataset/cyclic_mnist_opener.py b/docs/source/examples/substrafl/go_further/torch_cyclic_assets/dataset/cyclic_mnist_opener.py similarity index 100% rename from examples/substrafl/go_further/torch_cyclic_assets/dataset/cyclic_mnist_opener.py rename to docs/source/examples/substrafl/go_further/torch_cyclic_assets/dataset/cyclic_mnist_opener.py diff --git a/examples/substrafl/go_further/torch_cyclic_assets/dataset/description.md b/docs/source/examples/substrafl/go_further/torch_cyclic_assets/dataset/description.md similarity index 100% rename from examples/substrafl/go_further/torch_cyclic_assets/dataset/description.md rename to docs/source/examples/substrafl/go_further/torch_cyclic_assets/dataset/description.md diff --git a/examples/substrafl/go_further/torch_cyclic_assets/requirements.txt b/docs/source/examples/substrafl/go_further/torch_cyclic_assets/requirements.txt similarity index 100% rename from examples/substrafl/go_further/torch_cyclic_assets/requirements.txt rename to docs/source/examples/substrafl/go_further/torch_cyclic_assets/requirements.txt diff --git a/docs/source/examples/substrafl/index.rst b/docs/source/examples/substrafl/index.rst new file mode 100644 index 00000000..1a503360 --- /dev/null +++ b/docs/source/examples/substrafl/index.rst @@ -0,0 +1,19 @@ +SubstraFL examples +================== + +The examples below are compatible with SubstraFL |substrafl_version|. + + +Example to get started using the PyTorch interface +************************************************** + +.. nbgallery:: + get_started/run_mnist_torch.ipynb + +Example to go further +********************* + +.. nbgallery:: + go_further/run_iris_sklearn.ipynb + go_further/run_diabetes_substrafl.ipynb + go_further/run_mnist_cyclic.ipynb diff --git a/docs/source/index.rst b/docs/source/index.rst index 6a927ab1..c68e8e5e 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -93,8 +93,8 @@ Some quick links: :caption: Tutorials :hidden: - examples/substrafl/index - examples/substra_core/index + examples/substrafl/index.rst + examples/substra_core/index.rst .. toctree:: diff --git a/docs/source/static/nbsphinx-gallery.css b/docs/source/static/nbsphinx-gallery.css new file mode 100644 index 00000000..246ce7dc --- /dev/null +++ b/docs/source/static/nbsphinx-gallery.css @@ -0,0 +1,40 @@ +.nbsphinx-gallery { + display: grid; + grid-template-columns: repeat(auto-fill, 180px); + gap: 5px; + margin-top: 1em; + margin-bottom: 1em; +} + +.nbsphinx-gallery > a { + background-image: none; + border: solid #fff 1px; + background-color: #fff; + box-shadow: 0 0 15px rgba(142, 176, 202, 0.2); + border-radius: 5px; + min-height: 230px; + min-width: 180px; + padding: 10px 24px; + text-decoration: none; + color: var(--color-primary-500); + transition: transform 0.2s ease; +} + +.nbsphinx-gallery > a:hover { + border: solid var(--color-primary-500) 1px; + box-shadow: 0 0 15px rgba(142, 176, 202, 0.5); + transform: scale(1.05); +} + +.nbsphinx-gallery img { + max-width: 100%; + max-height: 100%; +} + +.nbsphinx-gallery > a > div:first-child { + display: flex; + align-items: start; + justify-content: center; + height: 120px; + margin-bottom: 5px; +} diff --git a/environment.yml b/environment.yml new file mode 100644 index 00000000..25c9c067 --- /dev/null +++ b/environment.yml @@ -0,0 +1,10 @@ +name: rtd +channels: + - defaults + - conda-forge +dependencies: + - python=3.8 + - pandoc=3.1 + - pip + - pip: + - -r requirements.txt diff --git a/examples/substra_core/README.rst b/examples/substra_core/README.rst deleted file mode 100644 index be41bc60..00000000 --- a/examples/substra_core/README.rst +++ /dev/null @@ -1,4 +0,0 @@ -Substra examples -================ - -The examples below are compatible with Substra |substra_version|. diff --git a/examples/substra_core/diabetes_example/README.rst b/examples/substra_core/diabetes_example/README.rst deleted file mode 100644 index 2290961f..00000000 --- a/examples/substra_core/diabetes_example/README.rst +++ /dev/null @@ -1,2 +0,0 @@ -Examples to go further -^^^^^^^^^^^^^^^^^^^^^^ \ No newline at end of file diff --git a/examples/substra_core/diabetes_example/run_diabetes.py b/examples/substra_core/diabetes_example/run_diabetes.py deleted file mode 100644 index 26a0ecd1..00000000 --- a/examples/substra_core/diabetes_example/run_diabetes.py +++ /dev/null @@ -1,510 +0,0 @@ -""" -=========================================== -Federated Analytics on the diabetes dataset -=========================================== - -This example demonstrates how to use the flexibility of the Substra library to do Federated Analytics. - -We use the **Diabetes dataset** available from the `Scikit-Learn dataset module `__. -This dataset contains medical information such as Age, Sex or Blood pressure. -The goal of this example is to compute some analytics such as Age mean, Blood pressure standard deviation or Sex percentage. - -We simulate having two different data organisations, and a third organisation which wants to compute aggregated analytics -without having access to the raw data. The example here runs everything locally; however there is only one parameter to -change to run it on a real network. - -**Caution:** - This example is provided as an illustrative example only. In real life, you should be careful not to - accidentally leak private information when doing Federated Analytics. For example if a column contains very similar values, - sharing its mean and its standard deviation is functionally equivalent to sharing the content of the column. - It is **strongly recommended** to consider what are the potential security risks in your use case, and to act accordingly. - It is possible to use other privacy-preserving techniques, such as - `Differential Privacy `_, in addition to Substra. - Because the focus of this example is Substra capabilities and for the sake of simplicity, such safeguards are not implemented here. - - -To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example. - - .. only:: builder_html or readthedocs - - :download:`assets required to run this example <../../../../../tmp/diabetes_assets.zip>` - - Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command ``pip install -r requirements.txt`` to install them. - -""" - -# %% -# Importing all the dependencies -# ============================== - -import os -import zipfile -import pathlib - -import substra -from substra.sdk.schemas import ( - FunctionSpec, - FunctionInputSpec, - FunctionOutputSpec, - AssetKind, - DataSampleSpec, - DatasetSpec, - Permissions, - TaskSpec, - ComputeTaskOutputSpec, - InputRef, -) - -# sphinx_gallery_thumbnail_path = 'static/example_thumbnail/diabetes.png' - -from assets.dataset.diabetes_dataset import setup_diabetes - -# %% -# Instantiating the Substra clients -# ================================= -# -# We work with three different organizations. -# Two organizations provide data, and a third one performs Federate Analytics to compute aggregated statistics without -# having access to the raw datasets. -# -# This example runs in local mode, simulating a federated learning experiment. -# - - -# Choose the subprocess mode to locally simulate the FL process -N_CLIENTS = 3 -clients_list = [substra.Client(client_name=f"org-{i+1}") for i in range(N_CLIENTS)] -clients = {client.organization_info().organization_id: client for client in clients_list} - -# Store organization IDs -ORGS_ID = list(clients) - -# The provider of the functions for computing analytics is defined as the first organization. -ANALYTICS_PROVIDER_ORG_ID = ORGS_ID[0] -# Data providers orgs are the two last organizations. -DATA_PROVIDER_ORGS_ID = ORGS_ID[1:] - -# %% -# Creating and registering the assets -# ----------------------------------- -# -# Every asset will be created in respect to predefined schemas (Spec) previously imported from -# ``substra.sdk.schemas``. To register assets, :ref:`documentation/api_reference:Schemas` -# are first instantiated and the specs are then registered, which generate the real assets. -# -# Permissions are defined when registering assets. In a nutshell: -# -# - Data cannot be seen once it's registered on the platform. -# - Metadata are visible by all the users of a network. -# - Permissions allow you to execute a function on a certain dataset. -# - -permissions_local = Permissions(public=False, authorized_ids=DATA_PROVIDER_ORGS_ID) -permissions_aggregation = Permissions(public=False, authorized_ids=[ANALYTICS_PROVIDER_ORG_ID]) - -# %% -# Next, we need to define the asset directory. You should have already downloaded the assets folder as stated above. -# -# The function ``setup_diabetes`` downloads if needed the *diabetes* dataset, and split it in two. Each data organisation -# has access to a chunk of the dataset. - -root_dir = pathlib.Path.cwd() -assets_directory = root_dir / "assets" -assert assets_directory.is_dir(), """Did not find the asset directory, -a directory called 'assets' is expected in the same location as this file""" - -data_path = assets_directory / "data" -data_path.mkdir(exist_ok=True) - -setup_diabetes(data_path=data_path) - - -# %% -# Registering data samples and dataset -# ------------------------------------ -# -# A dataset represents the data in Substra. It contains some metadata and an *opener*, a script used to load the -# data from files into memory. You can find more details about datasets -# in the :ref:`API reference`. -# - -dataset = DatasetSpec( - name=f"Diabetes dataset", - type="csv", - data_opener=assets_directory / "dataset" / "diabetes_opener.py", - description=data_path / "description.md", - permissions=permissions_local, - logs_permission=permissions_local, -) - -# We register the dataset for each of the organisations -dataset_keys = {client_id: clients[client_id].add_dataset(dataset) for client_id in DATA_PROVIDER_ORGS_ID} - -for client_id, key in dataset_keys.items(): - print(f"Dataset key for {client_id}: {key}") - - -# %% -# The dataset object itself is an empty shell. Data samples are needed in order to add actual data. -# A data sample contains subfolders containing a single data file like a CSV and the key identifying -# the dataset it is linked to. -# - -datasample_keys = { - org_id: clients[org_id].add_data_sample( - DataSampleSpec( - data_manager_keys=[dataset_keys[org_id]], - test_only=False, - path=data_path / f"org_{i + 1}", - ), - local=True, - ) - for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID) -} - -# %% -# The data has now been added as an asset through the data samples. -# - -# %% -# Adding functions to execute with Substra -# ======================================== -# A :ref:`Substra function` -# specifies the function to apply to a dataset or the function to aggregate models (artifacts). -# Concretely, a function corresponds to an archive (tar or zip file) containing: -# -# - One or more Python scripts that implement the function. -# - A Dockerfile on which the user can specify the required dependencies of the Python scripts. -# This Dockerfile also specifies the function name to execute. -# -# In this example, we will: -# -# 1. compute prerequisites for first-moment statistics on each data organization; -# 2. aggregate these values on the analytics computation organization to get aggregated statistics; -# 3. send these aggregated values to the data organizations, in order to compute second-moment prerequisite values; -# 4. finally, aggregate these values to get second-moment aggregated statistics. -# - - -# %% -# Local step: computing first order statistic moments -# --------------------------------------------------- -# First, we will compute on each data node some aggregated values: number of samples, sum of each numerical column -# (it will be used to compute the mean), and counts for each category for the categorical column (*Sex*). -# -# The computation is implemented in a *Python function* in the `federated_analytics_functions.py` file. -# We also write a `Dockerfile` to define the entrypoint, and we wrap everything in a Substra ``FunctionSpec`` object. -# -# If you're running this example in a Notebook, you can uncomment and execute the next cell to see what code is executed -# on each data node. - -# %% - -# %load -s local_first_order_computation assets/functions/federated_analytics_functions.py - -# %% - - -local_first_order_computation_docker_files = [ - assets_directory / "functions" / "federated_analytics_functions.py", - assets_directory / "functions" / "local_first_order_computation" / "Dockerfile", -] - -local_archive_first_order_computation_path = assets_directory / "functions" / "local_first_order_analytics.zip" -with zipfile.ZipFile(local_archive_first_order_computation_path, "w") as z: - for filepath in local_first_order_computation_docker_files: - z.write(filepath, arcname=os.path.basename(filepath)) - -local_first_order_function_inputs = [ - FunctionInputSpec( - identifier="datasamples", - kind=AssetKind.data_sample, - optional=False, - multiple=True, - ), - FunctionInputSpec(identifier="opener", kind=AssetKind.data_manager, optional=False, multiple=False), -] - -local_first_order_function_outputs = [ - FunctionOutputSpec(identifier="local_analytics_first_moments", kind=AssetKind.model, multiple=False) -] - -local_first_order_function = FunctionSpec( - name="Local Federated Analytics - step 1", - inputs=local_first_order_function_inputs, - outputs=local_first_order_function_outputs, - description=assets_directory / "functions" / "description.md", - file=local_archive_first_order_computation_path, - permissions=permissions_local, -) - - -local_first_order_function_keys = { - client_id: clients[client_id].add_function(local_first_order_function) for client_id in DATA_PROVIDER_ORGS_ID -} - -print(f"Local function key for step 1: computing first order moments {local_first_order_function_keys}") - -# %% -# First aggregation step -# ---------------------- -# In a similar way, we define the `FunctionSpec` for the aggregation node. - -# %% - -# %load -s aggregation assets/functions/federated_analytics_functions.py - -# %% - -aggregate_function_docker_files = [ - assets_directory / "functions" / "federated_analytics_functions.py", - assets_directory / "functions" / "aggregation" / "Dockerfile", -] - -aggregate_archive_path = assets_directory / "functions" / "aggregate_function_analytics.zip" -with zipfile.ZipFile(aggregate_archive_path, "w") as z: - for filepath in aggregate_function_docker_files: - z.write(filepath, arcname=os.path.basename(filepath)) - -aggregate_function_inputs = [ - FunctionInputSpec( - identifier="local_analytics_list", - kind=AssetKind.model, - optional=False, - multiple=True, - ), -] - -aggregate_function_outputs = [FunctionOutputSpec(identifier="shared_states", kind=AssetKind.model, multiple=False)] - -aggregate_function = FunctionSpec( - name="Aggregate Federated Analytics", - inputs=aggregate_function_inputs, - outputs=aggregate_function_outputs, - description=assets_directory / "functions" / "description.md", - file=aggregate_archive_path, - permissions=permissions_aggregation, -) - - -aggregate_function_key = clients[ANALYTICS_PROVIDER_ORG_ID].add_function(aggregate_function) - -print(f"Aggregation function key {aggregate_function_key}") - -# %% -# Local step: computing second order statistic moments -# ---------------------------------------------------- -# We also register the function for the second round of computations happening locally on the data nodes. -# -# Both aggregation steps will use the same function, so we don't need to register it again. - -# %% - -# %load -s local_second_order_computation assets/functions/federated_analytics_functions.py - -# %% - -local_second_order_computation_docker_files = [ - assets_directory / "functions" / "federated_analytics_functions.py", - assets_directory / "functions" / "local_second_order_computation" / "Dockerfile", -] - -local_archive_second_order_computation_path = assets_directory / "functions" / "local_function_analytics.zip" -with zipfile.ZipFile(local_archive_second_order_computation_path, "w") as z: - for filepath in local_second_order_computation_docker_files: - z.write(filepath, arcname=os.path.basename(filepath)) - -local_second_order_function_inputs = [ - FunctionInputSpec( - identifier="datasamples", - kind=AssetKind.data_sample, - optional=False, - multiple=True, - ), - FunctionInputSpec(identifier="opener", kind=AssetKind.data_manager, optional=False, multiple=False), - FunctionInputSpec(identifier="shared_states", kind=AssetKind.model, optional=False, multiple=False), -] - -local_second_order_function_outputs = [ - FunctionOutputSpec( - identifier="local_analytics_second_moments", - kind=AssetKind.model, - multiple=False, - ) -] - -local_second_order_function = FunctionSpec( - name="Local Federated Analytics - step 2", - inputs=local_second_order_function_inputs, - outputs=local_second_order_function_outputs, - description=assets_directory / "functions" / "description.md", - file=local_archive_second_order_computation_path, - permissions=permissions_local, -) - - -local_second_order_function_keys = { - client_id: clients[client_id].add_function(local_second_order_function) for client_id in DATA_PROVIDER_ORGS_ID -} - -print(f"Local function key for step 2: computing second order moments {local_second_order_function_keys}") - -# %% -# The data and the functions are now registered. -# - -# %% -# Registering tasks in Substra -# ============================ -# The next step is to register the actual machine learning tasks. -# - -from substra.sdk.models import Status -import time - - -def wait_task(client: substra.Client, key: str): - """Function to wait the function to be done before continuing. - - Args: - client(substra.Client): client owner of the task. - key (str): task key of the task to wait. - """ - task_status = client.get_task(key).status - - while task_status != Status.done: - time.sleep(1) - task_status = client.get_task(key).status - - client_id = client.organization_info().organization_id - print(f"Status of task {key} on client {client_id}: {task_status}") - - -data_manager_input = { - client_id: [InputRef(identifier="opener", asset_key=key)] for client_id, key in dataset_keys.items() -} - -datasample_inputs = { - client_id: [InputRef(identifier="datasamples", asset_key=key)] for client_id, key in datasample_keys.items() -} - -local_task_1_keys = { - client_id: clients[client_id].add_task( - TaskSpec( - function_key=local_first_order_function_keys[client_id], - inputs=data_manager_input[client_id] + datasample_inputs[client_id], - outputs={"local_analytics_first_moments": ComputeTaskOutputSpec(permissions=permissions_aggregation)}, - worker=client_id, - ) - ) - for client_id in DATA_PROVIDER_ORGS_ID -} - -for client_id, key in local_task_1_keys.items(): - wait_task(client=clients[client_id], key=key) - - -# %% -# In local mode, the registered task is executed at once: -# the registration function returns a value once the task has been executed. -# -# In deployed mode, the registered task is added to a queue and treated asynchronously: this means that the -# code that registers the tasks keeps executing. To wait for a task to be done, create a loop and get the task -# every `n` seconds until its status is done or failed. -# - -aggregation_1_inputs = [ - InputRef( - identifier="local_analytics_list", - parent_task_key=local_key, - parent_task_output_identifier="local_analytics_first_moments", - ) - for local_key in local_task_1_keys.values() -] - - -aggregation_task_1 = TaskSpec( - function_key=aggregate_function_key, - inputs=aggregation_1_inputs, - outputs={"shared_states": ComputeTaskOutputSpec(permissions=permissions_local)}, - worker=ANALYTICS_PROVIDER_ORG_ID, -) - -aggregation_task_1_key = clients[ANALYTICS_PROVIDER_ORG_ID].add_task(aggregation_task_1) - -wait_task(client=clients[ANALYTICS_PROVIDER_ORG_ID], key=aggregation_task_1_key) - -# %% - -shared_inputs = [ - InputRef( - identifier="shared_states", - parent_task_key=aggregation_task_1_key, - parent_task_output_identifier="shared_states", - ) -] - -local_task_2_keys = { - client_id: clients[client_id].add_task( - TaskSpec( - function_key=local_second_order_function_keys[client_id], - inputs=data_manager_input[client_id] + datasample_inputs[client_id] + shared_inputs, - outputs={"local_analytics_second_moments": ComputeTaskOutputSpec(permissions=permissions_aggregation)}, - worker=client_id, - ) - ) - for client_id in DATA_PROVIDER_ORGS_ID -} - -for client_id, key in local_task_2_keys.items(): - wait_task(client=clients[client_id], key=key) - -aggregation_2_inputs = [ - InputRef( - identifier="local_analytics_list", - parent_task_key=local_key, - parent_task_output_identifier="local_analytics_second_moments", - ) - for local_key in local_task_2_keys.values() -] - -aggregation_task_2 = TaskSpec( - function_key=aggregate_function_key, - inputs=aggregation_2_inputs, - outputs={"shared_states": ComputeTaskOutputSpec(permissions=permissions_local)}, - worker=ANALYTICS_PROVIDER_ORG_ID, -) - -aggregation_task_2_key = clients[ANALYTICS_PROVIDER_ORG_ID].add_task(aggregation_task_2) - -wait_task(client=clients[ANALYTICS_PROVIDER_ORG_ID], key=aggregation_task_2_key) - -# %% -# Results -# ------- -# Now we can view the results. -# - -import pickle -import tempfile - - -with tempfile.TemporaryDirectory() as temp_folder: - out_model1_file = clients[ANALYTICS_PROVIDER_ORG_ID].download_model_from_task( - aggregation_task_1_key, folder=temp_folder, identifier="shared_states" - ) - out1 = pickle.load(out_model1_file.open("rb")) - - out_model2_file = clients[ANALYTICS_PROVIDER_ORG_ID].download_model_from_task( - aggregation_task_2_key, folder=temp_folder, identifier="shared_states" - ) - out2 = pickle.load(out_model2_file.open("rb")) - -print( - f"""Age mean: {out1['means']['age']:.2f} years -Sex percentage: - Male: {100*out1['counts']['sex']['M']:.2f}% - Female: {100*out1['counts']['sex']['F']:.2f}% -Blood pressure std: {out2["std"]["bp"]:.2f} mm Hg -""" -) diff --git a/examples/substra_core/titanic_example/README.rst b/examples/substra_core/titanic_example/README.rst deleted file mode 100644 index b997063b..00000000 --- a/examples/substra_core/titanic_example/README.rst +++ /dev/null @@ -1,2 +0,0 @@ -Examples to get started -^^^^^^^^^^^^^^^^^^^^^^^ \ No newline at end of file diff --git a/examples/substra_core/titanic_example/run_titanic.py b/examples/substra_core/titanic_example/run_titanic.py deleted file mode 100644 index 47b4f68a..00000000 --- a/examples/substra_core/titanic_example/run_titanic.py +++ /dev/null @@ -1,361 +0,0 @@ -""" -================================================================= -Running Substra with a single organisation on the Titanic dataset -================================================================= - -This example is based on `the similarly named Kaggle challenge `__. - -In this example, we work on the Titanic tabular dataset. This is a classification problem -that uses a random forest model. - -Here you will learn how to interact with Substra, more specifically: - -- instantiating Substra Client -- creating and registering assets -- launching an experiment - - -There is no federated learning in this example, training and testing will happen on only one :term:`Organization`. - - -To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example. - - .. only:: builder_html or readthedocs - - :download:`assets required to run this example <../../../../../tmp/titanic_assets.zip>` - - Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command ``pip install -r requirements.txt`` to install them. - -""" - -# %% -# Import all the dependencies -# --------------------------- - -import os -import zipfile -from pathlib import Path - -import substra -from substra.sdk.schemas import ( - AssetKind, - DataSampleSpec, - DatasetSpec, - FunctionSpec, - FunctionInputSpec, - FunctionOutputSpec, - Permissions, - TaskSpec, - ComputeTaskOutputSpec, - InputRef, -) - -# %% -# Instantiating the Substra Client -# ================================ -# -# The client allows us to interact with the Substra platform. -# -# By setting the argument ``backend_type`` to: -# -# - ``docker`` all tasks will be executed from docker containers (default) -# - ``subprocess`` all tasks will be executed from Python subprocesses (faster) - -client = substra.Client(client_name="org-1") - -# %% -# -# Creation and Registration of the assets -# --------------------------------------- -# -# Every asset will be created in respect to predefined schemas (Spec) previously imported from -# substra.sdk.schemas. To register assets, asset :ref:`documentation/api_reference:Schemas` -# are first instantiated and the specs are then registered, which generates the real assets. -# -# Permissions are defined when registering assets. In a nutshell: -# -# - Data cannot be seen once it's registered on the platform. -# - Metadata are visible by all the users of a channel. -# - Permissions allow you to execute a function on a certain dataset. -# -# In a remote deployment, setting the parameter ``public`` to false means that the dataset can only be used by tasks in -# the same organization or by organizations that are in the ``authorized_ids``. However, these permissions are ignored in local mode. - -permissions = Permissions(public=True, authorized_ids=[]) - -# %% -# Next, we need to define the asset directory. You should have already downloaded the assets folder as stated above. -# - -root_dir = Path.cwd() -assets_directory = root_dir / "assets" -assert assets_directory.is_dir(), """Did not find the asset directory, a directory called 'assets' is -expected in the same location as this py file""" - -# %% -# -# Registering data samples and dataset -# ==================================== -# -# A dataset represents the data in Substra. It is made up of an opener, which is a script used to load the -# data from files into memory. You can find more details about datasets -# in the :ref:`API reference` - -dataset = DatasetSpec( - name="Titanic dataset - Org 1", - type="csv", - data_opener=assets_directory / "dataset" / "titanic_opener.py", - description=assets_directory / "dataset" / "description.md", - permissions=permissions, - logs_permission=permissions, -) - -dataset_key = client.add_dataset(dataset) -print(f"Dataset key {dataset_key}") - - -# %% -# Adding train data samples -# ========================= -# -# The dataset object itself is an empty shell. Data samples are needed in order to add actual data. -# A data sample contains subfolders containing a single data file like a CSV and the key identifying -# the dataset it is linked to. - -# sphinx_gallery_thumbnail_path = 'static/example_thumbnail/titanic.jpg' - -train_data_sample_folder = assets_directory / "train_data_samples" -train_data_sample_keys = client.add_data_samples( - DataSampleSpec( - paths=list(train_data_sample_folder.glob("*")), - data_manager_keys=[dataset_key], - ) -) - -print(f"{len(train_data_sample_keys)} data samples were registered") - -# %% -# Adding test data samples -# ======================== -# The operation is done again but with the test data samples. - -test_data_sample_folder = assets_directory / "test_data_samples" -test_data_sample_keys = client.add_data_samples( - DataSampleSpec( - paths=list(test_data_sample_folder.glob("*")), - data_manager_keys=[dataset_key], - ) -) - -print(f"{len(test_data_sample_keys)} data samples were registered") - - -# %% -# The data has now been added as an asset through the data samples both for the training and -# testing part of our experience. -# -# Adding Metrics -# ============== -# A metric corresponds to a function to evaluate the performance of a model on a dataset. -# Concretely, a metric corresponds to an archive (tar or zip file) containing: -# -# - Python scripts that implement the metric computation -# - a Dockerfile on which the user can specify the required dependencies of the Python scripts - -inputs_metrics = [ - FunctionInputSpec(identifier="datasamples", kind=AssetKind.data_sample, optional=False, multiple=True), - FunctionInputSpec(identifier="opener", kind=AssetKind.data_manager, optional=False, multiple=False), - FunctionInputSpec(identifier="predictions", kind=AssetKind.model, optional=False, multiple=False), -] - -outputs_metrics = [FunctionOutputSpec(identifier="performance", kind=AssetKind.performance, multiple=False)] - - -METRICS_DOCKERFILE_FILES = [ - assets_directory / "metric" / "titanic_metrics.py", - assets_directory / "metric" / "Dockerfile", -] - -metric_archive_path = assets_directory / "metric" / "metrics.zip" - -with zipfile.ZipFile(metric_archive_path, "w") as z: - for filepath in METRICS_DOCKERFILE_FILES: - z.write(filepath, arcname=os.path.basename(filepath)) - -metric_function = FunctionSpec( - inputs=inputs_metrics, - outputs=outputs_metrics, - name="Testing with Accuracy metric", - description=assets_directory / "metric" / "description.md", - file=metric_archive_path, - permissions=permissions, -) - -metric_key = client.add_function(metric_function) - -print(f"Metric key {metric_key}") - - -# %% -# Adding Function -# =============== -# A :ref:`documentation/concepts:Function` specifies the method to train a model on a dataset or the method to aggregate models. -# Concretely, a function corresponds to an archive (tar or zip file) containing: -# -# - One or more Python scripts that implement the function. It is required to define ``train`` and ``predict`` functions. -# - A Dockerfile in which the user can specify the required dependencies of the Python scripts. -# This Dockerfile also specifies the method name to execute (either ``train`` or ``predict`` here). - - -ALGO_TRAIN_DOCKERFILE_FILES = [ - assets_directory / "function_random_forest/titanic_function_rf.py", - assets_directory / "function_random_forest/train/Dockerfile", -] - -train_archive_path = assets_directory / "function_random_forest" / "function_random_forest.zip" -with zipfile.ZipFile(train_archive_path, "w") as z: - for filepath in ALGO_TRAIN_DOCKERFILE_FILES: - z.write(filepath, arcname=os.path.basename(filepath)) - -train_function_inputs = [ - FunctionInputSpec(identifier="datasamples", kind=AssetKind.data_sample, optional=False, multiple=True), - FunctionInputSpec(identifier="opener", kind=AssetKind.data_manager, optional=False, multiple=False), -] - -train_function_outputs = [FunctionOutputSpec(identifier="model", kind=AssetKind.model, multiple=False)] - -train_function = FunctionSpec( - name="Training with Random Forest", - inputs=train_function_inputs, - outputs=train_function_outputs, - description=assets_directory / "function_random_forest" / "description.md", - file=train_archive_path, - permissions=permissions, -) - - -train_function_key = client.add_function(train_function) - -print(f"Train function key {train_function_key}") - -# %% -# The predict function uses the same Python file as the function used for training. -ALGO_PREDICT_DOCKERFILE_FILES = [ - assets_directory / "function_random_forest/titanic_function_rf.py", - assets_directory / "function_random_forest/predict/Dockerfile", -] - -predict_archive_path = assets_directory / "function_random_forest" / "function_random_forest.zip" -with zipfile.ZipFile(predict_archive_path, "w") as z: - for filepath in ALGO_PREDICT_DOCKERFILE_FILES: - z.write(filepath, arcname=os.path.basename(filepath)) - -predict_function_inputs = [ - FunctionInputSpec(identifier="datasamples", kind=AssetKind.data_sample, optional=False, multiple=True), - FunctionInputSpec(identifier="opener", kind=AssetKind.data_manager, optional=False, multiple=False), - FunctionInputSpec(identifier="models", kind=AssetKind.model, optional=False, multiple=False), -] - -predict_function_outputs = [FunctionOutputSpec(identifier="predictions", kind=AssetKind.model, multiple=False)] - -predict_function_spec = FunctionSpec( - name="Predicting with Random Forest", - inputs=predict_function_inputs, - outputs=predict_function_outputs, - description=assets_directory / "function_random_forest" / "description.md", - file=predict_archive_path, - permissions=permissions, -) - -predict_function_key = client.add_function(predict_function_spec) - -print(f"Predict function key {predict_function_key}") - -# %% -# The data, the functions and the metric are now registered. - -# %% -# Registering tasks -# ----------------- -# The next step is to register the actual machine learning tasks. -# First a training task is registered which will produce a machine learning model. -# Then a testing task is registered to test the trained model. - -data_manager_input = [InputRef(identifier="opener", asset_key=dataset_key)] -train_data_sample_inputs = [InputRef(identifier="datasamples", asset_key=key) for key in train_data_sample_keys] -test_data_sample_inputs = [InputRef(identifier="datasamples", asset_key=key) for key in test_data_sample_keys] - -train_task = TaskSpec( - function_key=train_function_key, - inputs=data_manager_input + train_data_sample_inputs, - outputs={"model": ComputeTaskOutputSpec(permissions=permissions)}, - worker=client.organization_info().organization_id, -) - -train_task_key = client.add_task(train_task) - -print(f"Train task key {train_task_key}") - -# %% -# In local mode, the registered task is executed at once: -# the registration function returns a value once the task has been executed. -# -# In deployed mode, the registered task is added to a queue and treated asynchronously: this means that the -# code that registers the tasks keeps executing. To wait for a task to be done, create a loop and get the task -# every ``n`` seconds until its status is done or failed. - -model_input = [ - InputRef( - identifier="models", - parent_task_key=train_task_key, - parent_task_output_identifier="model", - ) -] - -predict_task = TaskSpec( - function_key=predict_function_key, - inputs=data_manager_input + test_data_sample_inputs + model_input, - outputs={"predictions": ComputeTaskOutputSpec(permissions=permissions)}, - worker=client.organization_info().organization_id, -) - -predict_task_key = client.add_task(predict_task) - -predictions_input = [ - InputRef( - identifier="predictions", - parent_task_key=predict_task_key, - parent_task_output_identifier="predictions", - ) -] - -test_task = TaskSpec( - function_key=metric_key, - inputs=data_manager_input + test_data_sample_inputs + predictions_input, - outputs={"performance": ComputeTaskOutputSpec(permissions=permissions)}, - worker=client.organization_info().organization_id, -) - -test_task_key = client.add_task(test_task) - -print(f"Test task key {test_task_key}") - - -# %% -# Results -# ------- -# Now we can view the results - -from substra.sdk.models import Status -import time - -test_task = client.get_task(test_task_key) -while test_task.status != Status.done: - time.sleep(1) - test_task = client.get_task(test_task_key) - -print(f"Test tasks status: {test_task.status}") - -performance = client.get_task_output_asset(test_task.key, identifier="performance") -print("Metric: ", test_task.function.name) -print("Performance on the metric: ", performance.asset) diff --git a/examples/substrafl/README.rst b/examples/substrafl/README.rst deleted file mode 100644 index dc5e68e0..00000000 --- a/examples/substrafl/README.rst +++ /dev/null @@ -1,4 +0,0 @@ -SubstraFL examples -================== - -The examples below are compatible with SubstraFL |substrafl_version|. diff --git a/examples/substrafl/get_started/README.rst b/examples/substrafl/get_started/README.rst deleted file mode 100644 index 26a7b47f..00000000 --- a/examples/substrafl/get_started/README.rst +++ /dev/null @@ -1,2 +0,0 @@ -Example to get started using the PyTorch interface -************************************************** \ No newline at end of file diff --git a/examples/substrafl/get_started/run_mnist_torch.py b/examples/substrafl/get_started/run_mnist_torch.py deleted file mode 100644 index dd00b0a6..00000000 --- a/examples/substrafl/get_started/run_mnist_torch.py +++ /dev/null @@ -1,535 +0,0 @@ -""" -=================================== -Using Torch FedAvg on MNIST dataset -=================================== - -This example illustrates the basic usage of SubstraFL and proposes Federated Learning using the Federated Averaging strategy -on the `MNIST Dataset of handwritten digits `__ using PyTorch. -In this example, we work on 28x28 pixel sized grayscale images. This is a classification problem -aiming to recognize the number written on each image. - -SubstraFL can be used with any machine learning framework (PyTorch, Tensorflow, Scikit-Learn, etc). - -However a specific interface has been developed for PyTorch which makes writing PyTorch code simpler than with other frameworks. This example here uses the specific PyTorch interface. - -This example does not use a deployed platform of Substra and runs in local mode. - -To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example. - - .. only:: builder_html or readthedocs - - :download:`assets required to run this example <../../../../../tmp/torch_fedavg_assets.zip>` - - * Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command ``pip install -r requirements.txt`` to install them. - * **Substra** and **SubstraFL** should already be installed. If not follow the instructions described here: :ref:`substrafl_doc/substrafl_overview:Installation`. - - -""" -# %% -# Setup -# ***** -# -# This example runs with three organizations. Two organizations provide datasets, while a third -# one provides the algorithm. -# -# In the following code cell, we define the different organizations needed for our FL experiment. - - -from substra import Client - -N_CLIENTS = 3 - -client_0 = Client(client_name="org-1") -client_1 = Client(client_name="org-2") -client_2 = Client(client_name="org-3") - -# %% -# Every computation will run in ``subprocess`` mode, where everything runs locally in Python -# subprocesses. -# Other backend_types are: -# -# - ``docker`` mode where computations run locally in docker containers -# - ``remote`` mode where computations run remotely (you need to have a deployed platform for that) -# -# To run in remote mode, use the following syntax: -# -# ``client_remote = Client(backend_type="remote", url="MY_BACKEND_URL", username="my-username", password="my-password")`` - - -# Create a dictionary to easily access each client from its human-friendly id -clients = { - client_0.organization_info().organization_id: client_0, - client_1.organization_info().organization_id: client_1, - client_2.organization_info().organization_id: client_2, -} - -# Store organization IDs -ORGS_ID = list(clients) -ALGO_ORG_ID = ORGS_ID[0] # Algo provider is defined as the first organization. -DATA_PROVIDER_ORGS_ID = ORGS_ID[1:] # Data providers orgs are the two last organizations. - -# %% -# Data and metrics -# **************** - -# %% -# Data preparation -# ================ -# -# This section downloads (if needed) the **MNIST dataset** using the `torchvision library -# `__. -# It extracts the images from the raw files and locally creates a folder for each -# organization. -# -# Each organization will have access to half the training data and half the test data (which -# corresponds to **30,000** -# images for training and **5,000** for testing each). - -import pathlib -from torch_fedavg_assets.dataset.mnist_dataset import setup_mnist - -# sphinx_gallery_thumbnail_path = 'static/example_thumbnail/mnist.png' - -# Create the temporary directory for generated data -(pathlib.Path.cwd() / "tmp").mkdir(exist_ok=True) -data_path = pathlib.Path.cwd() / "tmp" / "data_mnist" - -setup_mnist(data_path, len(DATA_PROVIDER_ORGS_ID)) - -# %% -# Dataset registration -# ==================== -# -# A :ref:`documentation/concepts:Dataset` is composed of an **opener**, which is a Python script that can load -# the data from the files in memory and a description markdown file. -# The :ref:`documentation/concepts:Dataset` object itself does not contain the data. The proper asset that contains the -# data is the **datasample asset**. -# -# A **datasample** contains a local path to the data. A datasample can be linked to a dataset in order to add data to a -# dataset. -# -# Data privacy is a key concept for Federated Learning experiments. That is why we set -# :ref:`documentation/concepts:Permissions` for :ref:`documentation/concepts:Assets` to determine how each organization -# can access a specific asset. -# -# Note that metadata such as the assets' creation date and the asset owner are visible to all the organizations of a -# network. - -from substra.sdk.schemas import DatasetSpec -from substra.sdk.schemas import Permissions -from substra.sdk.schemas import DataSampleSpec - -assets_directory = pathlib.Path.cwd() / "torch_fedavg_assets" -dataset_keys = {} -train_datasample_keys = {} -test_datasample_keys = {} - -for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID): - client = clients[org_id] - - permissions_dataset = Permissions(public=False, authorized_ids=[ALGO_ORG_ID]) - - # DatasetSpec is the specification of a dataset. It makes sure every field - # is well-defined, and that our dataset is ready to be registered. - # The real dataset object is created in the add_dataset method. - - dataset = DatasetSpec( - name="MNIST", - type="npy", - data_opener=assets_directory / "dataset" / "mnist_opener.py", - description=assets_directory / "dataset" / "description.md", - permissions=permissions_dataset, - logs_permission=permissions_dataset, - ) - dataset_keys[org_id] = client.add_dataset(dataset) - assert dataset_keys[org_id], "Missing dataset key" - - # Add the training data on each organization. - data_sample = DataSampleSpec( - data_manager_keys=[dataset_keys[org_id]], - path=data_path / f"org_{i+1}" / "train", - ) - train_datasample_keys[org_id] = client.add_data_sample(data_sample) - - # Add the testing data on each organization. - data_sample = DataSampleSpec( - data_manager_keys=[dataset_keys[org_id]], - path=data_path / f"org_{i+1}" / "test", - ) - test_datasample_keys[org_id] = client.add_data_sample(data_sample) - - -# %% -# Metrics definition -# ================== -# -# A metric is a function used to evaluate the performance of your model on one or several -# **datasamples**. -# -# To add a metric, you need to define a function that computes and returns a performance -# from the datasamples (as returned by the opener) and the predictions_path (to be loaded within the function). -# -# When using a Torch SubstraFL algorithm, the predictions are saved in the ``predict`` function in numpy format -# so that you can simply load them using ``np.load``. - -from sklearn.metrics import accuracy_score -from sklearn.metrics import roc_auc_score -import numpy as np - - -def accuracy(datasamples, predictions_path): - y_true = datasamples["labels"] - y_pred = np.load(predictions_path) - - return accuracy_score(y_true, np.argmax(y_pred, axis=1)) - - -def roc_auc(datasamples, predictions_path): - y_true = datasamples["labels"] - y_pred = np.load(predictions_path) - - n_class = np.max(y_true) + 1 - y_true_one_hot = np.eye(n_class)[y_true] - - return roc_auc_score(y_true_one_hot, y_pred) - - -# %% -# Machine learning components definition -# ************************************** -# This section uses the PyTorch based SubstraFL API to simplify the definition of machine learning components. -# However, SubstraFL is compatible with any machine learning framework. -# -# -# In this section, you will: -# -# - Register a model and its dependencies -# - Specify the federated learning strategy -# - Specify the training and aggregation nodes -# - Specify the test nodes -# - Actually run the computations - - -# %% -# Model definition -# ================ -# -# We choose to use a classic torch CNN as the model to train. The model architecture is defined by the user -# independently of SubstraFL. - -import torch -from torch import nn -import torch.nn.functional as F - -seed = 42 -torch.manual_seed(seed) - - -class CNN(nn.Module): - def __init__(self): - super(CNN, self).__init__() - self.conv1 = nn.Conv2d(1, 32, kernel_size=5) - self.conv2 = nn.Conv2d(32, 32, kernel_size=5) - self.conv3 = nn.Conv2d(32, 64, kernel_size=5) - self.fc1 = nn.Linear(3 * 3 * 64, 256) - self.fc2 = nn.Linear(256, 10) - - def forward(self, x, eval=False): - x = F.relu(self.conv1(x)) - x = F.relu(F.max_pool2d(self.conv2(x), 2)) - x = F.dropout(x, p=0.5, training=not eval) - x = F.relu(F.max_pool2d(self.conv3(x), 2)) - x = F.dropout(x, p=0.5, training=not eval) - x = x.view(-1, 3 * 3 * 64) - x = F.relu(self.fc1(x)) - x = F.dropout(x, p=0.5, training=not eval) - x = self.fc2(x) - return F.log_softmax(x, dim=1) - - -model = CNN() -optimizer = torch.optim.Adam(model.parameters(), lr=0.001) -criterion = torch.nn.CrossEntropyLoss() - -# %% -# Specifying on how much data to train -# ==================================== -# -# To specify on how much data to train at each round, we use the ``index_generator`` object. -# We specify the batch size and the number of batches (named ``num_updates``) to consider for each round. -# See :ref:`substrafl_doc/substrafl_overview:Index Generator` for more details. - - -from substrafl.index_generator import NpIndexGenerator - -# Number of model updates between each FL strategy aggregation. -NUM_UPDATES = 100 - -# Number of samples per update. -BATCH_SIZE = 32 - -index_generator = NpIndexGenerator( - batch_size=BATCH_SIZE, - num_updates=NUM_UPDATES, -) - -# %% -# Torch Dataset definition -# ========================== -# -# This torch Dataset is used to preprocess the data using the ``__getitem__`` function. -# -# This torch Dataset needs to have a specific ``__init__`` signature, that must contain (self, datasamples, is_inference). -# -# The ``__getitem__`` function is expected to return (inputs, outputs) if ``is_inference`` is ``False``, else only the inputs. -# This behavior can be changed by re-writing the ``_local_train`` or ``predict`` methods. - - -class TorchDataset(torch.utils.data.Dataset): - def __init__(self, datasamples, is_inference: bool): - self.x = datasamples["images"] - self.y = datasamples["labels"] - self.is_inference = is_inference - - def __getitem__(self, idx): - if self.is_inference: - x = torch.FloatTensor(self.x[idx][None, ...]) / 255 - return x - - else: - x = torch.FloatTensor(self.x[idx][None, ...]) / 255 - - y = torch.tensor(self.y[idx]).type(torch.int64) - y = F.one_hot(y, 10) - y = y.type(torch.float32) - - return x, y - - def __len__(self): - return len(self.x) - - -# %% -# SubstraFL algo definition -# ========================== -# -# A SubstraFL Algo gathers all the defined elements that run locally in each organization. -# This is the only SubstraFL object that is framework specific (here PyTorch specific). -# -# The ``TorchDataset`` is passed **as a class** to the -# :ref:`Torch algorithm`. -# Indeed, this ``TorchDataset`` will be instantiated directly on the data provider organization. - - -from substrafl.algorithms.pytorch import TorchFedAvgAlgo - - -class TorchCNN(TorchFedAvgAlgo): - def __init__(self): - super().__init__( - model=model, - criterion=criterion, - optimizer=optimizer, - index_generator=index_generator, - dataset=TorchDataset, - seed=seed, - ) - - -# %% -# Federated Learning strategies -# ============================= -# -# A FL strategy specifies how to train a model on distributed data. -# The most well known strategy is the Federated Averaging strategy: train locally a model on every organization, -# then aggregate the weight updates from every organization, and then apply locally at each organization the averaged -# updates. - - -from substrafl.strategies import FedAvg - -strategy = FedAvg(algo=TorchCNN()) - -# %% -# Where to train where to aggregate -# ================================= -# -# We specify on which data we want to train our model, using the :ref:`substrafl_doc/api/nodes:TrainDataNode` object. -# Here we train on the two datasets that we have registered earlier. -# -# The :ref:`substrafl_doc/api/nodes:AggregationNode` specifies the organization on which the aggregation operation -# will be computed. - -from substrafl.nodes import TrainDataNode -from substrafl.nodes import AggregationNode - - -aggregation_node = AggregationNode(ALGO_ORG_ID) - -# Create the Train Data Nodes (or training tasks) and save them in a list -train_data_nodes = [ - TrainDataNode( - organization_id=org_id, - data_manager_key=dataset_keys[org_id], - data_sample_keys=[train_datasample_keys[org_id]], - ) - for org_id in DATA_PROVIDER_ORGS_ID -] - - -# %% -# Where and when to test -# ====================== -# -# With the same logic as the train nodes, we create :ref:`substrafl_doc/api/nodes:TestDataNode` to specify on which -# data we want to test our model. -# -# The :ref:`substrafl_doc/api/evaluation_strategy:Evaluation Strategy` defines where and at which frequency we -# evaluate the model, using the given metric(s) that you registered in a previous section. - - -from substrafl.nodes import TestDataNode -from substrafl.evaluation_strategy import EvaluationStrategy - -# Create the Test Data Nodes (or testing tasks) and save them in a list -test_data_nodes = [ - TestDataNode( - organization_id=org_id, - data_manager_key=dataset_keys[org_id], - test_data_sample_keys=[test_datasample_keys[org_id]], - metric_functions={"Accuracy": accuracy, "ROC AUC": roc_auc}, - ) - for org_id in DATA_PROVIDER_ORGS_ID -] - - -# Test at the end of every round -my_eval_strategy = EvaluationStrategy(test_data_nodes=test_data_nodes, eval_frequency=1) - -# %% -# Running the experiment -# ********************** -# -# As a last step before launching our experiment, we need to specify the third parties dependencies required to run it. -# The :ref:`substrafl_doc/api/dependency:Dependency` object is instantiated in order to install the right libraries in -# the Python environment of each organization. - -from substrafl.dependency import Dependency - -dependencies = Dependency(pypi_dependencies=["numpy==1.23.1", "torch==1.11.0", "scikit-learn==1.1.1"]) - -# %% -# We now have all the necessary objects to launch our experiment. Please see a summary below of all the objects we created so far: -# -# - A :ref:`documentation/references/sdk:Client` to add or retrieve the assets of our experiment, using their keys to -# identify them. -# - An :ref:`Torch algorithm` to define the training parameters *(optimizer, train -# function, predict function, etc...)*. -# - A :ref:`Federated Strategy`, to specify how to train the model on -# distributed data. -# - :ref:`Train data nodes` to indicate on which data to train. -# - An :ref:`substrafl_doc/api/evaluation_strategy:Evaluation Strategy`, to define where and at which frequency we -# evaluate the model. -# - An :ref:`substrafl_doc/api/nodes:AggregationNode`, to specify the organization on which the aggregation operation -# will be computed. -# - The **number of rounds**, a round being defined by a local training step followed by an aggregation operation. -# - An **experiment folder** to save a summary of the operation made. -# - The :ref:`substrafl_doc/api/dependency:Dependency` to define the libraries on which the experiment needs to run. - -from substrafl.experiment import execute_experiment - -# A round is defined by a local training step followed by an aggregation operation -NUM_ROUNDS = 3 - -compute_plan = execute_experiment( - client=clients[ALGO_ORG_ID], - strategy=strategy, - train_data_nodes=train_data_nodes, - evaluation_strategy=my_eval_strategy, - aggregation_node=aggregation_node, - num_rounds=NUM_ROUNDS, - experiment_folder=str(pathlib.Path.cwd() / "tmp" / "experiment_summaries"), - dependencies=dependencies, - clean_models=False, - name="MNIST documentation example", -) - - -# %% -# The compute plan created is composed of 29 tasks: -# -# * For each local training step, we create 3 tasks per organisation: training + prediction + evaluation -> 3 tasks. -# * We are training on 2 data organizations; for each round, we have 3 * 2 local tasks + 1 aggregation task -> 7 tasks. -# * We are training for 3 rounds: 3 * 7 -> 21 tasks. -# * Before the first local training step, there is an initialization step on each data organization: 21 + 2 -> 23 tasks. -# * After the last aggregation step, there are three more tasks: applying the last updates from the aggregator + prediction + evaluation, on both organizations: 23 + 2 * 3 -> 29 tasks - -# %% -# Explore the results -# ******************* - -# The results will be available once the compute plan is completed -client_0.wait_compute_plan(compute_plan.key) - -# %% -# List results -# ============ - - -import pandas as pd - -performances_df = pd.DataFrame(client.get_performances(compute_plan.key).dict()) -print("\nPerformance Table: \n") -print(performances_df[["worker", "round_idx", "identifier", "performance"]]) - -# %% -# Plot results -# ============ - -import matplotlib.pyplot as plt - -fig, axs = plt.subplots(1, 2, figsize=(12, 6)) -fig.suptitle("Test dataset results") - -axs[0].set_title("Accuracy") -axs[1].set_title("ROC AUC") - -for ax in axs.flat: - ax.set(xlabel="Rounds", ylabel="Score") - - -for org_id in DATA_PROVIDER_ORGS_ID: - org_df = performances_df[performances_df["worker"] == org_id] - acc_df = org_df[org_df["identifier"] == "Accuracy"] - axs[0].plot(acc_df["round_idx"], acc_df["performance"], label=org_id) - - auc_df = org_df[org_df["identifier"] == "ROC AUC"] - axs[1].plot(auc_df["round_idx"], auc_df["performance"], label=org_id) - -plt.legend(loc="lower right") -plt.show() - -# %% -# Download a model -# ================ -# -# After the experiment, you might be interested in downloading your trained model. -# To do so, you will need the source code in order to reload your code architecture in memory. -# You have the option to choose the client and the round you are interested in downloading. -# -# If ``round_idx`` is set to ``None``, the last round will be selected by default. - -from substrafl.model_loading import download_algo_state - -client_to_download_from = DATA_PROVIDER_ORGS_ID[0] -round_idx = None - -algo = download_algo_state( - client=clients[client_to_download_from], - compute_plan_key=compute_plan.key, - round_idx=round_idx, -) - -model = algo.model - -print(model) diff --git a/examples/substrafl/go_further/README.rst b/examples/substrafl/go_further/README.rst deleted file mode 100644 index 8fdc1eeb..00000000 --- a/examples/substrafl/go_further/README.rst +++ /dev/null @@ -1,2 +0,0 @@ -Example to go further -********************* \ No newline at end of file diff --git a/examples/substrafl/go_further/run_diabetes_substrafl.py b/examples/substrafl/go_further/run_diabetes_substrafl.py deleted file mode 100644 index 9d6c45f5..00000000 --- a/examples/substrafl/go_further/run_diabetes_substrafl.py +++ /dev/null @@ -1,559 +0,0 @@ -""" -=========================================== -Federated Analytics on the diabetes dataset -=========================================== - -This example demonstrates how to use the flexibility of the SubstraFL library and the base class -ComputePlanBuilder to do Federated Analytics. It reproduces the `diabetes example `__ -of the Substra SDK example section using SubstraFL. -If you are new to SubstraFL, we recommend to start by the `MNIST Example `__ -to learn how to use the library in the simplest configuration first. - -We use the **Diabetes dataset** available from the `Scikit-Learn dataset module `__. -This dataset contains medical information such as Age, Sex or Blood pressure. -The goal of this example is to compute some analytics such as Age mean, Blood pressure standard deviation or Sex percentage. - -We simulate having two different data organizations, and a third organization which wants to compute aggregated analytics -without having access to the raw data. The example here runs everything locally; however there is only one parameter to -change to run it on a real network. - -**Caution:** - This example is provided as an illustrative example only. In real life, you should be careful not to - accidentally leak private information when doing Federated Analytics. For example if a column contains very similar values, - sharing its mean and its standard deviation is functionally equivalent to sharing the content of the column. - It is **strongly recommended** to consider what are the potential security risks in your use case, and to act accordingly. - It is possible to use other privacy-preserving techniques, such as - `Differential Privacy `_, in addition to Substra. - Because the focus of this example is Substra capabilities and for the sake of simplicity, such safeguards are not implemented here. - - -To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example. - - .. only:: builder_html or readthedocs - - :download:`assets required to run this example <../../../../../tmp/diabetes_substrafl_assets.zip>` - - Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command ``pip install -r requirements.txt`` to install them. - -""" - - -# %% -# Instantiating the Substra clients -# ================================= -# -# We work with three different organizations. -# Two organizations provide data, and a third one performs Federated Analytics to compute aggregated statistics without -# having access to the raw datasets. -# -# This example runs in local mode, simulating a federated learning experiment. -# -# In the following code cell, we define the different organizations needed for our FL experiment. - -# sphinx_gallery_thumbnail_path = 'static/example_thumbnail/diabetes.png' - -from substra import Client - -# Choose the subprocess mode to locally simulate the FL process -N_CLIENTS = 3 -client_0 = Client(client_name="org-1") -client_1 = Client(client_name="org-2") -client_2 = Client(client_name="org-3") - -# Create a dictionary to easily access each client from its human-friendly id -clients = { - client_0.organization_info().organization_id: client_0, - client_1.organization_info().organization_id: client_1, - client_2.organization_info().organization_id: client_2, -} -# Store organization IDs -ORGS_ID = list(clients) - -# The provider of the functions for computing analytics is defined as the first organization. -ANALYTICS_PROVIDER_ORG_ID = ORGS_ID[0] -# Data providers orgs are the two last organizations. -DATA_PROVIDER_ORGS_ID = ORGS_ID[1:] - -# %% -# Prepare the data -# ---------------- -# -# The function ``setup_diabetes`` downloads if needed the *diabetes* dataset, and split it in two to simulate a -# federated setup. Each data organization has access to a chunk of the dataset. - -import pathlib - -from diabetes_substrafl_assets.dataset.diabetes_substrafl_dataset import setup_diabetes - -data_path = pathlib.Path.cwd() / "tmp" / "data_diabetes" -data_path.mkdir(parents=True, exist_ok=True) - -setup_diabetes(data_path=data_path) - - -# %% -# Registering data samples and dataset -# ------------------------------------ -# -# Every asset will be created in respect to predefined specifications previously imported from -# ``substra.sdk.schemas``. To register assets, :ref:`documentation/api_reference:Schemas` -# are first instantiated and the specs are then registered, which generate the real assets. -# -# Permissions are defined when registering assets. In a nutshell: -# -# - Data cannot be seen once it's registered on the platform. -# - Metadata are visible by all the users of a network. -# - Permissions allow you to execute a function on a certain dataset. -# -# Next, we need to define the asset directory. You should have already downloaded the assets folder as stated above. -# -# A dataset represents the data in Substra. It contains some metadata and an *opener*, a script used to load the -# data from files into memory. You can find more details about datasets -# in the :ref:`API reference`. - -from substra.sdk.schemas import DataSampleSpec -from substra.sdk.schemas import DatasetSpec -from substra.sdk.schemas import Permissions - - -assets_directory = pathlib.Path.cwd() / "diabetes_substrafl_assets" -assert assets_directory.is_dir(), """Did not find the asset directory, -a directory called 'assets' is expected in the same location as this file""" - -permissions_dataset = Permissions(public=False, authorized_ids=[ANALYTICS_PROVIDER_ORG_ID]) - -dataset = DatasetSpec( - name=f"Diabetes dataset", - type="csv", - data_opener=assets_directory / "dataset" / "diabetes_substrafl_opener.py", - description=data_path / "description.md", - permissions=permissions_dataset, - logs_permission=permissions_dataset, -) - -# We register the dataset for each organization -dataset_keys = {client_id: clients[client_id].add_dataset(dataset) for client_id in DATA_PROVIDER_ORGS_ID} - -for client_id, key in dataset_keys.items(): - print(f"Dataset key for {client_id}: {key}") - - -# %% -# The dataset object itself is an empty shell. Data samples are needed in order to add actual data. -# A data sample contains subfolders containing a single data file like a CSV and the key identifying -# the dataset it is linked to. - -datasample_keys = { - org_id: clients[org_id].add_data_sample( - DataSampleSpec( - data_manager_keys=[dataset_keys[org_id]], - test_only=False, - path=data_path / f"org_{i + 1}", - ), - local=True, - ) - for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID) -} - -# %% -# The flexibility of the ComputePlanBuilder class -# =============================================== -# -# This example aims at explaining how to use the :ref:`substrafl_doc/api/compute_plan_builder:Compute Plan Builder` -# class, and how to use the full power of the flexibility it provides. -# -# Before starting, we need to have in mind that a federated computation can be represented as a graph of tasks. -# Some of these tasks need data to be executed (training tasks) and others are here to aggregate local results -# (aggregation tasks). -# -# Substra does not store an explicit definition of this graph; instead, it gives the user full flexibility to define -# the compute plan (or computation graph) they need, by linking a task to its parents. -# -# To create this graph of computations, SubstraFL provides the ``Node`` abstraction. A ``Node`` -# assigns to an organization (aka a Client) tasks of a given type. The type of the ``Node`` depends on the type of tasks -# we want to run on this organization (training or aggregation tasks). -# -# An organization (aka Client) without data can host an -# :ref:`Aggregation node`. -# We will use the :ref:`Aggregation node` object to compute the aggregated -# analytics. -# -# An organization (aka a Client) containing the data samples can host a -# :ref:`Train data node`. -# Each node will only have access data from the organization hosting it. -# These data samples must be instantiated with the right permissions to be processed by the given Client. - -from substrafl.nodes import TrainDataNode -from substrafl.nodes import AggregationNode - - -aggregation_node = AggregationNode(ANALYTICS_PROVIDER_ORG_ID) - -train_data_nodes = [ - TrainDataNode( - organization_id=org_id, - data_manager_key=dataset_keys[org_id], - data_sample_keys=[datasample_keys[org_id]], - ) - for org_id in DATA_PROVIDER_ORGS_ID -] - -# %% -# The :ref:`substrafl_doc/api/compute_plan_builder:Compute Plan Builder` is an abstract class that asks the user to -# implement only three methods: -# -# - ``build_compute_plan(...)`` -# - ``load_local_state(...)`` -# - ``save_local_state(...)`` -# -# The ``build_compute_plan`` method is essential to create the graph of the compute plan that will be executed on -# Substra. Using the different ``Nodes`` we created, we will update their states by applying user defined methods. -# -# These methods are passed as argument to the ``Node`` using its ``update_state`` method. -# - - -import numpy as np -import pandas as pd -import json -from collections import defaultdict -import pandas as pd -from typing import List, Dict - -from substrafl import ComputePlanBuilder -from substrafl.remote import remote_data, remote - - -class Analytics(ComputePlanBuilder): - def __init__(self): - super().__init__() - self.first_order_aggregated_state = {} - self.second_order_aggregated_state = {} - - @remote_data - def local_first_order_computation(self, datasamples: pd.DataFrame, shared_state=None): - """Compute from the data samples, expected to be a pandas dataframe, - the means and counts of each column of the data frame. - These datasamples are the output of the ``get_data`` function defined - in the ``diabetes_substrafl_opener.py`` file are available in the asset - folder downloaded at the beginning of the example. - - The signature of a function decorated by @remote_data must contain - the datasamples and the shared_state arguments. - - Args: - datasamples (pd.DataFrame): Pandas dataframe provided by the opener. - shared_state (None, optional): Unused here as this function only - use local information already present in the datasamples. - Defaults to None. - - Returns: - dict: dictionary containing the local information on means, counts - and number of sample. This dict will be used as a state to be - shared to an AggregationNode in order to compute the aggregation - of the different analytics. - """ - df = datasamples - states = { - "n_samples": len(df), - "means": df.select_dtypes(include=np.number).sum().to_dict(), - "counts": { - name: series.value_counts().to_dict() for name, series in df.select_dtypes(include="category").items() - }, - } - return states - - @remote_data - def local_second_order_computation(self, datasamples: pd.DataFrame, shared_state: Dict): - """This function will use the output of the ``aggregation`` function to compute - locally the standard deviation of the different columns. - - Args: - datasamples (pd.DataFrame): Pandas dataframe provided by the opener. - shared_state (Dict): Output of a first order analytics computation, - that must contain the means. - - Returns: - Dict: dictionary containing the local information on standard deviation - and number of sample. This dict will be used as a state to be shared - to an AggregationNode in order to compute the aggregation of the - different analytics. - """ - df = datasamples - means = pd.Series(shared_state["means"]) - states = { - "n_samples": len(df), - "std": np.power(df.select_dtypes(include=np.number) - means, 2).sum(), - } - return states - - @remote - def aggregation(self, shared_states: List[Dict]): - """Aggregation function that receive a list on locally computed analytics in order to - aggregate them. - The aggregation will be a weighted average using "n_samples" as weight coefficient. - - Args: - shared_states (List[Dict]): list of dictionaries containing a field "n_samples", - and the analytics to aggregate in separated fields. - - Returns: - Dict: dictionary containing the aggregated analytics. - """ - total_len = 0 - for state in shared_states: - total_len += state["n_samples"] - - aggregated_values = defaultdict(lambda: defaultdict(float)) - for state in shared_states: - for analytics_name, col_dict in state.items(): - if analytics_name == "n_samples": - # already aggregated in total_len - continue - for col_name, v in col_dict.items(): - if isinstance(v, dict): - # this column is categorical and v is a dict over - # the different modalities - if not aggregated_values[analytics_name][col_name]: - aggregated_values[analytics_name][col_name] = defaultdict(float) - for modality, vv in v.items(): - aggregated_values[analytics_name][col_name][modality] += vv / total_len - else: - # this is a numerical column and v is numerical - aggregated_values[analytics_name][col_name] += v / total_len - - # transform default_dict to regular dict - aggregated_values = json.loads(json.dumps(aggregated_values)) - - return aggregated_values - - def build_compute_plan( - self, - train_data_nodes: List[TrainDataNode], - aggregation_node: AggregationNode, - num_rounds=None, - evaluation_strategy=None, - clean_models=False, - ): - """Method to build and link the different computations to execute with each other. - We will use the ``update_state``method of the nodes given as input to choose which - method to apply. - For our example, we will only use TrainDataNodes and AggregationNodes. - - Args: - train_data_nodes (List[TrainDataNode]): Nodes linked to the data samples on which - to compute analytics. - aggregation_node (AggregationNode): Node on which to compute the aggregation - of the analytics extracted from the train_data_nodes. - num_rounds Optional[int]: Num rounds to be used to iterate on recurrent part of - the compute plan. Defaults to None. - evaluation_strategy Optional[substrafl.EvaluationStrategy]: Object storing the - TestDataNode. Unused in this example. Defaults to None. - clean_models (bool): Clean the intermediary models of this round on the - Substra platform. Default to False. - """ - first_order_shared_states = [] - local_states = {} - - for node in train_data_nodes: - # Call local_first_order_computation on each train data node - next_local_state, next_shared_state = node.update_states( - self.local_first_order_computation( - node.data_sample_keys, - shared_state=None, - _algo_name=f"Computing first order means with {self.__class__.__name__}", - ), - local_state=None, - round_idx=0, - authorized_ids=set([node.organization_id]), - aggregation_id=aggregation_node.organization_id, - clean_models=False, - ) - - # All local analytics are stored in the first_order_shared_states, - # given as input the the aggregation method. - first_order_shared_states.append(next_shared_state) - local_states[node.organization_id] = next_local_state - - # Call the aggregation method on the first_order_shared_states - self.first_order_aggregated_state = aggregation_node.update_states( - self.aggregation( - shared_states=first_order_shared_states, - _algo_name="Aggregating first order", - ), - round_idx=0, - authorized_ids=set([train_data_node.organization_id for train_data_node in train_data_nodes]), - clean_models=False, - ) - - second_order_shared_states = [] - - for node in train_data_nodes: - # Call local_second_order_computation on each train data node - _, next_shared_state = node.update_states( - self.local_second_order_computation( - node.data_sample_keys, - shared_state=self.first_order_aggregated_state, - _algo_name=f"Computing second order analytics with {self.__class__.__name__}", - ), - local_state=local_states[node.organization_id], - round_idx=1, - authorized_ids=set([node.organization_id]), - aggregation_id=aggregation_node.organization_id, - clean_models=False, - ) - - # All local analytics are stored in the second_order_shared_states, - # given as input the the aggregation method. - second_order_shared_states.append(next_shared_state) - - # Call the aggregation method on the second_order_shared_states - self.second_order_aggregated_state = aggregation_node.update_states( - self.aggregation( - shared_states=second_order_shared_states, - _algo_name="Aggregating second order", - ), - round_idx=1, - authorized_ids=set([train_data_node.organization_id for train_data_node in train_data_nodes]), - clean_models=False, - ) - - def save_local_state(self, path: pathlib.Path): - """This function will save the important local state to retrieve after each new - call to a train or test task. - - Args: - path (pathlib.Path): Path where to save the local_state. Provided internally by - Substra. - """ - state_to_save = { - "first_order": self.first_order_aggregated_state, - "second_order": self.second_order_aggregated_state, - } - with open(path, "w") as f: - json.dump(state_to_save, f) - - def load_local_state(self, path: pathlib.Path): - """Mirror function to load the local_state from a file saved using - ``save_local_state``. - - Args: - path (pathlib.Path): Path where to load the local_state. Provided internally by - Substra. - - Returns: - ComputePlanBuilder: return self with the updated local state. - """ - with open(path, "r") as f: - state_to_load = json.load(f) - - self.first_order_aggregated_state = state_to_load["first_order"] - self.second_order_aggregated_state = state_to_load["second_order"] - - return self - - -# %% -# Now that we saw the implementation of the custom ``Analytics`` class, we can add details to some of the previously -# introduced concepts. -# -# The ``update_state`` method outputs the new state of the node, that can be passed as an argument to a following one. -# This succession of ``next_state`` passed to a new ``node.update_state`` is how Substra build the graph of the -# compute plan. -# -# The ``load_local_state`` and ``save_local_state`` are two methods used at each new iteration on a Node, in order to -# retrieve the previous local state that have not been shared with the other ``Nodes``. -# -# For instance, after updating a :ref:`Train data node` using its -# ``update_state`` method, we will have access to its next local state, that we will pass as argument to the -# next ``update_state`` we will apply on this :ref:`Train data node`. -# -# To summarize, a :ref:`substrafl_doc/api/compute_plan_builder:Compute Plan Builder` is composed of several decorated -# user defined functions, that can need some data (decorated with ``@remote_data``) or not (decorated with ``@remote``). -# -# See :ref:`substrafl_doc/api/remote:Decorator` for more information on these decorators. -# -# These user defined functions will be used to create the graph of the compute plan through the ``build_compute_plan`` -# method and the ``update_state`` method of the different ``Nodes``. -# -# The local state obtained after updating a :ref:`Train data node` needs the -# methods ``save_local_state`` and ``load_local_state`` to retrieve the state where the Node was at the end of -# the last update. -# - -# %% -# Running the experiment -# ====================== -# -# As a last step before launching our experiment, we need to specify the third parties dependencies required to run it. -# The :ref:`substrafl_doc/api/dependency:Dependency` object is instantiated in order to install the right libraries in -# the Python environment of each organization. -# -# We now have all the necessary objects to launch our experiment. Please see a summary below of all the objects we created so far: -# -# - A :ref:`documentation/references/sdk:Client` to add or retrieve the assets of our experiment, using their keys to -# identify them. -# - A :ref:`Federated Strategy`, to specify what compute plan we want to execute. -# - :ref:`Train data nodes` to indicate on which data to train. -# - An :ref:`substrafl_doc/api/evaluation_strategy:Evaluation Strategy`, to define where and at which frequency we -# evaluate the model. Here this does not apply to our experiment. We set it to None. -# - An :ref:`substrafl_doc/api/nodes:AggregationNode`, to specify the organization on which the aggregation operation -# will be computed. -# - An **experiment folder** to save a summary of the operation made. -# - The :ref:`substrafl_doc/api/dependency:Dependency` to define the libraries on which the experiment needs to run. - -from substrafl.dependency import Dependency -from substrafl.experiment import execute_experiment - -dependencies = Dependency(pypi_dependencies=["numpy==1.23.1", "pandas==1.5.3"]) - -compute_plan = execute_experiment( - client=clients[ANALYTICS_PROVIDER_ORG_ID], - strategy=Analytics(), - train_data_nodes=train_data_nodes, - evaluation_strategy=None, - aggregation_node=aggregation_node, - experiment_folder=str(pathlib.Path.cwd() / "tmp" / "experiment_summaries"), - dependencies=dependencies, - clean_models=False, - name="Federated Analytics with SubstraFL documentation example", -) - -# %% -# Results -# ------- -# -# The output of a task can be downloaded using some utils function provided by SubstraFL, such as -# ``download_algo_state``, ``download_train_shared_state`` or ``download_aggregate_shared_state``. -# -# These functions download from a given ``Client`` and a given ``compute_plan_key`` the output of a -# given ``round_idx`` or ``rank_idx``. - -from substrafl.model_loading import download_aggregate_shared_state - -# The aggregated analytics are computed in the ANALYTICS_PROVIDER_ORG_ID client. -client_to_download_from = clients[ANALYTICS_PROVIDER_ORG_ID] - -# The results will be available once the compute plan is completed -client_to_download_from.wait_compute_plan(compute_plan.key) - -first_rank_analytics = download_aggregate_shared_state( - client=client_to_download_from, - compute_plan_key=compute_plan.key, - round_idx=0, -) - -second_rank_analytics = download_aggregate_shared_state( - client=client_to_download_from, - compute_plan_key=compute_plan.key, - round_idx=1, -) - -print( - f"""Age mean: {first_rank_analytics['means']['age']:.2f} years -Sex percentage: - Male: {100*first_rank_analytics['counts']['sex']['M']:.2f}% - Female: {100*first_rank_analytics['counts']['sex']['F']:.2f}% -Blood pressure std: {second_rank_analytics["std"]["bp"]:.2f} mm Hg -""" -) diff --git a/examples/substrafl/go_further/run_iris_sklearn.py b/examples/substrafl/go_further/run_iris_sklearn.py deleted file mode 100644 index 4c1ba1c2..00000000 --- a/examples/substrafl/go_further/run_iris_sklearn.py +++ /dev/null @@ -1,475 +0,0 @@ -""" -========================================= -Using scikit-learn FedAvg on IRIS dataset -========================================= - -This example illustrate an advanced usage of SubstraFL as it does not use the SubstraFL PyTorch interface, but showcases the general SubstraFL interface that you can use with any ML framework. - - -This example is based on: - -- Dataset: IRIS, tabular dataset to classify iris type -- Model type: Logistic regression using Scikit-Learn -- FL setup: three organizations, two data providers and one algo provider - -This example does not use the deployed platform of Substra, it runs in local mode. - -To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example. - - .. only:: builder_html or readthedocs - - :download:`assets required to run this example <../../../../../tmp/sklearn_fedavg_assets.zip>` - - * Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command ``pip install -r requirements.txt`` to install them. - * **Substra** and **SubstraFL** should already be installed. If not follow the instructions described here: :ref:`substrafl_doc/substrafl_overview:Installation`. - -""" -# %% -# Setup -# ***** -# -# We work with three different organizations. Two organizations provide a dataset, and a third -# one provides the algorithm and registers the machine learning tasks. -# -# This example runs in local mode, simulating a federated learning experiment. -# -# In the following code cell, we define the different organizations needed for our FL experiment. - - -import numpy as np - -from substra import Client - -SEED = 42 -np.random.seed(SEED) - -# Choose the subprocess mode to locally simulate the FL process -N_CLIENTS = 3 -clients_list = [Client(client_name=f"org-{i+1}") for i in range(N_CLIENTS)] -clients = {client.organization_info().organization_id: client for client in clients_list} - -# Store organization IDs -ORGS_ID = list(clients) -ALGO_ORG_ID = ORGS_ID[0] # Algo provider is defined as the first organization. -DATA_PROVIDER_ORGS_ID = ORGS_ID[1:] # Data provider orgs are the last two organizations. - -# %% -# Data and metrics -# **************** - -# %% -# Data preparation -# ================ -# -# This section downloads (if needed) the **IRIS dataset** using the `Scikit-Learn dataset module -# `__. -# It extracts the data locally create two folders: one for each organization. -# -# Each organization will have access to half the train data, and to half the test data. - -import pathlib -from sklearn_fedavg_assets.dataset.iris_dataset import setup_iris - -# sphinx_gallery_thumbnail_path = 'static/example_thumbnail/iris.jpg' - -# Create the temporary directory for generated data -(pathlib.Path.cwd() / "tmp").mkdir(exist_ok=True) -data_path = pathlib.Path.cwd() / "tmp" / "data_iris" - -setup_iris(data_path=data_path, n_client=len(DATA_PROVIDER_ORGS_ID)) - -# %% -# Dataset registration -# ==================== - -from substra.sdk.schemas import DatasetSpec -from substra.sdk.schemas import Permissions -from substra.sdk.schemas import DataSampleSpec - -assets_directory = pathlib.Path.cwd() / "sklearn_fedavg_assets" - -permissions_dataset = Permissions(public=False, authorized_ids=[ALGO_ORG_ID]) - -dataset = DatasetSpec( - name="Iris", - type="npy", - data_opener=assets_directory / "dataset" / "iris_opener.py", - description=assets_directory / "dataset" / "description.md", - permissions=permissions_dataset, - logs_permission=permissions_dataset, -) - -dataset_keys = {} -train_datasample_keys = {} -test_datasample_keys = {} - -for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID): - client = clients[org_id] - - # Add the dataset to the client to provide access to the opener in each organization. - dataset_keys[org_id] = client.add_dataset(dataset) - assert dataset_keys[org_id], "Missing data manager key" - - client = clients[org_id] - - # Add the training data on each organization. - data_sample = DataSampleSpec( - data_manager_keys=[dataset_keys[org_id]], - path=data_path / f"org_{i+1}" / "train", - ) - train_datasample_keys[org_id] = client.add_data_sample( - data_sample, - local=True, - ) - - # Add the testing data on each organization. - data_sample = DataSampleSpec( - data_manager_keys=[dataset_keys[org_id]], - path=data_path / f"org_{i+1}" / "test", - ) - test_datasample_keys[org_id] = client.add_data_sample( - data_sample, - local=True, - ) - -# %% -# Metrics registration -# ==================== - -from sklearn.metrics import accuracy_score -import numpy as np - - -def accuracy(datasamples, predictions_path): - y_true = datasamples["targets"] - y_pred = np.load(predictions_path) - - return accuracy_score(y_true, y_pred) - - -# %% -# Specify the machine learning components -# *************************************** -# -# SubstraFL can be used with any machine learning framework. The framework -# dependent functions are written in the :ref:`Algorithm` object. -# -# In this section, you will: -# -# - register a model and its dependencies -# - write your own Sklearn SubstraFL algorithm -# - specify the federated learning strategy -# - specify the organizations where to train and where to aggregate -# - specify the organizations where to test the models -# - actually run the computations - -# %% -# Model definition -# ================ -# -# The machine learning model used here is a logistic regression. -# The `warm_start` argument is essential in this example as it indicates to use the current state of the model -# as initialization for the future training. -# By default scikit-learn uses `max_iter=100`, which means the model trains on up to 100 epochs. -# When doing federated learning, we don't want to train too much locally at every round -# otherwise the local training will erase what was learned from the other centers. That is why we set `max_iter=3`. - - -import os -from sklearn import linear_model - -cls = linear_model.LogisticRegression(random_state=SEED, warm_start=True, max_iter=3) - -# Optional: -# Scikit-Learn raises warnings in case of non convergence, that we choose to disable here. -# As this example runs with python subprocess, the way to disable it is to use following environment -# variable: -os.environ["PYTHONWARNINGS"] = "ignore:lbfgs failed to converge (status=1):UserWarning" - -# %% -# SubstraFL algo definition -# ========================== -# -# This section is the most important one for this example. We will define here the function that will run locally on -# each node to train the model. -# -# As SubstraFL does not provide an algorithm comptatible with Sklearn, we need to define one using the provided documentation on -# :ref:`substrafl_doc/api/algorithms:Base Class`. -# -# To define a custom algorithm, we will need to inherit from the base class Algo, and to define two properties and four -# methods: -# -# - **strategies** (property): the list of strategies our algorithm is compatible with. -# - **model** (property): a property that returns the model from the defined algo. -# - **train** (method): a function to describe the training process to -# apply to train our model in a federated way. -# The train method must accept as parameters `datasamples` and `shared_state`. -# - **predict** (method): a function to describe how to compute the -# predictions from the algo model. -# The predict method must accept as parameters `datasamples`, `shared_state` and `predictions_path`. -# - **save** (method): specify how to save the important states of our algo. -# - **load** (method): specify how to load the important states of our algo from a previously saved filed -# by the `save` function describe above. - -from substrafl import algorithms -from substrafl import remote -from substrafl.strategies import schemas as fl_schemas - -import joblib -from typing import Optional -import shutil - -# The Iris dataset proposes four attributes to predict three different classes. -INPUT_SIZE = 4 -OUTPUT_SIZE = 3 - - -class SklearnLogisticRegression(algorithms.Algo): - def __init__(self, model, seed=None): - super().__init__(model=model, seed=seed) - - self._model = model - - # We need all different instances of the algorithm to have the same - # initialization. - self._model.coef_ = np.ones((OUTPUT_SIZE, INPUT_SIZE)) - self._model.intercept_ = np.zeros(3) - self._model.classes_ = np.array([-1]) - - if seed is not None: - np.random.seed(seed) - - @property - def strategies(self): - """List of compatible strategies""" - return [fl_schemas.StrategyName.FEDERATED_AVERAGING] - - @property - def model(self): - return self._model - - @remote.remote_data - def train( - self, - datasamples, - shared_state: Optional[fl_schemas.FedAvgAveragedState] = None, - ) -> fl_schemas.FedAvgSharedState: - """The train function to be executed on organizations containing - data we want to train our model on. The @remote_data decorator is mandatory - to allow this function to be sent and executed on the right organization. - - Args: - datasamples: datasamples extracted from the organizations data using - the given opener. - shared_state (Optional[fl_schemas.FedAvgAveragedState], optional): - shared_state provided by the aggregator. Defaults to None. - - Returns: - fl_schemas.FedAvgSharedState: State to be sent to the aggregator. - """ - - if shared_state is not None: - # If we have a shared state, we update the model parameters with - # the average parameters updates. - self._model.coef_ += np.reshape( - shared_state.avg_parameters_update[:-1], - (OUTPUT_SIZE, INPUT_SIZE), - ) - self._model.intercept_ += shared_state.avg_parameters_update[-1] - - # To be able to compute the delta between the parameters before and after training, - # we need to save them in a temporary variable. - old_coef = self._model.coef_ - old_intercept = self._model.intercept_ - - # Model training. - self._model.fit(datasamples["data"], datasamples["targets"]) - - # We compute de delta. - delta_coef = self._model.coef_ - old_coef - delta_bias = self._model.intercept_ - old_intercept - - # We reset the model parameters to their state before training in order to remove - # the local updates from it. - self._model.coef_ = old_coef - self._model.intercept_ = old_intercept - - # We output the length of the dataset to apply a weighted average between - # the organizations regarding their number of samples, and the local - # parameters updates. - # These updates are sent to the aggregator to compute the average - # parameters updates, that we will receive in the next round in the - # `shared_state`. - return fl_schemas.FedAvgSharedState( - n_samples=len(datasamples["targets"]), - parameters_update=[p for p in delta_coef] + [delta_bias], - ) - - @remote.remote_data - def predict(self, datasamples, shared_state, predictions_path): - """The predict function to be executed on organizations containing - data we want to test our model on. The @remote_data decorator is mandatory - to allow this function to be sent and executed on the right organization. - - Args: - datasamples: datasamples extracted from the organizations data using - the given opener. - shared_state: shared_state provided by the aggregator. - predictions_path: Path where to save the predictions. - This path is provided by Substra and the metric will automatically - get access to this path to load the predictions. - """ - predictions = self._model.predict(datasamples["data"]) - - if predictions_path is not None: - np.save(predictions_path, predictions) - - # np.save() automatically adds a ".npy" to the end of the file. - # We rename the file produced by removing the ".npy" suffix, to make sure that - # predictions_path is the actual file name. - shutil.move(str(predictions_path) + ".npy", predictions_path) - - def save_local_state(self, path): - joblib.dump( - { - "model": self._model, - "coef": self._model.coef_, - "bias": self._model.intercept_, - }, - path, - ) - - def load_local_state(self, path): - loaded_dict = joblib.load(path) - self._model = loaded_dict["model"] - self._model.coef_ = loaded_dict["coef"] - self._model.intercept_ = loaded_dict["bias"] - return self - - -# %% -# Federated Learning strategies -# ============================= - -from substrafl.strategies import FedAvg - -strategy = FedAvg(algo=SklearnLogisticRegression(model=cls, seed=SEED)) - -# %% -# Where to train where to aggregate -# ================================= - -from substrafl.nodes import TrainDataNode -from substrafl.nodes import AggregationNode - - -aggregation_node = AggregationNode(ALGO_ORG_ID) - -# Create the Train Data Nodes (or training tasks) and save them in a list -train_data_nodes = [ - TrainDataNode( - organization_id=org_id, - data_manager_key=dataset_keys[org_id], - data_sample_keys=[train_datasample_keys[org_id]], - ) - for org_id in DATA_PROVIDER_ORGS_ID -] - -# %% -# Where and when to test -# ====================== - - -from substrafl.nodes import TestDataNode -from substrafl.evaluation_strategy import EvaluationStrategy - -# Create the Test Data Nodes (or testing tasks) and save them in a list -test_data_nodes = [ - TestDataNode( - organization_id=org_id, - data_manager_key=dataset_keys[org_id], - test_data_sample_keys=[test_datasample_keys[org_id]], - metric_functions=accuracy, - ) - for org_id in DATA_PROVIDER_ORGS_ID -] - -my_eval_strategy = EvaluationStrategy(test_data_nodes=test_data_nodes, eval_frequency=1) - -# %% -# Running the experiment -# ********************** - -from substrafl.experiment import execute_experiment -from substrafl.dependency import Dependency - -# Number of times to apply the compute plan. -NUM_ROUNDS = 6 - -dependencies = Dependency(pypi_dependencies=["numpy==1.23.1", "scikit-learn==1.1.1"]) - -compute_plan = execute_experiment( - client=clients[ALGO_ORG_ID], - strategy=strategy, - train_data_nodes=train_data_nodes, - evaluation_strategy=my_eval_strategy, - aggregation_node=aggregation_node, - num_rounds=NUM_ROUNDS, - experiment_folder=str(pathlib.Path.cwd() / "tmp" / "experiment_summaries"), - dependencies=dependencies, - name="IRIS documentation example", -) - -# %% -# Explore the results -# ******************* - -# The results will be available once the compute plan is completed -clients[ALGO_ORG_ID].wait_compute_plan(compute_plan.key) - -# %% -# Listing results -# =============== - -import pandas as pd - -performances_df = pd.DataFrame(client.get_performances(compute_plan.key).dict()) -print("\nPerformance Table: \n") -print(performances_df[["worker", "round_idx", "performance"]]) - -# %% -# Plot results -# ============ - -import matplotlib.pyplot as plt - -plt.title("Test dataset results") -plt.xlabel("Rounds") -plt.ylabel("Accuracy") - -for org_id in DATA_PROVIDER_ORGS_ID: - df = performances_df[performances_df["worker"] == org_id] - plt.plot(df["round_idx"], df["performance"], label=org_id) - -plt.legend(loc="lower right") -plt.show() - -# %% -# Download a model -# ================ - -from substrafl.model_loading import download_algo_state - -client_to_download_from = DATA_PROVIDER_ORGS_ID[0] -round_idx = None - -algo = download_algo_state( - client=clients[client_to_download_from], - compute_plan_key=compute_plan.key, - round_idx=round_idx, -) - -cls = algo.model - -print("Coefs: ", cls.coef_) -print("Intercepts: ", cls.intercept_) diff --git a/examples/substrafl/go_further/run_mnist_cyclic.py b/examples/substrafl/go_further/run_mnist_cyclic.py deleted file mode 100644 index 00964435..00000000 --- a/examples/substrafl/go_further/run_mnist_cyclic.py +++ /dev/null @@ -1,851 +0,0 @@ -""" -=============================================== -Creating Torch Cyclic strategy on MNIST dataset -=============================================== - -This example illustrates an advanced usage of SubstraFL and proposes to implement a new Federated Learning strategy, -called **Cyclic Strategy**, using the SubstraFL base classes. -This example runs on the `MNIST Dataset of handwritten digits `__ using PyTorch. -In this example, we work on 28x28 pixel sized grayscale images. This is a classification problem -aiming to recognize the number written on each image. - -The **Cyclic Strategy** consists in training locally a model on different organizations (or centers) sequentially (one after the other). We -consider a round of this strategy to be a full cycle of local trainings. - -This example shows an implementation of the CyclicTorchAlgo using -:ref:`TorchAlgo ` as base class, and the CyclicStrategy implementation using -:ref:`Strategy ` as base class. - -This example does not use a deployed platform of Substra and runs in local mode. - -To run this example, you need to download and unzip the assets needed to run it in the same directory as used this example. - - .. only:: builder_html or readthedocs - - :download:`assets required to run this example <../../../../../tmp/torch_cyclic_assets.zip>` - - * Please ensure to have all the libraries installed. A *requirements.txt* file is included in the zip file, where you can run the command ``pip install -r requirements.txt`` to install them. - * **Substra** and **SubstraFL** should already be installed. If not follow the instructions described here: :ref:`substrafl_doc/substrafl_overview:Installation`. - -""" -# %% -# Setup -# ***** -# -# This example runs with three organizations. Two organizations provide datasets, while a third -# one provides the algorithm. -# -# In the following code cell, we define the different organizations needed for our FL experiment. - - -from substra import Client - -N_CLIENTS = 3 - -client_0 = Client(client_name="org-1") -client_1 = Client(client_name="org-2") -client_2 = Client(client_name="org-3") - -# %% -# Every computation will run in ``subprocess`` mode, where everything runs locally in Python -# subprocesses. -# Other backend_types are: -# -# - ``docker`` mode where computations run locally in docker containers -# - ``remote`` mode where computations run remotely (you need to have a deployed platform for that) -# -# To run in remote mode, use the following syntax: -# -# ``client_remote = Client(backend_type="remote", url="MY_BACKEND_URL", username="my-username", password="my-password")`` - - -# Create a dictionary to easily access each client from its human-friendly id -clients = { - client_0.organization_info().organization_id: client_0, - client_1.organization_info().organization_id: client_1, - client_2.organization_info().organization_id: client_2, -} - -# Store organization IDs -ORGS_ID = list(clients) -# Algo provider is defined as the first organization. -ALGO_ORG_ID = ORGS_ID[0] -# All organizations provide data in this cyclic setup. -DATA_PROVIDER_ORGS_ID = ORGS_ID - -# %% -# Data and metrics -# **************** - -# %% -# Data preparation -# ================ -# -# This section downloads (if needed) the **MNIST dataset** using the `torchvision library -# `__. -# It extracts the images from the raw files and locally creates a folder for each -# organization. -# -# Each organization will have access to half the training data and half the test data (which -# corresponds to **30,000** -# images for training and **5,000** for testing each). - -import pathlib -from torch_cyclic_assets.dataset.cyclic_mnist_dataset import setup_mnist - -# sphinx_gallery_thumbnail_path = 'static/example_thumbnail/cyclic-mnist.png' - -# Create the temporary directory for generated data -(pathlib.Path.cwd() / "tmp").mkdir(exist_ok=True) -data_path = pathlib.Path.cwd() / "tmp" / "data_mnist" - -setup_mnist(data_path, len(DATA_PROVIDER_ORGS_ID)) - -# %% -# Dataset registration -# ==================== -# -# A :ref:`documentation/concepts:Dataset` is composed of an **opener**, which is a Python script that can load -# the data from the files in memory and a description markdown file. -# The :ref:`documentation/concepts:Dataset` object itself does not contain the data. The proper asset that contains the -# data is the **datasample asset**. -# -# A **datasample** contains a local path to the data. A datasample can be linked to a dataset in order to add data to a -# dataset. -# -# Data privacy is a key concept for Federated Learning experiments. That is why we set -# :ref:`documentation/concepts:Permissions` for :ref:`documentation/concepts:Assets` to determine how each organization -# can access a specific asset. -# You can read more about permissions in the :ref:`User Guide`. -# -# Note that metadata such as the assets' creation date and the asset owner are visible to all the organizations of a -# network. - -from substra.sdk.schemas import DatasetSpec -from substra.sdk.schemas import Permissions -from substra.sdk.schemas import DataSampleSpec - -assets_directory = pathlib.Path.cwd() / "torch_cyclic_assets" -dataset_keys = {} -train_datasample_keys = {} -test_datasample_keys = {} - -for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID): - client = clients[org_id] - - permissions_dataset = Permissions(public=False, authorized_ids=[ALGO_ORG_ID]) - - # DatasetSpec is the specification of a dataset. It makes sure every field - # is well-defined, and that our dataset is ready to be registered. - # The real dataset object is created in the add_dataset method. - - dataset = DatasetSpec( - name="MNIST", - type="npy", - data_opener=assets_directory / "dataset" / "cyclic_mnist_opener.py", - description=assets_directory / "dataset" / "description.md", - permissions=permissions_dataset, - logs_permission=permissions_dataset, - ) - dataset_keys[org_id] = client.add_dataset(dataset) - assert dataset_keys[org_id], "Missing dataset key" - - # Add the training data on each organization. - data_sample = DataSampleSpec( - data_manager_keys=[dataset_keys[org_id]], - path=data_path / f"org_{i+1}" / "train", - ) - train_datasample_keys[org_id] = client.add_data_sample(data_sample) - - # Add the testing data on each organization. - data_sample = DataSampleSpec( - data_manager_keys=[dataset_keys[org_id]], - path=data_path / f"org_{i+1}" / "test", - ) - test_datasample_keys[org_id] = client.add_data_sample(data_sample) - - -# %% -# Metrics definition -# ================== -# -# A metric is a function used to evaluate the performance of your model on one or several -# **datasamples**. -# -# To add a metric, you need to define a function that computes and returns a performance -# from the datasamples (as returned by the opener) and the predictions_path (to be loaded within the function). -# -# When using a Torch SubstraFL algorithm, the predictions are saved in the ``predict`` function in numpy format -# so that you can simply load them using ``np.load``. - -from sklearn.metrics import accuracy_score -from sklearn.metrics import roc_auc_score -import numpy as np - - -def accuracy(datasamples, predictions_path): - y_true = datasamples["labels"] - y_pred = np.load(predictions_path) - - return accuracy_score(y_true, np.argmax(y_pred, axis=1)) - - -def roc_auc(datasamples, predictions_path): - y_true = datasamples["labels"] - y_pred = np.load(predictions_path) - - n_class = np.max(y_true) + 1 - y_true_one_hot = np.eye(n_class)[y_true] - - return roc_auc_score(y_true_one_hot, y_pred) - - -# %% -# Machine learning components definition -# ************************************** -# -# This section uses the PyTorch based SubstraFL API to simplify the definition of machine learning components. -# However, SubstraFL is compatible with any machine learning framework. -# -# -# In this section, you will: -# -# - Register a model and its dependencies -# - Create a federated learning strategy -# - Specify the training and aggregation nodes -# - Specify the test nodes -# - Actually run the computations - - -# %% -# Model definition -# ================ -# -# We choose to use a classic torch CNN as the model to train. The model architecture is defined by the user -# independently of SubstraFL. - -import torch -from torch import nn -import torch.nn.functional as F - -seed = 42 -torch.manual_seed(seed) - - -class CNN(nn.Module): - def __init__(self): - super(CNN, self).__init__() - self.conv1 = nn.Conv2d(1, 32, kernel_size=5) - self.conv2 = nn.Conv2d(32, 32, kernel_size=5) - self.conv3 = nn.Conv2d(32, 64, kernel_size=5) - self.fc1 = nn.Linear(3 * 3 * 64, 256) - self.fc2 = nn.Linear(256, 10) - - def forward(self, x, eval=False): - x = F.relu(self.conv1(x)) - x = F.relu(F.max_pool2d(self.conv2(x), 2)) - x = F.dropout(x, p=0.5, training=not eval) - x = F.relu(F.max_pool2d(self.conv3(x), 2)) - x = F.dropout(x, p=0.5, training=not eval) - x = x.view(-1, 3 * 3 * 64) - x = F.relu(self.fc1(x)) - x = F.dropout(x, p=0.5, training=not eval) - x = self.fc2(x) - return F.log_softmax(x, dim=1) - - -model = CNN() -optimizer = torch.optim.Adam(model.parameters(), lr=0.001) -criterion = torch.nn.CrossEntropyLoss() - -# %% -# Specifying on how much data to train -# ==================================== -# -# To specify on how much data to train at each round, we use the ``index_generator`` object. -# We specify the batch size and the number of batches (named ``num_updates``) to consider for each round. -# See :ref:`substrafl_doc/substrafl_overview:Index Generator` for more details. - - -from substrafl.index_generator import NpIndexGenerator - -# Number of model updates between each FL strategy aggregation. -NUM_UPDATES = 100 - -# Number of samples per update. -BATCH_SIZE = 32 - -index_generator = NpIndexGenerator( - batch_size=BATCH_SIZE, - num_updates=NUM_UPDATES, -) - -# %% -# Torch Dataset definition -# ========================== -# -# This torch Dataset is used to preprocess the data using the ``__getitem__`` function. -# -# This torch Dataset needs to have a specific ``__init__`` signature, that must contain (self, datasamples, is_inference). -# -# The ``__getitem__`` function is expected to return (inputs, outputs) if ``is_inference`` is ``False``, else only the inputs. -# This behavior can be changed by re-writing the ``_local_train`` or ``predict`` methods. - - -class TorchDataset(torch.utils.data.Dataset): - def __init__(self, datasamples, is_inference: bool): - self.x = datasamples["images"] - self.y = datasamples["labels"] - self.is_inference = is_inference - - def __getitem__(self, idx): - if self.is_inference: - x = torch.FloatTensor(self.x[idx][None, ...]) / 255 - return x - - else: - x = torch.FloatTensor(self.x[idx][None, ...]) / 255 - - y = torch.tensor(self.y[idx]).type(torch.int64) - y = F.one_hot(y, 10) - y = y.type(torch.float32) - - return x, y - - def __len__(self): - return len(self.x) - - -# %% -# Cyclic Strategy implementation -# ============================== -# -# A FL strategy specifies how to train a model on distributed data. -# -# The **Cyclic Strategy** passes the model from an organization to the next one, until all -# the data available in Substra has been sequentially presented to the model. -# -# This is not the most efficient strategy. The model will overfit the last dataset it sees, -# and the order of training will impact the performances of the model. But we will use this implementation -# as an example to explain and show how to implement your own strategies using SubstraFL. -# -# To instantiate this new strategy, we need to overwrite three methods: -# -# - ``initialization_round``, to indicate what tasks to execute at round 0, in order to setup the variable -# and be able to compute the performances of the model before any training. -# - ``perform_round``, to indicate what tasks and in which order we need to compute to execute a round of the strategy. -# - ``perform_predict``, to indicate how to compute the predictions and performances . -# - -from typing import Any -from typing import List -from typing import Optional - -from substrafl import strategies -from substrafl.algorithms.algo import Algo -from substrafl.nodes.aggregation_node import AggregationNode -from substrafl.nodes.test_data_node import TestDataNode -from substrafl.nodes.train_data_node import TrainDataNode - - -class CyclicStrategy(strategies.Strategy): - """The base class Strategy proposes a default compute plan structure - in its ``build_compute_plan``method implementation, dedicated to Federated Learning compute plan. - This method calls ``initialization_round`` at round 0, and then repeats ``perform_round`` for ``num_rounds``. - - The default ``build_compute_plan`` implementation also takes into account the given evaluation - strategy to trigger the tests tasks when needed. - """ - - def __init__(self, algo: Algo, *args, **kwargs): - """ - It is possible to add any arguments to a Strategy. It is important to pass these arguments as - args or kwargs to the parent class, using the super().__init__(...) method. - Indeed, SubstraFL does not use the instance of the object. It re-instantiates them at each new task - using the args and kwargs passed to the parent class, and uses the save and load local state method to retrieve - its state. - - Args: - algo (Algo): A Strategy takes an Algo as argument, in order to deal with framework - specific function in a dedicated object. - """ - super().__init__(algo=algo, *args, **kwargs) - - self._cyclic_local_state = None - self._cyclic_shared_state = None - - @property - def name(self) -> str: - """The name of the strategy. Useful to indicate which Algo - are compatible or aren't with this strategy. - - Returns: - str: Name of the strategy - """ - return "Cyclic Strategy" - - def initialization_round( - self, - *, - train_data_nodes: List[TrainDataNode], - clean_models: bool, - round_idx: Optional[int] = 0, - additional_orgs_permissions: Optional[set] = None, - ): - """The ``initialization_round`` function is called at round 0 on the - ``build_compute_plan`` method. In our strategy, we want to initialize - ``_cyclic_local_state`` in order to be able to test the model before - any training. - - We only initialize the model on the first train data node. - - Args: - train_data_nodes (List[TrainDataNode]): Train data nodes representing the different - organizations containing data we want to train on. - clean_models (bool): Boolean to indicate if we want to keep intermediate shared states. - Only taken into account in ``remote`` mode. - round_idx (Optional[int], optional): Current round index. The initialization round is zero by default, - but you are free to change it in the ``build_compute_plan`` method. Defaults to 0. - additional_orgs_permissions (Optional[set], optional): additional organization ids that could - have access to the outputs the task. In our case, this corresponds to the organization - containing test data nodes, in order to provide access to the model and to allow to - use it on the test data. - """ - first_train_data_node = train_data_nodes[0] - - # The algo.initialize method is an empty method useful to load all python object to the platform. - self._cyclic_local_state = first_train_data_node.init_states( - operation=self.algo.initialize( - _algo_name=f"Initializing with {self.algo.__class__.__name__}", - ), - round_idx=round_idx, - authorized_ids=set([first_train_data_node.organization_id]) | additional_orgs_permissions, - clean_models=clean_models, - ) - - def perform_round( - self, - *, - train_data_nodes: List[TrainDataNode], - aggregation_node: Optional[AggregationNode], - round_idx: int, - clean_models: bool, - additional_orgs_permissions: Optional[set] = None, - ): - """This method is called at each round to perform a series of task. For the cyclic - strategy we want to design, a round is a full cycle over the different train data - nodes. - We link the output of a computed task directly to the next one. - - Args: - train_data_nodes (List[TrainDataNode]): Train data nodes representing the different - organizations containing data we want to train on. - aggregation_node (List[AggregationNode]): In the case of the Cyclic Strategy, there is no - aggregation tasks so no need for AggregationNode. - clean_models (bool): Boolean to indicate if we want to keep intermediate shared states. - Only taken into account in ``remote`` mode. - round_idx (Optional[int], optional): Current round index. - additional_orgs_permissions (Optional[set], optional): additional organization ids that could - have access to the outputs the task. In our case, this will correspond to the organization - containing test data nodes, in order to provide access to the model and to allow to - use it on the test data. - """ - for i, node in enumerate(train_data_nodes): - # We get the next train_data_node in order to add the organization of the node - # to the authorized_ids - next_train_data_node = train_data_nodes[(i + 1) % len(train_data_nodes)] - - self._cyclic_local_state, self._cyclic_shared_state = node.update_states( - operation=self.algo.train( - node.data_sample_keys, - shared_state=self._cyclic_shared_state, - _algo_name=f"Training with {self.algo.__class__.__name__}", - ), - local_state=self._cyclic_local_state, - round_idx=round_idx, - authorized_ids=set([next_train_data_node.organization_id]) | additional_orgs_permissions, - aggregation_id=None, - clean_models=clean_models, - ) - - def perform_predict( - self, - test_data_nodes: List[TestDataNode], - train_data_nodes: List[TrainDataNode], - round_idx: int, - ): - """This method is called regarding the given evaluation strategy. If the round is included - in the evaluation strategy, the ``perform_predict`` method will be called on the different concerned nodes. - - We are using the last computed ``_cyclic_local_state`` to feed the test task, which mean that we will - always test the model after its training on the last train data nodes of the list. - - Args: - test_data_nodes (List[TestDataNode]): List of all the register test data nodes containing data - we want to test on. - train_data_nodes (List[TrainDataNode]): List of all the register train data nodes. - round_idx (int): Current round index. - """ - for test_node in test_data_nodes: - test_node.update_states( - traintask_id=self._cyclic_local_state.key, - operation=self.algo.predict( - data_samples=test_node.test_data_sample_keys, - _algo_name=f"Predicting with {self.algo.__class__.__name__}", - ), - round_idx=round_idx, - ) - - -# %% -# Torch Cyclic Algo implementation -# ================================ -# -# A SubstraFL Algo gathers all the defined elements that run locally in each organization. -# This is the only SubstraFL object that is framework specific (here PyTorch specific). -# -# In the case of our **Cyclic Strategy**, we need to use the TorchAlgo base class, and -# overwrite the ``strategies`` property and the ``train`` method to ensure that we output -# the shared state we need for our Federated Learning compute plan. -# -# For the **Cyclic Strategy**, the **shared state** will be directly the **model parameters**. We will -# retrieve the model from the shared state we receive and send the new parameters updated after -# the local training. - -from substrafl.algorithms.pytorch.torch_base_algo import TorchAlgo -from substrafl.remote import remote_data -from substrafl.algorithms.pytorch import weight_manager - - -class TorchCyclicAlgo(TorchAlgo): - """We create here the base class to be inherited for SubstraFL algorithms. - An Algo is a SubstraFL object that contains all framework specific functions. - """ - - def __init__( - self, - model: torch.nn.Module, - criterion: torch.nn.modules.loss._Loss, - optimizer: torch.optim.Optimizer, - index_generator: NpIndexGenerator, - dataset: torch.utils.data.Dataset, - seed: Optional[int] = None, - use_gpu: bool = True, - *args, - **kwargs, - ): - """It is possible to add any arguments to an Algo. It is important to pass these arguments as - args or kwargs to the parent class, using the super().__init__(...) method. - Indeed, SubstraFL does not use the instance of the object. It re-instantiates them at each new task - using the args and kwargs passed to the parent class, and the save and load local state method to retrieve the - right state. - - Args: - model (torch.nn.modules.module.Module): A torch model. - criterion (torch.nn.modules.loss._Loss): A torch criterion (loss). - optimizer (torch.optim.Optimizer): A torch optimizer linked to the model. - index_generator (BaseIndexGenerator): a stateful index generator. - dataset (torch.utils.data.Dataset): an instantiable dataset class whose ``__init__`` arguments are - ``x``, ``y`` and ``is_inference``. - seed (typing.Optional[int]): Seed set at the algo initialization on each organization. Defaults to None. - use_gpu (bool): Whether to use the GPUs if they are available. Defaults to True. - """ - super().__init__( - model=model, - criterion=criterion, - optimizer=optimizer, - index_generator=index_generator, - dataset=dataset, - scheduler=None, - seed=seed, - use_gpu=use_gpu, - *args, - **kwargs, - ) - - @property - def strategies(self) -> List[str]: - """List of compatible strategies. - - Returns: - List[str]: list of compatible strategy name. - """ - return ["Cyclic Strategy"] - - @remote_data - def train( - self, - datasamples: Any, - shared_state: Optional[dict] = None, - ) -> dict: - """This method decorated with ``@remote_data`` is a method that is executed inside - the train tasks of our strategy. - The decorator is used to retrieve the entire Algo object inside the task, to be able to access all values - useful for the training (such as the model, the optimizer, etc...). - The objective is to realize the local training on given data samples, and send the right shared state - to the next task. - - Args: - datasamples (Any): datasamples are the output of the ``get_data`` method of an opener. This opener - access the data of a train data nodes, and transforms them to feed methods decorated with - ``@remote_data``. - shared_state (Optional[dict], optional): a shared state is a dictionary containing the necessary values - to use from the previous trainings of the compute plan and initialize the model with it. In our case, - the shared state is the model parameters obtained after the local train on the previous organization. - The shared state is equal to None it is the first training of the compute plan. - - Returns: - dict: returns a dict corresponding to the shared state that will be used by the next train function on - a different organization. - """ - # Create torch dataset - train_dataset = self._dataset(datasamples, is_inference=False) - - if self._index_generator.n_samples is None: - # We need to initiate the index generator number of sample the first time we have access to - # the information. - self._index_generator.n_samples = len(train_dataset) - - # If the shared state is None, it means that this is the first training of the compute plan, - # and that we don't have a shared state to take into account yet. - if shared_state is not None: - assert self._index_generator.n_samples is not None - # The shared state is the average of the model parameters for all organizations. We set - # the model to these updated values. - model_parameters = [torch.from_numpy(x).to(self._device) for x in shared_state["model_parameters"]] - weight_manager.set_parameters( - model=self._model, - parameters=model_parameters, - with_batch_norm_parameters=False, - ) - - # We set the counter of updates to zero. - self._index_generator.reset_counter() - - # Train mode for torch model. - self._model.train() - - # Train the model. - self._local_train(train_dataset) - - # We verify that we trained the model on the right amount of updates. - self._index_generator.check_num_updates() - - # Eval mode for torch model. - self._model.eval() - - # We get the new model parameters values in order to send them in the shared states. - model_parameters = weight_manager.get_parameters(model=self._model, with_batch_norm_parameters=False) - new_shared_state = {"model_parameters": [p.cpu().detach().numpy() for p in model_parameters]} - - return new_shared_state - - -# %% -# To instantiate your algo, you need to instantiate it in a class with no argument. This comment is only valid when you -# inherit from the TorchAlgo base class. -# -# The ``TorchDataset`` is passed **as a class** to the :ref:`TorchAlgo `. -# Indeed, this ``TorchDataset`` will be instantiated directly on the data provider organization. -# -# .. warning:: -# -# It is possible to add any arguments to an Algo or a Strategy. It is important to pass these arguments as -# args or kwargs to the parent class, using the ``super().__init__(...)`` method. -# -# Indeed, SubstraFL does not use the instance of the object. It **re-instantiates** them at each new task -# using the args and kwargs passed to the parent class, and the save and load local state method to retrieve the -# right state. -# -# To summarize the ``Algo`` is the place to put all framework specific code we want to apply in tasks. It is often -# the tasks that needs the data to be executed, and that are decorated with ``@remote_data``. -# -# The ``Strategy`` contains the non-framework specific code, such as the ``build_compute_plan`` method, that creates the -# graph of tasks, the **initialization round**, **perform round** and **perform predict** methods that links tasks to -# each other and links the functions to the nodes. - - -class MyAlgo(TorchCyclicAlgo): - def __init__(self): - super().__init__( - model=model, - criterion=criterion, - optimizer=optimizer, - index_generator=index_generator, - dataset=TorchDataset, - seed=seed, - ) - - -strategy = CyclicStrategy(algo=MyAlgo()) - -# %% -# Where to train where to aggregate -# ================================= -# -# We specify on which data we want to train our model, using the :ref:`substrafl_doc/api/nodes:TrainDataNode` object. -# Here we train on the two datasets that we have registered earlier. -# -# The :ref:`substrafl_doc/api/nodes:AggregationNode` specifies the organization on which the aggregation operation -# will be computed. - -from substrafl.nodes import TrainDataNode - -# Create the Train Data Nodes (or training tasks) and save them in a list -train_data_nodes = [ - TrainDataNode( - organization_id=org_id, - data_manager_key=dataset_keys[org_id], - data_sample_keys=[train_datasample_keys[org_id]], - ) - for org_id in DATA_PROVIDER_ORGS_ID -] - - -# %% -# Where and when to test -# ====================== -# -# With the same logic as the train nodes, we create :ref:`substrafl_doc/api/nodes:TestDataNode` to specify on which -# data we want to test our model. -# -# The :ref:`substrafl_doc/api/evaluation_strategy:Evaluation Strategy` defines where and at which frequency we -# evaluate the model, using the given metric(s) that you registered in a previous section. - - -from substrafl.nodes import TestDataNode -from substrafl.evaluation_strategy import EvaluationStrategy - -# Create the Test Data Nodes (or testing tasks) and save them in a list -test_data_nodes = [ - TestDataNode( - organization_id=org_id, - data_manager_key=dataset_keys[org_id], - test_data_sample_keys=[test_datasample_keys[org_id]], - metric_functions={"Accuracy": accuracy, "ROC AUC": roc_auc}, - ) - for org_id in DATA_PROVIDER_ORGS_ID -] - - -# Test at the end of every round -my_eval_strategy = EvaluationStrategy(test_data_nodes=test_data_nodes, eval_frequency=1) - -# %% -# Running the experiment -# ********************** -# -# As a last step before launching our experiment, we need to specify the third parties dependencies required to run it. -# The :ref:`substrafl_doc/api/dependency:Dependency` object is instantiated in order to install the right libraries in -# the Python environment of each organization. - -from substrafl.dependency import Dependency - -dependencies = Dependency(pypi_dependencies=["numpy==1.23.1", "torch==1.11.0", "scikit-learn==1.1.1"]) - -# %% -# We now have all the necessary objects to launch our experiment. Please see a summary below of all the objects we created so far: -# -# - A :ref:`documentation/references/sdk:Client` to add or retrieve the assets of our experiment, using their keys to -# identify them. -# - An :ref:`Torch algorithm` to define the training parameters *(optimizer, train -# function, predict function, etc...)*. -# - A :ref:`Federated Strategy`, to specify how to train the model on -# distributed data. -# - :ref:`Train data nodes` to indicate on which data to train. -# - An :ref:`substrafl_doc/api/evaluation_strategy:Evaluation Strategy`, to define where and at which frequency we -# evaluate the model. -# - An :ref:`substrafl_doc/api/nodes:AggregationNode`, to specify the organization on which the aggregation operation -# will be computed. -# - The **number of rounds**, a round being defined by a local training step followed by an aggregation operation. -# - An **experiment folder** to save a summary of the operation made. -# - The :ref:`substrafl_doc/api/dependency:Dependency` to define the libraries on which the experiment needs to run. - -from substrafl.experiment import execute_experiment - -# A round is defined by a local training step followed by an aggregation operation -NUM_ROUNDS = 3 - -compute_plan = execute_experiment( - client=clients[ALGO_ORG_ID], - strategy=strategy, - train_data_nodes=train_data_nodes, - evaluation_strategy=my_eval_strategy, - aggregation_node=None, - num_rounds=NUM_ROUNDS, - experiment_folder=str(pathlib.Path.cwd() / "tmp" / "experiment_summaries"), - dependencies=dependencies, - clean_models=False, - name="Cyclic MNIST documentation example", -) - - -# %% -# Explore the results -# ******************* - -# The results will be available once the compute plan is completed -client_0.wait_compute_plan(compute_plan.key) - -# %% -# List results -# ============ - - -import pandas as pd - -performances_df = pd.DataFrame(client.get_performances(compute_plan.key).dict()) -print("\nPerformance Table: \n") -print(performances_df[["worker", "round_idx", "identifier", "performance"]]) - -# %% -# Plot results -# ============ - -import matplotlib.pyplot as plt - -fig, axs = plt.subplots(1, 2, figsize=(12, 6)) -fig.suptitle("Test dataset results") - -axs[0].set_title("Accuracy") -axs[1].set_title("ROC AUC") - -for ax in axs.flat: - ax.set(xlabel="Rounds", ylabel="Score") - - -for org_id in DATA_PROVIDER_ORGS_ID: - org_df = performances_df[performances_df["worker"] == org_id] - acc_df = org_df[org_df["identifier"] == "Accuracy"] - axs[0].plot(acc_df["round_idx"], acc_df["performance"], label=org_id) - - auc_df = org_df[org_df["identifier"] == "ROC AUC"] - axs[1].plot(auc_df["round_idx"], auc_df["performance"], label=org_id) - -plt.legend(loc="lower right") -plt.show() - -# %% -# Download a model -# ================ -# -# After the experiment, you might be interested in downloading your trained model. -# To do so, you will need the source code in order to reload your code architecture in memory. -# You have the option to choose the client and the round you are interested in downloading. -# -# If ``round_idx`` is set to ``None``, the last round will be selected by default. - -from substrafl.model_loading import download_algo_state - -client_to_download_from = DATA_PROVIDER_ORGS_ID[-1] -round_idx = None - -algo = download_algo_state( - client=clients[client_to_download_from], - compute_plan_key=compute_plan.key, - round_idx=round_idx, -) - -model = algo.model - -print(model) diff --git a/examples_requirements.txt b/examples_requirements.txt index b5891c4f..d4ef0342 100644 --- a/examples_requirements.txt +++ b/examples_requirements.txt @@ -1,8 +1,8 @@ -matplotlib==3.6.3 -scikit-learn==1.1.1 -pandas==1.5.3 -# Dependencies for the SubstraFL FedAvg example on MNIST dataset -torch==1.13.1 -torchvision==0.14.1 -numpy==1.23.1 -gitpython==3.1.35 +ipython==8.12.0 +jupyter==1.0.0 +-r docs/source/examples/substra_core/diabetes_example/assets/requirements.txt +-r docs/source/examples/substra_core/titanic_example/assets/requirements.txt +-r docs/source/examples/substrafl/get_started/torch_fedavg_assets/requirements.txt +-r docs/source/examples/substrafl/go_further/sklearn_fedavg_assets/requirements.txt +-r docs/source/examples/substrafl/go_further/torch_cyclic_assets/requirements.txt +-r docs/source/examples/substrafl/go_further/diabetes_substrafl_assets/requirements.txt \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 1ae79bd3..440d4195 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,2 +1,2 @@ -r docs/requirements.txt --r examples_requirements.txt +-r examples_requirements.txt \ No newline at end of file