Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add analytics example #311

Merged
merged 38 commits into from
Jun 30, 2023
Merged

feat: add analytics example #311

merged 38 commits into from
Jun 30, 2023

Conversation

ThibaultFy
Copy link
Member

@ThibaultFy ThibaultFy commented May 17, 2023

Context and user need:

Documentation example on how to create a simple Analytics using the ComputePlanBuilder class.

See https://www.notion.so/owkin-fdn/Federated-Analytics-in-SubstraFL-5685c4f2549748a89a3ad2e314cde886?pvs=4 for more info, especially the Mean Analytics example.

Functional spec:

Write the example in the SubstraFL "get started" section.
Mirror of the diabetes substra example

closes FL-631

@ThibaultFy ThibaultFy force-pushed the substrafl-analytics-example branch 2 times, most recently from 86c72f1 to 9d37d58 Compare May 22, 2023 13:19
@ThibaultFy ThibaultFy force-pushed the substrafl-analytics-example branch from 9d37d58 to dcc4b55 Compare June 8, 2023 08:45
@ThibaultFy ThibaultFy marked this pull request as ready for review June 15, 2023 15:13
@ThibaultFy ThibaultFy requested review from a team and RomainGoussault as code owners June 15, 2023 15:13
@linear
Copy link

linear bot commented Jun 15, 2023

FL-631 [Flex API] - Documentation on how to write FA

Context and user need:
Documentation example on how to create a simple Analytics using the ComputePlanBuilder class.

See https://www.notion.so/owkin-fdn/Federated-Analytics-in-SubstraFL-5685c4f2549748a89a3ad2e314cde886?pvs=4 for more info, especially the Mean Analytics example.

Functional spec:
Write the example in the SubstraFL "get started" section.
Mirror of the diabetes substra example

Technical spec:

Acceptance criteria:

Copy link
Contributor

@RomainGoussault RomainGoussault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, this is great for better understanding SubstraFL.
I will come back have another read later on.

Also I think I am not sure if we should keep it in the get started section as it's still quite complex.

Comment on lines 2 to 4
===========================================
Federated Analytics on the diabetes dataset
===========================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a few sentences explaining that is the same example as the Substra (no FL) one.

# Registering data samples and dataset
# ------------------------------------
#
# Every asset will be created in respect to predefined schemas (Spec) previously imported from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Every asset will be created in respect to predefined schemas (Spec) previously imported from
# Every asset will be created in respect to predefined specifications previously imported from

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not, I am confused, what is the difference between Schema and Specs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind it's quite the same...

Comment on lines 168 to 169
# SubstraFL provides different type of Nodes to ingest these data, a Node being an object that will create an link
# the different tasks with each other.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble understanding this part, but more generally I believe this outlines the half broken concept of node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit, I had to re-read twice (admittedly, it's Friday and I got up waaay too early), and I contributed to the definition of those concepts 🙃

It's just a suggestion but I think I would do the following:

  • I would start a new section at this point, after the data registration (which should be pretty straightforward for anyone familiar with Susbtra); this is the important part of the tutorial, it should be easy to refer to it; this section will include the discussion of Nodes and of ComputePlanBuilder, maybe with two subsections.
  • Move part of the "theoretical" explanation about compute graphs at the top of the section, to give some contexts why we are caring about nodes. Something like "a Federated computation can be represented as a graph of tasks, some of them executing on the data nodes (training tasks) and others aggregating local results (aggregation tasks). Substra does not store an explicit definition of this graph; instead, it gives the user full flexibility to define the computation graph they need, by linking a task to its parents (tasks that needs to be executed before this one, for example because it needs the outputs of the previous tasks). "
    And then proceed with the discussion of the type of Nodes, and after of the ComputePlanBuilder.

(I hope this is not too confusing, we can discuss it live if you want. Basically the idea is to first explain why it's important to set up the nodes, and how they relate with what the user really wants to do, that is building a Compute Graph. Then present the technical details. )

Comment on lines 179 to 182
# A third type of node exists in SubstraFL, called :ref:`Test data node<substrafl_doc/api/nodes:TestDataNode>`.
# We will not need it in the current example. See the `MNIST Example
# <https://docs.substra.org/en/stable/substrafl_doc/examples/get_started/run_mnist_torch.html#sphx-glr-substrafl-doc-examples-get-started-run-mnist-torch-py>`__
# to learn how to use this type of Nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not mention it then?

#
# The ``update_state`` method outputs the new state of the node, that can be passed as an argument to a following one.
# This succession of ``next_state`` pass to new ``node.update_state`` is how Substra create the graph of the
# ``ComputePlan``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent with either ComputePlan or .ComputePlan.
And if we do use ComputePlan then I would use Node as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand what you meant :/

Copy link
Contributor

@SdgJlbl SdgJlbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR
Committing comments before I loose them to my flaky internet connection 😅

substrafl_examples/get_started/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
substrafl_examples/get_started/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
Comment on lines 168 to 169
# SubstraFL provides different type of Nodes to ingest these data, a Node being an object that will create an link
# the different tasks with each other.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit, I had to re-read twice (admittedly, it's Friday and I got up waaay too early), and I contributed to the definition of those concepts 🙃

It's just a suggestion but I think I would do the following:

  • I would start a new section at this point, after the data registration (which should be pretty straightforward for anyone familiar with Susbtra); this is the important part of the tutorial, it should be easy to refer to it; this section will include the discussion of Nodes and of ComputePlanBuilder, maybe with two subsections.
  • Move part of the "theoretical" explanation about compute graphs at the top of the section, to give some contexts why we are caring about nodes. Something like "a Federated computation can be represented as a graph of tasks, some of them executing on the data nodes (training tasks) and others aggregating local results (aggregation tasks). Substra does not store an explicit definition of this graph; instead, it gives the user full flexibility to define the computation graph they need, by linking a task to its parents (tasks that needs to be executed before this one, for example because it needs the outputs of the previous tasks). "
    And then proceed with the discussion of the type of Nodes, and after of the ComputePlanBuilder.

(I hope this is not too confusing, we can discuss it live if you want. Basically the idea is to first explain why it's important to set up the nodes, and how they relate with what the user really wants to do, that is building a Compute Graph. Then present the technical details. )

Comment on lines 216 to 217
# Substra. Using the different nodes we created, we will update there states by applying user defined methods,
# called ``RemoteMethod`` or ``RemoteDataMethod``, created using simply decorators, such as ``@remote`` or ``@remote_data``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just me, but I will present first what we want each method to do. And then, after the code, add a note about "oh, look, there are some decorators @remote and @remote_data on some of the functions, it's very important to use them. This is because these functions are not going to be executed locally, but on a distant node, and they either need access to data or not. It's a bit of magic Substra provides to help the user."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good advice. I tried a new configuration, let me know what it looks like :)

Comment on lines 225 to 229
# The ``load_local_state`` and ``save_local_state`` are two methods used at each new iteration on a Node, in order to
# retrieve a the previous local state that have not been shared with the other nodes.
# For instance, after updating a :ref:`Train data node<substrafl_doc/api/nodes:TrainDataNode>` using its
# ``update_state`` method, we will have access to its next local state, that we will pass as argument to the
# next ``update_state`` we will apply on this :ref:`Train data node<substrafl_doc/api/nodes:TrainDataNode>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

substrafl_examples/get_started/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
substrafl_examples/get_started/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
substrafl_examples/get_started/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
#
# As a last step before launching our experiment, we need to specify the third parties dependencies required to run it.
# The :ref:`substrafl_doc/api/dependency:Dependency` object is instantiated in order to install the right libraries in
# the Python environment of each organization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall the Dependency object definition be moved to its own cell here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move the example to "go further", I would say no

substrafl_examples/get_started/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
@ThibaultFy
Copy link
Member Author

Thanks for the PR, this is great for better understanding SubstraFL. I will come back have another read later on.

Also I think I am not sure if we should keep it in the get started section as it's still quite complex.

I changed the example to the go_further section. Should I remove some of the explanation then ?
On an other subject, do we keep the exact same image as the diabetes example ?

@SdgJlbl
Copy link
Contributor

SdgJlbl commented Jun 16, 2023

I changed the example to the go_further section. Should I remove some of the explanation then ?

Maybe a few small things, but most of it is not too cumbersome and I think it's important that tutorials remain relatively self-contained and can be followed in any order.

Comment on lines 546 to 527
clean_models=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
clean_models=False,
)
clean_models=False,
name="Federated Analytics with SubstraFL documentation example",
)

Copy link
Contributor

@SdgJlbl SdgJlbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR.
A few comments, but mostly typos. Well done 🚀

Comment on lines 177 to 179
# To be able to create this graph of computation, SubstraFL provides the ``Node`` abstraction. A ``Node`` is used to
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on
# this organization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# To be able to create this graph of computation, SubstraFL provides the ``Node`` abstraction. A ``Node`` is used to
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on
# this organization.
# To create this graph of computations, SubstraFL provides the ``Node`` abstraction. A ``Node``
# assigns to an organization (aka a Client) tasks of a given type. The type of the ``Node`` depends on the type of tasks
# we want to run on this organization (training or aggregation tasks).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find Sarah version easier to understand


aggregation_node = AggregationNode(ANALYTICS_PROVIDER_ORG_ID)

# Create the Train Data Nodes (or training tasks) and save them in a list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this comment, and instead add in the explanation (l 187) that we need to create a Train Node for each organization which has data.

# - ``save_local_state(...)``
#
# The ``build_compute_plan`` method is essential to create the graph of the compute plan that will be executed on
# Substra. Using the different ``Nodes`` we created, we will update there states by applying user defined methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Substra. Using the different ``Nodes`` we created, we will update there states by applying user defined methods.
# Substra. Using the different ``Nodes`` we created, we will update their states by applying user defined methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look resolved


@remote_data
def local_first_order_computation(self, datasamples: pd.DataFrame, shared_state=None):
"""Compute from the data samples, expected to be pandas dataframe,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Compute from the data samples, expected to be pandas dataframe,
"""Compute from the data samples, expected to be a pandas dataframe,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

evaluation_strategy=None,
clean_models=False,
):
"""Method to build and link the different operation to execute with each other.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Method to build and link the different operation to execute with each other.
"""Method to build and link the different operations to execute with each other.

or maybe even "computations"?

substrafl_examples/go_further/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
substrafl_examples/go_further/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
substrafl_examples/go_further/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
# compute plan.
#
# The ``load_local_state`` and ``save_local_state`` are two methods used at each new iteration on a Node, in order to
# retrieve a the previous local state that have not been shared with the other ``Nodes``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a / the ⚔️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't get the comment 😶

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I wasn't very clear. It's written retrieve a the previous local state, and I think one of the two articles is enough :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaaaah ahah :D thanks !!

@ThibaultFy
Copy link
Member Author

@RomainGoussault should we keep the same image as the Substra example ?

@RomainGoussault
Copy link
Contributor

@RomainGoussault should we keep the same image as the Substra example ?

yes

Copy link
Contributor

@guilhem-barthes guilhem-barthes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work 🙏 However, some of the conversations have been marked as resolved although the point didn't seem to be actually resolved.

The goal of this example is to compute some analytics such as Age mean, Blood pressure standard deviation or Sex percentage.

We simulate having two different data organizations, and a third organization which wants to compute aggregated analytics
without having access to the raw data. The example here runs everything locally; however there is only one parameter to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention which one?

logs_permission=permissions_dataset,
)

# We register the dataset for each of the organizations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking

Suggested change
# We register the dataset for each of the organizations
# We register the dataset for each organization

substrafl_examples/go_further/run_diabetes_substrafl.py Outdated Show resolved Hide resolved
# the compute plan (or computation graph) they need, by linking a task to its parents.
#
# To be able to create this graph of computation, SubstraFL provides the ``Node`` abstraction. A ``Node`` is used to
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on
# attach an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on

Comment on lines 177 to 179
# To be able to create this graph of computation, SubstraFL provides the ``Node`` abstraction. A ``Node`` is used to
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on
# this organization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find Sarah version easier to understand

def aggregation(self, shared_states: List[Dict]):
"""Aggregation function that receive a list on locally computed analytics in order to
aggregate them.
The aggregation will be a weighted average using "n_samples" as weighted coefficient.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The aggregation will be a weighted average using "n_samples" as weighted coefficient.
The aggregation will be a weighted average using "n_samples" as weight coefficient.

Comment on lines 347 to 350
train_data_nodes (TrainDataNode): Nodes linked to the data samples on which
to compute analytics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
train_data_nodes (TrainDataNode): Nodes linked to the data samples on which
to compute analytics.
train_data_nodes (List[TrainDataNode]): Nodes linked to the data samples on which
to compute analytics.

to compute analytics.
aggregation_node (AggregationNode): Node on which to compute the aggregation
of the analytics extracted from the train_data_nodes.
num_rounds (int): Num rounds to be used to iterate on recurrent part of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is Optional as it defaults to None


# %%
# Now that we saw the implementation of the custom ``Analytics`` class, we can add detailed to some of the previously
# introduced concept.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# introduced concept.
# introduced concepts.

@ThibaultFy
Copy link
Member Author

Thanks for your work 🙏 However, some of the conversations have been marked as resolved although the point didn't seem to be actually resolved.

Wouah thanks a lot for catching this... I must have completely messed up the reset to sign up the automated commit from suggestions... And thanks a lot for your comments !

oleobal and others added 7 commits June 30, 2023 14:54
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
@ThibaultFy ThibaultFy merged commit b2733ae into main Jun 30, 2023
@ThibaultFy ThibaultFy deleted the substrafl-analytics-example branch June 30, 2023 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants