-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add analytics example #311
Conversation
86c72f1
to
9d37d58
Compare
9d37d58
to
dcc4b55
Compare
5db7ad6
to
f6cfb7a
Compare
FL-631 [Flex API] - Documentation on how to write FA
Context and user need: See https://www.notion.so/owkin-fdn/Federated-Analytics-in-SubstraFL-5685c4f2549748a89a3ad2e314cde886?pvs=4 for more info, especially the Mean Analytics example. Functional spec: Technical spec: Acceptance criteria: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, this is great for better understanding SubstraFL.
I will come back have another read later on.
Also I think I am not sure if we should keep it in the get started section as it's still quite complex.
=========================================== | ||
Federated Analytics on the diabetes dataset | ||
=========================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a few sentences explaining that is the same example as the Substra (no FL) one.
# Registering data samples and dataset | ||
# ------------------------------------ | ||
# | ||
# Every asset will be created in respect to predefined schemas (Spec) previously imported from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Every asset will be created in respect to predefined schemas (Spec) previously imported from | |
# Every asset will be created in respect to predefined specifications previously imported from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not, I am confused, what is the difference between Schema and Specs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind it's quite the same...
# SubstraFL provides different type of Nodes to ingest these data, a Node being an object that will create an link | ||
# the different tasks with each other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have trouble understanding this part, but more generally I believe this outlines the half broken concept of node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit, I had to re-read twice (admittedly, it's Friday and I got up waaay too early), and I contributed to the definition of those concepts 🙃
It's just a suggestion but I think I would do the following:
- I would start a new section at this point, after the data registration (which should be pretty straightforward for anyone familiar with Susbtra); this is the important part of the tutorial, it should be easy to refer to it; this section will include the discussion of
Nodes
and ofComputePlanBuilder
, maybe with two subsections. - Move part of the "theoretical" explanation about compute graphs at the top of the section, to give some contexts why we are caring about nodes. Something like "a Federated computation can be represented as a graph of tasks, some of them executing on the data nodes (training tasks) and others aggregating local results (aggregation tasks). Substra does not store an explicit definition of this graph; instead, it gives the user full flexibility to define the computation graph they need, by linking a task to its parents (tasks that needs to be executed before this one, for example because it needs the outputs of the previous tasks). "
And then proceed with the discussion of the type of Nodes, and after of theComputePlanBuilder
.
(I hope this is not too confusing, we can discuss it live if you want. Basically the idea is to first explain why it's important to set up the nodes, and how they relate with what the user really wants to do, that is building a Compute Graph. Then present the technical details. )
# A third type of node exists in SubstraFL, called :ref:`Test data node<substrafl_doc/api/nodes:TestDataNode>`. | ||
# We will not need it in the current example. See the `MNIST Example | ||
# <https://docs.substra.org/en/stable/substrafl_doc/examples/get_started/run_mnist_torch.html#sphx-glr-substrafl-doc-examples-get-started-run-mnist-torch-py>`__ | ||
# to learn how to use this type of Nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not mention it then?
# | ||
# The ``update_state`` method outputs the new state of the node, that can be passed as an argument to a following one. | ||
# This succession of ``next_state`` pass to new ``node.update_state`` is how Substra create the graph of the | ||
# ``ComputePlan``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be consistent with either ComputePlan
or .ComputePlan.
And if we do use ComputePlan
then I would use Node
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand what you meant :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR
Committing comments before I loose them to my flaky internet connection 😅
# SubstraFL provides different type of Nodes to ingest these data, a Node being an object that will create an link | ||
# the different tasks with each other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit, I had to re-read twice (admittedly, it's Friday and I got up waaay too early), and I contributed to the definition of those concepts 🙃
It's just a suggestion but I think I would do the following:
- I would start a new section at this point, after the data registration (which should be pretty straightforward for anyone familiar with Susbtra); this is the important part of the tutorial, it should be easy to refer to it; this section will include the discussion of
Nodes
and ofComputePlanBuilder
, maybe with two subsections. - Move part of the "theoretical" explanation about compute graphs at the top of the section, to give some contexts why we are caring about nodes. Something like "a Federated computation can be represented as a graph of tasks, some of them executing on the data nodes (training tasks) and others aggregating local results (aggregation tasks). Substra does not store an explicit definition of this graph; instead, it gives the user full flexibility to define the computation graph they need, by linking a task to its parents (tasks that needs to be executed before this one, for example because it needs the outputs of the previous tasks). "
And then proceed with the discussion of the type of Nodes, and after of theComputePlanBuilder
.
(I hope this is not too confusing, we can discuss it live if you want. Basically the idea is to first explain why it's important to set up the nodes, and how they relate with what the user really wants to do, that is building a Compute Graph. Then present the technical details. )
# Substra. Using the different nodes we created, we will update there states by applying user defined methods, | ||
# called ``RemoteMethod`` or ``RemoteDataMethod``, created using simply decorators, such as ``@remote`` or ``@remote_data``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's just me, but I will present first what we want each method to do. And then, after the code, add a note about "oh, look, there are some decorators @remote
and @remote_data
on some of the functions, it's very important to use them. This is because these functions are not going to be executed locally, but on a distant node, and they either need access to data or not. It's a bit of magic Substra provides to help the user."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good advice. I tried a new configuration, let me know what it looks like :)
# The ``load_local_state`` and ``save_local_state`` are two methods used at each new iteration on a Node, in order to | ||
# retrieve a the previous local state that have not been shared with the other nodes. | ||
# For instance, after updating a :ref:`Train data node<substrafl_doc/api/nodes:TrainDataNode>` using its | ||
# ``update_state`` method, we will have access to its next local state, that we will pass as argument to the | ||
# next ``update_state`` we will apply on this :ref:`Train data node<substrafl_doc/api/nodes:TrainDataNode>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
# | ||
# As a last step before launching our experiment, we need to specify the third parties dependencies required to run it. | ||
# The :ref:`substrafl_doc/api/dependency:Dependency` object is instantiated in order to install the right libraries in | ||
# the Python environment of each organization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall the Dependency object definition be moved to its own cell here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move the example to "go further", I would say no
b10abea
to
6a75c22
Compare
I changed the example to the go_further section. Should I remove some of the explanation then ? |
Maybe a few small things, but most of it is not too cumbersome and I think it's important that tutorials remain relatively self-contained and can be followed in any order. |
clean_models=False, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean_models=False, | |
) | |
clean_models=False, | |
name="Federated Analytics with SubstraFL documentation example", | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR.
A few comments, but mostly typos. Well done 🚀
# To be able to create this graph of computation, SubstraFL provides the ``Node`` abstraction. A ``Node`` is used to | ||
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on | ||
# this organization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# To be able to create this graph of computation, SubstraFL provides the ``Node`` abstraction. A ``Node`` is used to | |
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on | |
# this organization. | |
# To create this graph of computations, SubstraFL provides the ``Node`` abstraction. A ``Node`` | |
# assigns to an organization (aka a Client) tasks of a given type. The type of the ``Node`` depends on the type of tasks | |
# we want to run on this organization (training or aggregation tasks). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find Sarah version easier to understand
|
||
aggregation_node = AggregationNode(ANALYTICS_PROVIDER_ORG_ID) | ||
|
||
# Create the Train Data Nodes (or training tasks) and save them in a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this comment, and instead add in the explanation (l 187) that we need to create a Train Node for each organization which has data.
# - ``save_local_state(...)`` | ||
# | ||
# The ``build_compute_plan`` method is essential to create the graph of the compute plan that will be executed on | ||
# Substra. Using the different ``Nodes`` we created, we will update there states by applying user defined methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Substra. Using the different ``Nodes`` we created, we will update there states by applying user defined methods. | |
# Substra. Using the different ``Nodes`` we created, we will update their states by applying user defined methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look resolved
|
||
@remote_data | ||
def local_first_order_computation(self, datasamples: pd.DataFrame, shared_state=None): | ||
"""Compute from the data samples, expected to be pandas dataframe, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Compute from the data samples, expected to be pandas dataframe, | |
"""Compute from the data samples, expected to be a pandas dataframe, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
evaluation_strategy=None, | ||
clean_models=False, | ||
): | ||
"""Method to build and link the different operation to execute with each other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Method to build and link the different operation to execute with each other. | |
"""Method to build and link the different operations to execute with each other. |
or maybe even "computations"?
# compute plan. | ||
# | ||
# The ``load_local_state`` and ``save_local_state`` are two methods used at each new iteration on a Node, in order to | ||
# retrieve a the previous local state that have not been shared with the other ``Nodes``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a / the ⚔️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't get the comment 😶
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I wasn't very clear. It's written retrieve a the previous local state
, and I think one of the two articles is enough :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaaaah ahah :D thanks !!
a51674b
to
8c3f032
Compare
@RomainGoussault should we keep the same image as the Substra example ? |
yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work 🙏 However, some of the conversations have been marked as resolved although the point didn't seem to be actually resolved.
The goal of this example is to compute some analytics such as Age mean, Blood pressure standard deviation or Sex percentage. | ||
|
||
We simulate having two different data organizations, and a third organization which wants to compute aggregated analytics | ||
without having access to the raw data. The example here runs everything locally; however there is only one parameter to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention which one?
logs_permission=permissions_dataset, | ||
) | ||
|
||
# We register the dataset for each of the organizations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicking
# We register the dataset for each of the organizations | |
# We register the dataset for each organization |
# the compute plan (or computation graph) they need, by linking a task to its parents. | ||
# | ||
# To be able to create this graph of computation, SubstraFL provides the ``Node`` abstraction. A ``Node`` is used to | ||
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on | |
# attach an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on |
# To be able to create this graph of computation, SubstraFL provides the ``Node`` abstraction. A ``Node`` is used to | ||
# attached an organization (aka a Client) to a certain type of task depending on the computation we expect to happen on | ||
# this organization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find Sarah version easier to understand
def aggregation(self, shared_states: List[Dict]): | ||
"""Aggregation function that receive a list on locally computed analytics in order to | ||
aggregate them. | ||
The aggregation will be a weighted average using "n_samples" as weighted coefficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregation will be a weighted average using "n_samples" as weighted coefficient. | |
The aggregation will be a weighted average using "n_samples" as weight coefficient. |
train_data_nodes (TrainDataNode): Nodes linked to the data samples on which | ||
to compute analytics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
train_data_nodes (TrainDataNode): Nodes linked to the data samples on which | |
to compute analytics. | |
train_data_nodes (List[TrainDataNode]): Nodes linked to the data samples on which | |
to compute analytics. |
to compute analytics. | ||
aggregation_node (AggregationNode): Node on which to compute the aggregation | ||
of the analytics extracted from the train_data_nodes. | ||
num_rounds (int): Num rounds to be used to iterate on recurrent part of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is Optional
as it defaults to None
|
||
# %% | ||
# Now that we saw the implementation of the custom ``Analytics`` class, we can add detailed to some of the previously | ||
# introduced concept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# introduced concept. | |
# introduced concepts. |
Wouah thanks a lot for catching this... I must have completely messed up the reset to sign up the automated commit from suggestions... And thanks a lot for your comments ! |
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com>
968e7bf
to
8329154
Compare
Context and user need:
Documentation example on how to create a simple Analytics using the ComputePlanBuilder class.
See https://www.notion.so/owkin-fdn/Federated-Analytics-in-SubstraFL-5685c4f2549748a89a3ad2e314cde886?pvs=4 for more info, especially the Mean Analytics example.
Functional spec:
Write the example in the SubstraFL "get started" section.
Mirror of the diabetes substra example
closes FL-631