-
Notifications
You must be signed in to change notification settings - Fork 62
Debugging Locally Learner
Although K-FTLK is written to run experiments at scale, debugging locally is possible as well. Beware that this way of debugging does not scale beyond a few (say 3) nodes. Debugging on your cluster is possible as well, but requires a little more setup, refer to the writeup by Thomas Stringer for an idea on how to set this up.
The following entries must be present on the system/computer you are intending to debug/develop with, this can be a single machine.
- Cloned the repository in a directory, we will refer to this directory as
content_root
. - Created a working
venv
, during all times it is assumed that this virtual environment is active.python3.9 -m venv venv
source venv/bin/activate
pip install -r requirements-cpu.txt
- Preferably use PyCharm to run the configurations files.
- We provide configuration files that will allow to 'Run' and 'Debug' in PyCharm directly:
- Download the configurations client.xml, and federator.xml,
and place under
.idea/runConfigurations
in yourcontent_root
directory (i.e. the project in PyCharm). These provide the runtime configuration files.
- Download the configurations client.xml, and federator.xml,
and place under
- We provide configuration files that will allow to 'Run' and 'Debug' in PyCharm directly:
N.B. Make sure to change WORLD_SIZE
, and RANK
to the required size for your debugging experiment (defaults to only
2 nodes). Rank value 0
needs to be assigned to the Federator
, and all others in [1, WORLD_SIZE-1]
to the remaining nodes. You will need
to create a run configuration for EACH client that you intend to use.
- Make sure to have set up a 'Python interpreter' with all the requirements from
requirements-{*}.txt
. In addition, make this your project's default interpreter. - Download the federated.yaml to
experiments/test/federated.yaml
, or edit the file according to your needs (i.e. your experiment configuration). This will be the file used by theFederator
andClient
, which normally would be created by the KFLTK Orchestrator on a Kubernetes cluster.
When you are not using PyCharm, the following settings/environmental variables need to be set, assuming that you
are running locally on a machine. Make sure that no process is bound to the port 12345
, otherwise change the
environment variables accordingly.
For the Federator
node:
PYTHONUNBUFFERED=1
RANK=0
WORLD_SIZE=2
MASTER_PORT=12345
MASTER_ADDR=localhost
For the Client
node:
PYTHONUNBUFFERED=1
RANK=1
WORLD_SIZE=2
MASTER_PORT=12345
MASTER_ADDR=localhost
*N.B. Make sure to change WORLD_SIZE
, and RANK
to the required size for your debugging experiment. Rank value
0
needs to be assigned to the Federator
, and all others in [1, WORLD_SIZE-1]
to the remaining nodes.
Rather than relying on Kubeflow
to prepare the containers' environment, we will prepare this manually.
As this tutorial will create the following configuration:
- A single
Federator
node, which will communicate withClient
nodes. - A single
Client
node, which will register at theFederator
and do the training.
The following will assume that you are using PyCharm. IF YOU ARE NOT USING PYCHARM REFER TO When not using PyCharm.
- Run the
client
andfederator
configurations in PyCharm. - When debugging/encountering errors, you can make use of PyCharms' debugger, which in combination with breakpoints allows for debugging.
N.B. Parallel training is performed is done using Pytorch RPC calls, as such default breakpoints will most likely not work. However, you can add logging to obtain state information and quickly run and re-run.
Each of the nodes performs the following command, given that you provide the correct environment variables.
Look into python dotenv
in case you want to automate this partially.
From content_root
, execute:
For Federator:
PYTHONUNBUFFERED=1
RANK=0
WORLD_SIZE=2
MASTER_PORT=12345
MASTER_ADDR=localhost
python3.9 -m fltk remote experiments/test/federated.yaml
For Client:
PYTHONUNBUFFERED=1
RANK=1
WORLD_SIZE=2
MASTER_PORT=12345
MASTER_ADDR=localhost
python3.9 -m fltk remote experiments/test/federated.yaml
N.B. Make sure to change WORLD_SIZE
, and RANK
to the required size for your debugging experiment (defaults to only
2 nodes). Rank value 0
needs to be assigned to the Federator
, and all others in [1, WORLD_SIZE-1]
to the remaining nodes. You will need
to create a run configuration for EACH client that you intend to use.