Debugging Locally Learner

Tutorial on debugging locally

Although K-FTLK is written to run experiments at scale, debugging locally is possible as well. Beware that this way of debugging does not scale beyond a few (say 3) nodes. Debugging on your cluster is possible as well, but requires a little more setup, refer to the writeup by Thomas Stringer for an idea on how to set this up.

Pre-requisites

The following entries must be present on the system/computer you are intending to debug/develop with, this can be a single machine.

Cloned the repository in a directory, we will refer to this directory as content_root.
Created a working venv, during all times it is assumed that this virtual environment is active.
- python3.9 -m venv venv
- source venv/bin/activate
- pip install -r requirements-cpu.txt
Preferably use PyCharm to run the configurations files.
- We provide configuration files that will allow to 'Run' and 'Debug' in PyCharm directly:
  - Download the configurations client.xml, and federator.xml, and place under .idea/runConfigurations in your content_root directory (i.e. the project in PyCharm). These provide the runtime configuration files.

N.B. Make sure to change WORLD_SIZE, and RANK to the required size for your debugging experiment (defaults to only 2 nodes). Rank value 0 needs to be assigned to the Federator, and all others in [1, WORLD_SIZE-1] to the remaining nodes. You will need to create a run configuration for EACH client that you intend to use.

Make sure to have set up a 'Python interpreter' with all the requirements from requirements-{*}.txt. In addition, make this your project's default interpreter.
Download the federated.yaml to experiments/test/federated.yaml, or edit the file according to your needs (i.e. your experiment configuration). This will be the file used by the Federator and Client, which normally would be created by the KFLTK Orchestrator on a Kubernetes cluster.

When not using PyCharm

When you are not using PyCharm, the following settings/environmental variables need to be set, assuming that you are running locally on a machine. Make sure that no process is bound to the port 12345, otherwise change the environment variables accordingly.

For the Federator node:

PYTHONUNBUFFERED=1
RANK=0
WORLD_SIZE=2
MASTER_PORT=12345
MASTER_ADDR=localhost

For the Client node:

PYTHONUNBUFFERED=1
RANK=1
WORLD_SIZE=2
MASTER_PORT=12345
MASTER_ADDR=localhost

*N.B. Make sure to change WORLD_SIZE, and RANK to the required size for your debugging experiment. Rank value 0 needs to be assigned to the Federator, and all others in [1, WORLD_SIZE-1] to the remaining nodes.

General idea

Rather than relying on Kubeflow to prepare the containers' environment, we will prepare this manually. As this tutorial will create the following configuration:

A single Federator node, which will communicate with Client nodes.
A single Client node, which will register at the Federator and do the training.

Get started

The following will assume that you are using PyCharm. IF YOU ARE NOT USING PYCHARM REFER TO When not using PyCharm.

Run the client and federator configurations in PyCharm.
When debugging/encountering errors, you can make use of PyCharms' debugger, which in combination with breakpoints allows for debugging.

N.B. Parallel training is performed is done using Pytorch RPC calls, as such default breakpoints will most likely not work. However, you can add logging to obtain state information and quickly run and re-run.

When not using PyCharm

Each of the nodes performs the following command, given that you provide the correct environment variables. Look into python dotenv in case you want to automate this partially.

From content_root, execute:

For Federator:

PYTHONUNBUFFERED=1
RANK=0
WORLD_SIZE=2
MASTER_PORT=12345
MASTER_ADDR=localhost
python3.9 -m fltk remote experiments/test/federated.yaml

For Client:

PYTHONUNBUFFERED=1
RANK=1
WORLD_SIZE=2
MASTER_PORT=12345
MASTER_ADDR=localhost
python3.9 -m fltk remote experiments/test/federated.yaml

N.B. Make sure to change WORLD_SIZE, and RANK to the required size for your debugging experiment (defaults to only 2 nodes). Rank value 0 needs to be assigned to the Federator, and all others in [1, WORLD_SIZE-1] to the remaining nodes. You will need to create a run configuration for EACH client that you intend to use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging Locally Learner

Tutorial on debugging locally

Pre-requisites

When not using PyCharm

General idea

Get started

When not using PyCharm

Clone this wiki locally