GitHub - xtreemfs/xtreemfs-slurm: Scripts to deploy XtreemFS on a SLURM cluster

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
README		README
env.sh		env.sh
example_job_script.sh		example_job_script.sh
xtreemfs_slurm.sh		xtreemfs_slurm.sh
xtreemfs_slurm_rstart.sh		xtreemfs_slurm_rstart.sh
xtreemfs_slurm_rstop.sh		xtreemfs_slurm_rstop.sh
xtreemfs_slurm_watchdog.sh		xtreemfs_slurm_watchdog.sh

Repository files navigation

These scripts distributes XtreemFS over allocated slurm nodes.

1. Adjust to your environment.

First, please adjust the env.sh. Afterwards, use "xtreemfs_slurm.sh" to start
and stop a distributed environment. It can also clone a git repository (clone)
or clean (cleanup) the local XtreemFS folder for the current slurm job on each
allocated slurm node.

2. How to allocate slurm nodes?

An environment file will be created in the current job folder containing
useful variables regarding the started XtreemFS environment.

Use the following statement, whereas -N refers to the number of nodes,
which should be allocated.

 $ salloc -N2 -p <partition> -A <account> --exclusive

3. How to limit the number of OSD servers?

Adjust the variable "NUMBER_OF_NODES" in the env.sh file to the number of
OSD servers. Depending on "SAME_DIR_AND_MRC_NODE" add one or two to the number
of nodes. E.g. for four OSD server and a shared DIR and MRC server, set the
NUMBER_OF_NODES to five (four + one).


4. Something went wrong stopping the server. How to fix a broken server?

Connect to the specific node via the following statement:

 $ srun -N1-1 --nodelist=node-n[xx] --pty bash

If you want to stop a running server, locate the PID file and call "kill", e.g.:

 $ cat /local/$USER/xtreemfs/xxxxx/MRC.pid
 12345
 $ kill 12345

or short:

 $ kill $(</local/$USER/xtreemfs/xxxxx/MRC.pid)

If the current job folder has been deleted already, use the following
statement to retrieve the process id and kill it.

 $ ps -ax | grep java
 $ kill 12345

5. On the slurm nodes are still traces due to an error.

Call "cleanup" or connect to each server (comp. (2)) and remove the folder by

 $ rm -r /local/$USER/xtreemfs