Skip to content
This repository has been archived by the owner on Jan 16, 2024. It is now read-only.
/ DLSA Public archive

🧬 Distributing Local Sequence Alignment using Volunteer Computing

License

Notifications You must be signed in to change notification settings

Noorts/DLSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributing Local Sequence Alignment using Volunteer Computing

A coordinator-worker based distributed system for crowdsourced local sequence alignment.

It was developed as a lab project for the 2023/2024 Distributed Systems course at the Vrije Universiteit Amsterdam.

The key idea of the project is to enable crowdsourced local sequence alignment. This allows heterogeneous computers of different sizes (e.g., a laptop or a compute cluster node) to work together to perform sequence alignment jobs for scientists (this is a similar idea to Folding@Home).

The project report can be found here, and the experiment archive is over at DLSA-Experiments.

Overview

The project consists of two main aspects, 1) an implementation of the Smith-Waterman algorithm, and 2) a coordinator-worker architecture that is able to "intelligently" schedule and distribute the sequence alignment jobs across the pool of workers. The heterogeneous workers individually run a compute capacity estimation benchmark (using synthetic sequences), which is communicated to and used by the scheduler to distribute the work.

The diagram below depicts the coordinator-worker architecture. The project requires 1 master node, and 1+ worker nodes to be spun up (see instructions below). A command-line tool (see CLI below) can be used by the "User" (i.e., scientists) to submit sequence alignment jobs to the master node. The master will subsequently schedule and distribute the work across the pool of worker nodes, returning the result to the user when the work is finished.

The Architecture

Prerequisites

The project uses Python, Golang, and (nightly) Rust.

We've used the following versions in our testing. Nightly rust is currently being used for the std::simd module. Once the module is stabilized then stable rust can be used.

Dependency Version
Python 3.11.5
poetry 1.7.1
Go go1.21.4 linux/amd64
rustc rustc 1.76.0-nightly (eeff92ad3 2023-12-13)
cargo cargo 1.76.0-nightly (1aa9df1a5 2023-12-12)

Python dependencies are managed by Poetry (installation instructions). After installing Poetry, you can install the project's dependencies from the root folder using poetry install.

Note: Specific instructions for running this project on the DAS5 compute cluster can be found here.

Master

Usage

Execute poetry run python3 master/run.py to start the master node (see here for more details about poetry and virtual environments).

Optionally, navigate to http://localhost:8000/docs for the API documentation.

Testing

Run poetry run pytest master inside the root directory.

Worker

Usage

Execute go run cmd/worker/main.go "0.0.0.0:8000" to start the worker node. Golang should automatically install the required dependencies.

To use a manual number of cores run go run cmd/worker/main.go "0.0.0.0:8000 4" for 4 cores, this is just for experimentation purposes, by default it uses all available cores

If the "master node IP and port" argument is not supplied, then the worker will connect to a default master node hosted locally at 0.0.0.0:8000.

Testing

Run go test ./... inside the root directory.

Inner Workings

The worker runs in an infinite loop, which tries to register with the master node every X seconds. If the registration is successful, the worker starts sending a pulse to show the master it is alive every Y seconds, the worker also enters another loop state in which it asks for work every Z seconds. If it receives work from the master, it iterates through every query-target pair it was tasked to calculate and performs the Smith-Waterman algorithm. After it calculates the result of each pair, the worker immediately sends the result to the master such that if the worker were to shut down in the midst of calculations, the rest of the work could be delegated to another worker.

CLI

A command-line tool has been developed that allows one to submit sequence alignment jobs.

Run poetry run python3 cli [params] to submit a job. Run it without any parameters for help.

An example use: poetry run python3 cli --query datasets/query_sequences.fasta --database datasets/target_sequences.fasta --server-url http://0.0.0.0:8000 --match-score 2 --mismatch-penalty 1 --gap-penalty 1 --top-k 5

The result of the alignments will be saved to the results directory, where for every query sequence, a file is generated, with the corresponding best result for every target in the database file, with the same id as in the original files.

Synthetic Dataset Generation

To generate a synthetic query and a target/database file you can use the generate_synthetic_dataset.py script. First adjust the configuration in the script, and then execute python3 ./utils/generate_synthetic_data.py, the query and target files will be saved to the current working directory.

Experiments

To effortlessly run experiments on the DAS5 cluster, the run_das5_experiments.py was created. This quick and dirty script automates the starting of the master and workers, the submission of a job using the CLI, and collects all results into a JSON file. See the script and DAS5.md for more information.

For detailed experiment setups and results, and plotting see the DLSA-Experiments repository.