SparkLeBLAST

Scalable Parallelization of BLAST Sequence Alignment Using Spark

Dependencies

Scala Build Tool

Used to build SparkLeBLAST (https://www.scala-sbt.org/)

Spark

We conducted our experiments using Spark v2.2.0. (https://archive.apache.org/dist/spark/spark-2.2.0/)

NCBI BLAST

SparkLeBLAST works independent of a specific BLAST version. We recently tested it with BLAST v2.13.0: (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)

spark-slurm

For usage on an HPC cluster with SLURM workload manager (details below), there are two approaches:

Use the shipped start_spark_slurm.sbatch script
USe spark-slurm (https://github.com/NIH-HPC/spark-slurm) Note: spark-slurm enables more flexible configurations and logging options. It may need some edits to adapt it to your runtime environment. Our adapted version is available at: (https://github.com/karimyoussef91/spark-slurm)

Usage

HPC Cluster With SLURM Workload Manager

Environment variables

    export SPARK_HOME=<path/to/installed/spark/directory> # Where pre-built Spark was downloaded from dependencies abpve 
    export SPARK_SLURM_PATH=</path/to/spark-slurm> # if using spark-slurm 
    export NCBI_BLAST_PATH=</path/to/ncbi_blast/binaries>
    export SLB_WORKDIR=$(pwd) # Path to SparkLeBLAST root directory

Partitioning and formatting a BLAST database:

    ./SparkLeMakeDB -p <num_partitions> -time <job_time_integer_minutes> -i <input_DB_path> -t <output_partitions_base_path>

Running BLAST search:

   ./SparkLeBLASTSearch.sh -p <num_partitions> -time <job_time_integer_minutes> -q <query_file_path> -db <database_partitions_base_path> -d <spark_logs_path>

Custom Spark Cluster

A version of launch scripts without SLURM will be available soon

Data

Small Data Samples for Testing

COVID-19 Genomic Diversity Analysis

Preprocessed Query: Compressed file could be found in this repo under covdiv_sample
Database (Compressed Raw Size of 144GB): https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz

Publication

Youssef, Karim, and Wu-chun Feng. "SparkLeBLAST: Scalable Parallelization of BLAST Sequence Alignment Using Spark." 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 2020.

License

Please refer to the included LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
containers/singularity		containers/singularity
covid19_analysis		covid19_analysis
data		data
sample_data		sample_data
src/main/scala		src/main/scala
.gitignore		.gitignore
.ncbirc		.ncbirc
LICENSE		LICENSE
README.md		README.md
SparkLeBLASTSearch.sh		SparkLeBLASTSearch.sh
SparkLeMakeAndSearch.sh		SparkLeMakeAndSearch.sh
SparkLeMakeDB.sh		SparkLeMakeDB.sh
blastSearchScript		blastSearchScript
blast_args.txt		blast_args.txt
blast_args_test.txt		blast_args_test.txt
blast_makedb_args.txt		blast_makedb_args.txt
formatdbScript		formatdbScript
log.txt		log.txt
output_balst_search_4		output_balst_search_4
output_makedb_4		output_makedb_4
run_scaling.sh		run_scaling.sh
screen_text.txt		screen_text.txt
simple.sbt		simple.sbt
start_spark_slurm.sbatch		start_spark_slurm.sbatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkLeBLAST

Dependencies

Scala Build Tool

Spark

NCBI BLAST

spark-slurm

Usage

HPC Cluster With SLURM Workload Manager

Custom Spark Cluster

Data

Small Data Samples for Testing

COVID-19 Genomic Diversity Analysis

Publication

License

About

Releases

Packages

Contributors 3

Languages

License

vtsynergy/SparkLeBLAST

Folders and files

Latest commit

History

Repository files navigation

SparkLeBLAST

Dependencies

Scala Build Tool

Spark

NCBI BLAST

spark-slurm

Usage

HPC Cluster With SLURM Workload Manager

Custom Spark Cluster

Data

Small Data Samples for Testing

COVID-19 Genomic Diversity Analysis

Publication

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages