HLPI-Ensemble

A tool for predicting lncRNA-protein interaction from sequences. Based on http://ccsipb.lnu.edu.cn/hlpiensemble/index.php web server.

Requirements

The package was tested under following dependencies:

Package	Version
python	3.6.7
R	3.5.1
R::dplyr	0.7.8
R::caret	6.0.82
R::randomForest	4.6.14
R::xgboost	0.82.1
R::kernlab	0.9.26
R::doParallel	1.0.14
R::foreach	1.4.4
R::iterators	1.0.9

Currently, the package works only under UNIX. Windows users should suggest using WSL or Cygwin.

Installation

One needs git >= 1.8.2 and git-lfs to be installed to deal with large binaries at models/*.Rdata. Run

git clone https://github.com/dmitrymyl/hlpiensemble.git

and files will be fetched via git and git lfs. After that, the package is ready for usage.

Usage

The master script is hlpiensemble.py that allows one to run prediction from any directory. One should do as follows:

python3 path/to/repo/hlpiensemble.py -rna rna.fasta -protein protein.fasta -mode result -output here.csv -taskname some_task

Command line arguments are:

argument	type	description	default
-rna	mandatory	fasta file containing one or many RNA sequences. Allowed symbols for sequences are A, T, G, C, U.	None
-protein	mandatory	fasta file containing one or many protein sequences. Allowed symbols for sequences are 20 amino acid letter.	None
-mode	optional	Mode of output. If "result", will produce a .csv file. If "full", will produce a directory with all intermediate files.	result
-output	mandatory	Name of the output file/directory.	None
-taskname	optional	Name of task	some_task
-cores	optional	Number of cores to use for prediction	1
--timing	optional	Whether to profile execution time or not	False

How predictions are made

The training dataset was NPInter v2.0 database of lncRNA-protein interactions. Several features (named pse, kmer and acc) are exctracted from sequences and then applied to three mainstream algorithms: RF, SVM, XGBoost.

Execution scheme

The programme works in hlpiensemble directory. It copies sequence files (upcasing sequences and replacing Ts with Us in RNA sequence) to path/to/hlpiensemble/task/taskname directory. Intermediate files and prediction results are produced in the same directory. Completing prediction, programme copies HLPI-Ensemble.csv file to specified place in -output (with renaming) in case of result mode or move the entire directory to specified place (with renaming) in case of full mode. In both modes, task/taskname directory will be deleted.

If one consider parallel execution of hlpiensemble.py script, one should be aware of different names for tasks to prevent overwriting of results.

Output

The output is .csv file containing probabilities of interaction predicted by each algorithm for each pair of RNA and protein.

Time complexity

Testing scheme

It is tested how the length of RNA and protein, the number of sequences and number of cores influence performance. For length, 5 RNAs and 5 proteins are generated with length from 100 to 1000 with step 100, i.e. 10 RNA files and 10 protein files. Then all-to-all runs are performed, i.e. 100 runs. For number, 1 to 10 RNAs and proteins of length of 100 are generated and all-to-all runs are performed, i.e. 100 runs. For parallel processing, 10 RNAs of length of 100 and 10 proteins of length of 100 are taken and then processed with 1 to 10 cores, i.e. 10 runs.

How to reproduce testing

For evaluating execution time one has to generate sequence samples, test them and process results. To do so, one has to do following from the package directory:

cd time_samples
bash generate_samples.sh
cd ../time_results
bash length.sh
bash number.sh
bash parallel.sh

After execution one has to run time_results/time_results.ipynb notebook to produce plots.

Results

Due to 1 run per case there is lack of data and inconsistent results on time performance. It seems like length of sequences and their number do not influence execution time. However, running multiple cores in R negatively influences performance due to costs of parallelism applied.

Contributions

The initial author Fule Liu developed most of the prediction backend. @dmitrymyl adopted web server to CLI usage, including paths tweaks and master script, parallelism and time testing.

Citation

Please cite Huan Hu, Li Zhang, Haixin Ai, Hui Zhang, Yetian Fan, Qi Zhao & Hongsheng Liu (2018) HLPI-Ensemble: Prediction of human lncRNA-protein interactions based on ensemble strategy, RNA Biology, 15:6, 797-806, DOI: 10.1080/15476286.2018.1457935 in your paper if you use this software.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HLPI-Ensemble

Requirements

Installation

Usage

How predictions are made

Execution scheme

Output

Time complexity

Testing scheme

How to reproduce testing

Results

Contributions

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
models		models
pse		pse
task		task
time_results		time_results
time_samples		time_samples
LICENSE		LICENSE
README.md		README.md
Serverprediction.py		Serverprediction.py
generate_sequence.py		generate_sequence.py
hlpiensemble.py		hlpiensemble.py

License

dmitrymyl/hlpiensemble

Folders and files

Latest commit

History

Repository files navigation

HLPI-Ensemble

Requirements

Installation

Usage

How predictions are made

Execution scheme

Output

Time complexity

Testing scheme

How to reproduce testing

Results

Contributions

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages