Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run-trust4 is very slow, what should I do to speed, my data is bulk BCR-seq #328

Open
2024-lucky opened this issue Nov 12, 2024 · 19 comments

Comments

@2024-lucky
Copy link

[Tue Nov 5 13:52:15 2024] Read in and count kmers for 112200000 reads.
[Tue Nov 5 15:05:27 2024] Found 109640387 reads.
[Tue Nov 5 15:07:42 2024] Finish sorting the reads.
[Tue Nov 5 15:15:52 2024] Finish rough annotations.
[Tue Nov 5 15:17:52 2024] Processed 100000 reads (96984 are used for assembly).
[Tue Nov 5 15:18:08 2024] Processed 200000 reads (116845 are used for assembly).
[Tue Nov 5 15:18:09 2024] Processed 300000 reads (164767 are used for assembly).
[Tue Nov 5 15:18:09 2024] Processed 400000 reads (242688 are used for assembly).
...

[Sun Nov 10 01:47:45 2024] Processed 6800000 reads (5988662 are used for assembly).
[Sun Nov 10 07:57:21 2024] Processed 6900000 reads (6088107 are used for assembly).
[Sun Nov 10 12:22:23 2024] Processed 7000000 reads (6173861 are used for assembly).
[Sun Nov 10 20:38:25 2024] Processed 7100000 reads (6273293 are used for assembly).
[Mon Nov 11 03:37:36 2024] Processed 7200000 reads (6367994 are used for assembly).
[Mon Nov 11 08:49:15 2024] Processed 7300000 reads (6456989 are used for assembly).
[Mon Nov 11 15:24:26 2024] Processed 7400000 reads (6556401 are used for assembly).
[Mon Nov 11 20:34:51 2024] Processed 7500000 reads (6639358 are used for assembly).
[Tue Nov 12 04:52:51 2024] Processed 7600000 reads (6739051 are used for assembly).

@mourisl
Copy link
Collaborator

mourisl commented Nov 12, 2024

Which version are you using? What was your running command?

@2024-lucky
Copy link
Author

./run-trust4
-1 $DATA_PATH/BY24.clean.R1.fastq.gz
-2 $DATA_PATH/BY24.clean.R2.fastq.gz
-f $TRUST4_PATH/hg38_bcrtcr.fa
--ref $TRUST4_PATH/human_IMGT+C.fa
-t 20
--od $RESULTS_PATH/BY24_output/

this is my script,

@mourisl
Copy link
Collaborator

mourisl commented Nov 12, 2024

Which version of TRUST4 is this? You can add the option "--repseq", which should improve the running time. Just want to confirm, is this UMI-based BCR-seq?

@2024-lucky
Copy link
Author

yes,this is UMI-based BCR-seq.

@2024-lucky
Copy link
Author

./run-trust4 --version
[Tue Nov 12 12:02:29 2024] TRUST4 v1.1.5-r565 begins.
Yes,this is UMI-based BCR-seq.

@mourisl
Copy link
Collaborator

mourisl commented Nov 12, 2024

If it UMI based and you know the read format, like where the UMI sequence located in the read. You can use the options "--barcodeLevel molecule --barcode xxx --readFormat XXX", where TRUST4 can regard the UMI as barcode to speed up the data processing. Since your file is super large and you are using the github version v1.1.5, you can use options like "--skipReadRealign" and "--contigMinCov 3" to further speed up the process. The "--contigMinCov" will filter UMI with less the specified number of reads, so "--contigMinCov" will only generate the results for UMIs with at least 3 reads.

@2024-lucky
Copy link
Author

./run-trust4
-1 $DATA_PATH/BY24.clean.R1.fastq.gz
-2 $DATA_PATH/BY24.clean.R2.fastq.gz
-f $TRUST4_PATH/hg38_bcrtcr.fa
--ref $TRUST4_PATH/human_IMGT+C.fa
-t 80
-k 8
--skipReadRealign
--skipMateExtension
--repseq
--contigMinCov 3
--barcodeLevel molecule
--barcode 8 \ # UMI is in the first 8 bases
--readFormat PE
--od $RESULTS_PATH/output
error,bash run_trust4_job_2.sh
[Tue Nov 12 14:11:52 2024] TRUST4 v1.1.5-r565 begins.
Could not find file 8 ,I do know how to change it

@mourisl
Copy link
Collaborator

mourisl commented Nov 12, 2024

You don't need the "--repseq" option for UMI-based BCR-seq. The --barcode is indicating the file containing the barcode, and --readFomart is a way to describe where the barcode and read sequence located in the "-1,-2,--barcode" files. Examples for that can be found in the README: https://github.com/liulab-dfci/TRUST4?tab=readme-ov-file#10x-genomics-data-and-barcode-based-single-cell-data

@2024-lucky
Copy link
Author

运行TRUST4命令

./run-trust4
-1 $DATA_PATH/BY24.clean.R1.fastq.gz
-2 $DATA_PATH/BY24.clean.R2.fastq.gz
-f $TRUST4_PATH/hg38_bcrtcr.fa
--ref $TRUST4_PATH/human_IMGT+C.fa
--od $RESULTS_PATH
-t 36
--skipReadRealign
--skipMateExtension
--repseq
--contigMinCov 3
--clean 1
--barcode $DATA_PATH/BY24.clean.R1.fastq.gz
--readFormat "bc:0:15,r1:16:-1,r2:0:-1" , Hi mourisl, this is my script, but program is still very slow, just like this: Sun Nov 17 16:11:04 2024] Found 109640387 reads.
[Sun Nov 17 16:14:12 2024] Finish sorting the reads.
[Sun Nov 17 17:34:03 2024] Finish rough annotations.

...
[Mon Nov 18 01:20:03 2024] Processed 3700000 reads (3125295 are used for assembly).
[Mon Nov 18 02:19:36 2024] Processed 3800000 reads (3215601 are used for assembly).
[Mon Nov 18 03:33:27 2024] Processed 3900000 reads (3306588 are used for assembly).
[Mon Nov 18 05:02:21 2024] Processed 4000000 reads (3400819 are used for assembly).
[Mon Nov 18 06:27:30 2024] Processed 4100000 reads (3491291 are used for assembly).
[Mon Nov 18 08:14:26 2024] Processed 4200000 reads (3582141 are used for assembly).
[Mon Nov 18 10:02:51 2024] Processed 4300000 reads (3673120 are used for assembly).
[Mon Nov 18 12:53:53 2024] Processed 4400000 reads (3766495 are used for assembly).

@mourisl
Copy link
Collaborator

mourisl commented Nov 18, 2024

Yes, this is too slow. What is the running command of "trust4" main program in the running log?

@2024-lucky
Copy link
Author

I used nohup bash run_trust4_job.sh , this is script :
./run-trust4
-1 $DATA_PATH/BY24.clean.R1.fastq.gz
-2 $DATA_PATH/BY24.clean.R2.fastq.gz
-f $TRUST4_PATH/hg38_bcrtcr.fa
--ref $TRUST4_PATH/human_IMGT+C.fa
--od $RESULTS_PATH
-t 36
--skipReadRealign
--skipMateExtension
--repseq
--contigMinCov 3
--clean 1
--barcode $DATA_PATH/BY24.clean.R1.fastq.gz
--readFormat "bc:0:15,r1:16:-1,r2:0:-1"

@2024-lucky
Copy link
Author

[Sun Nov 17 16:11:04 2024] Found 109640387 reads.
[Sun Nov 17 16:14:12 2024] Finish sorting the reads.
[Sun Nov 17 17:34:03 2024] Finish rough annotations.
[Sun Nov 17 17:34:19 2024] Processed 100000 reads (96984 are used for assembly).
[Sun Nov 17 17:34:19 2024] Processed 200000 reads (116845 are used for assembly).
[Sun Nov 17 17:34:19 2024] Processed 300000 reads (164767 are used for assembly).
[Sun Nov 17 17:34:20 2024] Processed 400000 reads (242688 are used for assembly).
[Sun Nov 17 17:34:20 2024] Processed 500000 reads (329413 are used for assembly).
[Sun Nov 17 17:34:22 2024] Processed 600000 reads (422298 are used for assembly).
nohup: ignoring input
[Sun Nov 17 11:20:13 2024] TRUST4 v1.1.5-r565 begins.
[Sun Nov 17 11:20:13 2024] SYSTEM CALL: /mnt/disk5/jianrong_disk5/vdj/chen_VDJ/software/TRUST4-master/fastq-extractor -t 1 -f /home/jianrong/jianrong_disk5/vdj/chen_VDJ/software/TRUST4-master/hg38_bcrtcr.fa -o /home/jianrong/jianrong_disk5/vdj/chen_VDJ/result/TRUST_BY24_toassemble -1 /home/jianrong/jianrong_disk5/vdj/chen_VDJ/BY24_BulkVDJ/CleanData/BY24.clean.R1.fastq.gz -2 /home/jianrong/jianrong_disk5/vdj/chen_VDJ/BY24_BulkVDJ/CleanData/BY24.clean.R2.fastq.gz
[Sun Nov 17 11:20:13 2024] Start to extract candidate reads from read files.
[Sun Nov 17 12:38:24 2024] Finish extracting reads.
[Sun Nov 17 12:38:24 2024] SYSTEM CALL: /mnt/disk5/jianrong_disk5/vdj/chen_VDJ/software/TRUST4-master/trust4 -f /home/jianrong/jianrong_disk5/vdj/chen_VDJ/software/TRUST4-master/hg38_bcrtcr.fa -o /home/jianrong/jianrong_disk5/vdj/chen_VDJ/result/TRUST_BY24 -1 /home/jianrong/jianrong_disk5/vdj/chen_VDJ/result/TRUST_BY24_toassemble_1.fq -2 /home/jianrong/jianrong_disk5/vdj/chen_VDJ/result/TRUST_BY24_toassemble_2.fq
[Sun Nov 17 12:38:26 2024] Read in and count kmers for 100000 reads.
[Sun Nov 17 12:38:29 2024] Read in and count kmers for 200000 reads.
[Sun Nov 17 12:38:31 2024] Read in and count kmers for 300000 reads.
[Sun Nov 17 12:38:35 2024] Read in and count kmers for 400000 reads.
[Sun Nov 17 12:38:38 2024] Read in and count kmers for 500000 reads.
[Sun Nov 17 12:38:42 2024] Read in and count kmers for 600000 reads.
[Sun Nov 17 12:38:46 2024] Read in and count kmers for 700000 reads.
[Sun Nov 17 12:38:50 2024] Read in and count kmers for 800000 reads.
[Sun Nov 17 12:38:56 2024] Read in and count kmers for 900000 reads.
[Sun Nov 17 12:39:01 2024] Read in and count kmers for 1000000 reads.
[Sun Nov 17 12:39:04 2024] Read in and count kmers for 1100000 reads.
[Sun Nov 17 12:39:07 2024] Read in and count kmers for 1200000 reads.
[Sun Nov 17 12:39:10 2024] Read in and count kmers for 1300000 reads.
[Sun Nov 17 12:39:15 2024] Read in and count kmers for 1400000 reads.
[Sun Nov 17 12:39:18 2024] Read in and count kmers for 1500000 reads.
[Sun Nov 17 12:39:21 2024] Read in and count kmers for 1600000 reads.
[Sun Nov 17 12:39:24 2024] Read in and count kmers for 1700000 reads.
[Sun Nov 17 12:39:26 2024] Read in and count kmers for 1800000 reads.
[Sun Nov 17 12:39:30 2024] Read in and count kmers for 1900000 reads.
[Sun Nov 17 12:39:36 2024] Read in and count kmers for 2000000 reads.
[Sun Nov 17 12:39:39 2024] Read in and count kmers for 2100000 reads.
[Sun Nov 17 12:39:43 2024] Read in and count kmers for 2200000 reads.
[Sun Nov 17 12:39:47 2024] Read in and count kmers for 2300000 reads.
[Sun Nov 17 12:39:54 2024] Read in and count kmers for 2400000 reads.
[Sun Nov 17 12:39:58 2024] Read in and count kmers for 2500000 reads.
[Sun Nov 17 12:40:03 2024] Read in and count kmers for 2600000 reads.
[Sun Nov 17 12:40:07 2024] Read in and count kmers for 2700000 reads.
run_trust4_job.log

@2024-lucky
Copy link
Author

[Sun Nov 17 16:11:04 2024] Found 109640387 reads.
[Sun Nov 17 16:14:12 2024] Finish sorting the reads.
[Sun Nov 17 17:34:03 2024] Finish rough annotations.
[Sun Nov 17 17:34:19 2024] Processed 100000 reads (96984 are used for assembly).
[Sun Nov 17 17:34:19 2024] Processed 200000 reads (116845 are used for assembly).
[Sun Nov 17 17:34:19 2024] Processed 300000 reads (164767 are used for assembly).
[Sun Nov 17 17:34:20 2024] Processed 400000 reads (242688 are used for assembly).
[Sun Nov 17 17:34:20 2024] Processed 500000 reads (329413 are used for assembly).
[Sun Nov 17 17:34:22 2024] Processed 600000 reads (422298 are used for assembly).
[Sun Nov 17 17:34:24 2024] Processed 700000 reads (518004 are used for assembly).
[Sun Nov 17 17:34:27 2024] Processed 800000 reads (601344 are used for assembly).
[Sun Nov 17 17:34:35 2024] Processed 900000 reads (688215 are used for assembly).
[Sun Nov 17 17:34:49 2024] Processed 1000000 reads (773828 are used for assembly).
[Sun Nov 17 17:35:14 2024] Processed 1100000 reads (865280 are used for assembly).
[Sun Nov 17 17:35:46 2024] Processed 1200000 reads (953925 are used for assembly).
[Sun Nov 17 17:36:36 2024] Processed 1300000 reads (1042795 are used for assembly).
[Sun Nov 17 17:37:50 2024] Processed 1400000 reads (1129429 are used for assembly).
[Sun Nov 17 17:39:20 2024] Processed 1500000 reads (1214814 are used for assembly).
[Sun Nov 17 17:41:46 2024] Processed 1600000 reads (1300869 are used for assembly).
[Sun Nov 17 17:44:53 2024] Processed 1700000 reads (1385015 are used for assembly).
[Sun Nov 17 17:48:38 2024] Processed 1800000 reads (1469792 are used for assembly).
[Sun Nov 17 17:53:29 2024] Processed 1900000 reads (1554178 are used for assembly).
[Sun Nov 17 17:58:46 2024] Processed 2000000 reads (1637712 are used for assembly).
[Sun Nov 17 18:05:29 2024] Processed 2100000 reads (1722121 are used for assembly).
[Sun Nov 17 18:13:57 2024] Processed 2200000 reads (1806344 are used for assembly).
[Sun Nov 17 18:24:41 2024] Processed 2300000 reads (1891016 are used for assembly).
[Sun Nov 17 18:37:12 2024] Processed 2400000 reads (1975661 are used for assembly).
[Sun Nov 17 18:52:37 2024] Processed 2500000 reads (2063588 are used for assembly).
[Sun Nov 17 19:10:10 2024] Processed 2600000 reads (2149792 are used for assembly).
[Sun Nov 17 19:30:03 2024] Processed 2700000 reads (2236971 are used for assembly).
[Sun Nov 17 19:49:22 2024] Processed 2800000 reads (2324378 are used for assembly).
[Sun Nov 17 20:07:44 2024] Processed 2900000 reads (2410479 are used for assembly).
[Sun Nov 17 20:28:30 2024] Processed 3000000 reads (2497734 are used for assembly).
[Sun Nov 17 20:56:04 2024] Processed 3100000 reads (2587591 are used for assembly).
[Sun Nov 17 21:23:51 2024] Processed 3200000 reads (2676031 are used for assembly).
[Sun Nov 17 21:58:42 2024] Processed 3300000 reads (2764604 are used for assembly).
[Sun Nov 17 22:35:59 2024] Processed 3400000 reads (2853675 are used for assembly).
[Sun Nov 17 23:22:29 2024] Processed 3500000 reads (2943887 are used for assembly).
[Mon Nov 18 00:21:08 2024] Processed 3600000 reads (3035447 are used for assembly).
[Mon Nov 18 01:20:03 2024] Processed 3700000 reads (3125295 are used for assembly).
[Mon Nov 18 02:19:36 2024] Processed 3800000 reads (3215601 are used for assembly).
[Mon Nov 18 03:33:27 2024] Processed 3900000 reads (3306588 are used for assembly).
[Mon Nov 18 05:02:21 2024] Processed 4000000 reads (3400819 are used for assembly).
[Mon Nov 18 06:27:30 2024] Processed 4100000 reads (3491291 are used for assembly).
[Mon Nov 18 08:14:26 2024] Processed 4200000 reads (3582141 are used for assembly).
[Mon Nov 18 10:02:51 2024] Processed 4300000 reads (3673120 are used for assembly).
[Mon Nov 18 12:53:53 2024] Processed 4400000 reads (3766495 are used for assembly).

@mourisl
Copy link
Collaborator

mourisl commented Nov 18, 2024

As you can see, there are many options did not pass to the "trust4" main program, such as --barcode and the thread number (-t). I guess there might be some typo in the bash script, or you still ran it with the old parameter setting.

@2024-lucky
Copy link
Author

If there are syntax errors in the script, it will not run properly. Because I wrote syntax errors in my script before, the program could not run.

@2024-lucky
Copy link
Author

I try new script, check there no problem:
cat run_job_3.log
nohup: ignoring input
[Mon Nov 18 15:14:02 2024] TRUST4 v1.1.5-r565 begins.
[Mon Nov 18 15:14:02 2024] SYSTEM CALL: /home/data/TRUST4-master/fastq-extractor -t 120 -f /home/data/TRUST4-master/hg38_bcrtcr.fa -o /home/data/BY24_BulkVDJ/results_2/TRUST_BY24_toassemble --readFormat bc:0:15,r1:16:-1,r2:0:-1 -1 /home/data/BY24_BulkVDJ/CleanData/BY24.clean.R1.fastq.gz -2 /home/data/BY24_BulkVDJ/CleanData/BY24.clean.R2.fastq.gz --barcode /home/data/BY24_BulkVDJ/CleanData/BY24.clean.R1.fastq.gz
[Mon Nov 18 15:14:02 2024] Start to extract candidate reads from read files. But , it is still slow.

@2024-lucky
Copy link
Author

I found that no matter how many threads I set up, it ultimately runs in a single thread.

@2024-lucky
Copy link
Author

I think that the program may not be using multithreading correctly. In the logs, both fastq-extractor and trust4 indeed received the -t 120 parameter, but whether they correctly parsed and applied it internally needs to be checked in their documentation or source code.

@mourisl
Copy link
Collaborator

mourisl commented Nov 18, 2024

What is the command of "trust4" command in the running log now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants