Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nanocompore SampComp stuck at 0% #222

Open
sidizhao opened this issue Jul 10, 2023 · 8 comments
Open

Nanocompore SampComp stuck at 0% #222

sidizhao opened this issue Jul 10, 2023 · 8 comments

Comments

@sidizhao
Copy link

Describe the bug
Hi, I've been trying to run SampComp on 6 samples of ONT Direct RNA-seq on a METTL3 KD cell line for some time, and have yet to get past the "parse transcript" step. I have 512G of RAM requested to run this and it just gets stuck for multiple days. Here's the log:

2023-07-08T01:55:22.910444+0000 WARNING - MainProcess | Running SampComp
2023-07-08T01:55:22.910969+0000 INFO - MainProcess | Checking and initialising SampComp
2023-07-08T01:55:22.911463+0000 DEBUG - MainProcess | 	package_name: nanocompore
2023-07-08T01:55:22.911790+0000 DEBUG - MainProcess | 	package_version: 1.0.4
2023-07-08T01:55:22.912016+0000 DEBUG - MainProcess | 	timestamp: 2023-07-08 01:55:22.912005
2023-07-08T01:55:22.912189+0000 DEBUG - MainProcess | 	fasta_fn: /storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/all-chrs.fa
2023-07-08T01:55:22.912390+0000 DEBUG - MainProcess | 	outpath: /storage1/fs1/christophermaher/Active/maherlab/getian/pr1/results/test
2023-07-08T01:55:22.912605+0000 DEBUG - MainProcess | 	outprefix: out
2023-07-08T01:55:22.913050+0000 DEBUG - MainProcess | 	overwrite: True
2023-07-08T01:55:22.913381+0000 DEBUG - MainProcess | 	comparison_methods: GMM,KS
2023-07-08T01:55:22.913537+0000 DEBUG - MainProcess | 	logit: True
2023-07-08T01:55:22.913793+0000 DEBUG - MainProcess | 	anova: False
2023-07-08T01:55:22.913928+0000 DEBUG - MainProcess | 	allow_warnings: True
2023-07-08T01:55:22.914056+0000 DEBUG - MainProcess | 	sequence_context: 0
2023-07-08T01:55:22.914176+0000 DEBUG - MainProcess | 	sequence_context_weights: uniform
2023-07-08T01:55:22.914547+0000 DEBUG - MainProcess | 	min_coverage: 30
2023-07-08T01:55:22.914672+0000 DEBUG - MainProcess | 	min_ref_length: 100
2023-07-08T01:55:22.914795+0000 DEBUG - MainProcess | 	downsample_high_coverage: 5000
2023-07-08T01:55:22.914919+0000 DEBUG - MainProcess | 	max_invalid_kmers_freq: 0.1
2023-07-08T01:55:22.915044+0000 DEBUG - MainProcess | 	select_ref_id: []
2023-07-08T01:55:22.915164+0000 DEBUG - MainProcess | 	exclude_ref_id: []
2023-07-08T01:55:22.915294+0000 DEBUG - MainProcess | 	nthreads: 3
2023-07-08T01:55:22.915415+0000 DEBUG - MainProcess | 	progress: True
2023-07-08T01:55:22.962418+0000 DEBUG - MainProcess | OrderedDict([('ctrl', {'ctrl_1': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAR62703/FAR62703_eventalign_collapse.tsv', 'ctrl_2': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAU62362/FAU62362_eventalign_collapse.tsv'}), ('kd', {'kd_1': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAU65190/FAU65190_eventalign_collapse.tsv', 'kd_2': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/PAM12362/PAM12362_eventalign_collapse.tsv', 'kd_3': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAS65798/FAS65798_eventalign_collapse.tsv', 'kd_4': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAS65713/FAS65713_eventalign_collapse.tsv'})])
2023-07-08T01:55:23.258677+0000 INFO - MainProcess | Reading eventalign index files
2023-07-08T01:55:59.601494+0000 DEBUG - MainProcess | 	Condition:ctrl Sample:ctrl_1 	High fraction of invalid kmers: 1,437,090	valid reads: 418,498
2023-07-08T01:56:29.285645+0000 DEBUG - MainProcess | 	Condition:ctrl Sample:ctrl_2 	High fraction of invalid kmers: 1,170,379	valid reads: 335,198
2023-07-08T01:57:09.931337+0000 DEBUG - MainProcess | 	Condition:kd Sample:kd_1 	High fraction of invalid kmers: 1,572,437	valid reads: 485,752
2023-07-08T02:00:49.691289+0000 DEBUG - MainProcess | 	Condition:kd Sample:kd_2 	High fraction of invalid kmers: 8,258,366	valid reads: 2,978,007
2023-07-08T02:01:12.363506+0000 DEBUG - MainProcess | 	Condition:kd Sample:kd_3 	High fraction of invalid kmers: 880,577	valid reads: 302,132
2023-07-08T02:01:40.454853+0000 DEBUG - MainProcess | 	Condition:kd Sample:kd_4 	High fraction of invalid kmers: 948,443	valid reads: 309,549
2023-07-08T02:01:40.455894+0000 INFO - MainProcess | 	References found in index: 203
2023-07-08T02:01:40.456229+0000 INFO - MainProcess | Filtering out references with low coverage
2023-07-08T02:01:41.073182+0000 DEBUG - MainProcess | 	kd kd_2 Reads: 218,264	ctrl ctrl_1 Reads: 146,950	kd kd_1 Reads: 144,722	ctrl ctrl_2 Reads: 136,951	kd kd_3 Reads: 132,693	kd kd_4 Reads: 130,008	invalid_ref_id: 137	valid_ref_id: 66
2023-07-08T02:01:41.073543+0000 INFO - MainProcess | 	References remaining after reference coverage filtering: 66
2023-07-08T02:01:45.479003+0000 INFO - MainProcess | Starting data processing
2023-07-08T02:01:45.584960+0000 DEBUG - Process-1 | Adding chr1 to in_q
2023-07-08T02:01:45.675927+0000 DEBUG - Process-2 | Worker thread started
2023-07-08T02:01:45.621621+0000 DEBUG - Process-1 | Adding chr10 to in_q
2023-07-08T02:01:46.053949+0000 DEBUG - Process-1 | Adding chr11 to in_q
2023-07-08T02:01:46.297462+0000 DEBUG - Process-1 | Adding chr12 to in_q
2023-07-08T02:01:46.342287+0000 DEBUG - Process-1 | Adding chr13 to in_q
2023-07-08T02:01:46.370333+0000 DEBUG - Process-1 | Adding chr14 to in_q
2023-07-08T02:01:46.414894+0000 DEBUG - Process-1 | Adding chr15 to in_q
2023-07-08T02:01:46.446543+0000 DEBUG - Process-1 | Adding chr16 to in_q
2023-07-08T02:01:46.498344+0000 DEBUG - Process-1 | Adding chr16_KI270728v1_random to in_q
2023-07-08T02:01:46.549909+0000 DEBUG - Process-1 | Adding chr17 to in_q
2023-07-08T02:01:46.589211+0000 DEBUG - Process-1 | Adding chr18 to in_q
2023-07-08T02:01:46.624302+0000 DEBUG - Process-1 | Adding chr19 to in_q
2023-07-08T02:01:46.653000+0000 DEBUG - Process-1 | Adding chr2 to in_q
2023-07-08T02:01:46.703102+0000 DEBUG - Process-1 | Adding chr20 to in_q
2023-07-08T02:01:46.747155+0000 DEBUG - Process-1 | Adding chr21 to in_q
2023-07-08T02:01:46.791885+0000 DEBUG - Process-1 | Adding chr22 to in_q
2023-07-08T02:01:46.839534+0000 DEBUG - Process-1 | Adding chr22_KI270733v1_random to in_q
2023-07-08T02:01:46.877947+0000 DEBUG - Process-1 | Adding chr3 to in_q
2023-07-08T02:01:46.922850+0000 DEBUG - Process-1 | Adding chr4 to in_q
2023-07-08T02:01:46.961869+0000 DEBUG - Process-1 | Adding chr5 to in_q
2023-07-08T02:01:46.991997+0000 DEBUG - Process-1 | Adding chr6 to in_q
2023-07-08T02:01:47.070883+0000 DEBUG - Process-2 | Worker thread processing new item from in_q: chr1
2023-07-08T02:01:47.065802+0000 DEBUG - Process-1 | Adding chr7 to in_q
2023-07-08T02:01:47.098081+0000 DEBUG - Process-1 | Adding chr8 to in_q
2023-07-08T02:01:47.098439+0000 DEBUG - Process-1 | Adding chr9 to in_q
2023-07-08T02:01:47.098590+0000 DEBUG - Process-1 | Adding chr4_GL000257v2_alt to in_q
2023-07-08T02:01:47.098738+0000 DEBUG - Process-1 | Adding chr5_KI270791v1_alt to in_q
2023-07-08T02:01:47.098868+0000 DEBUG - Process-1 | Adding chr5_GL339449v2_alt to in_q
2023-07-08T02:01:47.099001+0000 DEBUG - Process-1 | Adding chr6_GL000250v2_alt to in_q
2023-07-08T02:01:47.099134+0000 DEBUG - Process-1 | Adding chr7_KI270806v1_alt to in_q
2023-07-08T02:01:47.099274+0000 DEBUG - Process-1 | Adding chr7_GL383534v2_alt to in_q
2023-07-08T02:01:47.099406+0000 DEBUG - Process-1 | Adding chr8_KI270817v1_alt to in_q
2023-07-08T02:01:47.099532+0000 DEBUG - Process-1 | Adding chr15_GL383554v1_alt to in_q
2023-07-08T02:01:47.099654+0000 DEBUG - Process-1 | Adding chr15_KI270849v1_alt to in_q
2023-07-08T02:01:47.099776+0000 DEBUG - Process-1 | Adding chr16_KI270855v1_alt to in_q
2023-07-08T02:01:47.099896+0000 DEBUG - Process-1 | Adding chr16_KI270853v1_alt to in_q
2023-07-08T02:01:47.100018+0000 DEBUG - Process-1 | Adding chr17_KI270857v1_alt to in_q
2023-07-08T02:01:47.100149+0000 DEBUG - Process-1 | Adding chr17_GL000258v2_alt to in_q
2023-07-08T02:01:47.100283+0000 DEBUG - Process-1 | Adding chr19_KI270868v1_alt to in_q
2023-07-08T02:01:47.100407+0000 DEBUG - Process-1 | Adding chr19_KI270866v1_alt to in_q
2023-07-08T02:01:47.100528+0000 DEBUG - Process-1 | Adding chr19_GL949746v1_alt to in_q
2023-07-08T02:01:47.100648+0000 DEBUG - Process-1 | Adding chr21_GL383581v2_alt to in_q
2023-07-08T02:01:47.100767+0000 DEBUG - Process-1 | Adding chr22_KI270879v1_alt to in_q
2023-07-08T02:01:47.100885+0000 DEBUG - Process-1 | Adding chr22_GL383582v2_alt to in_q
2023-07-08T02:01:47.101000+0000 DEBUG - Process-1 | Adding chr5_KI270897v1_alt to in_q
2023-07-08T02:01:47.101115+0000 DEBUG - Process-1 | Adding chr6_GL000251v2_alt to in_q
2023-07-08T02:01:47.101233+0000 DEBUG - Process-1 | Adding chr15_KI270905v1_alt to in_q
2023-07-08T02:01:47.101363+0000 DEBUG - Process-1 | Adding chr17_KI270908v1_alt to in_q
2023-07-08T02:01:47.101481+0000 DEBUG - Process-1 | Adding chr19_GL949747v2_alt to in_q
2023-07-08T02:01:47.101598+0000 DEBUG - Process-1 | Adding chr22_KB663609v1_alt to in_q
2023-07-08T02:01:47.101713+0000 DEBUG - Process-1 | Adding chr6_GL000252v2_alt to in_q
2023-07-08T02:01:47.101829+0000 DEBUG - Process-1 | Adding chr19_GL949748v2_alt to in_q
2023-07-08T02:01:47.101956+0000 DEBUG - Process-1 | Adding chr22_KI270928v1_alt to in_q
2023-07-08T02:01:47.102074+0000 DEBUG - Process-1 | Adding chr6_GL000253v2_alt to in_q
2023-07-08T02:01:47.102191+0000 DEBUG - Process-1 | Adding chr19_GL949749v2_alt to in_q
2023-07-08T02:01:47.102318+0000 DEBUG - Process-1 | Adding chr6_GL000254v2_alt to in_q
2023-07-08T02:01:47.102434+0000 DEBUG - Process-1 | Adding chr19_GL949750v2_alt to in_q
2023-07-08T02:01:47.102550+0000 DEBUG - Process-1 | Adding chr6_GL000255v2_alt to in_q
2023-07-08T02:01:47.102667+0000 DEBUG - Process-1 | Adding chr19_GL949751v2_alt to in_q
2023-07-08T02:01:47.102788+0000 DEBUG - Process-1 | Adding chr6_GL000256v2_alt to in_q
2023-07-08T02:01:47.102910+0000 DEBUG - Process-1 | Adding chr19_GL949752v1_alt to in_q
2023-07-08T02:01:47.103025+0000 DEBUG - Process-1 | Adding chr19_GL949753v2_alt to in_q
2023-07-08T02:01:47.103144+0000 DEBUG - Process-1 | Adding chr19_KI270938v1_alt to in_q
2023-07-08T02:01:47.103263+0000 DEBUG - Process-1 | Adding chrM to in_q
2023-07-08T02:01:47.103387+0000 DEBUG - Process-1 | Adding chrUn_GL000220v1 to in_q
2023-07-08T02:01:47.103511+0000 DEBUG - Process-1 | Adding chrX to in_q
2023-07-08T02:01:47.103630+0000 DEBUG - Process-1 | Adding chrY to in_q
2023-07-08T02:01:47.103761+0000 DEBUG - Process-1 | Parsed transcripts:66
  0%|          | 0/66 [00:00<?, ? Processed References/s]

To Reproduce
I ran a bash script based on our linux computing cluster using the docker image quay.io/biocontainers/nanocompore:1.0.4--pyhdfd78af_0. Here's the commad:

cmd="nanocompore sampcomp \
    --file_list1 /storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAR62703/FAR62703_eventalign_collapse.tsv,/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAU62362/FAU62362_eventalign_collapse.tsv \
    --file_list2 /storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAU65190/FAU65190_eventalign_collapse.tsv,/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/PAM12362/PAM12362_eventalign_collapse.tsv,/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAS65798/FAS65798_eventalign_collapse.tsv,/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAS65713/FAS65713_eventalign_collapse.tsv --label1 ctrl \
    --label2 kd \
    --fasta /storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/all-chrs.fa \
    --outpath /storage1/fs1/christophermaher/Active/maherlab/getian/pr1/results/test --progress --overwrite --allow_warnings --log_level debug"

Would you be able to help me? Thank you.

@lmulroney
Copy link
Collaborator

Hi @sidizhao,

It looks like you ran nanocompore on a genomic reference instead of a transcriptomic reference. Sometimes Nanocompore can stall when the reference sequences are super long (greater than 50kb), and this is likely the reason that you're experiencing a long execution time. You can either kill the process and start it again and see if it gets through the stall that way (this sometimes works and we don't know why), or start the whole pipeline over again aligning to a transcriptome reference fasta. Given you have all the data together, it might be worth simply restarting it and seeing if that works, but I suspect that redoing the pipeline with a transcriptome reference is better. If you provide SampComp a bed file, it will do an internal liftover from transcriptome reference coordinates to genome reference coordinates.

I hope this helps,
Logan

@sidizhao
Copy link
Author

sidizhao commented Jul 11, 2023 via email

@lmulroney
Copy link
Collaborator

Hi @sidizhao,

Yes, by transcriptomic reference I mean a reference fasta file of each contiguous transcript isoform with no introns present, and one reference sequence per isoform.
This is the download link for the genocde transcriptome reference fasta.
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.transcripts.fa.gz

Alternatively, you can create a reference transcriptome fasta from the reference genome and a gtf file using bedtools get fasta.

And by bed file, I mean a bed file that matches the transcriptome reference fasta file in genomic coordinates. So you can use something like bedparse (https://github.com/tleonardi/bedparse) to convert a gtf file to bed12 format. You can find the gencode reference gtf file on the gencode home page.

Does this make sense?
Logan

@sidizhao
Copy link
Author

Yes. Thank you. Would it work if I keep the current genomic fasta file but add a bed12 file of only the transcripts? Or do I necessarily need to download the transcripts only fasta?

@lmulroney
Copy link
Collaborator

You essentially need to start over from the minimap2 step using the transcriptome reference fasta instead of the genome reference fasta. This will require that you redo eventalign and eventalign collapse as well from this new bam file. Importantly, you do not want to align in splice aware mode when using a transcriptome reference fasta.

You can find more detailed instructions here ( https://doi.org/10.1002/cpz1.683) or here (https://nanocompore.rna.rocks/) if you want a more in depth breakdown of the steps.

Let me know if you have more questions.

Logan

@sidizhao
Copy link
Author

Oh wow, I see. That is going to take a while since these Direct RNA-seq files take a long time on nanopolish. I will come back with more questions if it still doesn't work. Thank you so much.

@lmulroney
Copy link
Collaborator

You can try using f5c instead of nanopolish. It is a c implementation of nanopolish and is roughly 10 times faster. There are a few flags you need to use that are unique to f5c that are not used by nanopolish. The protocol paper I posted earlier goes through all the necessary differences using f5c compared to nanopolish if you decide to give it a try.

Briefly, you need to use --rna --min-mapq=0 --secondary=yes in addition to all the normal nanopolish commands

But I'm doing this from memory, so double check the help messages to make sure I have the spelling correct!!!

Logan

@sidizhao
Copy link
Author

sidizhao commented Jul 11, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants