Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck at simulation stage #193

Open
kainth-amoldeep opened this issue May 31, 2023 · 4 comments
Open

Stuck at simulation stage #193

kainth-amoldeep opened this issue May 31, 2023 · 4 comments
Labels

Comments

@kainth-amoldeep
Copy link

Hi, I am trying to use NanoSim 3.0.0. (downloaded source code from (https://github.com/bcgsc/NanoSim/releases/tag/v3.0.0)) for simulating ONT reads. I created my own profile using: ./read_analysis.py transcriptome -i ont_run.fq -rg ucsc_hg38.fasta -rt hg38_transcriptome_gff.fa -annot hg38_transcriptome.gff3 -t 38 . It seems to run fine with the following output:

2023-05-30 15:21:23: Read pre-process and unaligned reads analysis
2023-05-30 15:21:39: Alignment with minimap2 to reference transcriptome
2023-05-30 15:24:33: Alignment with minimap2 to reference genome
2023-05-30 16:00:19: Processing transcriptome alignment file: sam
2023-05-30 16:00:44: Processing genome alignment file: sam
2023-05-30 16:01:14: Parse the annotation file (GTF/GFF3)
2023-05-30 16:01:47: Modeling Intron Retention
2023-05-30 16:01:47: Reading intron coordinates from GFF file
2023-05-30 16:02:30: Read primary genome alignment for each read
2023-05-30 16:07:03: Read primary transcriptome alignment for each read
2023-05-30 16:09:49: Calculating probabilities for each intron retention event
2023-05-30 16:16:23: Aligned reads analysis
2023-05-30 16:16:40: Computing KDEds processed >> 1215001
2023-05-30 16:16:40: Computing 2D KDE for transcriptome ref length
2023-05-30 16:16:41: Computing KDE for aligned region
2023-05-30 16:16:41: Computing KDE for aligned reads
2023-05-30 16:16:42: Computing KDE for unaligned length
2023-05-30 16:16:42: Computing KDE for ht ratio
2023-05-30 16:16:43: Unaligned reads analysis
2023-05-30 16:16:43: match and error models
2023-05-30 16:19:13: Model fitting
2023-05-30 16:19:18: Mismatch fitting start
Mismatch parameters: 0.10299333509559308 0.787357964367962 0.2182370542984321 Residual 6.509615946814762e-05
2023-05-30 16:19:48: Mismatch fitting done
2023-05-30 16:19:48: Insertion fitting start
WARNING! Insertion parameters may not be optimal! 0.8374719781176181 0.9192989587536269 0.008942175282069826 0.9880211696155594 Residual 0.0030656834500610852
2023-05-30 16:21:14: Insertion fitting done
2023-05-30 16:21:14: Deletion fitting start
WARNING! Deletion parameters may not be optimal! 0.9569706479182295 1.032499061802694 0.2866297773531245 0.8752291450180834 Residual 0.0005721754355394459
2023-05-30 16:22:00: Deletion fitting done
2023-05-30 16:22:00: Finished!

But when I try to run simulator using ./simulator.py transcriptome -rt hg38_transcriptome_gff.fa -rg ucsc_hg38.fasta -c training -e expression_abundance.tsv -n 20000 -max 10000 -min 200 -b guppy -r cDNA_1D --fastq -o simulated_1D it gets stuck at:

2023-05-31 09:48:19: Read in reference
2023-05-31 09:48:21: Read in expression profile
2023-05-31 09:48:21: Read in reference genome and create .fai index file
2023-05-31 09:48:21: Read in IR markov model
2023-05-31 09:48:21: Read in GFF3 annotation file
2023-05-31 09:48:41: Read error profile
2023-05-31 09:48:41: Read KDF of unaligned reads
2023-05-31 09:48:41: Read KDF of aligned reads
2023-05-31 09:48:41: Read chimeric simulation information
2023-05-31 09:48:41: Start simulation of aligned reads

I have ensured that my expression_abundance.tsv file have transcript_id with "ENS".
I would really appreciate if you can help me with this. Thanks.

@kmnip
Copy link
Collaborator

kmnip commented Jun 1, 2023

I think it has to do with these settings: -n 20000 -max 10000 -min 200
It sounds like a similar issue in metagenome mode: #184

@SaberHQ Possibly a bug in transcriptome mode?

@kmnip kmnip added the bug label Jun 1, 2023
@SaberHQ
Copy link
Collaborator

SaberHQ commented Jun 1, 2023

Thanks for reporting this @kainth-amoldeep

@kmnip I do not think it is related to the infinite loop bug you mentioned (#184) because, in the transcriptome mode, NanoSim first selects a reference transcriptome based on the expression profiles and then based on the 2 dimensional KDE distributions, it picks an aligned length for the simulation. Therefore, max and min lengths are not used. They are only used in generating unaligned reads I believe.

That being said, dear @kainth-amoldeep would you mind trying again not using the min and max parameters and see if it works or not? In addition, you said you ensured that the transcript ids in the expression file start with "ENS". Can you also double-check if they are the same in your reference transcriptome file?

Did you calculate the expression profiles using the module provided by NanoSim? I mean did you run the NanoSim -quantify module?

@kmnip
Copy link
Collaborator

kmnip commented Jun 2, 2023

@SaberHQ Good points!

@kainth-amoldeep I suggest you can also try our development code on the master branch. We have had a number of bugfixes since v3.0.0.

@kainth-amoldeep
Copy link
Author

Hi,
Thanks for your comments and suggestions. I tried without min max and it was still stuck. @SaberHQ I made an expression abundance file on the basis of short read sequencing; may something has gone wrong in that in terms of nomenclature. However, I was able to simulate reads on the basis of a pretrained model. So, for now, it serves my purpose. Thanks a lot again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants