Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling viral gff files #263

Open
rebeelouise opened this issue Dec 2, 2024 · 6 comments
Open

Handling viral gff files #263

rebeelouise opened this issue Dec 2, 2024 · 6 comments
Labels
input data Issue is caused by input data

Comments

@rebeelouise
Copy link

rebeelouise commented Dec 2, 2024

Hi!

I am trying to use isoquant within the polyTailor tool.

I am looking at viral sequences...

Isoquant does not like the viral gff files and throws the below error.

2024-12-02 19:46:39,045 - INFO -  === IsoQuant pipeline started === 
2024-12-02 19:46:39,046 - INFO - gffutils version: 0.13
2024-12-02 19:46:39,046 - INFO - pysam version: 0.22.1
2024-12-02 19:46:39,046 - INFO - pyfaidx version: 0.8.1.3
2024-12-02 19:46:39,048 - INFO - Checking input gene annotation
2024-12-02 19:46:39,050 - INFO - Gene annotation seems to be correct
2024-12-02 19:46:39,050 - INFO - Converting gene annotation file to .db format (takes a while)...
2024-12-02 19:46:39,077 - CRITICAL - IsoQuant failed with the following error, please, submit this issue to https://github.com/ablab/IsoQuant/issuesTraceback (most recent call last):
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 622, in _populate_from_lines
    self._insert(f, c)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 566, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/bin/isoquant.py", line 819, in <module>
    main(sys.argv[1:])
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/bin/isoquant.py", line 813, in main
    run_pipeline(args)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/bin/isoquant.py", line 749, in run_pipeline
    args.genedb = convert_gtf_to_db(args)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/share/isoquant-3.6.2-0/src/gtf2db.py", line 144, in convert_gtf_to_db
    gtf_filename, genedb_filename = convert_db(gtf_filename, genedb_filename, gtf2db, args)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/share/isoquant-3.6.2-0/src/gtf2db.py", line 360, in convert_db
    convert_fn(gtf_filename, genedb_filename, args.complete_genedb, args.gtf_check)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/share/isoquant-3.6.2-0/src/gtf2db.py", line 133, in gtf2db
    gffutils.create_db(gtf, db, force=True, keep_order=True, merge_strategy='error',
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 1401, in create_db
    c.create()
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 543, in create
    self._populate_from_lines(self.iterator)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 624, in _populate_from_lines
    fixed, final_strategy = self._do_merge(f, self.merge_strategy)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 257, in _do_merge
    raise ValueError("Duplicate ID {0.id}".format(f))
ValueError: Duplicate ID cds-NP_073549.1
`

```
Have any virologists got any hacks? Or do the writers of this tool know how I can get around this sensibly??

For another reference I am getting the following messages:

```
`2024-12-02 19:14:12,804 - INFO -  === IsoQuant pipeline started === 
2024-12-02 19:14:12,804 - INFO - gffutils version: 0.13
2024-12-02 19:14:12,804 - INFO - pysam version: 0.22.1
2024-12-02 19:14:12,804 - INFO - pyfaidx version: 0.8.1.3
2024-12-02 19:14:12,807 - INFO - Gene annotation file found. Using /mnt/g/rebee/projects/nano3p_seq/polyTailor/isoquant/MNV.db
2024-12-02 19:14:12,807 - INFO - Loading gene database from /mnt/g/rebee/projects/nano3p_seq/polyTailor/isoquant/MNV.db
2024-12-02 19:14:12,818 - INFO - Loading reference genome from /mnt/g/rebee/projects/nano3p_seq/references/MNVNoV-400.reference.fasta
2024-12-02 19:14:12,822 - INFO - Processing 1 experiment
2024-12-02 19:14:12,822 - INFO - Processing experiment OUT
2024-12-02 19:14:12,822 - INFO - Experiment has 1 BAM file: align/algs.bam
2024-12-02 19:14:12,824 - INFO - Collecting read alignments
2024-12-02 19:14:12,864 - INFO - Processing chromosome NC_008311.1
2024-12-02 19:14:13,604 - WARNING - Gene gene-NoVGV_gp1 has no exons / transcripts, check your input annotation
2024-12-02 19:14:13,604 - WARNING - Genes gene-NoVGV_gp1, gene-NoVGV_gp2, gene-NoVGV_gp4, gene-NoVGV_gp3 have no exons, check you GTF file
2024-12-02 19:14:16,211 - INFO - Finished processing chromosome NC_008311.1
2024-12-02 19:14:16,244 - INFO - Counting multimapped reads
2024-12-02 19:14:16,244 - INFO - Loading read assignments from isoquant/OUT/aux/OUT.save_NC_008311.1
2024-12-02 19:14:16,539 - INFO - Resolving multimappers
2024-12-02 19:14:16,542 - INFO - Multimappers resolved
2024-12-02 19:14:16,546 - INFO - Alignments collected, overall alignment statistics:
2024-12-02 19:14:16,546 - INFO - primary: 22357
2024-12-02 19:14:16,547 - INFO - secondary: 1
2024-12-02 19:14:16,547 - INFO - supplementary: 133
2024-12-02 19:14:16,547 - INFO - unaligned: 2082329
2024-12-02 19:14:16,551 - INFO - Finishing read assignment, total assignments 22357, polyA percentage 53.5
2024-12-02 19:14:16,552 - INFO - Read assignments files saved to isoquant/OUT/aux/OUT.save*. 
2024-12-02 19:14:16,552 - INFO - To keep these intermediate files for debug purposes use --keep_tmp flag
2024-12-02 19:14:16,554 - INFO - Total assignments used for analysis: 22357, polyA tail detected in 11960 (53.5%)
2024-12-02 19:14:16,554 - INFO - Processing assigned reads OUT
2024-12-02 19:14:16,554 - INFO - Transcript models construction is turned on
2024-12-02 19:14:16,560 - INFO - Transcript construction options:
2024-12-02 19:14:16,560 - INFO -   Novel monoexonic transcripts will be reported: yes
2024-12-02 19:14:16,561 - INFO -   PolyA tails are required for multi-exon transcripts to be reported: no
2024-12-02 19:14:16,561 - INFO -   PolyA tails are required for 2-exon transcripts to be reported: yes
2024-12-02 19:14:16,561 - INFO -   PolyA tails are required for known monoexon transcripts to be reported: yes
2024-12-02 19:14:16,561 - INFO -   PolyA tails are required for novel monoexon transcripts to be reported: yes
2024-12-02 19:14:16,561 - INFO -   Splice site reporting level: only_stranded
2024-12-02 19:14:16,583 - INFO - Processing chromosome NC_008311.1
2024-12-02 19:14:16,608 - INFO - Loading read assignments from isoquant/OUT/aux/OUT.save_NC_008311.1
2024-12-02 19:14:16,616 - WARNING - Gene gene-NoVGV_gp1 has no exons / transcripts, check your input annotation
2024-12-02 19:14:18,062 - WARNING - Gene gene-NoVGV_gp1 has no exons / transcripts, check your input annotation
2024-12-02 19:14:18,062 - WARNING - Genes gene-NoVGV_gp1, gene-NoVGV_gp2, gene-NoVGV_gp4, gene-NoVGV_gp3 have no exons, check you GTF file
2024-12-02 19:14:18,084 - INFO - Finished processing chromosome NC_008311.1
2024-12-02 19:14:18,188 - INFO - Transcript model file isoquant/OUT/OUT.transcript_models.gtf
2024-12-02 19:14:18,192 - INFO - Extended annotation is saved to isoquant/OUT/OUT.extended_annotation.gtf
2024-12-02 19:14:18,192 - INFO - Transcript model statistics
2024-12-02 19:14:18,192 - INFO - novel_not_in_catalog: 3
2024-12-02 19:14:18,413 - INFO - Gene counts are stored in isoquant/OUT/OUT.gene_counts.tsv
2024-12-02 19:14:18,414 - INFO - Transcript counts are stored in isoquant/OUT/OUT.transcript_counts.tsv
2024-12-02 19:14:18,414 - INFO - Read assignments are stored in isoquant/OUT/OUT.read_assignments.tsv.gz
2024-12-02 19:14:18,414 - INFO - Read assignment statistics
2024-12-02 19:14:18,415 - INFO - intergenic: 22357
2024-12-02 19:14:18,437 - INFO - Processed experiment OUT
2024-12-02 19:14:18,437 - INFO - Processed 1 experiment
2024-12-02 19:14:18,437 - INFO -  === IsoQuant pipeline finished === 
`
```
@andrewprzh
Copy link
Collaborator

Dear @rebeelouise

In the first message gffutils detect identical IDs in your GFF:
ValueError: Duplicate ID cds-NP_073549.1

Duplicating IDs, in fact, violate GFF/GTF format in general (even for features of different type, e.g. gene and CDS may not have the same ID), so other tools might not like it as well. I suggest to change duplicating IDs.

Second message complains about genes not having transcript / exon children records. IsoQuant is primarily designed for eukaryotes and expects each gene to contain transcript records, and each transcript to contain exon records.

Also, if I may ask, what is the goal of your project and is it the right tool, since as I mentioned, IsoQuant is designed for eukaryotic organisms and working with alternative splicing?

All the best
Andrey

@andrewprzh andrewprzh added the input data Issue is caused by input data label Dec 3, 2024
@rebeelouise
Copy link
Author

Dear @rebeelouise

In the first message gffutils detect identical IDs in your GFF:

ValueError: Duplicate ID cds-NP_073549.1

Duplicating IDs, in fact, violate GFF/GTF format in general (even for features of different type, e.g. gene and CDS may not have the same ID), so other tools might not like it as well. I suggest to change duplicating IDs.

Second message complains about genes not having transcript / exon children records. IsoQuant is primarily designed for eukaryotes and expects each gene to contain transcript records, and each transcript to contain exon records.

Also, if I may ask, what is the goal of your project and is it the right tool, since as I mentioned, IsoQuant is designed for eukaryotic organisms and working with alternative splicing?

All the best

Andrey

Hi Andrey,

Thanks for your quick response!

This is often the case when using tools for viral work. Was hoping I could somehow get it to work regardless. I will speak with the writers of polyTailor to see if they are happy to work with me on adapting this to suit my application! Or find an alternative to isoquant for that step in their workflow!

It's being used to quantify polyA length and look at 3' end of the sequence of RNA. The viral genomes are polyA'd and also launch sgmRNAs from their genomes. I think I have seen people use isoquant on viral stuff before in the literature!

I guess if you're interested at any point in adding in this as an application - I'd be happy to talk!

@andrewprzh
Copy link
Collaborator

@rebeelouise

Adding viral functionality would be interesting, but it's hard to predict the timeline. I'll keep that in mind.

As to GTFs, it thinks it's possible to make a small converter that would correct viral GTFs to make it compatible.

@andrewprzh
Copy link
Collaborator

By the way, if you could, is it possible to briefly state what data are you using and what kind of analysis you are aiming at?
I am not very much into viral genomes :)

@rebeelouise
Copy link
Author

By the way, if you could, is it possible to briefly state what data are you using and what kind of analysis you are aiming at? I am not very much into viral genomes :)

Absolutely! Can I drop you more info via email? :)

@andrewprzh
Copy link
Collaborator

Sure! I don't post it in comments but it can be found in my profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
input data Issue is caused by input data
Projects
None yet
Development

No branches or pull requests

2 participants