Handling viral gff files #263

rebeelouise · 2024-12-02T19:53:48Z

Hi!

I am trying to use isoquant within the polyTailor tool.

I am looking at viral sequences...

Isoquant does not like the viral gff files and throws the below error.

2024-12-02 19:46:39,045 - INFO -  === IsoQuant pipeline started === 
2024-12-02 19:46:39,046 - INFO - gffutils version: 0.13
2024-12-02 19:46:39,046 - INFO - pysam version: 0.22.1
2024-12-02 19:46:39,046 - INFO - pyfaidx version: 0.8.1.3
2024-12-02 19:46:39,048 - INFO - Checking input gene annotation
2024-12-02 19:46:39,050 - INFO - Gene annotation seems to be correct
2024-12-02 19:46:39,050 - INFO - Converting gene annotation file to .db format (takes a while)...
2024-12-02 19:46:39,077 - CRITICAL - IsoQuant failed with the following error, please, submit this issue to https://github.com/ablab/IsoQuant/issuesTraceback (most recent call last):
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 622, in _populate_from_lines
    self._insert(f, c)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 566, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/bin/isoquant.py", line 819, in <module>
    main(sys.argv[1:])
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/bin/isoquant.py", line 813, in main
    run_pipeline(args)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/bin/isoquant.py", line 749, in run_pipeline
    args.genedb = convert_gtf_to_db(args)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/share/isoquant-3.6.2-0/src/gtf2db.py", line 144, in convert_gtf_to_db
    gtf_filename, genedb_filename = convert_db(gtf_filename, genedb_filename, gtf2db, args)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/share/isoquant-3.6.2-0/src/gtf2db.py", line 360, in convert_db
    convert_fn(gtf_filename, genedb_filename, args.complete_genedb, args.gtf_check)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/share/isoquant-3.6.2-0/src/gtf2db.py", line 133, in gtf2db
    gffutils.create_db(gtf, db, force=True, keep_order=True, merge_strategy='error',
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 1401, in create_db
    c.create()
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 543, in create
    self._populate_from_lines(self.iterator)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 624, in _populate_from_lines
    fixed, final_strategy = self._do_merge(f, self.merge_strategy)
  File "/home/van_hohenheim/miniforge3/envs/polyTailor/lib/python3.10/site-packages/gffutils/create.py", line 257, in _do_merge
    raise ValueError("Duplicate ID {0.id}".format(f))
ValueError: Duplicate ID cds-NP_073549.1
`

```
Have any virologists got any hacks? Or do the writers of this tool know how I can get around this sensibly??

For another reference I am getting the following messages:

```
`2024-12-02 19:14:12,804 - INFO -  === IsoQuant pipeline started === 
2024-12-02 19:14:12,804 - INFO - gffutils version: 0.13
2024-12-02 19:14:12,804 - INFO - pysam version: 0.22.1
2024-12-02 19:14:12,804 - INFO - pyfaidx version: 0.8.1.3
2024-12-02 19:14:12,807 - INFO - Gene annotation file found. Using /mnt/g/rebee/projects/nano3p_seq/polyTailor/isoquant/MNV.db
2024-12-02 19:14:12,807 - INFO - Loading gene database from /mnt/g/rebee/projects/nano3p_seq/polyTailor/isoquant/MNV.db
2024-12-02 19:14:12,818 - INFO - Loading reference genome from /mnt/g/rebee/projects/nano3p_seq/references/MNVNoV-400.reference.fasta
2024-12-02 19:14:12,822 - INFO - Processing 1 experiment
2024-12-02 19:14:12,822 - INFO - Processing experiment OUT
2024-12-02 19:14:12,822 - INFO - Experiment has 1 BAM file: align/algs.bam
2024-12-02 19:14:12,824 - INFO - Collecting read alignments
2024-12-02 19:14:12,864 - INFO - Processing chromosome NC_008311.1
2024-12-02 19:14:13,604 - WARNING - Gene gene-NoVGV_gp1 has no exons / transcripts, check your input annotation
2024-12-02 19:14:13,604 - WARNING - Genes gene-NoVGV_gp1, gene-NoVGV_gp2, gene-NoVGV_gp4, gene-NoVGV_gp3 have no exons, check you GTF file
2024-12-02 19:14:16,211 - INFO - Finished processing chromosome NC_008311.1
2024-12-02 19:14:16,244 - INFO - Counting multimapped reads
2024-12-02 19:14:16,244 - INFO - Loading read assignments from isoquant/OUT/aux/OUT.save_NC_008311.1
2024-12-02 19:14:16,539 - INFO - Resolving multimappers
2024-12-02 19:14:16,542 - INFO - Multimappers resolved
2024-12-02 19:14:16,546 - INFO - Alignments collected, overall alignment statistics:
2024-12-02 19:14:16,546 - INFO - primary: 22357
2024-12-02 19:14:16,547 - INFO - secondary: 1
2024-12-02 19:14:16,547 - INFO - supplementary: 133
2024-12-02 19:14:16,547 - INFO - unaligned: 2082329
2024-12-02 19:14:16,551 - INFO - Finishing read assignment, total assignments 22357, polyA percentage 53.5
2024-12-02 19:14:16,552 - INFO - Read assignments files saved to isoquant/OUT/aux/OUT.save*. 
2024-12-02 19:14:16,552 - INFO - To keep these intermediate files for debug purposes use --keep_tmp flag
2024-12-02 19:14:16,554 - INFO - Total assignments used for analysis: 22357, polyA tail detected in 11960 (53.5%)
2024-12-02 19:14:16,554 - INFO - Processing assigned reads OUT
2024-12-02 19:14:16,554 - INFO - Transcript models construction is turned on
2024-12-02 19:14:16,560 - INFO - Transcript construction options:
2024-12-02 19:14:16,560 - INFO -   Novel monoexonic transcripts will be reported: yes
2024-12-02 19:14:16,561 - INFO -   PolyA tails are required for multi-exon transcripts to be reported: no
2024-12-02 19:14:16,561 - INFO -   PolyA tails are required for 2-exon transcripts to be reported: yes
2024-12-02 19:14:16,561 - INFO -   PolyA tails are required for known monoexon transcripts to be reported: yes
2024-12-02 19:14:16,561 - INFO -   PolyA tails are required for novel monoexon transcripts to be reported: yes
2024-12-02 19:14:16,561 - INFO -   Splice site reporting level: only_stranded
2024-12-02 19:14:16,583 - INFO - Processing chromosome NC_008311.1
2024-12-02 19:14:16,608 - INFO - Loading read assignments from isoquant/OUT/aux/OUT.save_NC_008311.1
2024-12-02 19:14:16,616 - WARNING - Gene gene-NoVGV_gp1 has no exons / transcripts, check your input annotation
2024-12-02 19:14:18,062 - WARNING - Gene gene-NoVGV_gp1 has no exons / transcripts, check your input annotation
2024-12-02 19:14:18,062 - WARNING - Genes gene-NoVGV_gp1, gene-NoVGV_gp2, gene-NoVGV_gp4, gene-NoVGV_gp3 have no exons, check you GTF file
2024-12-02 19:14:18,084 - INFO - Finished processing chromosome NC_008311.1
2024-12-02 19:14:18,188 - INFO - Transcript model file isoquant/OUT/OUT.transcript_models.gtf
2024-12-02 19:14:18,192 - INFO - Extended annotation is saved to isoquant/OUT/OUT.extended_annotation.gtf
2024-12-02 19:14:18,192 - INFO - Transcript model statistics
2024-12-02 19:14:18,192 - INFO - novel_not_in_catalog: 3
2024-12-02 19:14:18,413 - INFO - Gene counts are stored in isoquant/OUT/OUT.gene_counts.tsv
2024-12-02 19:14:18,414 - INFO - Transcript counts are stored in isoquant/OUT/OUT.transcript_counts.tsv
2024-12-02 19:14:18,414 - INFO - Read assignments are stored in isoquant/OUT/OUT.read_assignments.tsv.gz
2024-12-02 19:14:18,414 - INFO - Read assignment statistics
2024-12-02 19:14:18,415 - INFO - intergenic: 22357
2024-12-02 19:14:18,437 - INFO - Processed experiment OUT
2024-12-02 19:14:18,437 - INFO - Processed 1 experiment
2024-12-02 19:14:18,437 - INFO -  === IsoQuant pipeline finished === 
`
```

The text was updated successfully, but these errors were encountered:

andrewprzh · 2024-12-03T00:49:45Z

Dear @rebeelouise

In the first message gffutils detect identical IDs in your GFF:
ValueError: Duplicate ID cds-NP_073549.1

Duplicating IDs, in fact, violate GFF/GTF format in general (even for features of different type, e.g. gene and CDS may not have the same ID), so other tools might not like it as well. I suggest to change duplicating IDs.

Second message complains about genes not having transcript / exon children records. IsoQuant is primarily designed for eukaryotes and expects each gene to contain transcript records, and each transcript to contain exon records.

Also, if I may ask, what is the goal of your project and is it the right tool, since as I mentioned, IsoQuant is designed for eukaryotic organisms and working with alternative splicing?

All the best
Andrey

rebeelouise · 2024-12-03T07:36:25Z

Dear @rebeelouise

In the first message gffutils detect identical IDs in your GFF:

ValueError: Duplicate ID cds-NP_073549.1

Duplicating IDs, in fact, violate GFF/GTF format in general (even for features of different type, e.g. gene and CDS may not have the same ID), so other tools might not like it as well. I suggest to change duplicating IDs.

Second message complains about genes not having transcript / exon children records. IsoQuant is primarily designed for eukaryotes and expects each gene to contain transcript records, and each transcript to contain exon records.

Also, if I may ask, what is the goal of your project and is it the right tool, since as I mentioned, IsoQuant is designed for eukaryotic organisms and working with alternative splicing?

All the best

Andrey

Hi Andrey,

Thanks for your quick response!

This is often the case when using tools for viral work. Was hoping I could somehow get it to work regardless. I will speak with the writers of polyTailor to see if they are happy to work with me on adapting this to suit my application! Or find an alternative to isoquant for that step in their workflow!

It's being used to quantify polyA length and look at 3' end of the sequence of RNA. The viral genomes are polyA'd and also launch sgmRNAs from their genomes. I think I have seen people use isoquant on viral stuff before in the literature!

I guess if you're interested at any point in adding in this as an application - I'd be happy to talk!

andrewprzh · 2024-12-03T16:48:09Z

@rebeelouise

Adding viral functionality would be interesting, but it's hard to predict the timeline. I'll keep that in mind.

As to GTFs, it thinks it's possible to make a small converter that would correct viral GTFs to make it compatible.

andrewprzh · 2024-12-03T16:49:30Z

By the way, if you could, is it possible to briefly state what data are you using and what kind of analysis you are aiming at?
I am not very much into viral genomes :)

rebeelouise · 2024-12-03T20:54:40Z

By the way, if you could, is it possible to briefly state what data are you using and what kind of analysis you are aiming at? I am not very much into viral genomes :)

Absolutely! Can I drop you more info via email? :)

andrewprzh · 2024-12-03T22:57:11Z

Sure! I don't post it in comments but it can be found in my profile.

andrewprzh added the input data Issue is caused by input data label Dec 3, 2024

rebeelouise mentioned this issue Dec 3, 2024

Slightly different use case to no avail! novoalab/polyTailor#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling viral gff files #263

Handling viral gff files #263

rebeelouise commented Dec 2, 2024 •

edited

Loading

andrewprzh commented Dec 3, 2024

rebeelouise commented Dec 3, 2024

andrewprzh commented Dec 3, 2024

andrewprzh commented Dec 3, 2024

rebeelouise commented Dec 3, 2024

andrewprzh commented Dec 3, 2024

Handling viral gff files #263

Handling viral gff files #263

Comments

rebeelouise commented Dec 2, 2024 • edited Loading

andrewprzh commented Dec 3, 2024

rebeelouise commented Dec 3, 2024

andrewprzh commented Dec 3, 2024

andrewprzh commented Dec 3, 2024

rebeelouise commented Dec 3, 2024

andrewprzh commented Dec 3, 2024

rebeelouise commented Dec 2, 2024 •

edited

Loading