Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agat_sp_merge_annotations.pl fails to merge? #325

Closed
dariober opened this issue Jan 20, 2023 · 2 comments
Closed

agat_sp_merge_annotations.pl fails to merge? #325

dariober opened this issue Jan 20, 2023 · 2 comments

Comments

@dariober
Copy link

dariober commented Jan 20, 2023

Hi- thanks for the great set of tools! This is with agat 1.0.0 from bioconda. I cannot understand why the features in these two gff files are not merged by agat_sp_merge_annotations.pl:

cat b.gff3 
chr1	AUGUSTUS	gene	1000424	1039237	.	+	.	ID=A;
chr1	AUGUSTUS	mRNA	1000424	1039237	.	+	.	ID=A.t1;Parent=A;

cat g.gff3 
chr1	AUGUSTUS	gene	1000424	1039237	.	+	.	ID=B;
chr1	AUGUSTUS	mRNA	1000424	1039237	.	+	.	ID=B.t1;Parent=B;

I would expect the merged output to contain one gene and one mRNA only. Instead, the two genes remain distinct:

agat_sp_merge_annotations.pl -f b.gff3 -f g.gff3 -o m.gff3

cat m.gff3 
##gff-version 3
chr1	AUGUSTUS	gene	1000424	1039237	.	+	.	ID=A
chr1	AUGUSTUS	mRNA	1000424	1039237	.	+	.	ID=A.t1;Parent=A
chr1	AUGUSTUS	gene	1000424	1039237	.	+	.	ID=B
chr1	AUGUSTUS	mRNA	1000424	1039237	.	+	.	ID=B.t1;Parent=B

Output from agat_sp_merge_annotations below. Am I missing something?

Using standard /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/config.yaml file
********************************************************************************
*                              - Start parsing -                               *
********************************************************************************
-------------------------- parse options and metadata --------------------------
=> Accessing the feature_levels YAML file
Using standard /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/feature_levels.yaml file
=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
	* locus_tag
	* gene_id
=> merge_loci option deactivated
=> Machine information:
	This script is being run by perl v5.32.1
	Bioperl location being used: /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/Bio/
	Operating system being used: linux 
=> Accessing Ontology
	No ontology accessible from the gff file header!
	We use the SOFA ontology distributed with AGAT:
		/export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo
	Read ontology /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo:
		4 root terms, and 2596 total terms, and 1516 leaf terms
	Filtering ontology:
		We found 1861 terms that are sequence_feature or is_a child of it.
--------------------------------- parsing file ---------------------------------
=> Number of line in file: 2
=> Number of comment lines: 0
=> Fasta included: No
=> Number of features lines: 2
=> Number of feature type (3rd column): 2
	* Level1: 1 => gene
	* level2: 1 => mRNA
	* level3: 0 => 
	* unknown: 0 => 
=> Version of the Bioperl GFF parser selected by AGAT: 3
********************************************************************************
*                               - End parsing -                                *
*                              done in 2 seconds                               *
********************************************************************************

********************************************************************************
*                               - Start checks -                               *
********************************************************************************
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket ---------------------------
Nothing to check as sequential bucket!
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check5: l1 linked to l2 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check6: remove orphan l1 ---------------------------
We remove only those not supposed to be orphan
None found
------------------------------ done in 0 seconds -------------------------------

------------------------- Check7: all level3 locations -------------------------
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check8: check cds -------------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check9: check exons ------------------------------
No exons created
No exons locations modified
No supernumerary exons removed
No level2 locations modified
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check10: check utrs ------------------------------
No UTRs created
No UTRs locations modified
No supernumerary UTRs removed
------------------------------ done in 0 seconds -------------------------------

------------------------ Check11: all level2 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

------------------------ Check12: all level1 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

---------------------- Check13: remove identical isoforms ----------------------
None found
------------------------------ done in 0 seconds -------------------------------
********************************************************************************
*                                - End checks -                                *
*                              done in 0 seconds                               *
********************************************************************************

=> OmniscientI total time: 2 seconds
b.gff3 GFF3 file parsed
There is 1 mrna
There is 1 gene
********************************************************************************
*                              - Start parsing -                               *
********************************************************************************
-------------------------- parse options and metadata --------------------------
=> Accessing the feature_levels YAML file
Using standard /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/feature_levels.yaml file
=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
	* locus_tag
	* gene_id
=> merge_loci option deactivated
=> Machine information:
	This script is being run by perl v5.32.1
	Bioperl location being used: /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/Bio/
	Operating system being used: linux 
=> Accessing Ontology
	No ontology accessible from the gff file header!
	We use the SOFA ontology distributed with AGAT:
		/export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo
	Read ontology /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/lib/perl5/site_perl/auto/share/dist/AGAT/so.obo:
		4 root terms, and 2596 total terms, and 1516 leaf terms
	Filtering ontology:
		We found 1861 terms that are sequence_feature or is_a child of it.
--------------------------------- parsing file ---------------------------------
=> Number of line in file: 2
=> Number of comment lines: 0
=> Fasta included: No
=> Number of features lines: 2
=> Number of feature type (3rd column): 2
	* Level1: 1 => gene
	* level2: 1 => mRNA
	* level3: 0 => 
	* unknown: 0 => 
=> Version of the Bioperl GFF parser selected by AGAT: 3
********************************************************************************
*                               - End parsing -                                *
*                              done in 2 seconds                               *
********************************************************************************

********************************************************************************
*                               - Start checks -                               *
********************************************************************************
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket ---------------------------
Nothing to check as sequential bucket!
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check5: l1 linked to l2 ----------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check6: remove orphan l1 ---------------------------
We remove only those not supposed to be orphan
None found
------------------------------ done in 0 seconds -------------------------------

------------------------- Check7: all level3 locations -------------------------
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check8: check cds -------------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check9: check exons ------------------------------
No exons created
No exons locations modified
No supernumerary exons removed
No level2 locations modified
------------------------------ done in 0 seconds -------------------------------

----------------------------- Check10: check utrs ------------------------------
No UTRs created
No UTRs locations modified
No supernumerary UTRs removed
------------------------------ done in 0 seconds -------------------------------

------------------------ Check11: all level2 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

------------------------ Check12: all level1 locations -------------------------
No problem found
------------------------------ done in 0 seconds -------------------------------

---------------------- Check13: remove identical isoforms ----------------------
None found
------------------------------ done in 0 seconds -------------------------------
********************************************************************************
*                                - End checks -                                *
*                              done in 0 seconds                               *
********************************************************************************

=> OmniscientI total time: 2 seconds
g.gff3 GFF3 file parsed
There is 1 gene
There is 1 mrna

Total raw data of files together:
There is 2 mrna
There is 2 gene

Now merging overlaping loci, and removing identical isoforms
Use of uninitialized value $nb_cases in concatenation (.) or string at /export/projects/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/genomeAnnotationPipeline/bin/agat_sp_merge_annotations.pl line 77.
 overlapping cases found. For each case 2 loci have been merged within a single locus

final result:
There is 2 mrna
There is 2 gene
Formating output to GFF3
@Juke34
Copy link
Collaborator

Juke34 commented Jan 20, 2023

Thank you for reporting it
Right AGAT is supposed to removes identical isoforms. You can modify this behaviour in the config.yaml (agat config --expose).
But there is an obvious problem here. I will have a look!

@Juke34
Copy link
Collaborator

Juke34 commented Jan 20, 2023

Right it is because the overlap is checked at CDS level, if no CDS we use the exons. Here there is none. I could add if there is no level3 feature (CDS, exon, etc) test equality at level2 (ncRNA, mRNA, etc), then level1 (gene, match, etc).

Juke34 added a commit that referenced this issue Feb 17, 2023
…unction merge_overlap_feature in merge_overlap_loci. Replace _check_overlap_name_diff and _check_identical_isoforms by merge_overlap_feature in agat_sp_fix_longest_ORF. Remove remove_l2_related_feature and test (replaced by remove_l2_and_relatives)

Improve check_gene_overlap_at_CDSthenEXON to work at l2 and l1 if no children, and rename it into check_feature_overlap_from_l3_to_l1.
Fix error in keep_only_uniq_from_list2 function to merge identical isoform and add merge_ suffix for merged attributes
@Juke34 Juke34 closed this as completed in 16c9288 Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants