Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agat_sp_manage_features.pl includes empty interpro output #147

Closed
Neato-Nick opened this issue Jun 30, 2021 · 5 comments
Closed

agat_sp_manage_features.pl includes empty interpro output #147

Neato-Nick opened this issue Jun 30, 2021 · 5 comments

Comments

@Neato-Nick
Copy link

Neato-Nick commented Jun 30, 2021

I noticed when an interpro domain is not found for an interproscan hit, it's still added to the dbxref list as '-'. This is easy enough to 'sed' out of the gff but wanted to report it anyway. I also don't think this invalidates the gff, but wanted to report it anyway. I noticed hits to CDD are a common culprit of this

example ipr output

PHRA102_6673.1  02e711a4621dd8379a18b3d8eb701f9e        410     CDD     cd06093 PX_domain       302     395     4.74892E-6      T       29-06-2021      -       -
PHRA102_6673.1  02e711a4621dd8379a18b3d8eb701f9e        410     Gene3D  G3DSA:3.30.1520.10      -       298     408     2.8E-8  T       29-06-2021      IPR036871       PX domain superfamily 
PHRA102_6673.1  02e711a4621dd8379a18b3d8eb701f9e        410     SUPERFAMILY     SSF64268        PX domain       302     395     1.96E-7 T       29-06-2021      IPR036871       PX domain superfamily

corresponding gff output (entry following Gene3D hit)

Phyram_PR-102_s0005     AUGUSTUS        mRNA    3335865 3337097 .       +       .       ID=PHRA102_6673.1;Parent=PHRA102_6673;Dbxref=CDD:cd06093,Gene3D:G3DSA:3.30.1520.10,InterPro:-,InterPro:IPR036871,SUPERFAMILY:SSF64268;Name=atl63;Ontology_term=GO:0035091;locus_tag=KRP23_6786;product=RING-H2 finger protein ATL63;uniprot_id=Q9LUZ9

Edit: To remove this from the output I used
sed -i -E -e 's/InterPro:-,|,InterPro:-//g' my.gff

@Neato-Nick
Copy link
Author

Neato-Nick commented Jun 30, 2021

I also have a separate problem but is still related to parsing of the attributes column.

I noticed database references are added as "Dbxref:", is this distinct from "db_xref:" that GenBank uses, following insdc standards? Another thing easy for me to do a simple string substitution (or use _manage_attributes.pl to fix ;) )

@Juke34
Copy link
Collaborator

Juke34 commented Jul 1, 2021

Hi, we can definitly fix the problem and remove skip the - from the output.

Yes true we use Dbxref originally to be compliant with the GFF3 specification and genome browsers like Webapollo.
INSDC uses instead the tag db_xref but it is exactly the same thing except INSDC accepts only information from specific databases to be stored in this attribute while GFF3 does not care.
When Submitting to INSDC DB archive we use the ENA gate (some prefer the NCBI), and use EMBLmyGFF3 tool to prepare the required EMBL file. During the conversion we translate some attribute to match the expected term of INSDC (see here), and as example Dbxref is translated into db_xref.

@Neato-Nick
Copy link
Author

Neato-Nick commented Jul 1, 2021 via email

@Juke34
Copy link
Collaborator

Juke34 commented Jul 1, 2021

I'm curious, during the emblmygff3 conversion, do you also move the value of the uniprot_id= tag into the db_xref list?

Not by default but everything is possible within EMBLmyGFF3 ^^ you just need to tune the proper "mapping file" in this case it will be the translation_gff_attribute_to_embl_qualifier.json file that you can access by running EMBLmyGFF3 --expose_translations and then add the following information:

"uniprot_id": {
    "source description": "uniprot database cross reference.",
    "target": "db_xref",
    "dev comment": "Nothing special to say here"
},

@Neato-Nick
Copy link
Author

Neato-Nick commented Jul 1, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants