Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'transcript_id' with Ensemble human annotation #2

Closed
emi80 opened this issue Dec 7, 2017 · 21 comments
Closed

KeyError: 'transcript_id' with Ensemble human annotation #2

emi80 opened this issue Dec 7, 2017 · 21 comments

Comments

@emi80
Copy link
Member

emi80 commented Dec 7, 2017

From @sridhar0605 originally posted in #1:

When using Homo_sapiens.GRCh37.75.gtf as reference from ensembl, I see this error

Using default tag: latest
latest: Pulling from guigolab/ggsashimi
915665fee719: Pull complete
1a0814f59c8e: Pull complete
b3b71680ed5d: Pull complete
1c3c8afa6ada: Pull complete
2fbeb903a5b4: Pull complete
Digest: sha256:82590f821978568e948ad4861ce009fcb26e7543263bea9d7b78c17667f8d675
Status: Downloaded newer image for guigolab/ggsashimi:latest
Traceback (most recent call last):
File "/sashimi-plot.py", line 592, in
transcripts, exons = read_gtf(args.gtf, args.coordinates)
File "/sashimi-plot.py", line 278, in read_gtf
transcript_id = d["transcript_id"]
KeyError: 'transcript_id'

few lines from gtf:

#!genome-build GRCh37.p13
#!genome-version GRCh37
#!genome-date 2009-02
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2013-09
1	pseudogene	gene	11869	14412	.	+	.	gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
1	processed_transcript	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
1	processed_transcript	exon	11869	12227	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002234944";
1	processed_transcript	exon	12613	12721	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00003582793";
1	processed_transcript	exon	13221	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002312635";

@emi80
Copy link
Member Author

emi80 commented Dec 7, 2017

The problem is that we assume the transcript_id attribute is present in every line of the GTF (except for comments of course). I see here that the gene line does not have it. One solution: you could preprocess the annotation and add the transcript_id field where it is not present.

@abreschi what do you suggest?

@abreschi
Copy link
Collaborator

abreschi commented Dec 7, 2017

Hi! Sorry about this issue. Unfortunately the transcript_id is a required field in the GTF format (https://genome.ucsc.edu/FAQ/FAQformat.html#format4), even in gene rows. So, I would modify the Ensembl file like @emi80 said. Hope it helps.

@sridhar0605
Copy link

sridhar0605 commented Dec 7, 2017

Hello @emi80 @abreschi ,

Thank you for your reply. I will modify and try this again, currently i do not see any issue if i change the build to 37.67 may be its build specific.

I guess you can close this issue.

thanks

@ChaoTang-SCU
Copy link

Here,
When I used Ensembl GTF file, I also get the same error. Finally, I found that the GTF only have transcript and exon rows works well.
awk -F "\t" '$3=="exon"||$3=="transcript"' Homo_sapiens.GRCh38.87.gtf > Homo_sapiens.GRCh38.87.transccript.exon.gtf

@ManavalanG
Copy link

ManavalanG commented Sep 6, 2018

Just wanted to add that gencode gtf runs into same issue.

@bellenger-l
Copy link

Here,
When I used Ensembl GTF file, I also get the same error. Finally, I found that the GTF only have transcript and exon rows works well.
awk -F "\t" '$3=="exon"||$3=="transcript"' Homo_sapiens.GRCh38.87.gtf > Homo_sapiens.GRCh38.87.transccript.exon.gtf

I'm sorry it didn't fix the issue for me, I have a new error :

Traceback (most recent call last):
  File "./sashimi-plot.py", line 612, in <module>
    transcripts, exons = read_gtf(args.gtf, args.coordinates)
  File "./sashimi-plot.py", line 283, in read_gtf
    d = dict(kv.strip().split(" ") for kv in tags.strip(";").split("; "))
ValueError: dictionary update sequence element #17 has length 7; 2 is required

I am using the Mus_musculus.GRCm38.83 annotation.

If you have a solution, I would appreciate it.

Thanks,
Lea

@dgarrimar
Copy link
Collaborator

Dear Lea @bellenger-l,

You could try using GENCODE annotation files. The release corresponding to mouse ensembl 83 is GENCODE M8. Alternatively, could you provide some lines of your GTF to check what is the problem? As stated in previous comments, make sure that the file follows the proper format. Specially, the transcript_id attribute should be present in every line of the GTF.

@bellenger-l
Copy link

Dear @dgarrimar,

Thanks a lot ! It works like a charm with the Gencode annotation, but it doesn't print the different transcripts under sashimi plots. I can't figure it out which option can do that.

Best,
Lea

@dgarrimar
Copy link
Collaborator

In principle it should, could you send the command that you are using and the output that you generated? Thanks!

@bellenger-l
Copy link

I'm sorry, I didn't check gencode GTF and the chromosome names were different from Ensembl GTF ("chr1" against "1"), I remove "chr" from first column and now I have the transcripts...

Thanks a lot for your help anyway,
Best
Lea

@kylinson
Copy link

kylinson commented Mar 5, 2019

just use transcript_id = d.get("transcript_id","transcript_id_missing") to replace the original code 284th line.

@PhKoch
Copy link

PhKoch commented Mar 28, 2019

@kylinson this hack didn't work for me. The following error occured:

Traceback (most recent call last):
  File "./sashimi-plot.py", line 612, in <module>
    transcripts, exons = read_gtf(args.gtf, args.coordinates)
  File "./sashimi-plot.py", line 283, in read_gtf
    d = dict(kv.strip().split(" ") for kv in tags.strip(";").split("; "))
ValueError: dictionary update sequence element #10 has length 7; 2 is required

I'll edit my gtf as suggested earlier.

@archana433
Copy link

archana433 commented May 11, 2020

@tangchao7498
I'm sorry it didn't fix the issue for me, I also have a new error :
I am using the Mus_musculus.GRCm38.99 annotation

Traceback (most recent call last):
  File "./sashimi-plot.py", line 612, in <module>
    transcripts, exons = read_gtf(args.gtf, args.coordinates)
  File "./sashimi-plot.py", line 283, in read_gtf
    d = dict(kv.strip().split(" ") for kv in tags.strip(";").split("; "))
ValueError: dictionary update sequence element #11 has length 7; 2 is required

@dgarrimar
Copy link
Collaborator

dgarrimar commented May 11, 2020

Dear @archana433, have you tried to use gencode anotation? I believe this is the equivalent to the one you use. Give it a try and let me know!

@archana433
Copy link

archana433 commented May 11, 2020

thanks , it worked. now got this error

Error in seq.default(start, max(start + 1, end - 4), by = 2425) : 
  'from' must be of length 1
Calls: rbind -> [ -> [.data.table -> seq -> seq.default
Execution halted

@dgarrimar
Copy link
Collaborator

Great, as the annotation issue is solved, let's continue the discussion regarding this error in #33.

@antonioggsousa
Copy link

antonioggsousa commented Jul 7, 2020

Hi,

I faced the same problem. I'm trying to run the python script, but instead of changing the GTF file, I added a couple of code lines to ignore the absence of "transcript_id" and, also concatenate gene names with a space:

                   `#--------------------------------------------------------------------------------

                    ## AGGS: skip lines without "transcript_id" tag

                    if "transcript_id" not in tags: 

                            continue

                    dict_list = [] # concatenate "gene_name" with space, e.g., "PDH-E1 ALPHA" into "PDH-E1_ALPHA"

                    for ele in tags.strip(";").split("; "):

                            l = ele.strip().split(" ")

                            if len(l[1::]) > 1: 

                                    gene_name = "_".join(ele.strip().split(" ")[1::])

                                    l = [l[0], gene_name]

                            dict_list.append(l)

                    dic_tuple = tuple(dict_list)

                    d = dict(dic_tuple)

                    #-------------------------------------------------------------------------------- 

                    #d = dict(kv.strip().split(" ") for kv in tags.strip(";").split("; ")) #aggs: commented line`

You might consider adding these couple of lines (from lines: 283-297) to your python script. I know that is not very pythonic.

António

@KrotosBenjamin
Copy link

I've fixed this issue with gencodeID, but still works as originally intended with a try statement. This is easier than editing a GTF file.

replace:
transcript_id = d["transcript_id"]

with try statement below.

try:
    transcript_id = d["transcript_id"]
except KeyError:
    transcript_id = d["gene_id"]

@antonioggsousa
Copy link

Thx @KrotosBenjamin is by far much more pythonic and elegant.

António

@stale
Copy link

stale bot commented Jan 21, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues with no recent activity label Jan 21, 2021
@dgarrimar
Copy link
Collaborator

Following suggestions in PR #52 by @ygidtu, with minor modifications, GTFs with gene rows without the transcript_id attribute will not throw an error anymore. However, the transcript_id attribute will still be required in transcript/exon rows. We included a more informative error message for this case.

@stale stale bot removed the stale Issues with no recent activity label Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests