KeyError: 'transcript_id' with Ensemble human annotation #2

emi80 · 2017-12-07T11:04:10Z

From @sridhar0605 originally posted in #1:

When using Homo_sapiens.GRCh37.75.gtf as reference from ensembl, I see this error

Using default tag: latest
latest: Pulling from guigolab/ggsashimi
915665fee719: Pull complete
1a0814f59c8e: Pull complete
b3b71680ed5d: Pull complete
1c3c8afa6ada: Pull complete
2fbeb903a5b4: Pull complete
Digest: sha256:82590f821978568e948ad4861ce009fcb26e7543263bea9d7b78c17667f8d675
Status: Downloaded newer image for guigolab/ggsashimi:latest
Traceback (most recent call last):
File "/sashimi-plot.py", line 592, in
transcripts, exons = read_gtf(args.gtf, args.coordinates)
File "/sashimi-plot.py", line 278, in read_gtf
transcript_id = d["transcript_id"]
KeyError: 'transcript_id'

few lines from gtf:

#!genome-build GRCh37.p13
#!genome-version GRCh37
#!genome-date 2009-02
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2013-09
1	pseudogene	gene	11869	14412	.	+	.	gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
1	processed_transcript	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
1	processed_transcript	exon	11869	12227	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002234944";
1	processed_transcript	exon	12613	12721	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00003582793";
1	processed_transcript	exon	13221	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon_id "ENSE00002312635";

The text was updated successfully, but these errors were encountered:

emi80 · 2017-12-07T11:10:25Z

The problem is that we assume the transcript_id attribute is present in every line of the GTF (except for comments of course). I see here that the gene line does not have it. One solution: you could preprocess the annotation and add the transcript_id field where it is not present.

@abreschi what do you suggest?

abreschi · 2017-12-07T20:16:32Z

Hi! Sorry about this issue. Unfortunately the transcript_id is a required field in the GTF format (https://genome.ucsc.edu/FAQ/FAQformat.html#format4), even in gene rows. So, I would modify the Ensembl file like @emi80 said. Hope it helps.

sridhar0605 · 2017-12-07T21:51:17Z

Hello @emi80 @abreschi ,

Thank you for your reply. I will modify and try this again, currently i do not see any issue if i change the build to 37.67 may be its build specific.

I guess you can close this issue.

thanks

ChaoTang-SCU · 2018-04-03T02:59:25Z

Here,
When I used Ensembl GTF file, I also get the same error. Finally, I found that the GTF only have transcript and exon rows works well.
awk -F "\t" '$3=="exon"||$3=="transcript"' Homo_sapiens.GRCh38.87.gtf > Homo_sapiens.GRCh38.87.transccript.exon.gtf

ManavalanG · 2018-09-06T00:24:09Z

Just wanted to add that gencode gtf runs into same issue.

bellenger-l · 2019-02-20T15:55:04Z

Here,
When I used Ensembl GTF file, I also get the same error. Finally, I found that the GTF only have transcript and exon rows works well.
awk -F "\t" '$3=="exon"||$3=="transcript"' Homo_sapiens.GRCh38.87.gtf > Homo_sapiens.GRCh38.87.transccript.exon.gtf

I'm sorry it didn't fix the issue for me, I have a new error :

Traceback (most recent call last):
  File "./sashimi-plot.py", line 612, in <module>
    transcripts, exons = read_gtf(args.gtf, args.coordinates)
  File "./sashimi-plot.py", line 283, in read_gtf
    d = dict(kv.strip().split(" ") for kv in tags.strip(";").split("; "))
ValueError: dictionary update sequence element #17 has length 7; 2 is required

I am using the Mus_musculus.GRCm38.83 annotation.

If you have a solution, I would appreciate it.

Thanks,
Lea

dgarrimar · 2019-02-21T10:00:31Z

Dear Lea @bellenger-l,

You could try using GENCODE annotation files. The release corresponding to mouse ensembl 83 is GENCODE M8. Alternatively, could you provide some lines of your GTF to check what is the problem? As stated in previous comments, make sure that the file follows the proper format. Specially, the transcript_id attribute should be present in every line of the GTF.

bellenger-l · 2019-02-21T15:49:24Z

Dear @dgarrimar,

Thanks a lot ! It works like a charm with the Gencode annotation, but it doesn't print the different transcripts under sashimi plots. I can't figure it out which option can do that.

Best,
Lea

dgarrimar · 2019-02-21T16:06:53Z

In principle it should, could you send the command that you are using and the output that you generated? Thanks!

bellenger-l · 2019-02-21T16:24:30Z

I'm sorry, I didn't check gencode GTF and the chromosome names were different from Ensembl GTF ("chr1" against "1"), I remove "chr" from first column and now I have the transcripts...

Thanks a lot for your help anyway,
Best
Lea

kylinson · 2019-03-05T08:09:25Z

just use transcript_id = d.get("transcript_id","transcript_id_missing") to replace the original code 284th line.

PhKoch · 2019-03-28T15:12:19Z

@kylinson this hack didn't work for me. The following error occured:

Traceback (most recent call last):
  File "./sashimi-plot.py", line 612, in <module>
    transcripts, exons = read_gtf(args.gtf, args.coordinates)
  File "./sashimi-plot.py", line 283, in read_gtf
    d = dict(kv.strip().split(" ") for kv in tags.strip(";").split("; "))
ValueError: dictionary update sequence element #10 has length 7; 2 is required

I'll edit my gtf as suggested earlier.

archana433 · 2020-05-11T11:21:54Z

@tangchao7498
I'm sorry it didn't fix the issue for me, I also have a new error :
I am using the Mus_musculus.GRCm38.99 annotation

Traceback (most recent call last):
  File "./sashimi-plot.py", line 612, in <module>
    transcripts, exons = read_gtf(args.gtf, args.coordinates)
  File "./sashimi-plot.py", line 283, in read_gtf
    d = dict(kv.strip().split(" ") for kv in tags.strip(";").split("; "))
ValueError: dictionary update sequence element #11 has length 7; 2 is required

dgarrimar · 2020-05-11T12:03:33Z

Dear @archana433, have you tried to use gencode anotation? I believe this is the equivalent to the one you use. Give it a try and let me know!

archana433 · 2020-05-11T12:27:40Z

thanks , it worked. now got this error

Error in seq.default(start, max(start + 1, end - 4), by = 2425) : 
  'from' must be of length 1
Calls: rbind -> [ -> [.data.table -> seq -> seq.default
Execution halted

dgarrimar · 2020-05-11T12:39:03Z

Great, as the annotation issue is solved, let's continue the discussion regarding this error in #33.

antonioggsousa · 2020-07-07T13:34:29Z

Hi,

I faced the same problem. I'm trying to run the python script, but instead of changing the GTF file, I added a couple of code lines to ignore the absence of "transcript_id" and, also concatenate gene names with a space:

                   `#--------------------------------------------------------------------------------

                    ## AGGS: skip lines without "transcript_id" tag

                    if "transcript_id" not in tags: 

                            continue

                    dict_list = [] # concatenate "gene_name" with space, e.g., "PDH-E1 ALPHA" into "PDH-E1_ALPHA"

                    for ele in tags.strip(";").split("; "):

                            l = ele.strip().split(" ")

                            if len(l[1::]) > 1: 

                                    gene_name = "_".join(ele.strip().split(" ")[1::])

                                    l = [l[0], gene_name]

                            dict_list.append(l)

                    dic_tuple = tuple(dict_list)

                    d = dict(dic_tuple)

                    #-------------------------------------------------------------------------------- 

                    #d = dict(kv.strip().split(" ") for kv in tags.strip(";").split("; ")) #aggs: commented line`

You might consider adding these couple of lines (from lines: 283-297) to your python script. I know that is not very pythonic.

António

KrotosBenjamin · 2020-09-01T12:48:43Z

I've fixed this issue with gencodeID, but still works as originally intended with a try statement. This is easier than editing a GTF file.

replace:
transcript_id = d["transcript_id"]

with try statement below.

try:
    transcript_id = d["transcript_id"]
except KeyError:
    transcript_id = d["gene_id"]

antonioggsousa · 2020-09-01T13:18:52Z

Thx @KrotosBenjamin is by far much more pythonic and elegant.

António

stale · 2021-01-21T13:58:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dgarrimar · 2021-01-21T18:26:05Z

Following suggestions in PR #52 by @ygidtu, with minor modifications, GTFs with gene rows without the transcript_id attribute will not throw an error anymore. However, the transcript_id attribute will still be required in transcript/exon rows. We included a more informative error message for this case.

stale bot added the stale Issues with no recent activity label Jan 21, 2021

stale bot removed the stale Issues with no recent activity label Jan 21, 2021

dgarrimar closed this as completed in d7a9a9d Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: 'transcript_id' with Ensemble human annotation #2

KeyError: 'transcript_id' with Ensemble human annotation #2

emi80 commented Dec 7, 2017

emi80 commented Dec 7, 2017

abreschi commented Dec 7, 2017

sridhar0605 commented Dec 7, 2017 •

edited

Loading

ChaoTang-SCU commented Apr 3, 2018

ManavalanG commented Sep 6, 2018 •

edited

Loading

bellenger-l commented Feb 20, 2019

dgarrimar commented Feb 21, 2019

bellenger-l commented Feb 21, 2019

dgarrimar commented Feb 21, 2019

bellenger-l commented Feb 21, 2019

kylinson commented Mar 5, 2019

PhKoch commented Mar 28, 2019

archana433 commented May 11, 2020 •

edited by emi80

Loading

dgarrimar commented May 11, 2020 •

edited

Loading

archana433 commented May 11, 2020 •

edited by emi80

Loading

dgarrimar commented May 11, 2020

antonioggsousa commented Jul 7, 2020 •

edited

Loading

KrotosBenjamin commented Sep 1, 2020

antonioggsousa commented Sep 1, 2020

stale bot commented Jan 21, 2021

dgarrimar commented Jan 21, 2021

KeyError: 'transcript_id' with Ensemble human annotation #2

KeyError: 'transcript_id' with Ensemble human annotation #2

Comments

emi80 commented Dec 7, 2017

emi80 commented Dec 7, 2017

abreschi commented Dec 7, 2017

sridhar0605 commented Dec 7, 2017 • edited Loading

ChaoTang-SCU commented Apr 3, 2018

ManavalanG commented Sep 6, 2018 • edited Loading

bellenger-l commented Feb 20, 2019

dgarrimar commented Feb 21, 2019

bellenger-l commented Feb 21, 2019

dgarrimar commented Feb 21, 2019

bellenger-l commented Feb 21, 2019

kylinson commented Mar 5, 2019

PhKoch commented Mar 28, 2019

archana433 commented May 11, 2020 • edited by emi80 Loading

dgarrimar commented May 11, 2020 • edited Loading

archana433 commented May 11, 2020 • edited by emi80 Loading

dgarrimar commented May 11, 2020

antonioggsousa commented Jul 7, 2020 • edited Loading

KrotosBenjamin commented Sep 1, 2020

antonioggsousa commented Sep 1, 2020

stale bot commented Jan 21, 2021

dgarrimar commented Jan 21, 2021

sridhar0605 commented Dec 7, 2017 •

edited

Loading

ManavalanG commented Sep 6, 2018 •

edited

Loading

archana433 commented May 11, 2020 •

edited by emi80

Loading

dgarrimar commented May 11, 2020 •

edited

Loading

archana433 commented May 11, 2020 •

edited by emi80

Loading

antonioggsousa commented Jul 7, 2020 •

edited

Loading