Ensembl fasta files no long usable as references #1295

bwlang · 2019-02-20T20:26:16Z

As a result of this change, it's now longer possible to use gencode's transcript fasta as a reference database with mark duplicates because parentheses are common in their transcript names.

I read a bit on the background and I'm not sure I understand why the sam spec should be changed to disallow parentheses and brackets. I think @tfenne 's comment is right on. It's too late to make these changes without downstream breakage.

Maybe the @HD-VN checking will help with this?
I switched back to picard 2.18.23 (just before 2.18.24 update htsjdk)

Your environment

version of htsjdk - 2.18.2
version of java 11.01
version of linux: ubuntu 16.04

Steps to reproduce

attempt to mark duplicates of a bam file aligned to an encode list of transcripts

Expected behaviour

should mark duplicates

Actual behaviour

complains that contig names do not match regex

lbergelson · 2019-02-20T21:13:10Z

@bwlang Is this the ensemble human transcripts? As far as I understood they all looked something like this: ENST00000514109.1

bwlang · 2019-02-20T21:19:05Z

I noticed this on mouse:

lbergelson · 2019-02-20T21:19:31Z

Ah, I don't see any in the gencode human transcripts.

lbergelson · 2019-02-20T21:23:34Z

Yeah, I see 4 transcripts in the v20M with () in them. That's unfortunate. We'll have to figure out how to deal with that.

lbergelson · 2019-02-22T16:44:32Z

We've sent a message to the people at ensemble, hopefully we can work to fix the transcript names in future versions. Until then we have to decide how we want to enable people to continue working with things that have pathological reference names.

bwlang · 2019-02-22T16:49:55Z

I think it's a bad idea to tighten the SAM spec a this point. The goal is to reduce pain - but I suspect net happiness is going to go down as a result.
If that decision is immutable, perhaps applying strictness only on those files that have been marked with a later SAM version is a practical approach.

jmarshall · 2019-02-22T17:06:30Z

I think it's a bad idea to tighten the SAM spec a this point. The goal is to reduce pain

In the case of commas, the goal was to make the spec self-consistent and the format parsable. In the case of the various brackets, statistics were gathered (see samtools/hts-specs#333 et al) and a grand total of zero instances of () were observed over a period of months.

However teething problems like this were not unanticipated. Please raise an issue over at hts-specs if you would like the SAM specification folks to consider relaxing the spec to allow parenthesis characters here.

What exact file is this from? gencode.vM20.transcripts.fa.gz?

ENSMUST00000159544.2|ENSMUSG00000086429.9|OTTMUSG00000034748.3|OTTMUST00000088412.2|Gt(ROSA)26Sor-203|Gt(ROSA)26Sor|822|antisense|

I am curious though. Do you really enjoy having all that as a reference sequence name? Do people doing this sort of thing ever truncate the FASTA headers at the first | and work with ENSMUST00000159544.2 etc instead, which seems rather less unwieldy?

perhaps applying strictness only on those files that have been marked with a later SAM version is a practical approach.

In fact that is what the spec recommends.

(OTOH htsjdk not doing so is what has brought your issue to light so quickly…)

bwlang · 2019-02-22T17:23:21Z

I think the statistics gathering approaching did not gather information covering the diverse world of reference databases.

There is utility in using transcript names unmodified from sources like gencode. All that extra information does turn out to be useful sometimes.

Here's the link to the data:
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.transcripts.fa.gz

jmarshall · 2019-02-22T17:29:12Z

Clearly what we need is more statistics 😄

Thanks for bringing this mouse transcript file to people's attention. Are there other examples of the diverse world of reference databases that you'd like to bring to our attention?

This was referenced Sep 3, 2020

Sequence name regex issue in SortSam broadinstitute/picard#1574

Closed

Parentheses in RNAME samtools/hts-specs#526

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensembl fasta files no long usable as references #1295

Ensembl fasta files no long usable as references #1295

bwlang commented Feb 20, 2019

lbergelson commented Feb 20, 2019

bwlang commented Feb 20, 2019 •

edited

Loading

lbergelson commented Feb 20, 2019

lbergelson commented Feb 20, 2019

lbergelson commented Feb 22, 2019

bwlang commented Feb 22, 2019

jmarshall commented Feb 22, 2019

bwlang commented Feb 22, 2019

jmarshall commented Feb 22, 2019

Ensembl fasta files no long usable as references #1295

Ensembl fasta files no long usable as references #1295

Comments

bwlang commented Feb 20, 2019

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

lbergelson commented Feb 20, 2019

bwlang commented Feb 20, 2019 • edited Loading

lbergelson commented Feb 20, 2019

lbergelson commented Feb 20, 2019

lbergelson commented Feb 22, 2019

bwlang commented Feb 22, 2019

jmarshall commented Feb 22, 2019

bwlang commented Feb 22, 2019

jmarshall commented Feb 22, 2019

bwlang commented Feb 20, 2019 •

edited

Loading