-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensembl fasta files no long usable as references #1295
Comments
@bwlang Is this the ensemble human transcripts? As far as I understood they all looked something like this: |
I noticed this on mouse: ENSMUST00000159544.2|ENSMUSG00000086429.9|OTTMUSG00000034748.3|OTTMUST00000088412.2|Gt(ROSA)26Sor-203|Gt(ROSA)26Sor|822|antisense| |
Ah, I don't see any in the gencode human transcripts. |
Yeah, I see 4 transcripts in the v20M with |
We've sent a message to the people at ensemble, hopefully we can work to fix the transcript names in future versions. Until then we have to decide how we want to enable people to continue working with things that have pathological reference names. |
I think it's a bad idea to tighten the SAM spec a this point. The goal is to reduce pain - but I suspect net happiness is going to go down as a result. |
In the case of commas, the goal was to make the spec self-consistent and the format parsable. In the case of the various brackets, statistics were gathered (see samtools/hts-specs#333 et al) and a grand total of zero instances of However teething problems like this were not unanticipated. Please raise an issue over at hts-specs if you would like the SAM specification folks to consider relaxing the spec to allow parenthesis characters here. What exact file is this from? gencode.vM20.transcripts.fa.gz?
I am curious though. Do you really enjoy having all that as a reference sequence name? Do people doing this sort of thing ever truncate the FASTA headers at the first
In fact that is what the spec recommends. (OTOH htsjdk not doing so is what has brought your issue to light so quickly…) |
I think the statistics gathering approaching did not gather information covering the diverse world of reference databases. There is utility in using transcript names unmodified from sources like gencode. All that extra information does turn out to be useful sometimes. Here's the link to the data: |
Clearly what we need is more statistics 😄 Thanks for bringing this mouse transcript file to people's attention. Are there other examples of the diverse world of reference databases that you'd like to bring to our attention? |
As a result of this change, it's now longer possible to use gencode's transcript fasta as a reference database with mark duplicates because parentheses are common in their transcript names.
I read a bit on the background and I'm not sure I understand why the sam spec should be changed to disallow parentheses and brackets. I think @tfenne 's comment is right on. It's too late to make these changes without downstream breakage.
Maybe the @HD-VN checking will help with this?
I switched back to picard 2.18.23 (just before 2.18.24 update htsjdk)
Your environment
Steps to reproduce
attempt to mark duplicates of a bam file aligned to an encode list of transcripts
Expected behaviour
should mark duplicates
Actual behaviour
complains that contig names do not match regex
The text was updated successfully, but these errors were encountered: