Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Error: "Sequence contains non-DNA character '*' at position 393" #30

Open
jon4thin opened this issue Aug 5, 2024 · 11 comments
Open

Comments

@jon4thin
Copy link

jon4thin commented Aug 5, 2024

I am trying to run g2gtools extract with a g2gtools DB created from a GTF and a FASTA. It is running successfully but after 14 minutes it stops:

2024-08-05T19:38:02.180415541Z GTTCCAATACTGTGTTGCAGTGGGAGCCCAAACTTTCCCCAGTGTGAGTGCTCCCAGCAAGAAAGTGGCAAAGCAGATGGCCGCAGAGGAAGCCATGAAGGCCCTGCATGGGGAGGCGACCAACTCCATGGCTTCTGATAACCAG 2024-08-05T19:38:02.180441323Z [g2gtools debug] Exon ID=ENSE00003837056.1_L;Exon: ENSE00003837056.1_L chr1_L:154596815-154597005 (-1) #6 2024-08-05T19:38:02.180595316Z [g2gtools debug] chr1_L:154596815-154597005 (Length: 191) 2024-08-05T19:38:02.180600730Z CCTGAAGGTATGATCTCAGAGTCACTTGATAACTTGGAATCCATGATGCCCAACAAGGTCAGGAAGATTGGCGAGCTCGTGAGATACCTGAACACCAACCCTGTGGGTGGCCTTTTGGAGTACGCCCGCTCCCATGGCTTTGCTGCTGAATTCAAGTTGGTCGACCAGTCCGGACCTCCTCACGAGCCCAA 2024-08-05T19:38:02.180627776Z [g2gtools debug] Exon ID=ENSE00003836678.2_L;Exon: ENSE00003836678.2_L chr1_L:154589767-154590419 (-1) #7 2024-08-05T19:38:02.183394207Z Traceback (most recent call last): 2024-08-05T19:38:02.185504398Z File "/opt/conda/envs/g2gtools/bin/g2gtools", line 4, in <module> 2024-08-05T19:38:02.186001359Z __import__('pkg_resources').run_script('g2gtools==0.2.9', 'g2gtools') 2024-08-05T19:38:02.186006933Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/pkg_resources/__init__.py", line 666, in run_script 2024-08-05T19:38:02.187122669Z self.require(requires)[0].run_script(script_name, ns) 2024-08-05T19:38:02.187127761Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script 2024-08-05T19:38:02.187499526Z exec(code, namespace, namespace) 2024-08-05T19:38:02.187942044Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/EGG-INFO/scripts/g2gtools", line 132, in <module> 2024-08-05T19:38:02.187946940Z G2GToolsApp() 2024-08-05T19:38:02.187949734Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/EGG-INFO/scripts/g2gtools", line 99, in __init__ 2024-08-05T19:38:02.187994284Z getattr(self, args.command)() 2024-08-05T19:38:02.187999699Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/EGG-INFO/scripts/g2gtools", line 105, in extract 2024-08-05T19:38:02.188002813Z g2gtools.g2g_commands.command_fasta_extract(sys.argv[2:], self.script_name + ' extract') 2024-08-05T19:38:02.188005779Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/g2g_commands.py", line 422, in command_fasta_extract 2024-08-05T19:38:02.189323301Z fasta.fasta_extract_transcripts(fasta_file, args.database, None, raw=args.raw) 2024-08-05T19:38:02.189330035Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/fasta.py", line 562, in fasta_extract_transcripts 2024-08-05T19:38:02.190316627Z partial_seq_str = str(g2g_utils.reverse_complement_sequence(partial_seq)) 2024-08-05T19:38:02.190322077Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/g2g_utils.py", line 229, in reverse_complement_sequence 2024-08-05T19:38:02.190975659Z return reverse_sequence(complement_sequence(sequence)) 2024-08-05T19:38:02.190981606Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/g2g_utils.py", line 214, in complement_sequence 2024-08-05T19:38:02.191010963Z raise ValueError("Sequence contains non-DNA character '{0}' at position {1:n}\n".format(val[position], position + 1)) 2024-08-05T19:38:02.191575326Z ValueError: Sequence contains non-DNA character '*' at position 393

Is '*' supposed to denote a stop codon? but then again, I only have 35 in my entire FASTA....

@mattjvincent
Copy link
Member

You seem to be using an older version. Python 2.7 and g2gtools 0.2.9. Can you update to Python 3 and g2gtools 2.0.0 and retry?

@jon4thin
Copy link
Author

@mattjvincent thank you so much for your reply. I have been using the docker image: kbchoi/emase:latest , I see that this has not been updated.....

  1. Ideally I would like to use a docker image, since I am running this on a velsera-based cloud platform. That is not avilable, correct?
  2. I cannot even find g2g tools 2.0.0. The most recent I could find is 1.0.1 on the github: https://github.com/churchill-lab/g2gtools

@mattjvincent
Copy link
Member

mattjvincent commented Aug 15, 2024

Installing from the main branch will install 2.0.0. I just created a release. Feel free to create a Docker image of your own.

There is one in the main branch in the top level directory.

@mattjvincent
Copy link
Member

@jon4thin ...

Try:

docker run churchilllab/g2gtools:2.0.0 g2gtools --help

You will have to mount directories and such, but this should get you started.

@jon4thin
Copy link
Author

jon4thin commented Aug 15, 2024

I will give it a shot right now and update shortly! I have never made a Docker image before so I will have to see how far i can get. tysvm for getting me going!

@jon4thin
Copy link
Author

Is the /opt/conda/envs/emase/bin/emase-zero command in kbchoi/emase:latest also outdated? I noticed that the flags in the documentation online and what I can do with the tool from the docker image is slightly different.

@jon4thin
Copy link
Author

sorry for the delaychurchilllab/g2gtools:2.0.0 works and it works on the Velsera CWL-based cloud compute platform as well! Let me know if there is any updated documentation (i.e. change in the command flags) and let me know if kbchoi/emase:latest also outdated! Thanks so much for your support <3~

@mattjvincent
Copy link
Member

EMASE's codebase was folded into GBRS. However, emase-zero has remain unchanged. There have been some minor flag changes since upgrading gbrs to Python 3, but we will have to compile a list.

@jon4thin
Copy link
Author

jon4thin commented Aug 20, 2024

Hey @mattjvincent , sorry to bother again. I updated my pipeline to use churchilllab/g2gtools:2.0.0 and while I received no issues with the * in the patch or transform or even gtf2db steps, g2g extract returned an error:

ValueError: Sequence contains non-DNA character '*' at position 393

I have the full error if needed.

Unless I am missing something, I had a conversation with a GATK person, and it seems like one can just grep/awk remove these records anyway:

Remove the * ALT Allele Reporting From a g.vcf Before (or After) Genotyping

It is just a little bit frustrating because the g2gtools -> alntools -> emase-zero pipeline requires 3 phases of manual removal of records:

  1. All HLA contigs from the first input VCF and FASTA
  2. All * alleles from the VCI
  3. All ENST's that are _PAR_Y and the original ENST (these throw an error in the emase-zero step) from the aligned SAM and the gene2transcript map.

This would be no problem if it was mentioned in a manual or something. It just takes a long time to figure out if you dont know before hand.

@mattjvincent
Copy link
Member

Interesting. The reason it fails is because g2gtools only uses the following bases/characters (upper or lower can be used):

ATGCYRSWKMBDHVN

I understand that * are stop codons. Might have to rethink how to handle this.

@jon4thin
Copy link
Author

The * that appear in .g.vcfs and vcfs denote the genotype at a SNP site when there is a upstream spanning deletion. These appear when there is a SNP somewhere but there is also, on the other allele or another patient, a deletion spanning that reason. Therefore, * is used to call the genotype at the SNP location. In this way, you can call a "genotype" at that position while referencing the fact that there is an upstream deletion. Therefore if you are constructing a FASTA sequence, you want to leave out the * records because that information is already contained in the upstream deletion.

There should be no stop codons denoted as * in g.vcf or vcf or fasta because these are DNA, not protein, sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants