Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence header SQ line can exceed 80 characters #65

Closed
peterjc opened this issue Nov 9, 2016 · 5 comments
Closed

Sequence header SQ line can exceed 80 characters #65

peterjc opened this issue Nov 9, 2016 · 5 comments

Comments

@peterjc
Copy link
Contributor

peterjc commented Nov 9, 2016

Testing gff3_to_embl on a bacterial assembly with lots of N bases generated this SQ line:

SQ   Sequence 5090820 BP; 1202378 A; 1343737 C; 1345174 G; 1198528 T; 1003 other;

This caused this slightly cryptic exception:

Traceback (most recent call last):
  File "/home/xxxx/bin/gff3_to_embl", line 4, in <module>
    __import__('pkg_resources').run_script('gff3toembl==1.1.0', 'gff3_to_embl')
  File "/home/xxxx/lib/python2.7/site-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/xxxx/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1511, in run_script
    exec(script_code, namespace, namespace)
  File "/home/xxxx/lib/python2.7/site-packages/gff3toembl-1.1.0-py2.7.egg/EGG-INFO/scripts/gff3_to_embl", line 38, in <module>
    
  File "build/bdist.linux-x86_64/egg/gff3toembl/EMBLWriter.py", line 96, in parse_and_run
  File "build/bdist.linux-x86_64/egg/gff3toembl/EMBLWriter.py", line 45, in create_output_file
  File "build/bdist.linux-x86_64/egg/gff3toembl/EMBLContig.py", line 26, in format
ValueError: Could not format contig, a line exceeded 80 characters in length

Simple hack for testing:

$ git diff
diff --git a/gff3toembl/EMBLContig.py b/gff3toembl/EMBLContig.py
index 4214a4e..15f8b8d 100644
--- a/gff3toembl/EMBLContig.py
+++ b/gff3toembl/EMBLContig.py
@@ -23,7 +23,9 @@ class EMBLContig(object):
     line_lengths = map(len, formatted_string.split('\n'))
     maximum_line_length = max(line_lengths)
     if maximum_line_length > 80:
-      raise ValueError("Could not format contig, a line exceeded 80 characters in length")
+      # raise ValueError("Could not format contig, a line exceeded 80 characters in length")
+      import sys
+      sys.stderr.write("WARNING: Exceeded 80 character per line limit\n")
     return formatted_string
 
   def add_header(self, **kwargs):

Sadly if and how to line wrap is not explicit in ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt

3.4.17  The SQ Line
The SQ (SeQuence header) line marks the beginning of the sequence data and 
Gives a summary of its content. An example is:
     SQ   Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; 
As shown, the line contains the length of the sequence in base pairs followed
by its base composition.  Bases other than A, C, G and T are grouped 
together as "other". (Note that "BP" is also used for single stranded RNA
sequences, which is not strictly accurate, but has been used for consistency
of format.) This information can be used as a check on accuracy or for
statistical  purposes. The word "Sequence" is present solely as a marker for
readability.
@andrewjpage
Copy link
Member

Thanks for catching that. I looked up another published genome to double check and the SQ line also exceeds 80 characters. I'll add in a fix for it.

SQ   Sequence 65476681 BP; 21260528 A; 11272502 C; 11286742 G; 21320298 T; 336611 other;

@peterjc
Copy link
Contributor Author

peterjc commented Nov 9, 2016

Interesting - could you share the accession of that example please? Could be a useful test for other parsers/writers as well. Thanks!

@andrewjpage
Copy link
Member

@andrewjpage
Copy link
Member

I've pushed an update for this:
#66

@peterjc
Copy link
Contributor Author

peterjc commented Nov 9, 2016

Great - that works nicely for me, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants