Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor / rewrite / remove convert.py #24

Merged
merged 68 commits into from
Mar 19, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
041eff2
Change variables names in EMBLConverter
bewt85 Mar 11, 2015
3f7b307
Remove code from visit region which doesn't do anything, probably
bewt85 Mar 11, 2015
d4773f3
Refactor out function to check if a feature is seen
bewt85 Mar 11, 2015
09ed237
Rename old EMBLWriter for refactor
bewt85 Mar 11, 2015
8483ddd
Update .gitignore to vim
bewt85 Mar 11, 2015
e65a761
WIP refactor header writing code
bewt85 Mar 11, 2015
d4e5f50
WIP add init for Header object
bewt85 Mar 11, 2015
e3f008d
WIP refactor EMBLHeader to take sequence length
bewt85 Mar 12, 2015
e814823
WIP Add EMBLFeature object
bewt85 Mar 12, 2015
72cdc81
WIP add coordinate formatter to EMBLFeature
bewt85 Mar 12, 2015
65b9d26
WIP Ignore ID and protein features
bewt85 Mar 12, 2015
602b8e1
WIP reformat eC_number feature_type
bewt85 Mar 12, 2015
59b938d
WIP Lookup which attribute creator for feature
bewt85 Mar 12, 2015
8c7e9e3
WIP create features for products
bewt85 Mar 12, 2015
6fb8f3d
WIP create locus_tag attributes for features
bewt85 Mar 12, 2015
4ea94f7
WIP create EC_number attributes for features
bewt85 Mar 12, 2015
9eebb17
WIP moved EC_number reformatting
bewt85 Mar 12, 2015
8c8f3a7
WIP create inference attribute for feature
bewt85 Mar 12, 2015
2ee0124
WIP add inference attribute to features
bewt85 Mar 13, 2015
a32fc83
WIP default attribute for features
bewt85 Mar 13, 2015
c07a7da
WIP: wrap multiline attributes for features
bewt85 Mar 13, 2015
13e7bea
WIP pick feature builder for feature_type
bewt85 Mar 13, 2015
91fbc1a
WIP create most features
bewt85 Mar 13, 2015
7812d91
WIP create translation_table attribute for feature
bewt85 Mar 13, 2015
a6bbc06
WIP create CDS features
bewt85 Mar 13, 2015
f85cf20
WIP create uninitialized feature objects
bewt85 Mar 13, 2015
9f38317
WIP setup initializer for EMBLFeature objects
bewt85 Mar 13, 2015
dcdbfc9
WIP don't create features for ID or protein_id types
bewt85 Mar 13, 2015
463478d
WIP add test for initializing ID and protein_id features
bewt85 Mar 13, 2015
1ccaf45
WIP update whitespace
bewt85 Mar 13, 2015
0b06962
WIP fix spelling mistake in tests and code
bewt85 Mar 16, 2015
6631a5f
WIP better file comparison in tests
bewt85 Mar 16, 2015
167031f
WIP fix bug: ignore some attributes
bewt85 Mar 16, 2015
9357ef2
WIP change formatter for transl_table attributes
bewt85 Mar 16, 2015
b0784bb
WIP don't create ncRNA features
bewt85 Mar 16, 2015
517875b
WIP EC_number attributes are sorted
bewt85 Mar 16, 2015
0ac8a38
WIP fix ordering of file comparison
bewt85 Mar 16, 2015
e977b30
WIP add test to ignore some attributes of features
bewt85 Mar 16, 2015
abf57fe
WIP start using new feature converter code
bewt85 Mar 16, 2015
0e154bc
Remove old dead code
bewt85 Mar 16, 2015
dbe6390
Add more parameters to EMBLConverter object
bewt85 Mar 16, 2015
c1b919b
WIP Sequence counts number of each nucleotide
bewt85 Mar 16, 2015
610ad08
WIP add sequence header
bewt85 Mar 16, 2015
34b022c
WIP split sequence body into little bits
bewt85 Mar 16, 2015
d276559
WIP format a sequence body
bewt85 Mar 16, 2015
79d9b92
WIP format a sequence object
bewt85 Mar 16, 2015
af68549
WIP initialize sequences
bewt85 Mar 16, 2015
022ec55
WIP add contig formatter
bewt85 Mar 16, 2015
d06293d
WIP add add_feature to EMBLContig
bewt85 Mar 16, 2015
529d198
WIP contig can add sequence data
bewt85 Mar 16, 2015
12cd240
WIP add_header to contig
bewt85 Mar 16, 2015
55a4223
Fix bug in tests
bewt85 Mar 16, 2015
297449f
WIP don't add a feature to a contig if it should be ignored
bewt85 Mar 16, 2015
ff8ba99
Fix Bug: contif features are a dictionary
bewt85 Mar 16, 2015
df407fd
WIP add sequence length to EMBLSequence
bewt85 Mar 16, 2015
7947e81
Almost replaced all convert.py code
bewt85 Mar 16, 2015
804a899
Fix bug in test case
bewt85 Mar 17, 2015
8ebbec8
Sort output, seems to work like the old version
bewt85 Mar 17, 2015
3d32d4f
Removed description, not being used anyway
bewt85 Mar 17, 2015
5e8f8d7
Remove parameters from EMBLConverter which it doesn't need
bewt85 Mar 17, 2015
fea2bc4
Remove all uses of convert.py
bewt85 Mar 17, 2015
23c4d69
Clean up imports
bewt85 Mar 17, 2015
8737fc7
Rename OldEMBLWriter back to EMBLWriter
bewt85 Mar 17, 2015
fb8b369
Add some helpful comments to EMBLContig
bewt85 Mar 17, 2015
ba2cb47
Remove last traces of convert.py
bewt85 Mar 17, 2015
908dabe
Treat source feature in header as a feature
bewt85 Mar 17, 2015
9d81a42
Raise ValueError if an output line is more than 80 characters long
bewt85 Mar 17, 2015
cab93e7
Update version number
bewt85 Mar 17, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,6 @@ docs/_build/

# PyBuilder
target/

# Swap files
*.swp
375 changes: 375 additions & 0 deletions gff3toembl/EMBLContig.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,375 @@
import re
from textwrap import TextWrapper

class EMBLContig(object):
def __init__(self):
self.header = None
self.features = {}
self.sequence = None

def format(self):
try:
header = self.header.format()
except AttributeError:
raise ValueError("Could not format contig, no header data found")
feature_strings = [feature.format() for feature in self.sorted_features()]
features = "".join(feature_strings)
try:
sequence = self.sequence.format()
except AttributeError:
raise ValueError("Could not format contig, no sequence data found")
formatted_string = header + features + sequence
line_lengths = map(len, formatted_string.split('\n'))
maximum_line_length = max(line_lengths)
if maximum_line_length > 80:
raise ValueError("Could not format contig, a line exceeded 80 characters in length")
return formatted_string

def add_header(self, **kwargs):
if self.header != None:
raise ValueError("Contig already has header data")
header = EMBLHeader(**kwargs)
self.header = header

def add_feature(self, sequence_id, **kwargs):
feature = EMBLFeature(**kwargs)
unique_feature_reference = "{}_{}_{}_{}".format(sequence_id, feature.feature_type, feature.start, feature.end)
if unique_feature_reference in self.features:
# we're already seen a feature in this region so don't add another
return False
elif feature.format() == None:
# some feature types should be ignored; format() returns None in these cases
return False
else:
self.features[unique_feature_reference] = feature
return True

def add_sequence(self, sequence_string):
if self.sequence != None:
raise ValueError("Contig already has sequence data")
sequence = EMBLSequence(sequence_string)
self.sequence = sequence

def sorted_features(self):
# Features should be sorted by start and then by end irrespective of strand
def compare_features(feature_1, feature_2):
if feature_1.start < feature_2.start:
return -1
elif feature_1.start > feature_2.start:
return 1
else:
return feature_2.end - feature_1.end
return sorted(self.features.values(), cmp=compare_features)

class EMBLFeature(object):
inference_to_db_xref_map = {
'similar to AA sequence:UniProtKB': 'UniProtKB/Swiss-Prot',
'protein motif:Pfam': 'PFAM',
'protein motif:CLUSTERS': "CDD",
'protein motif:Cdd': "CDD",
'protein motif:TIGRFAMs': "TIGRFAM"
}

def __init__(self, feature_type, start, end, strand, feature_attributes,
locus_tag=None, translation_table=11):
# Picks a feature builder and builds the feature
# Most features are built with a default but some are either a little different or
# should just be ignored
feature_builder = self.pick_feature_builder(feature_type)
feature_builder(feature_type=feature_type, start=start, end=end, strand=strand,
feature_attributes=feature_attributes, locus_tag=locus_tag,
translation_table=translation_table)

def pick_feature_builder(self, feature_type):
feature_builders = {
'CDS': self.create_CDS_feature,
'source': self.create_source_feature,
'ncRNA': self.create_empty_feature
}
return feature_builders.get(feature_type, self.create_default_feature)

def create_default_feature(self, feature_type, start, end, strand, feature_attributes, locus_tag, translation_table):
self.feature_type = feature_type
self.start = start
self.end = end
self.strand = strand
self.locus_tag = locus_tag
self.translation_table = translation_table
self.attributes = []
for attribute_key, attribute_value in feature_attributes.items():
attribute_creator = self.lookup_attribute_creator(attribute_key)
new_attributes = attribute_creator(attribute_key, attribute_value)
self.attributes += new_attributes

def create_CDS_feature(self, **kwargs):
self.create_default_feature(**kwargs)
self.attributes += self.create_translation_table_attributes('transl_table', self.translation_table)

def create_source_feature(self, feature_type, start, end, strand, feature_attributes, locus_tag, translation_table):
self.feature_type = feature_type
self.start = start
self.end = end
self.strand = strand
self.locus_tag = locus_tag
self.translation_table = translation_table

organism = feature_attributes['organism']
db_xref = feature_attributes['db_xref']
note = feature_attributes['note']

# We hard code the order and composition of attributes for source features
# Source features are only created as part of the header
self.attributes = [("organism", organism), ("mol_type", "genomic DNA"), ("db_xref", db_xref), ("note", note)]

def create_empty_feature(self, **kwargs):
# Some features should be ignored. This is how this is done
self.format = lambda: None

def format(self):
coordinates = self.format_coordinates(self.start, self.end, self.strand)
header_string = "FT {feature_type: <16}{coordinates}".format( feature_type=self.feature_type,
coordinates=coordinates)
attribute_strings = [header_string]
for attribute_key,attribute_value in self.attributes:
attribute_strings.append(self.format_attribute(attribute_key, attribute_value))

return '\n'.join(attribute_strings) + '\n'

def format_attribute(self, key, value):
# Looks up a formatter for an attribute and formats the attribute
# Some attributes are formatted a little differently
formatter = self.lookup_attribute_formatter(key)
return formatter(key, value)

def lookup_attribute_formatter(self, attribute_type):
formatters = {
'transl_table': self.translation_table_attribute_formatter
}
return formatters.get(attribute_type, self.default_attribute_formatter)

def translation_table_attribute_formatter(self, key, value):
# transl_table attributes do not have their values in quotes
wrapper = TextWrapper()
wrapper.initial_indent='FT '
wrapper.subsequent_indent='FT '
wrapper.width=79
attribute_text_template='/{attribute_key}={attribute_value}'
attribute_text=attribute_text_template.format(attribute_key=key, attribute_value=value)
return wrapper.fill(attribute_text)

def default_attribute_formatter(self, key, value):
wrapper = TextWrapper()
wrapper.initial_indent='FT '
wrapper.subsequent_indent='FT '
wrapper.width=79
attribute_text_template='/{attribute_key}="{attribute_value}"'
attribute_text=attribute_text_template.format(attribute_key=key, attribute_value=value)
return wrapper.fill(attribute_text)

def format_coordinates(self, start, end, strand):
if strand == '-':
return "complement({start}..{end})".format(start=start, end=end)
else:
return "{start}..{end}".format(start=start, end=end)

def lookup_attribute_creator(self, attribute_key):
# These functions take attributes and reformat them into a list
# of (key, values) which are later formatted into strings by other
# methods. There is quite a lot of variation between these such as
# whether to keep more than one value for a given attribute type.
attribute_creator_table = {
'product': self.create_product_attributes,
'locus_tag': self.create_locus_tag_attributes,
'eC_number': self.create_EC_number_attributes,
'inference': self.create_inference_attributes,
'protein_id': self.ignore_attributes,
'ID': self.ignore_attributes
}
return attribute_creator_table.get(attribute_key, self.create_default_attributes)

def create_default_attributes(self, attribute_key, attribute_value):
all_attribute_values = attribute_value.split(',')
first_attribute_value = all_attribute_values[0]
return [(attribute_key, first_attribute_value)]

def create_product_attributes(self, attribute_key, attribute_value):
def remove_hypotheticals(value):
return value != 'hypothetical protein'
def replace_unknown_with_uncharacterised(value):
return value.replace("nknown","ncharacterised")
# attribute_value may be a comma deliminated list of values
# only some of which might be valid
attribute_values = attribute_value.split(',')
attribute_values = filter(remove_hypotheticals, attribute_values)
attribute_values = map(replace_unknown_with_uncharacterised, attribute_values)
chosen_value = attribute_values[0] if len(attribute_values) > 0 else 'Uncharacterised protein'
return [('product', chosen_value)]

def create_locus_tag_attributes(self, attribute_key, attribute_value):
if self.locus_tag == None:
return [('locus_tag', attribute_value)]
else:
attribute_value_suffix = attribute_value.split('_')[-1]
return [('locus_tag', "{}_{}".format(self.locus_tag, attribute_value_suffix))]

def create_EC_number_attributes(self, attribute_key, attribute_value):
attribute_values = attribute_value.split(',')
def deduplicate_values(values):
return list(set(values))
attribute_values = deduplicate_values(attribute_values)
return [('EC_number', value) for value in attribute_values]

def create_inference_attributes(self, attribute_key, attribute_value):
attribute_values = attribute_value.split(',')
attributes = []
for value in attribute_values:
if self.should_convert_to_db_xref(value):
attributes.append(('db_xref', self.convert_to_db_xref(value)))
else:
attributes.append(('inference', value))
return attributes

def ignore_attributes(self, attribute_key, attribute_value):
return []

def should_convert_to_db_xref(self, attribute_value):
for search_text in self.inference_to_db_xref_map:
if search_text in attribute_value:
return True
return False

def convert_to_db_xref(self, attribute_value):
for search_text, replacement_text in self.inference_to_db_xref_map.items():
if search_text in attribute_value:
return attribute_value.replace(search_text, replacement_text)
raise ValueError("Failed to convert inference attribute '%s' to db_xref" % attribute_value)

def create_translation_table_attributes(self, attribute_key, attribute_value):
return [('transl_table', attribute_value)]

class EMBLHeader(object):
def __init__(self,
authors="Pathogen Genomics",
classification="UNC",
genome_type="circular",
organism=None,
project="",
publication="Unpublished",
sequence_identifier="",
sequence_length="",
sequence_name=None,
taxon_id=None,
title="Draft assembly annotated with Prokka",
):
self.authors=authors
self.classification=classification
self.genome_type=genome_type
self.organism=organism
self.project=project
self.publication=publication
self.sequence_identifier=self.remove_non_word_characters(sequence_identifier)
self.sequence_length=sequence_length
self.sequence_name=sequence_name
self.taxon_id=taxon_id
self.title=title

source_attributes = self.build_source_attributes(organism, taxon_id, sequence_name)
self.source_feature = EMBLFeature(feature_type='source', start=1, end=sequence_length,
strand='+', feature_attributes=source_attributes)

self.header_template = """\
ID XXX; XXX; {genome_type}; genomic DNA; STD; {classification}; {sequence_length} BP.
XX
AC XXX;
XX
AC * _{sequence_identifier}
XX
PR Project:{project};
XX
DE XXX;
XX
RN [1]
RA {authors};
RT "{title}";
RL {publication}.
XX
FH Key Location/Qualifiers
FH
"""

def remove_non_word_characters(self, sequence_identifier):
return re.sub(r'\W+', '', sequence_identifier)

def format(self):
return self.header_template.format(**self.__dict__) + self.source_feature.format()

def build_source_attributes(self, organism, taxon_id, sequence_name):
def empty_string_if_none(value):
return value if value else ''
organism = empty_string_if_none(organism)
taxon_id = empty_string_if_none(taxon_id)
sequence_name = empty_string_if_none(sequence_name)
return {"organism": organism, "db_xref": "taxon:{}".format(taxon_id), "note": sequence_name}

class EMBLSequence(object):

def __init__(self, sequence_string):
nucleotide_counts = self.calculate_nucleotide_counts(sequence_string)
self.header = self.format_header(nucleotide_counts)
self.body = self.format_sequence_body(sequence_string)
self.length = len(sequence_string)

def format(self):
return self.header + '\n' + self.body

def calculate_nucleotide_counts(self, sequence):
sequence = sequence.lower()
counts = {}
counts['a'] = sequence.count('a')
counts['c'] = sequence.count('c')
counts['g'] = sequence.count('g')
counts['t'] = sequence.count('t')
count_of_acgt = sum(counts.values())
counts['other'] = len(sequence) - count_of_acgt
return counts

def format_header(self, nucleotide_counts):
template = "SQ Sequence {total} BP; {a} A; {c} C; {g} G; {t} T; {other} other;"
total_counts = sum(nucleotide_counts.values())
nucleotide_counts['total'] = total_counts
return template.format(**nucleotide_counts)

def format_sequence_body(self, sequence_string):
sequence_string = sequence_string.lower()
lines = self.split_sequence(sequence_string)
def format_a_line(line):
# a line looks like:
# (["1234567890", "12345", '', '', '', ''], 15)
# and should look like
# " 1234567890 12345 15"
blocks_of_sequence, end_of_line = line
format_arguments = blocks_of_sequence + [end_of_line]
return " {:<10} {:<10} {:<10} {:<10} {:<10} {:<10} {:>9}".format(*format_arguments)
formatted_lines = map(format_a_line, lines)
return '\n'.join(formatted_lines) + '\n'

def split_line_of_sequence(self, line_of_sequence):
# Turns "123456789012345" into ["1234567890", "12345", '', '', '', '']
splits = []
line_breaks = range(0, 60, 10)
for line_break in line_breaks:
split = line_of_sequence[line_break:line_break+10]
splits.append(split)
return splits

def split_sequence(self, sequence_string):
splits = []
sequence_length = len(sequence_string)
for start_of_line in range(0, sequence_length, 60):
# might not actually be the end of the line if the line isn't long enough
end_of_line = start_of_line + 60
line_of_sequence = sequence_string[start_of_line:end_of_line]
length_of_line = len(line_of_sequence)
end_of_line = start_of_line + length_of_line # actually end of the line
splits.append((self.split_line_of_sequence(line_of_sequence), end_of_line))
return splits
Loading