Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

translating protein from anvi-get-dna-sequences-for-hmm-hits #400

Closed
jbird9 opened this issue Sep 12, 2016 · 3 comments
Closed

translating protein from anvi-get-dna-sequences-for-hmm-hits #400

jbird9 opened this issue Sep 12, 2016 · 3 comments
Assignees

Comments

@jbird9
Copy link

jbird9 commented Sep 12, 2016

Most of the time concatenated gene trees are made from aa sequences.

Here is a quick ruby script to translate sequences to aa and keep the same output for.

#!/usr/bin/env ruby

require 'bio'

file = Bio::FastaFormat.open(ARGV.shift)
file.each do |entry|
    prots = []
    puts ">#{entry.definition}"
    seq = Bio::Sequence::NA.new("#{entry.seq}")
    codon_table = Bio::CodonTable[11]
    prot = seq.translate(1, codon_table)
    prots.push(prot)
    prot = seq.translate(2, codon_table)
    prots.push(prot)
    prot = seq.translate(3, codon_table)
    prots.push(prot)
    prot = seq.translate(-1, codon_table)
    prots.push(prot)
    prot = seq.translate(-2, codon_table)
    prots.push(prot)
    prot = seq.translate(-3, codon_table)
    prots.push(prot)
    full = prots.select{ |p| p[/^M._\_$/] }
    puts full
end

Disclaimer: I should probably add a check on the length of protein for avoid any overlap between starts with M ends with * and the real protein but it worked for my proposes. Let me know if there was already a solution to this I missed.

@jbird9 jbird9 closed this as completed Sep 12, 2016
@meren
Copy link
Member

meren commented Sep 12, 2016

Hi Jordan,

Would an anvi-get-aa-sequences-for-hmm-hits help? OK. Maybe I should implement one anyway. I am reopening this.

Take care,

@meren meren reopened this Sep 12, 2016
@meren meren self-assigned this Sep 12, 2016
@jbird9
Copy link
Author

jbird9 commented Sep 12, 2016

Yeah, I think other users might find it useful. It is probably best to integrate a anvi-get-aa-sequences-for-hmm-hits into the existing codebase due to issues with dependencies. My code requires ruby and the ruby gem bio. But for now users can use the following code to translate there anvi-get-dna-sequences-for-hmm-hits output into aa.

#!/usr/bin/env ruby

require 'bio'

def length_finder(input_array)
    a = []
    input_array.each do |n|
        a << n.length.to_f
    end
    a
end

file = Bio::FastaFormat.open(ARGV.shift)
file.each do |entry|
    n=0
    prots = []
    stopss = []
    protlens = []
    puts ">#{entry.definition}"
    seq = Bio::Sequence::NA.new("#{entry.seq}")
    codon_table = Bio::CodonTable[11]
    prot = seq.translate(1, codon_table)
    stops = prot.scan(/\*/).count
    stopss.push(stops.to_i)
    prots.push(prot)
    prot = seq.translate(2, codon_table)
    stops = prot.scan(/\*/).count
        stopss.push(stops.to_i)
    prots.push(prot)
    prot = seq.translate(3, codon_table)
    stops = prot.scan(/\*/).count
        stopss.push(stops.to_i)
    prots.push(prot)
    prot = seq.translate(-1, codon_table)
    stops = prot.scan(/\*/).count
        stopss.push(stops.to_i)
    prots.push(prot)
    prot = seq.translate(-2, codon_table)
    stops = prot.scan(/\*/).count
        stopss.push(stops.to_i)
    prots.push(prot)
    prot = seq.translate(-3, codon_table)
    stops = prot.scan(/\*/).count
        stopss.push(stops.to_i)
    prots.push(prot)
    lennuc = entry.definition.scan(/\d+$/).join('')
        protlen = lennuc.to_f / 3
    protlens = length_finder(prots)
    full = ''
    part = ''
    for n in 0..5 do
        if (stopss[n] == 1) && (protlens[n] == protlen) then
            puts prots[n]
        elsif (stopss[n] == 0) && (protlens[n] == protlen) then
            puts prots[n]
        end
    end
end

@meren
Copy link
Member

meren commented Nov 22, 2016

This is done! In the new version the program will be called anvi-get-sequences-for-hmm-hits, and it will return amino acid sequences instead of DNA sequences when the flag --get-aa-seqeunces is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants