Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler now computes codon variability instead of AA #809

Merged
merged 25 commits into from
Apr 19, 2018
Merged

Conversation

meren
Copy link
Member

@meren meren commented Apr 18, 2018

This PR implements codon-level variability profiling.

The engine AA continues to work seamlessly in anvi-gen-variability-profile, and amino acid frequencies are computes from codon frequency data. A new engine, CDN, is now also available.

These changes required an upgrade in the db version, but the PR contains a well-tested migration script.

'help': "Anvi'o can perform accurate characterization of codon frequencies in genes during profiling. While having\
codon frequencies opens doors to powerful evolutionary insights in downstream analyses, due to its\
computational complexity, this feature comes 'off' by default. Using this flag you can rise against the\
authority as you always should, and make anvi'o to profile codons."}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "to profile codons" just say "profile codons". Would've changed it myself, but I'm on my phone

@@ -419,7 +438,7 @@ def insert_additional_fields(self, entry_ids=[]):
and should it be GlxSer or SerGlx? There are three rules that define our conventions:

1. Competing_aas ALWAYS appear in alphabetical order. Even if Cys is most common, and
Ala is second most commond, competing_aas = AlaCys.
Ala is second most commond, competing_aas = AlaCys.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common, not commond

@ekiefl
Copy link
Contributor

ekiefl commented Apr 19, 2018

I tested that the branch was doing what it was supposed to by comparing the output of gen-variability-profile to master.

git fetch
git checkout -b codon-variability origin/codon-variability
anvio
cd tests
bash run_variability_mock.sh new
cd sandbox/test-output/
cp  -r * /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE

git checkout master
cd ../..
bash run_variability_mock.sh continue
cd sandbox/test-output/

# make sure bams are equivalent
shasum *.bam
shasum /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE/*.bam

# copy master outputs to folder
cp variability_AA.txt variability_AA_master.txt
cp variability_NT.txt variability_NT_master.txt
mv variability_AA_master.txt /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE/variability_AA_master.txt
mv variability_NT_master.txt /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE/variability_NT_master.txt


cd /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE/

Then I ran the following Python script to test their equivalence:

import pandas as pd

pd.options.display.max_columns = 50
pd.options.display.max_rows = 50

aa_m = pd.read_csv("variability_AA_master.txt", sep="\t")
aa_v = pd.read_csv("variability_AA.txt", sep="\t")

nt_m = pd.read_csv("variability_NT_master.txt", sep="\t")
nt_v = pd.read_csv("variability_NT.txt", sep="\t")

cdn_v = pd.read_csv("variability_CDN.txt", sep="\t")

column_name_differences1 = [x for x in aa_m.columns if x not in aa_v.columns]
column_name_differences2 = [y for y in aa_v.columns if y not in aa_m.columns]
include_aa = [x for x in aa_m.columns if x not in set(column_name_differences1 + column_name_differences2)]
print("column name differences for aa")
print(column_name_differences1)
print(column_name_differences2)
print("\n\n")

column_name_differences1 = [x for x in nt_m.columns if x not in nt_v.columns]
column_name_differences2 = [y for y in nt_v.columns if y not in nt_m.columns]
include_nt = [x for x in nt_m.columns if x not in set(column_name_differences1 + column_name_differences2)]
print("column name differences for nt")
print(column_name_differences1)
print(column_name_differences2)
print("\n\n")

column_name_differences1 = [x for x in cdn_v.columns if x not in aa_v.columns]
column_name_differences2 = [y for y in aa_v.columns if y not in cdn_v.columns]
print("column name differences amino acids and codons in the variability branch")
print(column_name_differences1)
print(column_name_differences2)
print("\n\n")



print("aa: are columns shared between both equal?\n\n")
true = True
for col in include_aa:
    if not aa_m[col].equals(aa_v[col]):
        print("{} columns are not equal".format(col))
        if not aa_m[col].round(5).equals(aa_v[col].round(5)):
            print("{} is not equal between aa:".format(col))
            print("variation is observed at:\nindex\tmaster\tcodon".format(col))
            for index in aa_m[col].index:
                if aa_m[col].loc[index] != aa_v[col].loc[index]:
                    print(index, aa_m[col].loc[index], aa_v[col].loc[index])
            true = False
        else:
            print("but are equal after rounding to 5 decimal places")
print("so are all equal? {}\n\n".format(true))


print("nt: are columns shared between both equal?")
for col in include_nt:
    true = True
    if not nt_m[col].equals(nt_v[col]):
        print("{} is not equal between nt:".format(col))
        print(nt_m[col].head())
        print(nt_v[col].head())
        true = False
print("so are all equal? {}\n\n".format(true))


print("is the conversion between codons and amino acids correct? converting between the two using a different method...")
import anvio.constants as constants
counts = {}
for aa, cdns in constants.AA_to_codons.items():
    counts[aa] = cdn_v[cdns].sum(axis = 1)
counts = pd.DataFrame(counts)
print("well? {}".format(counts.equals(aa_v[constants.amino_acids])))

After a couple of commits to codon-variability the output is now:

column name differences for aa
[]
[]



column name differences for nt
[]
[]



column name differences amino acids and codons in the variability branch
['AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', 'AGA', 'AGC', 'AGG', 'AGT', 'ATA', 'ATC', 'ATG', 'ATT', 'CAA', 'CAC', 'CAG', 'CAT', 'CCA', 'CCC', 'CCG', 'CCT', 'CGA', 'CGC', 'CGG', 'CGT', 'CTA', 'CTC', 'CTG', 'CTT', 'GAA', 'GAC', 'GAG', 'GAT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GGG', 'GGT', 'GTA', 'GTC', 'GTG', 'GTT', 'TAA', 'TAC', 'TAG', 'TAT', 'TCA', 'TCC', 'TCG', 'TCT', 'TGA', 'TGC', 'TGG', 'TGT', 'TTA', 'TTC', 'TTG', 'TTT', 'competing_codons']
['Ala', 'Arg', 'Asn', 'Asp', 'Cys', 'Gln', 'Glu', 'Gly', 'His', 'Ile', 'Leu', 'Lys', 'Met', 'Phe', 'Pro', 'STP', 'Ser', 'Thr', 'Trp', 'Tyr', 'Val', 'BLOSUM62', 'BLOSUM90', 'competing_aas', 'BLOSUM62_weighted', 'BLOSUM90_weighted']



aa: are columns shared between both equal?


departure_from_reference columns are not equal
but are equal after rounding to 5 decimal places
so are all equal? True


nt: are columns shared between both equal?
so are all equal? True


is the conversion between codons and amino acids correct? converting between the two using a different method...
well? True

@meren if you're happy with the changes merge it :)

@meren
Copy link
Member Author

meren commented Apr 19, 2018

Great! Let's take this to master, and see if we run into any issues, then.

@meren meren merged commit b0dd51b into master Apr 19, 2018
@meren meren deleted the codon-variability branch September 20, 2018 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants