Profiler now computes codon variability instead of AA #809

meren · 2018-04-18T03:26:54Z

This PR implements codon-level variability profiling.

The engine AA continues to work seamlessly in anvi-gen-variability-profile, and amino acid frequencies are computes from codon frequency data. A new engine, CDN, is now also available.

These changes required an upgrade in the db version, but the PR contains a well-tested migration script.

--profile-AA-frequencies goes, and --profile-SCVs comes.

totally irrelevant summarizer fix in codon-variability branch. meren is doing embarrassing things. this fix is due to the fact that we no longer keep num mapped reads in the self table.

this migration script will remove the SAAVs table :( so it will hurt a lot of people.

Although the previous config files will need to be fixed now :/

ShaiberAlon · 2018-04-19T00:13:07Z

anvio/__init__.py

+             'help': "Anvi'o can perform accurate characterization of codon frequencies in genes during profiling. While having\
+                      codon frequencies opens doors to powerful evolutionary insights in downstream analyses, due to its\
+                      computational complexity, this feature comes 'off' by default. Using this flag you can rise against the\
+                      authority as you always should, and make anvi'o to profile codons."}


Instead of "to profile codons" just say "profile codons". Would've changed it myself, but I'm on my phone

ShaiberAlon · 2018-04-19T00:22:35Z

anvio/variabilityops.py

@@ -419,7 +438,7 @@ def insert_additional_fields(self, entry_ids=[]):
        and should it be GlxSer or SerGlx? There are three rules that define our conventions:

            1. Competing_aas ALWAYS appear in alphabetical order. Even if Cys is most common, and
-               Ala is second most commond, competing_aas = AlaCys.  
+               Ala is second most commond, competing_aas = AlaCys.


Common, not commond

…to codon-variability pull "better wording" changes

ekiefl · 2018-04-19T17:38:14Z

I tested that the branch was doing what it was supposed to by comparing the output of gen-variability-profile to master.

git fetch
git checkout -b codon-variability origin/codon-variability
anvio
cd tests
bash run_variability_mock.sh new
cd sandbox/test-output/
cp  -r * /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE

git checkout master
cd ../..
bash run_variability_mock.sh continue
cd sandbox/test-output/

# make sure bams are equivalent
shasum *.bam
shasum /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE/*.bam

# copy master outputs to folder
cp variability_AA.txt variability_AA_master.txt
cp variability_NT.txt variability_NT_master.txt
mv variability_AA_master.txt /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE/variability_AA_master.txt
mv variability_NT_master.txt /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE/variability_NT_master.txt


cd /Users/evan/Academics/Research/Meren/CODE_TESTS/CODON_VARIABILITY_EQUIVALENCE/

Then I ran the following Python script to test their equivalence:

import pandas as pd

pd.options.display.max_columns = 50
pd.options.display.max_rows = 50

aa_m = pd.read_csv("variability_AA_master.txt", sep="\t")
aa_v = pd.read_csv("variability_AA.txt", sep="\t")

nt_m = pd.read_csv("variability_NT_master.txt", sep="\t")
nt_v = pd.read_csv("variability_NT.txt", sep="\t")

cdn_v = pd.read_csv("variability_CDN.txt", sep="\t")

column_name_differences1 = [x for x in aa_m.columns if x not in aa_v.columns]
column_name_differences2 = [y for y in aa_v.columns if y not in aa_m.columns]
include_aa = [x for x in aa_m.columns if x not in set(column_name_differences1 + column_name_differences2)]
print("column name differences for aa")
print(column_name_differences1)
print(column_name_differences2)
print("\n\n")

column_name_differences1 = [x for x in nt_m.columns if x not in nt_v.columns]
column_name_differences2 = [y for y in nt_v.columns if y not in nt_m.columns]
include_nt = [x for x in nt_m.columns if x not in set(column_name_differences1 + column_name_differences2)]
print("column name differences for nt")
print(column_name_differences1)
print(column_name_differences2)
print("\n\n")

column_name_differences1 = [x for x in cdn_v.columns if x not in aa_v.columns]
column_name_differences2 = [y for y in aa_v.columns if y not in cdn_v.columns]
print("column name differences amino acids and codons in the variability branch")
print(column_name_differences1)
print(column_name_differences2)
print("\n\n")



print("aa: are columns shared between both equal?\n\n")
true = True
for col in include_aa:
    if not aa_m[col].equals(aa_v[col]):
        print("{} columns are not equal".format(col))
        if not aa_m[col].round(5).equals(aa_v[col].round(5)):
            print("{} is not equal between aa:".format(col))
            print("variation is observed at:\nindex\tmaster\tcodon".format(col))
            for index in aa_m[col].index:
                if aa_m[col].loc[index] != aa_v[col].loc[index]:
                    print(index, aa_m[col].loc[index], aa_v[col].loc[index])
            true = False
        else:
            print("but are equal after rounding to 5 decimal places")
print("so are all equal? {}\n\n".format(true))


print("nt: are columns shared between both equal?")
for col in include_nt:
    true = True
    if not nt_m[col].equals(nt_v[col]):
        print("{} is not equal between nt:".format(col))
        print(nt_m[col].head())
        print(nt_v[col].head())
        true = False
print("so are all equal? {}\n\n".format(true))


print("is the conversion between codons and amino acids correct? converting between the two using a different method...")
import anvio.constants as constants
counts = {}
for aa, cdns in constants.AA_to_codons.items():
    counts[aa] = cdn_v[cdns].sum(axis = 1)
counts = pd.DataFrame(counts)
print("well? {}".format(counts.equals(aa_v[constants.amino_acids])))

After a couple of commits to codon-variability the output is now:

column name differences for aa
[]
[]



column name differences for nt
[]
[]



column name differences amino acids and codons in the variability branch
['AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', 'AGA', 'AGC', 'AGG', 'AGT', 'ATA', 'ATC', 'ATG', 'ATT', 'CAA', 'CAC', 'CAG', 'CAT', 'CCA', 'CCC', 'CCG', 'CCT', 'CGA', 'CGC', 'CGG', 'CGT', 'CTA', 'CTC', 'CTG', 'CTT', 'GAA', 'GAC', 'GAG', 'GAT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GGG', 'GGT', 'GTA', 'GTC', 'GTG', 'GTT', 'TAA', 'TAC', 'TAG', 'TAT', 'TCA', 'TCC', 'TCG', 'TCT', 'TGA', 'TGC', 'TGG', 'TGT', 'TTA', 'TTC', 'TTG', 'TTT', 'competing_codons']
['Ala', 'Arg', 'Asn', 'Asp', 'Cys', 'Gln', 'Glu', 'Gly', 'His', 'Ile', 'Leu', 'Lys', 'Met', 'Phe', 'Pro', 'STP', 'Ser', 'Thr', 'Trp', 'Tyr', 'Val', 'BLOSUM62', 'BLOSUM90', 'competing_aas', 'BLOSUM62_weighted', 'BLOSUM90_weighted']



aa: are columns shared between both equal?


departure_from_reference columns are not equal
but are equal after rounding to 5 decimal places
so are all equal? True


nt: are columns shared between both equal?
so are all equal? True


is the conversion between codons and amino acids correct? converting between the two using a different method...
well? True

@meren if you're happy with the changes merge it :)

meren · 2018-04-19T18:39:21Z

Great! Let's take this to master, and see if we run into any issues, then.

meren added 20 commits April 17, 2018 20:59

update the community parameters

cf5ccc5

--profile-AA-frequencies goes, and --profile-SCVs comes.

cosmetics

ced1b44

SSMs for codons engine

3510d35

AAFrequencies -> CodonFrequencies

6270f0a

anvi-get-aa-frequencies -> anvi-get-codon-frequencies

7dd8c5a

update parameter.

a7afbfd

merger, profiler, and summarizer works with codon variability table

358f896

variable_aas_table goes, variable_codons_table comes

3014e16

Profile super knows about additional layer data

2d5ef0d

fixy fix

a1ae969

totally irrelevant summarizer fix in codon-variability branch. meren is doing embarrassing things. this fix is due to the fact that we no longer keep num mapped reads in the self table.

variabilityops can now work with AA and CDN engines seamlessly

2aa327b

examples in run all tests for all engines are now present.

1144726

much cleaner

0542d2d

bump the profile db version

ec66922

this migration script will remove the SAAVs table :( so it will hurt a lot of people.

fixy

ee19c39

language.

f2da6ab

update Snakemake files as well.

2c775a7

Although the previous config files will need to be fixed now :/

fixy fix.

25fd401

Merge branch 'master' into codon-variability

b490b14

gene_call_id. no.

1e201e4

meren requested review from ozcan, ekiefl and ShaiberAlon April 18, 2018 03:26

ekiefl and others added 3 commits April 18, 2018 11:34

add --engine cdn to run_variability_mock.sh

1880a9d

add missing columns, add conversion function to aa engine for dfr

b633ae8

better wording

819c92b

ShaiberAlon reviewed Apr 19, 2018

View reviewed changes

ekiefl added 2 commits April 19, 2018 12:22

vectorize row-by-row sum operation

683a9da

Merge branch 'codon-variability' of https://github.com/meren/anvio in…

d1a6d7c

…to codon-variability pull "better wording" changes

meren merged commit b0dd51b into master Apr 19, 2018

meren deleted the codon-variability branch September 20, 2018 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiler now computes codon variability instead of AA #809

Profiler now computes codon variability instead of AA #809

meren commented Apr 18, 2018

ShaiberAlon Apr 19, 2018

ShaiberAlon Apr 19, 2018

ekiefl commented Apr 19, 2018

meren commented Apr 19, 2018

Profiler now computes codon variability instead of AA #809

Profiler now computes codon variability instead of AA #809

Conversation

meren commented Apr 18, 2018

ShaiberAlon Apr 19, 2018

Choose a reason for hiding this comment

ShaiberAlon Apr 19, 2018

Choose a reason for hiding this comment

ekiefl commented Apr 19, 2018

meren commented Apr 19, 2018