Skip to content

Latest commit

 

History

History
executable file
·
215 lines (154 loc) · 13.1 KB

tutorial_2.md

File metadata and controls

executable file
·
215 lines (154 loc) · 13.1 KB

retrive metrics of specific branches in ubiquitous genes


In this tutorial we will compare two nested models through a Likelihood Ratio Test (LRT from now on): the general model allows a single omega value across all the branches of our phylogeny while the alternative model allows a specific clade to have a different omega value from the rest of the phylogeny, for each gene; subsequently we will extract omega - along with other metrics - for a clade of interest. This analysis will focus only on ubiquitous genes, in which a single-copy gene is present for each species. After moving to the toy-dataset folder we can quickly revise what's needed to start our analysis:

  • a folder containing aligned genes, in fasta format and with the .fa extesion. These alignment shouldn't have any STOP codon, yet BASE will report and exclude the genes wehre they are found. The fasta alignment headers have to be identical to the tips names in the species-tree, if provided.

  • a rooted species tree - in the newick format .nwk. It has to include all the species considered and no branchlengths are needed as they will be optimized for each gene through our analysis. The tips names in the phylogenetic tree have to be identical to the headers of the fasta alignments.

  • two codeml .ctl files, which describe the models we want to leverage in our analysis.

  • a file which specifies the branch groups that can have a different omega value - here only for the alternative model.

In the .ctl files, every parameter can still be modified, but the seqfile =, outfile = , treefile = , omega = fields should be left empty. When using your own data, remember to properly set the genetic code in the .ctl files. As stated before, in this analysis we will compare two branch models: a model where there is one omega shared by all branches and (branch model 0) and a model where some branches have their own omega (branch model 2). We can specify this two models to the workflow using the --model_g and --model_a flags, in which "g" and "a" stend respectively for general and alternative. The clade where we want to allow a different omega compared to the rest of the tree can be specified using a label file like this, where all the species in the clade are listed on a single line, followed by the appropriate codeml label.

lart lubb tcan tusa $1

We can start the analysis using 4 cores - using the --cores flag - and restricting the analysis to ubiquitous genes - using --ubiquitous:

sh ../BASE.sh --analyze --input _ubiquitous_genes/ --output _ubiquitous_genes_clade --s_tree spp_tree.nwk 
--model_g m0.ctl --model_a m2.ctl --cores 4 --labels clade.label --ubiquitous

Some information are printed to the standard output, including potential errors, as can be seen from the second-last line:


  analysis started on mer 23 giu 2021, 14.44.11, CEST

  analyizing 1 replicate(s) of 21 genes using the species-tree: branch models 0 VS 2 & site models 0 VS 0 

  analyze:	100% 

  performing LRT 

  formatting output 

  the analysis has produced some warning(s): you will find the information relative to each failure in the file warnings_summary.txt

  analysis finished on mer 23 giu 2021, 14.49.06, CEST 

Inside the output folder _ubiquitous_genes_clade you will find a warnings_summary.txt. If you cat the file, you can read all the warnings generated by the analysis, such as: the gene OG3105.fa is non-ubiquitous and has been excluded by the analyses This is because this gene misses one of the 7 species which are present in the analysis species tree; we'll see how to included such genes further in the tutorials. The analysis generated a codeml output (.out) for each gene: theses files names state whether the general or alternative model was the best-fit and - if specified - the best replicate (more on that in following tutorials). We'll need these outputs for the following step, but a rather important result is already the summary of the LRTs itself. Type column -t _ubiquitous_genes_clade/likelihood_summary.txt to take a look at it:

gene    branch_model_g  site_model_g  model_g_np  model_g_LnL    rep_g  branch_model_a  site_model_a  model_a_np  model_a_LnL    rep_a  nRF   LRT      df  p.value  significance
OG3126  0               0             2           -4132.017209   1      2               0             3           -4131.614596   1      0     0.8052   1   0.3695   n/s
OG3158  0               0             2           -3818.030135   1      2               0             3           -3817.939497   1      0.25  0.1813   1   0.6703   n/s
OG3164  0               0             2           -5374.980383   1      2               0             3           -5374.757647   1      0     0.4455   1   0.5045   n/s
OG3196  0               0             2           -3365.98055    1      2               0             3           -3360.350168   1      0     11.2608  1   8e-04    *
OG3197  0               0             2           -3680.783168   1      2               0             3           -3679.832723   1      0     1.9009   1   0.168    n/s
OG3302  0               0             2           -9054.676043   1      2               0             3           -9053.740476   1      0     1.8711   1   0.1713   n/s
OG3342  0               0             2           -3444.803353   1      2               0             3           -3443.289908   1      0     3.0269   1   0.0819   n/s
OG3347  0               0             2           -9597.616313   1      2               0             3           -9597.162687   1      0     0.9073   1   0.3408   n/s
OG3359  0               0             2           -3147.226951   1      2               0             3           -3147.149083   1      0     0.1557   1   0.6931   n/s
OG3362  0               0             2           -5722.861684   1      2               0             3           -5719.895191   1      0     5.933    1   0.0149   n/s
OG3372  0               0             2           -3710.675722   1      2               0             3           -3710.061445   1      0     1.2286   1   0.2677   n/s
OG3387  0               0             2           -4067.125785   1      2               0             3           -4067.11138    1      0     0.0288   1   0.8652   n/s
OG3395  0               0             2           -3901.33739    1      2               0             3           -3886.103971   1      0     30.4668  1   0        ***
OG3399  0               0             2           -3003.011552   1      2               0             3           -3002.977544   1      0     0.068    1   0.7942   n/s
OG3600  0               0             2           -2867.120583   1      2               0             3           -2865.770236   1      0     2.7007   1   0.1003   n/s
OG3622  0               0             2           -3407.557493   1      2               0             3           -3407.557441   1      0     1e-04    1   0.9919   n/s
OG3640  0               0             2           -3342.883393   1      2               0             3           -3342.860035   1      0     0.0467   1   0.8289   n/s
OG3648  0               0             2           -10491.352418  1      2               0             3           -10491.348213  1      0     0.0084   1   0.9269   n/s
OG3682  0               0             2           -3703.868866   1      2               0             3           -3702.208706   1      0     3.3203   1   0.0684   n/s
OG3683  0               0             2           -2791.481942   1      2               0             3           -2790.341601   1      0     2.2807   1   0.131    n/s

This output is self-explanatory and the more important information in this table are the result of the LRTs found in the last two columns.


Subsequently we can proceed to the further step: --extract. it takes as input the folder generated by the previous step - which containt the codeml outputs - and will generate an equivalent number of .annotation files, which contain for each OG:

  • a tree with internal nodes matching codeml outputs
  • the species which are associated to each branch

Type:

sh ../BASE.sh --extract --input _ubiquitous_genes_clades/

and here goes the standard output:


  analysis started on mer 23 giu 2021, 14.50.01, CEST 

  annotating 20 codeml outputs 

  annotate:	100% 

  annotation ok 

  label file is missing - relaunch the analysis with label(s) to extract specific branches/clades

  analysis finished on mer 23 giu 2021, 14.51.43, CEST 

As the second-last line tells us, perfroming this step without specifying any branch/clade will just annotate codeml outputs; to retrive the metrics relative to our branch/clade of interest we need to specify them using a file like this. The latter is substantially similar to the file used for labelling the clade in the analyze step: it contains on each line all the species associated to our branch/clade of interested - separated by single spaces - followed by a custom name instead of the codeml label. Here clade_of_interest is used but it can be changed to any name. Let's take a look to our labels file:

lart lubb tcan tusa clade_of_interest

We also need to specify a minimum number of species to consider a branch/clade: the latter can be expressed with an absolute number - such as 3 - or via a proportion - such as 0.9 - but when we want to retrive metrics only when all the species of our branch/clade of interest are presen we can just specify --min_spp x.

Let's extract the metrics relative to our branch of interest, using the line:

sh ../BASE.sh --extract --input _ubiquitous_genes_clades/ --labels clade_of_interest --min_spp x

This will print to the screen:


  analysis started on mer 23 giu 2021, 14.55.48, CEST 

  annotation ok 

  extracting 1 branches from 20 codeml output 

  extract clade_of..	 100% 
 
  the analysis has produced some warning(s): you will find the information relative to each failure in the file warnings_summary.txt 

  analysis finished on mer 23 giu 2021, 14.56.48, CEST 

The warning reported is the same generated in the analyze step, so we shouldn't care about it. The two separate --extract commands could have been carried out in a single step, by specifying since the beginning the labels file. If one wants to extract additional branches this step will be faster, as the workflow will recognize the .annotation files and skip that step. By typing column -t _ubiquitous_OGs_clades/extract.clade_of_interest.min.spp.4.dNdS.summary we can take a look at the output; as you can see each line of the label file generates a summary output - named with the identifyer and the treshold of missing species. This is the ouput for our clade:

branch/clade      gene    spp_n  dNdS    t      dN      dS
clade_of_interst  OG3126  4      0.0646  0.555  0.0432  0.6686
clade_of_interst  OG3158  4      0.0946  1.213  0.1278  1.3518
clade_of_interst  OG3164  4      0.0866  0.211  0.0235  0.2711
clade_of_interst  OG3196  4      0.1759  0.347  0.0532  0.3024
clade_of_interst  OG3197  4      0.1076  0.712  0.0767  0.7128
clade_of_interst  OG3302  4      0.0974  0.465  0.0519  0.5326
clade_of_interst  OG3342  4      0.0837  0.588  0.0573  0.6849
clade_of_interst  OG3347  4      0.0697  0.572  0.0492  0.7064
clade_of_interst  OG3359  4      0.0272  0.352  0.0118  0.4355
clade_of_interst  OG3362  4      0.1145  0.530  0.0618  0.5399
clade_of_interst  OG3372  4      0.0231  0.484  0.0163  0.7062
clade_of_interst  OG3387  4      0.1054  0.704  0.0711  0.6749
clade_of_interst  OG3395  4      0.1059  0.140  0.0159  0.1497
clade_of_interst  OG3399  4      0.0919  0.628  0.0643  0.6994
clade_of_interst  OG3600  4      0.1416  0.375  0.0530  0.3741
clade_of_interst  OG3622  4      0.0684  0.807  0.0657  0.9611
clade_of_interst  OG3640  4      0.1045  0.963  0.1087  1.0395
clade_of_interst  OG3648  4      0.1708  0.651  0.1002  0.5867
clade_of_interst  OG3682  4      0.1242  0.342  0.0430  0.3461
clade_of_interst  OG3683  4      0.0993  0.263  0.0289  0.2915

This table is rather self explanatory and represent the endpoint of our analysis.


While throughout this tutorial a very simplemodel has been used, BASE allows to use all the possibility of codeml, including

  • use multiple labels in the same analyses - both branch (#) and clade ($) ones
  • compare a general and alternative model which both have labels
  • use branch, site and branch-site models

You can find more information on additional models and labelling strategies here.


Back to the tutorials list.