retrive metrics of specific branches in ubiquitous genes
In this tutorial we will compare two nested models through a Likelihood Ratio Test (LRT from now on): the general model allows a single omega value across all the branches of our phylogeny while the alternative model allows a specific clade to have a different omega value from the rest of the phylogeny, for each gene; subsequently we will extract omega - along with other metrics - for a clade of interest. This analysis will focus only on ubiquitous genes, in which a single-copy gene is present for each species. After moving to the toy-dataset folder we can quickly revise what's needed to start our analysis:
-
a folder containing aligned genes, in fasta format and with the
.fa
extesion. These alignment shouldn't have any STOP codon, yet BASE will report and exclude the genes wehre they are found. The fasta alignment headers have to be identical to the tips names in the species-tree, if provided. -
a rooted species tree - in the newick format
.nwk
. It has to include all the species considered and no branchlengths are needed as they will be optimized for each gene through our analysis. The tips names in the phylogenetic tree have to be identical to the headers of the fasta alignments. -
two codeml
.ctl
files, which describe the models we want to leverage in our analysis. -
a file which specifies the branch groups that can have a different omega value - here only for the alternative model.
In the .ctl
files, every parameter can still be modified, but the seqfile =
, outfile =
, treefile =
, omega =
fields should be left empty.
When using your own data, remember to properly set the genetic code in the .ctl
files.
As stated before, in this analysis we will compare two branch models:
a model where there is one omega shared by all branches and (branch model 0) and
a model where some branches have their own omega (branch model 2).
We can specify this two models to the workflow using the --model_g
and --model_a
flags, in which "g" and "a" stend respectively for general and alternative.
The clade where we want to allow a different omega compared to the rest of the tree can be specified using a label file like this,
where all the species in the clade are listed on a single line, followed by the appropriate codeml label.
lart lubb tcan tusa $1
We can start the analysis using 4 cores - using the --cores
flag - and restricting the analysis to ubiquitous genes - using --ubiquitous
:
sh ../BASE.sh --analyze --input _ubiquitous_genes/ --output _ubiquitous_genes_clade --s_tree spp_tree.nwk
--model_g m0.ctl --model_a m2.ctl --cores 4 --labels clade.label --ubiquitous
Some information are printed to the standard output, including potential errors, as can be seen from the second-last line:
analysis started on mer 23 giu 2021, 14.44.11, CEST
analyizing 1 replicate(s) of 21 genes using the species-tree: branch models 0 VS 2 & site models 0 VS 0
analyze: 100%
performing LRT
formatting output
the analysis has produced some warning(s): you will find the information relative to each failure in the file warnings_summary.txt
analysis finished on mer 23 giu 2021, 14.49.06, CEST
Inside the output folder _ubiquitous_genes_clade
you will find a warnings_summary.txt
.
If you cat
the file, you can read all the warnings generated by the analysis, such as:
the gene OG3105.fa is non-ubiquitous and has been excluded by the analyses
This is because this gene misses one of the 7 species which are present in the analysis species tree; we'll see how to included such genes further in the tutorials.
The analysis generated a codeml output (.out
) for each gene: theses files names state whether the general or alternative model was
the best-fit and - if specified - the best replicate (more on that in following tutorials). We'll need these outputs for the following step, but a
rather important result is already the summary of the LRTs itself. Type column -t _ubiquitous_genes_clade/likelihood_summary.txt
to take a look at it:
gene branch_model_g site_model_g model_g_np model_g_LnL rep_g branch_model_a site_model_a model_a_np model_a_LnL rep_a nRF LRT df p.value significance
OG3126 0 0 2 -4132.017209 1 2 0 3 -4131.614596 1 0 0.8052 1 0.3695 n/s
OG3158 0 0 2 -3818.030135 1 2 0 3 -3817.939497 1 0.25 0.1813 1 0.6703 n/s
OG3164 0 0 2 -5374.980383 1 2 0 3 -5374.757647 1 0 0.4455 1 0.5045 n/s
OG3196 0 0 2 -3365.98055 1 2 0 3 -3360.350168 1 0 11.2608 1 8e-04 *
OG3197 0 0 2 -3680.783168 1 2 0 3 -3679.832723 1 0 1.9009 1 0.168 n/s
OG3302 0 0 2 -9054.676043 1 2 0 3 -9053.740476 1 0 1.8711 1 0.1713 n/s
OG3342 0 0 2 -3444.803353 1 2 0 3 -3443.289908 1 0 3.0269 1 0.0819 n/s
OG3347 0 0 2 -9597.616313 1 2 0 3 -9597.162687 1 0 0.9073 1 0.3408 n/s
OG3359 0 0 2 -3147.226951 1 2 0 3 -3147.149083 1 0 0.1557 1 0.6931 n/s
OG3362 0 0 2 -5722.861684 1 2 0 3 -5719.895191 1 0 5.933 1 0.0149 n/s
OG3372 0 0 2 -3710.675722 1 2 0 3 -3710.061445 1 0 1.2286 1 0.2677 n/s
OG3387 0 0 2 -4067.125785 1 2 0 3 -4067.11138 1 0 0.0288 1 0.8652 n/s
OG3395 0 0 2 -3901.33739 1 2 0 3 -3886.103971 1 0 30.4668 1 0 ***
OG3399 0 0 2 -3003.011552 1 2 0 3 -3002.977544 1 0 0.068 1 0.7942 n/s
OG3600 0 0 2 -2867.120583 1 2 0 3 -2865.770236 1 0 2.7007 1 0.1003 n/s
OG3622 0 0 2 -3407.557493 1 2 0 3 -3407.557441 1 0 1e-04 1 0.9919 n/s
OG3640 0 0 2 -3342.883393 1 2 0 3 -3342.860035 1 0 0.0467 1 0.8289 n/s
OG3648 0 0 2 -10491.352418 1 2 0 3 -10491.348213 1 0 0.0084 1 0.9269 n/s
OG3682 0 0 2 -3703.868866 1 2 0 3 -3702.208706 1 0 3.3203 1 0.0684 n/s
OG3683 0 0 2 -2791.481942 1 2 0 3 -2790.341601 1 0 2.2807 1 0.131 n/s
This output is self-explanatory and the more important information in this table are the result of the LRTs found in the last two columns.
Subsequently we can proceed to the further step: --extract
. it
takes as input the folder generated by the previous step - which containt the codeml outputs - and will generate
an equivalent number of .annotation
files, which contain for each OG:
- a tree with internal nodes matching codeml outputs
- the species which are associated to each branch
Type:
sh ../BASE.sh --extract --input _ubiquitous_genes_clades/
and here goes the standard output:
analysis started on mer 23 giu 2021, 14.50.01, CEST
annotating 20 codeml outputs
annotate: 100%
annotation ok
label file is missing - relaunch the analysis with label(s) to extract specific branches/clades
analysis finished on mer 23 giu 2021, 14.51.43, CEST
As the second-last line tells us, perfroming this step without specifying any branch/clade will just annotate codeml outputs;
to retrive the metrics relative to our branch/clade of interest we need to specify them using a file like this.
The latter is substantially similar to the file used for labelling the clade in the analyze
step:
it contains on each line all the species associated to our branch/clade of interested - separated by single spaces - followed by a custom name instead of the codeml label.
Here clade_of_interest
is used but it can be changed to any name. Let's take a look to our labels file:
lart lubb tcan tusa clade_of_interest
We also need to specify a minimum number of species to consider a branch/clade: the latter can be expressed with an
absolute number - such as 3 - or via a proportion - such as 0.9 - but when we want to retrive metrics only when all the species of our branch/clade of interest
are presen we can just specify --min_spp x
.
Let's extract the metrics relative to our branch of interest, using the line:
sh ../BASE.sh --extract --input _ubiquitous_genes_clades/ --labels clade_of_interest --min_spp x
This will print to the screen:
analysis started on mer 23 giu 2021, 14.55.48, CEST
annotation ok
extracting 1 branches from 20 codeml output
extract clade_of.. 100%
the analysis has produced some warning(s): you will find the information relative to each failure in the file warnings_summary.txt
analysis finished on mer 23 giu 2021, 14.56.48, CEST
The warning reported is the same generated in the analyze step, so we shouldn't care about it.
The two separate --extract
commands could have been carried out in a single step, by specifying since the beginning the labels file.
If one wants to extract additional branches this step will be faster,
as the workflow will recognize the .annotation
files and skip that step.
By typing column -t _ubiquitous_OGs_clades/extract.clade_of_interest.min.spp.4.dNdS.summary
we can take a look at the output; as you can see
each line of the label file generates a summary output - named with the identifyer and the treshold of missing species. This is the ouput for our clade:
branch/clade gene spp_n dNdS t dN dS
clade_of_interst OG3126 4 0.0646 0.555 0.0432 0.6686
clade_of_interst OG3158 4 0.0946 1.213 0.1278 1.3518
clade_of_interst OG3164 4 0.0866 0.211 0.0235 0.2711
clade_of_interst OG3196 4 0.1759 0.347 0.0532 0.3024
clade_of_interst OG3197 4 0.1076 0.712 0.0767 0.7128
clade_of_interst OG3302 4 0.0974 0.465 0.0519 0.5326
clade_of_interst OG3342 4 0.0837 0.588 0.0573 0.6849
clade_of_interst OG3347 4 0.0697 0.572 0.0492 0.7064
clade_of_interst OG3359 4 0.0272 0.352 0.0118 0.4355
clade_of_interst OG3362 4 0.1145 0.530 0.0618 0.5399
clade_of_interst OG3372 4 0.0231 0.484 0.0163 0.7062
clade_of_interst OG3387 4 0.1054 0.704 0.0711 0.6749
clade_of_interst OG3395 4 0.1059 0.140 0.0159 0.1497
clade_of_interst OG3399 4 0.0919 0.628 0.0643 0.6994
clade_of_interst OG3600 4 0.1416 0.375 0.0530 0.3741
clade_of_interst OG3622 4 0.0684 0.807 0.0657 0.9611
clade_of_interst OG3640 4 0.1045 0.963 0.1087 1.0395
clade_of_interst OG3648 4 0.1708 0.651 0.1002 0.5867
clade_of_interst OG3682 4 0.1242 0.342 0.0430 0.3461
clade_of_interst OG3683 4 0.0993 0.263 0.0289 0.2915
This table is rather self explanatory and represent the endpoint of our analysis.
While throughout this tutorial a very simplemodel has been used, BASE allows to use all the possibility of codeml, including
- use multiple labels in the same analyses - both branch (#) and clade ($) ones
- compare a general and alternative model which both have labels
- use branch, site and branch-site models
You can find more information on additional models and labelling strategies here.
Back to the tutorials list.