cgb
is a Python library for comparative genomics of transcriptional
regulation in Bacteria. This repository contains the core library and the
graphical interface code for the comparative genomics platform.
If using CGB in your research, please cite:
Kılıç, S., Sánchez-Osuna, M., Collado-Padilla, A., Barbé, J. and Erill, I. Flexible comparative genomics of prokaryotic transcriptional regulatory networks. BMC Genomics 21, 466 (2020). https://doi.org/10.1186/s12864-020-06838-x
Given binding site evidence from one or more reference organisms, and a set of
target genomes of interest, cgb
can be used to
- predict operons in each target genome
- identify putative binding sites on promoter regions of each target genome
- compute the posterior probability of regulation of each operon
- detect orthologs between organisms and create orthologous groups of genes
- perform ancestral state reconstruction on orthologous groups to analyze the evolution of transcriptional regulation of each gene
cgb
runs on Python 2.7 and depends on few packages listed in
requirements.txt
. All dependencies can be installed using pip:
pip install -r requirements.txt
clustalo
blast
import cgb
json_input_file = 'test_input.json' # See below for the format
cgb.go(json_input_file)
cgb
expects the input in JSON format. Below is a sample input file followed
by descriptions for each field.
{
"TF": "LexA",
"motifs": [
{
"protein_accession": "NP_217236.2",
"sites": [
"AAATCGAACATGTGTTCGAGTA",
"GTCTCGAACATGTGTTCGAGAA",
"GTATCGAACAATTGTTCGATAT",
"GAATCAAACATGTGTTCGACAG",
"TATTCGAACATGTATTCGAGTA"
]
},
{
"protein_accession": "WP_003857389.1",
"sites": [
"TATGCGAACGTTTTTTCTAAAT",
"TGATCGCAATTGTGTGCTAAAA",
"TATTAAAACACTTGTTCTAAAC",
"TAGTCGAACATGTGAACGGTAT",
"AATACTGACAGAGGTTCGAATA",
"ATCTCGAACACTCGTACCATTT",
"ATTTCGAACAGTTGTGCGTGTA",
"TATTCGAAAACTTTTCCGATCA",
"TCCTCAAAAAAGTGGTCTAATG"
]
}
],
"genomes": [
{
"name": "ace",
"accession_numbers": ["NC_008578.1"]
},
{
"name": "cgl",
"accession_numbers": ["NC_003450.3"]
},
{
"name": "lxy",
"accession_numbers": ["NC_006087.1"]
}
],
"prior_regulation_probability": 0.03,
"phylogenetic_weighting": true,
"site_count_weighting": true,
"posterior_probability_threshold": 0.5
}
Two mandatory input parameters are the list of reference motifs and target genomes.
- The field
motifs
contains one or more motifs. Each motif is described by two sub-fields:protein_accession
andsites
. - The
genomes
field contains the list of target genomes to be used in the analysis. Each genome is described by two fields:name
andaccession_numbers
. The fieldaccession_numbers
could have multiple accession numbers, one for each chromosome/plasmid.
Other input parameters are optional.
prior_regulation_probability
, the prior probability of regulation. Used by Bayesian estimation of probability of regulation.phylogenetic_weighting
. If true, the binding evidence from multiple reference organisms are weighted according to their phylogenetic distances to each target genome.site_count_weighting
. If true, the binding evidence from each reference organism is weighted by the binding site collection size.posterior_probability_threshold
. The genes/operons with posterior probability of regulation less than provided value are not reported.
cgb
saves all the output in the folder output
created on the working
directory.
user_PSWM/
contains the user-provided binding motifs in JASPAR format.derived_PSWM/
contains binding motifs in JASPAR format, tailored for each target genome combining all the evidence from each reference motif.identified_sites/
contains identified binding sites and information such as their genomic locations, downstram regulated genes and their functions. Predicted binding site data is saved into CSV files, one for each target genome.operons/
contains the operon predictions of each target genome, saved as CSV files.orthologs.csv
contains the groups of orthologous genes and their probabilities of regulation.phylogeny.png
is plot of the phylogenetic tree.ancestral_states.csv
has the reconstructed state of each gene in all ancestral clades. For each target species and ancestral clades, the states areP(1)
, the probability of TF bindingP(0)
, the probability of TF not bindingP(A)
, the probability of absence of the gene.
plots/
folder contains the visualization of the results.