This project endeavours to delve into the comparative genomics of Helicobacter pylori (H. pylori) isolates from diverse human populations. By employing advanced bioinformatic analyses, the study aims to uncover the variations and similarities in virulent and non-virulent genes across different geographical regions, shedding light on the evolutionary relationships within this pathogen.
Helicobacter pylori (H. pylori), a pervasive bacterium affecting nearly half of the global population, is implicated in various gastro-duodenal diseases, including gastritis, peptic ulcer disease, and gastric adenocarcinomas. Despite its prevalence and significant health implications, the precise mechanisms underlying its pathogenicity remain elusive. This project seeks to explore the genomic landscape of H. pylori strains, elucidating the genetic factors contributing to its virulence and geographical diversity.
Publicly available whole genome sequences of 870 H. pylori genomes were downloaded from the PATRIC database from the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) repository (Wattam et al., 2014) in the FASTA file format on May 10, 2023, 01:32 AM GMT (https://www.bv-brc.org/).
- Host must be human.
- Must be whole genome sequences.
- Reads must be of good quality.
- Geographical regions of isolation should be Africa, Asia, North America, South America, Oceania, and Europe.
- The species must be H. pylori only.
- The isolation year was from 2010 downwards. The 870 Strains that were returned to be in this category after applying the filter included: 150 strains from Africa, 242 strains from Asia, 192 from North America, 11 from South America, 112 from Oceania and 163 from Europe.
Unless otherwise stated, the following tools were installed with Anaconda and ran in a Linux environment (UBUNTU 22.04.2).
Annotation of the genomes was done with Prokka v.1.14.6 to identify, label and assign functions to features (annotation) of the genome (Seemann, 2014). The sequences were input in the FASTA file format. The output files generated by Prokka were automatically directed into a folder and were of different formats (.err, .ffn, .fsa, .gff, .sqn, .tsv, .faa, .fna, .gbk, .log, .tbl, .txt).
A pangenome analysis of all 870 genomes was run with the ROARY v.3.13.0 tool to identify the core and accessory genes that are present or absent in the genomes based on their geographical locations of isolation. The Prokka-generated .gff files of the various genomes were used as the input files. The output files were visual files and subsequent interpretation was given on them (Page et al., 2015).
A Newick tree was generated with the core gene alignment file (generated from running ROARY) using the FastTree command in combination with the python-dependent 'roary_plots.py' to generate a phylogenetic tree for geographical comparisons of evolutionary traits and the iTOL online software was used to visualise it (Letunic and Bork, 2021; Page et al., 2015).
From each clade on the phylogenetic trees constructed for each geographical region, 2 sequences were randomly selected with the H. pylori reference genome 26695, producing 63 sequences; 12 from Africa and 10 each from the other geographical regions, to represent each geographical region, plus the reference genome.
Genome annotation, pangenome analysis and evolutionary analysis were repeated for the newly selected samples.
The ABRicate v.0.8 tool was used to identify the antibiotic-resistant or virulent genes present in all the genomes, geographically, which returned hits that suggested the virulent factors. It was automatically run against the VFDB database on July 15, 2023, 09:05:05 PM GMT (Chen et al., 2016; Seemann T, 2020).
A visualization of the most virulent genomes from each geographical region, and the reference genome, was done with genovi and the results were compared (Cumsille et al., 2023).
Arnold Abakah