bgen_parser is a simple, lightweight and (hopefully) efficient Python parser for the BGEN format. It is nothing more than a Python wrapper to the bgenix C++ library of Gavin Band.
The main motivation for developing this package was that, at the time, I couldn't find a decent BGEN parser that would parse the imputed genotypes of the UK Biobank in a reasonable time (it took them too long to initially load the data). I needed a parser that would work in real time.
For example, to parse the imputed genotypes of the UK Biobank on chromosome 14:
import os
from bgen_parser import BgenParser
UKBB_IMPUTATION_V3_DIR = '/path/to/uk_biobank/EGAD00010001474'
chrom = '14'
bgen_file_path = os.path.join(UKBB_IMPUTATION_V3_DIR, 'ukb_imp_chr%s_v3.bgen' % chrom)
bgi_file_path = os.path.join(UKBB_IMPUTATION_V3_DIR, 'ukb_imp_chr%s_v3.bgen.bgi' % chrom)
sample_file_path = os.path.join(UKBB_IMPUTATION_V3_DIR, 'ukb26664_imp_chr%s_v3.sample' % chrom)
chrom_imputation_data = BgenParser(bgen_file_path, bgi_file_path, sample_file_path)
chrom_imputation_data.sample_ids # A series with the sample IDs
chrom_imputation_data.variants # A dataframe of all the variants
chrom_imputation_data.read_variant_probs(4) # Will read the genotyping of the fifth variant, returning a numpy array of shape (n_samples, 3)
- cython
- numpy
- pandas
The following instructions worked at the time they were written, but it could very well be that bgenix has since changed. If it doesn't work for you, please refer to their website for instructions.
To install bgenix at ~/third_party/bgenix, do the following:
cd /tmp
wget http://bitbucket.org/gavinband/bgen/get/master.tar.gz
tar xvfz master.tar.gz
mv gavinband-bgen-44fcabbc5c38 ~/third_party/bgenix
cd ~/third_party/bgenix
./waf configure
./waf
- Set the BGENIX_DIR environment variable to whatever directory you have installed bgenix at. For example, in cshell it would look like:
setenv BGENIX_DIR /cs/phd/nadavb/third_party/bgenix
- Run:
python setup.py install
If you use bgen_parser as part of work contributing to a scientific publication, we ask that you cite our paper: Brandes, N., Linial, N. & Linial, M. PWAS: proteome-wide association study—linking genes and phenotypes by functional variation in proteins. Genome Biol 21, 173 (2020). https://doi.org/10.1186/s13059-020-02089-x