Lightweight python package to map transposon insertion sequencing (INSeq) data in bacteria.
This section is meant for users who know their way around terminal and conda
. To use pyinseq
,
create a virtual environment with python
3.7 and install pyinseq
using conda
.
$ conda install -n base -c conda-forge mamba
$ conda create -n pyinseq-py37 python=3.7
$ conda activate pyinseq-py37
(pyinseq-py37) $ conda install -c bioconda bowtie
(pyinseq-py37) $ pip install pyinseq
Verify your installation with --test
(pyinseq-py37) $ pyinseq --test
Now you can run pyinseq
!
(pyinseq-py37) $ pyinseq -i <input file> -s <sample file> -g <genbank file> -e <experiment name>
- Overview of Command Line Operation
- Output description
- Notes on output
- Background
- User guide
- Installation
- FAQ
- License
Basic operation and a short description of the files are listed here. Below are detailed descriptions and links to example input files.
$ pyinseq -i <input file> -s <sample file> -g <genbank file> -e <experiment name>
-i
/ --input
- Illumina reads file in FASTQ or gzipped-FASTQ format.
-s
/ --samples
- Sample list where each line is the
sample_name
<tab>barcode (4bp)
.
-g
/ --genome
- Concatenated GenBank File for all contigs for the genome.
-e
/ --experiment
- All results files will be created in a subfolder with this name.
--get_default_config
- Creates a default configuration file for running
pyinseq
--config_format
- File format for configuration file (
yaml
orjson
)
-c
/ --config
- Configuration file for running a
pyinseq
workflow. Every other argument will be ignored.
--test
- Runs pytest on installed
pyinseq
software.
--gff
- Generate genome output file in gff3 format.
-d
/ --disruption
- Five-prime fraction of gene (
0.0
-1.0
) that must be disrupted for the hit to be counted in the summary_gene_table. Often insertions at the 3' end of a gene do not disrupt function so it may be of interest to run the pipeline with a disruption value of0.8
or0.9
. [default: 0.9]
--min_count
- Minimum number of reads per insertion site. [default: 3]
--max_ratio
- Maximum ratio of left:right or right:left reads per insertion site. [default: 10]
--barcode_length
- Length of barcode index that is expected in samples. [default: 4]
--transposon_seq
- DNA sequence of transposon that flanks reads. [default: 'ACAGGTTG']
-t
/ --threads
- Number of cores to use for execution [default: the CPU count of the computer]
--snakemake_params
- Additional parameters that will get passed to
snakemake
.
File | Description |
---|---|
<experiment name>-config.yml |
Configuration file with run parameters |
results/summary_gene_table.txt |
summary for entire experiment (values in counts-per-million, cpm) |
results/<sample>_sites.txt (for each sample) |
Counts of each insertion in each sample |
results/<sample>_genes.txt (for each sample) |
Counts of each insertion mapped to genes |
results/<sample>_bowtie.txt (for each sample) |
Bowtie mapping results |
results/<sample>_trimmed.fastq (for each sample) |
Demultiplexed fastq reads trimmed for the chromosome sequence only |
results/log.txt |
text printed to console |
results/samples_info_yml |
basics stats for each sample |
results/genome_lookup/genome.fna |
genome fasta nucleotide file |
results/genome_lookup/genome.ftt |
genome feature table |
bowtie indexes | index files created from genome by bowtie |
results/raw_data/<sample>.fastq (for each sample) |
demultiplexed files for each sample/barcode |
results/raw_data/_other.fastq |
demultiplexed files for unrecognized barcodes |
T50
- The minimum number of transposon insertion sites in the sample that account for at least 50% of the samples's reads. Used as a crude measure to detect bottlenecks, when comparing output T50 to the library input T50, or when comparing biological or technical replicates.
samples_info.yml
- Contains basic information of each sample in the experiment.
log.txt
- Log file from pyinseq, including the messages from the
snakemake
execution. This includes the order of steps taken during the pyinseq execution.
<experiment name>
-config.yaml
- Configuration file with run parameters.
pyinseq
was inspired by the software published by Goodman et al. (2011). There are a number of differences, some of which are noted here. Most are fairly superficial in that they are intended to increase automation and reproducibility but do not materially affect the results. One exception is the first point, which does affect the resulting output.
-
As was conducted in Brooks et al. (2014) transposon site data are normalized across all replicons (chromosomes/plasmids/contigs) for calculation of a counts-per-million (cpm) per sample.
-
The user's file of reads is demultiplexed into separate .fastq files for each barcoded read.
-
The user provides a GenBank file, and
pyinseq
generates a fasta nucleotide file (.fna) and gene feature table (.ftt) that are formatted and named properly forbowtie
. -
The gene feature table (.ftt) is comparable to the protein feature table (.ptt) but it also includes RNA genes.
-
If
pyinseq
is installed fromconda
, thenbowtie
is installed fromconda
as a dependency. -
At the end of the analysis, results are aggregated into a tab-delimited table and sample info is summarized.
-
pyinseq
is written primarily in Python and usessnakemake
as the workflow manager. You probably figured that out already.
Illumina sequencing reads [-i
] for the experiment. File can be uncompressed (.fastq or .fq) or gzip-compressed (.fastq.gz or .fq.gz).
@DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0:
GAAGCGACACCGACGACGGTAACAGGTTGGATGATAAGTCCCCGGTCTTCG
+
CCCFFFDDHHHHHCGHHHIJDHIIJJFHHJIJJJJIJJDHIJJHIAGIJJJ
...
Sample file [-s
] describing the sample names and barcodes. Sample names should be restricted to letters, numbers, dash (-), and underscore (_), with a tab between the sample and the barcode in each row of a text file. It is recommended that the file be prepared in a text editor to ensure that additional hidden characters are not introduced. BBEdit is one option for Mac.
Microsoft Excel can export tab-delimited files (.tsv), but do not use Microsoft Word for this purpose.
E001_01 GAAG
E001_02 CTTT
GenBank file [-g
] listing the features and DNA sequence for the organism. If the organism has multiple chromosomes/contigs in the sequence the file they should be concatenated into a single file. Ensure that the double slash //
at the end of the file remains to separate each contig.
Files from NCBI GenBank often work where the corresponding files from NCBI RefSeq do not. Feel free to contact us with any questions here.
LOCUS CP000020 2897536 bp DNA circular BCT 02-APR-2008
DEFINITION Vibrio fischeri ES114 chromosome I, complete sequence.
ACCESSION CP000020
VERSION CP000020.2 GI:171902228
KEYWORDS .
SOURCE Vibrio fischeri ES114
...
FEATURES Location/Qualifiers
source 1..2897536
/organism="Vibrio fischeri ES114"
/mol_type="genomic DNA"
/strain="ES114"
/db_xref="taxon:312309"
/chromosome="I"
gene complement(313..747)
/gene="mioC"
/locus_tag="VF_0001"
/old_locus_tag="VF0001"
CDS complement(313..747)
/dnas_title="FMN-binding protein MioC"
/gene="mioC"
/locus_tag="VF_0001"
/old_locus_tag="VF0001"
/codon_start=1
/transl_table=11
/product="FMN-binding protein MioC"
/protein_id="AAW84496.1"
/db_xref="GI:59478709"
/translation="MKKVSIITGSTLGGAEYVGDHLADLLEEMDFSTDIHNQPNLDDI
DIDSLWLLVVSTHGAGDYPDNIKPFIQQLESVTQPLSSVEFAVVAIGDSSYDTFCAAG
KSLQNTLKEHGAIEKYPLLEIDVTQNSIPEEPAELWLKQHIC"
...
ORIGIN
1 aagatcactt aatatatata agatctttta aagagatctt ttattagatc tattatatag
61 atcgtcgatc tctgtggata agtgataaat gatcaatagg atcatatact ttagatggat
121 ccaaagttgt tatctttctt tgatcttcga tcggacagct tgaggacaaa agagttagtt
181 atccacaagg ggggagggcg ttagatctta ttcaatggat aactataact tgatcactgg
241 atctttctat agttatccac atagtaggta tcatctattt aataactttt atagatcgga
301 caacactttt tattaacaaa tgtgttgttt tagccacaat tctgctggtt cttcagggat
361 actattttga gttacatcta tctctaatag agggtacttt tcgatagcgc catgctcttt
421 taaggtattt tgaagtgatt ttcctgctgc gcagaaagtg tcataacttg aatcaccgat
481 agcaacaaca gcaaattcaa cgctcgataa tggctgagtc acgctttcta actgttgaat
541 aaatggttta atgttgtcag ggtaatcacc agcaccgtga gttgatacaa cgagtaacca
601 taagctatca atatcaatat catctaaatt tggttgatta tgaatatcgg tggaaaaatc
661 catttcttct aataaatcag caagatggtc accaacatac tcagcaccac ctagagtgct
721 tcctgtaata atagatactt ttttcatgaa tttatcctat aaaaatataa aaaatgggcc
781 tacataggcc cattattaat cttattaata ttggttttat ttaccaatac agaatgaagt
...
//
LOCUS CP000021 1330333 bp DNA circular BCT 02-APR-2008
DEFINITION Vibrio fischeri ES114 chromosome II, complete sequence.
...
Pyinseq uses snakemake to execute the following workflow:
pyinseq
- Demultiplex a file of FASTQ reads.
- Write separate trimmed versions of the files (no barcodes, no transposon sequence).
- Map the trimmed reads to the genome.
- Quantify insertions per site and per ORF in the genome.
- Output summary gene table and report files describing the dataset.
The benefit of using snakemake
is that it allows for parallel execution if more than one thread
is provided.
The modularity of snakemake
enables future feature implementation.
pyinseq
writes a configuration file (<experiment name>-config.yaml
) into the working directory, which holds the parameters for the run.
For the example command below, the pyinseq-config.yaml
file will store the parameters used for this run.
$ pyinseq -i reads.fastq -s samples.txt -g genome.gbk -e pyinseq --threads 4
$ cat demo-run-config.yaml
snakemake_params: []
barcode_length: 4
command: pyinseq
config: false
config_format: yaml
disruption: 0.9
experiment: demo-run
genome: genome.gbk
get_default_config: false
gff3: false
input: reads.fastq.gz
max_ratio: 10
min_count: 3
samples: samples.txt
threads: 4
transposon_seq: ACAGGTTG
If you installed conda in your terminal, you can use the option --use-conda
which will install bowtie during the executuion of the workflow.
$ pyinseq -i reads.fastq -s samples.txt -g genome.gbk -e demo-run --threads 4 --snakemake_params --use-conda
You can also get a default configuration file by using --get_default_config
and modify it using a text-editor.
$ pyinseq --get_default_config
$ ls
default-config-pyinseq.yaml
Make sure that all file paths in the configuration are correct
These commands are useful when combining samples from multiple Illumina runs in a single pyinseq
analysis.
Both commands will also create a configuration file specific to each subcommand.
pyinseq demultiplex
- Demultiplex a file of Illumina reads.
- Writes separate trimmed versions of the files (no barcode, no transposon sequence) unless the optional
--notrim
flag is added.
$ pyinseq demultiplex -i <input file> -s <sample file> -e <experiment name>
pyinseq genomeprep
- Prepare the fasta nucleotide (.fna) and feature table (.ftt) files from a GenBank file.
- Is a good quick test to run on new GenBank files.
- Generates bowtie indexes unless the
--noindex
flag is added. - Optionally create GFF3 file with the
-gff
flag (for use in other programs).
$ pyinseq genomeprep -g <genbank file> -e <experiment name>
Similar to
pyinseq
main usage, you can run these commands using a configuration file and pass additionalsnakemake
options.
pyinseq
was written and tested using a MacOS (or Linux-based) operating system.
- MacOS or Linux-based operating systems.
- Python 3.7
- bowtie (v1.0.0)
- snakemake
pyinseq
has not being tested on Windows operating systems, but as of Windows 10 there is support for terminals with Ubuntu
Also note that
pyinseq
usesbowtie
; not bowtie2, which is a different software.
Conda is a command-line package manager that can create virtual environments with
all the necessary dependencies to run pyinseq
. You can acquire conda by installing
Anaconda and all the packages it
brings, or by installing its lightweight version called
miniconda.
We recommend miniconda since it installs a minimal number packages,
and is actually all you need to run pyinseq
Once conda is installed, you can verify it by running:
$ conda --help
mamba
is now required for snakemake
. Install it system-wide with this command. The subsequent steps are shown using conda
since that is more common for the
scientific audience, but can largely be accomplished in mamba
if the user desires.
conda install -n base -c conda-forge mamba
A virtual environment is an isolated computational space where you can install dependencies and software without affecting the base operating system's configuration.
We can use conda
to create a virtual environment with python 3.7
$ conda create -n pyinseq-py37 python=3.7
To activate your environment:
$ conda activate pyinseq-py37
You should see the name of your environment surrounded by parentheses in your terminal prompt.
(pyinseq-py37) $
(pyinseq-py37) $ conda install -c bioconda bowtie
Install the stable version:
(pyinseq-py37) $ pip install pyinseq
Or, install the most recent version from GitHub:
(pyinseq-py37) $ pip install git+https://github.com/mjmlab/pyinseq
Verify your installation with --test
(pyinseq-py37) $ pyinseq --test
Verify that pyinseq installed correctly by running:
```bash
(pyinseq-py37) $ pyinseq --help
2021-05-26 13:10 - INFO - pyinseq - Process command line arguments
usage: pyinseq [-h] [--get_default_config] [--config_format CONFIG_FORMAT]
[-c CONFIG] [-t THREADS] [--snakemake_params ...] [-v]
[-i INPUT] [-s SAMPLES] [-e EXPERIMENT] [-g GENOME]
[-d DISRUPTION] [--min_count MIN_COUNT] [--max_ratio MAX_RATIO]
[--barcode_length BARCODE_LENGTH]
[--transposon_seq TRANSPOSON_SEQ] [--gff3]
{demultiplex,genomeprep} ...
...
Now you are ready to run pyinseq
!
You can test your installation of pyinseq
by using the option --test
.
(pyinseq-py37) $ pyinseq --test
If all tests pass then you are good to go!
Ensure that the file is in GenBank and not RefSeq format.
You can do this two ways:
- uninstalling pyinseq from the virtual environment
pip uninstall pyinseq
- or completely remove the conda virtual environment.
conda env remove -n pyinseq-py37
Note: will need to conda deactivate
first to leave the environment.
Please use the GitHub Issues.
Pyinseq is an open-source software licensed under BSD-3.