Skip to content
/ codfreq Public

FASTQ-to-CodFreq pipeline for HIV-1 and SARS-CoV-2

License

Notifications You must be signed in to change notification settings

hivdb/codfreq

Repository files navigation

CodFreq

Codon Frequency Table Format

The HIVDB Sequence Reads Interpretation Program accepts a codon frequency table that stores in the CodFreq format. The CodFreq format consists of five columns:

  1. gene (PR, RT, or IN);
  2. position;
  3. total number of reads of this position;
  4. codon nucleotide triplet; and
  5. total number of reads of this codon.

Examples

This repository contains CodFreq files generated from publicly available SRA sequences. We have also included three selected files from studies that utilize Illumina sequencing. To analyze these files, first download one or more CodFreq example files. Then, submit them to the HIVDB Interpretation Program for analysis.

Create .codfreq file from .fastq/.fastq.gz file

  1. Install Docker CE (https://docs.docker.com/install/).

  2. Download script:

    sudo curl -sL https://raw.githubusercontent.com/hivdb/codfreq/main/bin-wrapper/align-all-docker -o /usr/local/bin/fastq2codfreq
    sudo chmod +x /usr/local/bin/fastq2codfreq
  3. Download alignment profiles:

    mkdir profiles
    curl -sL https://raw.githubusercontent.com/hivdb/codfreq/main/profiles/HIV1.json -o profiles/HIV1.json
    curl -sL https://raw.githubusercontent.com/hivdb/codfreq/main/profiles/SARS2.json -o profiles/SARS2.json
  4. Use following command to process FASTQ files and generate CodFreq files.

    fastq2codfreq -r profiles/HIV1.json -d path/to/fastq/folders

    The script will automatically find every file named with an extension of .fastq, align them to .sam file and then extract the codon freqency table into .codfreq file.

    The above command is adequate for most case of both paired or unpaired FASTQ files generated by Illumina with the filename pattern looks like *_L001_R1_001.fastq.gz and *_L001_R1_002.fastq.gz. However, if your FASTQ files are in other naming convention, please read Advanced usages § Manually pairing FASTQ files.

Note: the fastq2codfreq script can only be executed in an Unix-like system. If you are using Microsoft Windows 10, you need to install the Windows Subsystem for Linux to use this script.

Offline usage

The fastq2codfreq command can be used offline, although the usage is slightly different from the above description. Followings are the differences:

  • Docker's installation package, the fastq2codfreq script and the alignment profiles can be transfered to the offline server using a portable drive.
  • Docker image used by fastq2codfreq can be downloaded into a binary file, and transfer to the offline server using a portable drive.
    # Run this command on a computer with Internet access
    docker save hivdb/codfreq-runner:latest | gzip > codfreq-runner.tar.gz
    
    # Run this command on the offline server
    docker load < codfreq-runner.tar.gz
  • The auto-update option of fastq2codfreq should also be disabled with argument -s:
    fastq2codfreq -s -r profiles/HIV1.json -d path/to/fastq/folders

Advanced usages

Disable auto-pairing FASTQ files

A flag argument -m can be added to fastq2codfreq command to dissable auto-pairing FASTQ files.

fastq2codfreq -m -r profiles/HIV1.json -d path/to/fastq/folders

Manually pairing FASTQ files

With paired FASTQ files, a single CodFreq file will be generated by the process. The program will try to match the FASTQ files with similar names as paired FASTQ files. To change this behavior, a pairinfo.json file can be supplied under the same folder that includes FASTQ files. We have provided an example file at examples/pairinfo.json.

Customize fastp options

Program fastp is by default used to trim adapters, filter low quality regions and reads which are too short. examples/fastp-config.json listed all fastp options supported by this pipeline. Please refer to fastp's documentation for the usage and explanation of these options.

To apply your customized settings, make a fastp-config.json file and save it under the same folder that includes FASTQ files. You can also disable adapter trimming, low phred quality filtering or length filtering by set the corresponding disabling flags to true.

Primer trimming - FASTA

CodFreq pipeline supports trimming FASTA format primer sequences by using cutadapt. examples/cutadapt-config.json listed all cutadapt options supported by this pipeline. Please refer to cutadapt's reference guide for the usage and explanation of these options.

Three type of optional FASTA primer files can be supplied under the same folder that includes the FASTQ files: primers3.fa, primers5.fa and primers53.fa which corresponding to the “3’ adapters”, “5’ adapters”, and “5’ or 3’ adapters” described in cutadapt's user guide.

To enable primer trimming (FASTA), you must make a valid cutadapt-config.json file under the same folder that includes FASTQ files.

Primer trimming - BED

CodFreq pipeline supports trimming BED format primer locations by using ivar. examples/ivar-trim-config.json listed all ivar trim options supported by this pipeline. Please refer to ivar's manual for the usage and explanation of these options.

A BED primer file can be supplied under the same folder that includes the FASTQ files: primers.bed (example: examples/primers.bed). ivar requires a BED6 format which is a tab-delimited file include following six columns (no header): reference, start, end, name, score, and strand. We have reviewed ivar 4.1 source code and have confirmed that only four columns - start, end, name, and strand are used by ivar. The other two (reference and score) can be just supplied in any values for completing the BED6 format.

To enable primer trimming (BED), you must make a valid ivar-trim-config.json file under the same folder that includes FASTQ files.

Other tools

Consolidate codon frequency table to amino acid freqency table

A script using only the standard Python library is provided to consolidate a codon frequency table (.codfreq or .codfreq.gz file) into an amino acid frequency table (.aafreq.csv file). The script merges rows of codons that can be translated into the same amino acid.

This script requires Python 3.9 or higher version to be installed. This required Python runtime is included in the latest version of MacOS and most Linux releases. To install the latest Python version, please follow the official website.

To use this script:

  1. Download the script:

    sudo curl -sL https://raw.githubusercontent.com/hivdb/codfreq/main/scripts/codfreq2aafreq.py -o /usr/local/bin/codfreq2aafreq
    sudo chmod +x /usr/local/bin/codfreq2aafreq
  2. Run the script:

    codfreq2aafreq dir/to/read/codfreqs dir/to/write/aafreqs