Skip to content

Usage (Binaries)

Saki Tojo edited this page May 12, 2023 · 1 revision

This page is the way to use the HANA binaries (for constructing bash or 3rd party pipeline).

Quick Start Guide

This is a demonstration for assembling the Nipponbare genome data. Assume we have already got the following two files:

  • contigs.fasta which stores all the contigs
  • sample.bwa_mem.bam which is the mapping results from bwa

To assemble the data and generate the FASTA of the chromosome, please follow the steps below.

hana_extract -f contigs_sim.fasta -m sample.bwa_mem.bam -o bare -e GATC -t 16 -q 40

This should generate bare.hmr_nodes and bare.hmr_reads.

hana_draft -n bare.hmr_nodes -r bare.hmr_reads -o bare

This should generate bare.hmr_edges and bare.hmr_nodes_invalid.

hana_partition -n bare.hmr_nodes -e bare.hmr_edges -g 12 -o bare

This should generate bare_1g12.hmr_group to bare_12g12.hmr_group.

hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_1g12.hmr_group -o bare_1g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_2g12.hmr_group -o bare_2g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_3g12.hmr_group -o bare_3g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_4g12.hmr_group -o bare_4g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_5g12.hmr_group -o bare_5g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_6g12.hmr_group -o bare_6g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_7g12.hmr_group -o bare_7g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_8g12.hmr_group -o bare_8g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_9g12.hmr_group -o bare_9g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_10g12.hmr_group -o bare_10g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_11g12.hmr_group -o bare_11g12.hmr_seq -t 16
hana_ordering -n bare.hmr_nodes -e bare.hmr_edges -g bare_12g12.hmr_group -o bare_12g12.hmr_seq -t 16

These should generate bare_1g12.hmr_seq to bare_12g12.hmr_seq.

hana_orientation -n bare.hmr_nodes -r bare.hmr_reads -s bare_1g12.hmr_group bare_2g12.hmr_group bare_3g12.hmr_group bare_4g12.hmr_group bare_5g12.hmr_group bare_6g12.hmr_group bare_7g12.hmr_group bare_8g12.hmr_group bare_9g12.hmr_group bare_10g12.hmr_group bare_11g12.hmr_group bare_12g12.hmr_group

This should generate bare_1g12.hmr_chromo to bare_12g12.hmr_chromo.

hana_build -f contigs_sim.fasta -c bare_1g12.hmr_chromo bare_2g12.hmr_chromo bare_3g12.hmr_chromo bare_4g12.hmr_chromo bare_5g12.hmr_chromo bare_6g12.hmr_chromo bare_7g12.hmr_chromo bare_8g12.hmr_chromo bare_9g12.hmr_chromo bare_10g12.hmr_chromo bare_11g12.hmr_chromo bare_12g12.hmr_chromo -o chromosome.fasta

This should generate chromosome.fasta.

Pipeline Modules

Extract

Extract the useful information from FASTA and BAM files. We expected it to generate the following file:

  • .hmr_nodes describes the contig length and restriction sites count.
  • .hmr_reads describes the necessary reads information (contig id, position, paired contig id, paired position).
  • .hmr_allele_table describes the allele table contigs indices if allele table is provided.
usage: hana_extract [-h] -f FASTA -a ALLELE -m MAPPING 1, MAPPING 2... -o OUTPUT -e ENZYME -t THREADS -r ENZYME_RANGE -q MAPQ --search-buffer SEARCH_BUF_SIZE --mapping-buffer MAP_BUF_SIZE --no-flag --no-range

  -h, --help            Show this help message and exit
  -f FASTA, --fasta FASTA
                        Contig FASTA file (.fasta/.fasta.gz)
  -a ALLELE, --allele ALLELE
                        Contig allele table file (.ctg.table)
  -m MAPPING 1, MAPPING 2..., --mapping MAPPING 1, MAPPING 2...
                        Hi-C reads mapping files (.bam/.pairs)
  -o OUTPUT, --output OUTPUT
                        Output file prefix
  -e ENZYME, --enzyme ENZYME
                        Enzyme to find in the sequence
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  -r ENZYME_RANGE, --range ENZYME_RANGE
                        Hi-C pairs valid distance around enzyme (default: 1000)
  -q MAPQ, --mapq MAPQ  BAM minimum mapping quality (default: 40)
  --search-buffer SEARCH_BUF_SIZE
                        FASTA searching buffer size per thread (default: 32)
  --mapping-buffer MAP_BUF_SIZE
                        Mapping parse buffer size (unit: K, default: 512)
  --no-flag             Skip the flag checking
  --no-range            Skip the enzyme range checking

Draft

Based on the contig and reads information, calculate the edge counts and weights for partition. We expected it to generate the following file:

  • .hmr_edges describes the relationships between contigs in the .hmr_nodes file.
  • .hmr_nodes_invalid describes the contigs that failed to meet the constraints.
usage: hana_draft [-h] -n NODES -r READS -a ALLELE_TABLE -o OUTPUT -b BUFFER_SIZE --min-links MIN_LINKS --min-re MIN_RE --max-link-density MAX_DENSITY

  -h, --help            Show this help message and exit
  -n NODES, --nodes NODES
                        HMR contig node file (.hmr_nodes)
  -r READS, --reads READS
                        HMR paired-reads file (.hmr_reads)
  -a ALLELE_TABLE, --allele-table ALLELE_TABLE
                        Allele contig table (.hmr_allele_table)
  -o OUTPUT, --output OUTPUT
                        Output graph prefix
  -b BUFFER_SIZE, --buffer-size BUFFER_SIZE
                        HMR paired-reads buffer size (unit: K, default: 512)
  --min-links MIN_LINKS
                        Minimum number of links for contig pair (default: 3)
  --min-re MIN_RE       Minimum number of RE sites in a contig (default: 10)
  --max-link-density MAX_DENSITY
                        Maximum allowed link density (default: 2)

Partition

Cluster contigs based on the edge information generated in the 'draft' step. The algorithm is a Louvain-like community detection algorithm. We expected it to generate the following file:

  • .hmr_group describes the contigs that should be from the same chromosome.
usage: hana_partition [-h] -n NODES -e EDGES -a ALLELE_TABLE -g GROUP -o OUTPUT -b BUFFER_SIZE --non-informative-ratio NON_INFO_RATIO

  -h, --help            Show this help message and exit
  -n NODES, --nodes NODES
                        HMR contig node file (.hmr_nodes)
  -e EDGES, --edges EDGES
                        HMR edges file (.hmr_edges)
  -a ALLELE_TABLE, --allele ALLELE_TABLE
                        HMR allele table file (.hmr_allele_table)
  -g GROUP, --group GROUP
                        Number of groups to be partitioned
  -o OUTPUT, --output OUTPUT
                        Output partition file (.hmr_partition)
  -b BUFFER_SIZE, --buffer-size BUFFER_SIZE
                        HMR edge buffer size (unit: K, default: 512)
  --non-informative-ratio NON_INFO_RATIO
                        Skipped contigs recover cutoff (default: 3)

Ordering

This is the first step of the AllHiC optimize. It reads one group generated by partition, and uses evolutionary algorithm to find out the order. We expected it to generate the following file:

  • .hmr_seq describes the sequence of the contigs in the same group.
usage: hana_ordering [-h] -n NODES -e EDGE -g GROUP -o OUTPUT -t THREAS -b BUFFER_SIZE -s SEED --mutapb MUTATION --ngen NUM_OF_GENERATION --max-gen MAX_GENERATION --npop NUM_OF_POP

  -h, --help            Show this help message and exit
  -n NODES, --nodes NODES
                        HMR contig node file (.hmr_contig)
  -e EDGE, --edge EDGE  HMR edge file (.hmr_edge)
  -g GROUP, --group GROUP
                        HMR contig group file (.hmr_group)
  -o OUTPUT, --output OUTPUT
                        Output ordered contig sequence file (.hmr_seq)
  -t THREAS, --threads THREAS
                        Number of threads (default: 1)
  -b BUFFER_SIZE, --buffer-size BUFFER_SIZE
                        HMR edge buffer size (unit: K, default: 512)
  -s SEED, --seed SEED  Fixed random seed, 0 for no special seed (default: 0)
  --mutapb MUTATION     Mutation probability (default: 0.2)
  --ngen NUM_OF_GENERATION
                        Number of generations for convergence (default: 5000)
  --max-gen MAX_GENERATION
                        Limits of trial generations (default: 1000000)
  --npop NUM_OF_POP     Candidate sequences size (default: 100)

Orientation

This is the first step of the AllHiC optimize. It reads all the sorted group sequences and use gradient of each contigs to decide the orientation. We expect it to generate the following file:

  • .hmr_chromo describes the sequence of the contigs and their directions.
usage: hana_orientation [-h] -n NODES -r READS -s SEQ 1, SEQ 2... -b BUFFER_SIZE

  -h, --help            Show this help message and exit
  -n NODES, --nodes NODES
                        HMR contig node file (.hmr_contig)
  -r READS, --reads READS
                        HMR paired-reads file (.hmr_reads)
  -s SEQ 1, SEQ 2..., --seq SEQ 1, SEQ 2...
                        HMR sorted contig sequence file (.hmr_seq)
  -b BUFFER_SIZE, --buffer-size BUFFER_SIZE
                        HMR paired-reads buffer size (unit: K, default: 512)

Build

This step converts the chromosome results into FASTA file. We expect it to generate the following file:

  • .fasta which is the assembled sequences.
usage: hana_build [-h] -f FASTA -c CHROMO 1, CHROMO 2... -o OUTPUT
optional arguments:
  -h, --help            Show this help message and exit
  -f FASTA, --fasta FASTA
                        Contig FASTA file (.fasta/.fasta.gz)
  -c CHROMO 1, CHROMO 2..., --chromosome CHROMO 1, CHROMO 2...
                        Hi-C reads mapping files (.bam/.hmr_mapping)
  -o OUTPUT, --output OUTPUT
                        Output build fasta file (.fasta)
Clone this wiki locally