FALCON-Phase integrates PacBio long-read assemblies with Phase Genomics Hi-C data to create phased, diploid, chromosome-scale scaffolds.
Read about the method and its performance in our preprint.
Version 2 of this method will be maintained in the pb-assembly bioconda package. See here for details of the full suite of tools.
FALCON-Phase was developed to solve the problem of haplotype switching in diploid genome assemblies. It has been tested on mammalian assemblies and performs well on an F1 bull with 0.7% heterozygosity. It has not been thoroughly tested on samples with very low or very high heterozygosity. In principle, it can be applied to outbred diploid organisms with less than 5% divergence between maternal and paternal haplotypes; when run on organisms with higher heterozygsity determining homology between haplotypes is ineffective as currently implemented. For samples with low heterozygosity, you should determine how much of your genome has "unzipped" by comparing the lengths of your haplotigs to those of your primary contigs. If the total haplotig length is small, FALCON-Phase will have limited utility.
To run the pipeline you need a FALCON-Unzip assembly and Hi-C data. See PacBio recommendations for assembly coverage. For Hi-C we suggest 100 million read pairs per 1 Gb of genome length, with adjustments for genome complexity and library quality.
FALCON-Phase can be used to phase haplotype blocks within a contig and contigs within scaffolds. See our preprint and SFAF slides for details.
We did our best to minimize dependencies, but there are a number of standard bioinformatics tools required by the pipeline. The version numbers of the dependencies are listed below, but new/older versions should work, but are untested. The required binaries are specified in the config.json file.
- Python (3.6) - Running Snakemake
- NumPy (1.14.2) - Pipeline python scripts
- Snakemake (3.6) - Running the pipeline interactively or on a cluster
- BWA (v0.7.17) - Mapping the Hi-C to the minced contigs
- Mummer 4 (4.0.0) - Mapping the haplotigs (h contigs)
- BEDTools (2.27.1) - Creating AB pair index
- HSTLIB (1.5 or greater) - Internal dependency (bundled with FALCON-Phase)
- SAMTOOLS (1.5 or greater) - Indexing fasta files
- Biopython (1.6.0 or greater) - Running
bin/preprocess_diploid_asm_for_fc_phase.py
Cloning the repository downloads the pipeline, the source code, and HTSLIB.
git clone --recursive https://github.com/phasegenomics/FALCON-Phase.git ; cd FALCON-Phase/src ; make
After running this command you should see the executable binary FALCON-Phase/bin/falcon-phase
.
We have provided a small pseudo test dataset (sample info) to get the pipeline running.
-
Install the pipeline (as shown in the Install Process)
-
Clean the headers of the test FALCON-Unzip assembly files and generate the name_mapping.txt file. We include a script
bin/preprocess_diploid_asm_for_fc_phase.py
that can be used to generate this file and (try to) clean up contig names that have been modified. (there is also a deprecated perl script to do this for you:scrub_names.pl
in the bin directory. The script removes the "|arrow" suffix on the FASTA headers and makes new files with the ".cleaned" suffix on the base name.)
Usage:
bin/preprocess_diploid_asm_for_fc_phase.py -P p-contigs.fa -H h-contigs.fa -o cleaned_contigs # writes name_mappings.txt in working directory
scrub_names.pl p-contigs.fa h-contigs.fa > name_mapping.txt
The name_mapping.txt file maps each haplotig to a primary contig and is required to for the mummer steps and to build the snakemake DAG. It looks like this:
000000F 000000F_002
000000F 000000F_003
000001F 000001F_001
000001F 000001F_002
000001F 000001F_003
- Edit the config files
If you are running on a cluster that uses modules, you can load the binaries with a series of commands in the config.sh
file. Below is an example config.sh
.
module load snakemake
module load bwa/0.7.17
module load bedtools/2.27.1
module load samtools/1.7
module load mummer/4.0.0
Next, edit the config.json
in the pipeline folder, filling out the paths to the dependencies, and sample information. The paths to the required binaries in the config.json can be determined using which
(e.g. which samtools
returns the path to the binary).
The tables below explains the fields in the config file.
Key | Value | Explanation |
---|---|---|
env | config.sh | This file is sourced by the shell and loads enviromental variables |
CPU | 2 | default number of CPUs to use |
nucmer | /path/to/mummer/bin/nucmer | Path to nucmer |
delta-filter | /path/to/mummer/bin/delta-filter | Path to delta-filter |
show-coords | /path/to/mummer/bin/show-coords | Path to show-coords |
samtools | /path/to/samtools | Path to samtools |
hp | /path/to/FALCON-Phase/bin/coords2hp.py | Path to coords2hp.py |
hpfilt | /path/to/FALCON-Phase/bin/filt_hp.py | Path to filt_hp.py |
falcon_phase | /path/to/FALCON-Phase/bin/falcon-phase | Path to falcon-phase |
falcon_oi | /path/to/FALCON-Phase/bin/primary_contig_index.pl | Path to primary_contig_index.pl |
bedtools | /path/to/bedtools | Path to bedtools |
bwa | /path/to/bwa | Path to bwa |
bwa | cpu: 24 | number of CPUs for bwa mapping |
Key | Value | Explanation |
---|---|---|
name | test | The name of the sample, most output files will have this prefix |
output_format | pseudohap | The desired output format ("unzip" or "pseudohap") |
min_aln_len | 3000 | The minimal alignment length to consider during haplotig placement |
p_to_h | /path/to/name_mapping_tiny.txt | A file that maps the haplotig names to primary contig names |
p_ctgs | /path/to/cns_p_ctg.clean.fasta | Path to CLEANED primary contigs |
h_ctgs | /path/to/cns_h_ctg.clean.fasta | Path to CLEANED haplotigs |
r1 | /path/to/FALCON-Phase/test_dataset/S3HiC_R1.fastq | Hi-C read-pair 1 |
r2 | /path/to/FALCON-Phase/test_dataset/S3HiC_R2.fastq | Hi-C read-pair 2 |
enzyme | GATC | The restriction enzyme recognition seq used for Hi-C library prep |
iter | 10000 | The number of iterations for phasing algorithm, 10e7 is recommended |
- Phew! Take a deep breath, you're almost done.
5a. Run the snakemake
pipeline locally. To do so, on my computer, I need to be in a python 3.6 virtual environment with NumPy
libraries loaded. This differs between machines, so you'll need to get that sorted out. Here's the command to run the FALCON-Phase snakemake pipeline:
(py36)$ snakemake -s snakefile --verbose -p
Snakemake will start printing information to your screen about the process. This example dataset only takes a few minutes to run, but a "real" job will take longer and should be run in the background. Once everthing is done you should see four files in the output
direstory.
5b. Run the snakemake
on a cluster. Snakemake can be run on a cluster and submit jobs the scheduler. We have included an SGE cluster config file as an example. The table below explains the fields in the cluster.config.sge.json
.
Field | Key | Value | Explanation |
---|---|---|---|
default | S | /bin/bash | interpreting shell |
N | falcon-phase | job name | |
P | -cwd | directory | |
Q | default | queue name | |
CPU | -pe smp 2 | default number of cores requested | |
E | qsub_log/ | dir for stderr | |
O | qsub_log/ | dir for stdout | |
aln | CPU | -pe smp 24 | number of cores requested for bwa mapping |
NOTES: the number of CPUs specified in the cluster.config.sge.json
should match that in environmetal settings in config.json. Make sure you create the qsub_log
dir before launching the job!!
Below is the command to run snakemake on PacBio's SGE cluster. This command runs 50 concurrent jobs and pulls the other qsub
parameters from the cluster.config.sge.json
file.
snakemake -j 50 --cluster-config cluster.config.sge.json --cluster "qsub -S {cluster.S} -N {cluster.N} {cluster.P} -q {cluster.Q} {cluster.CPU} -e {cluster.E} -o {cluster.O} -V" -s snakefile --verbose -p --latency-wait 60
The FALCON-Phase pipeline has a number of steps that are implemented by rules in the snakefile
. Snakemake
implements the rules from bottom to top in the snakefile
. Details about each step can be found in our preprint and are summarized in the figure below:
FALCON-Phase is only compatible with assemblies generated by FALCON-Unzip. Primary contigs have names like 000123F and haplotigs have names like 000123F_001.
The haplotig placement file specifies where each haplotig aligns to its primary contig and is used to define phase blocks in the FALCON-Unzip assembly. Making the haplotig placement file involved several steps. First, nucmer
is used to align all FALCON-Unzip haplotigs to their primary contig. This step produces delta files for each primary contig.
job_dir/
βββ delta_files/
βββ test.000000F.delta/ # delta file of haplotigs aligned to primary contig 000000F
βββ test.000001F.delta/ # delta file of haplotigs aligned to primary contig 000001F
Delta files are filtered in the next step using delta-filter
:
job_dir/
βββ filtered_delta_files/
βββ test.000000F.delta.filt/ # filtered delta file of haplotigs aligned to primary contig 000000F
βββ test.000001F.delta.filt/ # filtered delta file of haplotigs aligned to primary contig 000001F
Alignment coordinate files are made with show-coords
:
job_dir/
βββ coords_files/
βββ test.000000F.coords/ # coords file for primary contig 000000F
βββ test.000001F.coords/ # coords file for primary contig 000001F
The haplotig placement file is made by calling two scripts, coords2hp.py
then filt_hp.py
. The file specifying the phase block pairing is contained in the haplotig_placement directory and is produced in the mincing stage (see below).
job_dir/
βββ haplotig_placement_file/
βββ test.hbird.hp.txt/ # unfiltered haplotig placement file
βββ test.hbird.filt_hp.txt/ # final haplotig placement file
βββ test.AB_pairs.txt/ # pairing of A-B haplotigs (phase blocks)
Once the haplotig placement file and A-B phase block pairings are done, the primary contigs are minced at phase block boundaries. Mincing allows Hi-C reads to be mapped to each pair of phase block so that the density of Hi-C connections can be used to assign blocks to the same phase.
job_dir/
βββ mince/
βββ test.A_haplotigs.bed/ # BED file for A minced haplotigs (original FALCON-Unzip haplotigs)
βββ test.B_haplotigs.bed/ # BED file for B minced haplotigs (corresponding phase blocks on FALCON-Unzip primary contigs)
βββ test.collapsed_haplotypes.bed/ # BED file for collapsed haplotigs (non-Unzipped regions of primary contigs)
βββ test.A_haplotigs.fasta/ # FASTA file for A minced haplotigs
βββ test.B_haplotigs.fasta/ # FASTA file for B minced haplotigs
βββ test.collapsed_haplotypes.fasta/ # FASTA file for collapsed haplotypes
βββ test.minced.fasta/ # Concatenated FASTA (in this order: A_haplotigs, B_haplotigs, collapsed)
βββ test.BC.bed/ # sorted BED of B and collapsed
βββ B_haplotigs_merged.bed/ # merging of overlapping B haplotigs (used to define collapsed regions)
Hi-C reads are mapped to the minced FASTA file using bwa mem
, streamed and processed in samtools
and filtered with a FALCON-Phase utility. This is usually the most time consuming step and typically runs overnight on a mammalian genome with 24-48 cores.
job_dir/
βββ hic_mapping/
βββ test.unfiltered.bam/ # Hi-C reads mapped to minced FASTA
βββ test.filtered.bam/ # filtered Hi-C reads mapped to minced FASTA
The binary matrix and index file are input to the FALCON-Phase algorithm to assign phase to each A-B phase block pair. The index file is created by the primary_contig_index.pl
script.
job_dir/
βββ phasing/
βββ test.binmat/ # binary matric of normalized Hi-C mappings
βββ test.ov_index.txt/ # index file of ordering of minced contigs on each primary contig and A-B phase block pairings
βββ test.results.txt/ # phase assignment for each A-B phase block pair
βββ test.seq.txt/ # summary of RE cut sites for each minced fragment
The results.txt
file has the following columns:
- primary contig ID
- phase 0 haplotig ID
- phase 1 haplotig ID
- proportion of simulations in this configuration
- normalized count of reads mapped to phase 0 haplotig
- normalized count of reads mapped to phase 1 haplotig
- numeric index in fasta for sequence 0
- numeric index in fasta for sequence 1
You have two options for the final output of FALCON-Phase, which is generated by the emit_haplotigs.pl
script. pseudohap
output is two phased full-length pseudo-haplotypes for each primary contig. unzip
style output is a primary contig file and haplotigs file with the phase-switch errors corrected. "Phase 0" output corresponds to the primary contigs and "Phase 1" to the haplotigs.
job_dir/
βββ output/
βββ test_data.phased.0.bed/ # list of minced contigs used to reconstruct phase 0 output
βββ test_data.phased.1.bed/ # list of minced contigs used to reconstruct phase 1 output
βββ test_data.phased.0.fasta/ # phase 0 output
βββ test_data.phased.1.fasta/ # phase 1 output
For the pseudohap
the final output FASTA headers look like:
>000000F_0
>000001F_0
>000000F_1
>000001F_1
Phase is specified by the _0
or _1
suffix on the original FALCON-Unzip primary contig IDs.
For the unzip
style output, the fasta headers are the same as what you are used to with FALCON-Unzip:
Primary contigs:
>000000F
>000001F
Haplotigs:
>000000F_001
>000000F_003
>000000F_002
>000001F_003
>000001F_001
>000001F_002