Falcon: a set of tools for fast aligning long reads for consensus and assembly
The Falcon tool kit is a set of simple code collection which I use for studying efficient assembly algorithm for haploid and diploid genomes. It has some back-end code implemented in C for speed and some simple front-end written in Python for convenience.
Here is a brief description of the files in the package
Several C files for implementing sequence matching, alignment and consensus:
kmer_lookup.c # kmer match code for quickly identify potential hits
DW_banded.c # function for detailed sequence alignment
# It is based on Eugene Myers' Paper
# "AnO(ND) difference algorithm and its variations", 1986,
# http://dx.doi.org/10.1007/BF01840446
falcon.c # functions for generating consensus sequences for a set of multiple sequence alginment
common.h # header file for common declaration
A python wrapper library using Python's ctypes to call the C functions: falcon_kit.py
Some python scripts for (1) overlapping reads (2) generation consensus and (3) generate assembly contigs:
falcon_overlap.py # an overlapper
falcon_wrap.py # generate consensus from a group of reads
get_rdata.py # a utility for preparing data for falcon_wrap.py
falcon_asm.py # take the overlapping information and the sequence to generate assembled contig
falcon_fixasm.py # a script analyzing the assembly graph and break contigs on potential mis-assembly points
remove_dup_ctg.py # a utility code to remove duplication contigs in the assembly results
You need to install pbcore
and networkx
first. You might want to install
the HBAR-DTK
if you want to assemble genomes from raw PacBio data.
On a Linux box, you should be able to use the standard python setup.py install
to compile the C code and install python package. There is no standard
way to install the shared objects from the C code inside a python package, so I
did some hack to make it work. It might have some unexpected behavior. You can
simply install the .so
files in a path where the operation system can find
(e.g. setting the environment variable LD_LIBRARY_PATH
), and remove all
prefix in Python ctypes
CDDL
function calls.
Example for generating pre-assembled reads:
python get_rdata.py queries.fofn targets.fofn m4.fofn 72 0 16 8 64 50 50 | falcon_wrap.py > p-reads-0.fa
bestn : 72
group_id : 0
num_chunk : 16
min_cov : 8
max_cov : 64
trim_align : 50
trim_plr : 50
It is designed to use with the m4 alignment information generated by blasr + HBAR_WF2.py (https://github.com/PacificBiosciences/HBAR-DTK)
Example for generating overlap data:
falcon_overlap.py --min_len 4000 --n_core 24 --d_core 3 preads.fa > preads.ovlp
Example for generating assembly
falcon_asm.py preads.ovlp preads.fa
The following files will be generated by falcon_asm.py
in the same directory:
full_string_graph.adj # the adjecent nodes of the edges in the full string graph
string_graph.gexf # the gexf file of the string graph for graph visulization
string_graph.adj # the adjecent nodes of the edges in the string graph after transitive reduction
edges_list # full edge list
paths # path for the unitigs
unit_edges.dat # path and sequence of the untigs
uni_graph.gexf # unitig graph in gexf format
unitgs.fa # fasta files of the unitigs
all_tigs_paths # paths for all final contigs (= primary contigs + associated contigs)
all_tigs.fa # fasta file for all contigs
primary_tigs_paths # paths for all primary contigs
primary_tigs.fa # fasta file fot the primary contigs
asm_graph.gexf # the assembly graph where the edges are the contigs
Although I have tested this tool kit to genome up to 150Mb and get reasonable good assembly results, this tool kit is still highly experimental and is not meant to be used by novice people. If you like to try it out, you will very likely to know more detail about it and be able to tweak the code to adapt it to your computation cluster. I will hope that I can provide more details and clean the code up a little in the future so it can be useful for more people.
The principle of the layout algorithm is also available at https://speakerdeck.com/jchin/string-graph-assembly-for-diploid-genomes-with-long-reads
Major part of the coding work is done with my own time and on my own MacBook(R) Air. However, as a PacBio(R) employee, most of the testing are done with the data generated by PacBio and PacBio's computational resources, so it is fair the code is released with PacBio's version of open source licence. If you are from a competitor and try to take advantage of any open source code from PacBio, the only thing you can really justify such practice is to release your real data in public and your code as open source too.
Also, releasing this code to public is fully my own discretion. If my employer has any concern about this, I might have to pull it off.
Standard PacBio Open Source License that is associated with this package:
#################################################################################$$
# Copyright (c) 2011-2014, Pacific Biosciences of California, Inc.
#
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted (subject to the limitations in the
# disclaimer below) provided that the following conditions are met:
#
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
#
# * Redistributions in binary form must reproduce the above
# copyright notice, this list of conditions and the following
# disclaimer in the documentation and/or other materials provided
# with the distribution.
#
# * Neither the name of Pacific Biosciences nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE
# GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC
# BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE.
#################################################################################$$
--Jason Chin, Dec 16, 2013