Assembly evaluation tool
Asset can use sequencing data from four platforms (Pacbio, 10X, Bionano, HiC) to accumulate the support evidence for a de novo assembly. You can use Pipeline Guide to build your own pipeline.
- zlib
- minimap2
- bwa
- samtools
- RefAligner
- runner (optional, only necessary when you run my pipeline script run_asset.py)
git clone https://github.com/dfguan/asset.git
cd asset/src && make
There will be bin directory keeping all asset executables under asset after compiling successfully.
Given an assembly asm, use the following command to preprocess.
bin/detgaps $asm > $output_dir/gaps.bed
Given a pacbio files list pblist and the assembly asm, apply the following command to get Pacbio support regions.
for fl in $pblist
do
minimap2 -xmap-pb -t 12 $asm $fl > $fl.paf
done
bin/ast_pb $fl1.paf $fl2.paf $fl3.paf ... >$output_dir/pb.bed 2>ast_pb.log
Given a 10x read files list 10xlist (suppose in fastq.gz format) and the assembly asm, use the following command to get 10X support regions.
bwa index $asm
while read -r r1 r2
do
prefix=`basename $r1 .fastq.gz`
bin/10x -c -p $prefix $r1 $r2 # generate trimmed read files $prefix_{1,2}.fq.gz
bwa mem -t12 $asm $prefix_1.fq.gz $prefix_2.fq.gz | samtools view -b - > $prefix.bam
done < $10xlist
bin/ast_10x $output_dir/gaps.bed $bam1 $bam2 $bam3 ... >$output_dir/10x.bed 2>ast_10x.log
10x_trim is available at dfguan/utls.
Given a bionano consensus map files list bnlist (suppose in .cmap format) and the assembly asm, use the following command to get Bionano support regions.
solve_dir=/nfs/users/nfs_d/dg30/luster_dg30/dg30/projects/vgp/tools/Solve3.2.1_04122018/
for fl in $bnlist
do
fn=`basename $fl`
fn_pref=`echo $fn | cut -d_ -f1`
tech=`echo $fn | cut -d_ -f2`
enzyme=`echo $fn | cut -d_ -f3`
perl $solve_dir/HybridScaffold/04122018/scripts/fa2cmap.pl -n ${enzyme:0:4} -i ref -o $output_dir
cp $fl $output_dir
ref_prefix=${asm%.*}
ref_cmap=$output_dir/fa2cmap/"$ref_prefix"_"$enzyme"_0Kb_0labels.cmap
key_fn=$output_dir/fa2cmap/"$ref_prefix"_"$enzyme"_0Kb_0labels_key.txt
query_cmap=$output_dir/$fn
optn=${tech,,}
if [ "$enzyme" = "DLE1" ]
then
optn="DLE1_"$optn
fi
python2 $solve_dir/Pipeline/04122018/runCharacterize.py -t $solve_dir/RefAligner/7437.7523rel/RefAligner -q $query_cmap -r $ref_cmap -p $solve_dir/Pipeline/04122018/ -a $solve_dir/RefAligner/7437.7523rel/optArguments_nonhaplotype_"$optn".xml -n 4
map_path=$output_dir/alignref/${fn%.*}
rmap_fn="$map_path"_r.cmap
qmap_fn="$map_path"_q.cmap
xmap_fn="$map_path".xmap
bin/ast_bion $rmap_fn $qmap_fn $xmap_fn $key_fn > $output_dir/bionano_"$tech"_"$enzyme".bed 2>ast_bion_"$tech"_"$enzyme".log
done
fln=`wc -l $bnlist | awk {print $1}`
# only consider one or two bionano files
if [ "$fln" -eq 2 ]
then
bin/union bionano_*.bed > bn.bed
else
cp bionano_*.bed bn.bed
fi
Update soon
Given a HiC files list hiclist (suppose in fastq.gz format) and the assembly asm, use the following command to get Bionano support regions.
bin/split_fa $asm > split.fa
samtools faidx split.fa
bwa index split.fa
while read -r r1 r2
do
prefix=`basename $r1 .fq.gz`
dirn=`dirname $r1`
bwa mem -SP -B10 -t12 split.fa $r1 $r2 | samtools view -b - > $dirn/$prefix.bam
done < $hiclist
bin/col_conts *.bam > $output_dir/links.mat
bin/ast_hic2 split.fa.fai $output_dir/links.mat >$output_dir/hic2.bed 2>ast_hic.log
once you got the bed files from Pacbio (pb.bed), 10X (10x.bed), Bionano (bn.bed), HiC (hic.bed), run the following command to get accumulation bed file and detect break points.
bin/acc $output_dir/gaps.bed $output_dir/{pb,bn}.bed $output_dir/bn.bed > $output_dir/pb_bn.bed
bin/acc $output_dir/gaps.bed $output_dir/{10x,hic2,bn}.bed > $output_dir/10x_hic2_bn.bed
bin/pchlst -c $output_dir/gaps.bed $output_dir/pb_bn.bed > $output_dir/pchlst_ctg.bed
bin/pchlst $output_dir/gaps.bed $output_dir/10x_hic2_bn.bed > $output_dir/pchlst_scaf.bed
bin/union_brks $output_dir/gaps.bed $output_dir/pchlst_{ctg,scaf}.bed > $output_dir/pchlst_final.bed
Welcome to use, you can use github webpage to report an issue or email me dfguan9@gmail.com with any advice.