Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual.
The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference.
Merqury provides a set of tools for this purpose.
- gcc 10.2.0 or higher (for installing Meryl)
- Meryl v1.4.1
- Java run time environment (JRE)
- R with argparse, ggplot2, and scales (recommend R 4.0.3+)
- bedtools
- samtools
Note that igvtools is no longer used. The .tdf
files are replaced with .wig
files, compatable to IGV and UCSC genome browser.
Only the latest Merqury version supports the compressed
option used in Verkko.
- Meryl
Get a working Meryl in yourPATH
. Download Meryl release: https://github.com/marbl/meryl/releases/tag/v1.4.1
tar -xJf meryl-1.4.*.tar.xz
cd meryl-1.4.1/bin
export PATH=$pwd:$PATH
If the binary doesn't work, download the source and compile:
git clone https://github.com/marbl/meryl.git
cd meryl/src
make -j 24
export PATH=/path/to/meryl/…/bin:$PATH
See if we get help message with meryl
.
- Merqury
git clone https://github.com/marbl/merqury.git
cd merqury
export MERQURY=$PWD
Other dependency:
- Java run time environment (JRE)
- R with argparse, ggplot2, and scales (recommend R 4.0.3+)
- bedtools
- samtools
The v1.3 Merqury is compatable with Meryl v1.3, however does not support homopolymer-compressed kmers. In addition, multiple issues were fixed (e.g. use of large k-mers, better memory utilization, minor bugs in the logs etc.) since then. Therefore, we recommend to use the latest Meryl and Merqury. The Conda version below is currently deploying v1.3.
- Get a working Meryl in your PATH
Download Meryl release: https://github.com/marbl/meryl/releases/tag/v1.3
tar -xJf meryl-1.3.*.tar.xz
cd meryl-1.3/bin
export PATH=$pwd:$PATH
If the binary doesn't work, download the source and compile:
cd meryl/src
make -j 24
export PATH=/path/to/meryl/…/bin:$PATH
See if we get help message with meryl
.
- Download the release version and set env variable
$MERQURY
wget https://github.com/marbl/merqury/archive/v1.3.tar.gz
tar -zxvf v1.3.tar.gz
cd merqury-1.3
export MERQURY=$PWD
Add the “export” part to your environment for meryl
and MERQURY
(~/.bash_profile or ~/.profile).
Add installation dir paths for bedtools
and samtools
to your environment.
source
it.
Thanks to @EdHarry, a conda recipe is now available: https://anaconda.org/bioconda/merqury
On a new conda environment, run:
conda install -c conda-forge -c bioconda merqury
Or, if you have a different version of jdk installed or want to have a separate conda environnment for merqury:
conda create -n merqury -c conda-forge -c bioconda merqury openjdk=11
You will then need to activate the merqury environment before using it with:
conda activate merqury
Test running
Rscript $MERQURY/plot/plot_spectra_cn.R --help
In case R complains for version mismatches of the R packages, try
conda update --all
It seems like R in conda isn't maintained anymore. Try to modify channel priority in .condarc
.
- !! Merqury assumes all meryl dbs (dirs) are named with
.meryl
. !!
On a single machine:
ln -s $MERQURY/merqury.sh # Link merqury
./merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
Usage: merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
<read-db.meryl> : k-mer counts of the read set
<mat.meryl> : k-mer counts of the maternal haplotype (ex. mat.only.meryl or mat.hapmer.meryl)
<pat.meryl> : k-mer counts of the paternal haplotype (ex. pat.only.meryl or pat.hapmer.meryl)
<asm1.fasta> : Assembly fasta file (ex. pri.fasta, hap1.fasta or maternal.fasta)
[asm2.fasta] : Additional fasta file (ex. alt.fasta, hap2.fasta or paternal.fasta)
*asm1.meryl and asm2.meryl will be generated. Avoid using the same names as the hap-mer dbs
<out> : Output prefix
< >
: required
[ ]
: optional
Below is showing examples how to run Merqury using the prebuilt meryl dbs on a. thaliana F1 hybrid. The fasta files are the trio-binned assemblies from Koren et al.
### Download assemblies ###
wget https://gembox.cbcb.umd.edu/triobinning/athal_COL.fasta
wget https://gembox.cbcb.umd.edu/triobinning/athal_CVI.fasta
### Download prebuilt meryl dbs ###
# read.meryl of the F1 hybrid between COL-0 and CVI-0
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.k18.meryl.tar.gz
# hap-mers for COL-0 haplotype
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.col0.hapmer.meryl.tar.gz
# hap-mers for CVI-0 haplotype
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.cvi0.hapmer.meryl.tar.gz
# Untar
for gz in *.tar.gz
do
tar -zxf $gz
done
# Run merqury
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta athal_CVI.fasta test
# I don't have the hap-mers
$MERQURY/merqury.sh read-db.meryl asm1.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl athal_COL.fasta test-1
# I have the hap-mers
$MERQURY/merqury.sh read-db.meryl mat.meryl pat.meryl asm1.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta test-1
# I don't have the hap-mers
$MERQURY/merqury.sh read-db.meryl asm1.fasta asm2.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl athal_COL.fasta athal_CVI.fasta test-2
# I have the hap-mers
$MERQURY/merqury.sh read-db.meryl mat.meryl pat.meryl asm1.fasta asm2.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta athal_CVI.fasta test-2
- Note there is no need to run merqury per-assemblies. Give two fasta files, Merqury generates stats for each and combined.
Merqury starts with eval/spectra_cn.sh
.
When hap-mers are provided, merqury runs modules under trio/
in addition to eval/spectra_cn.sh
.
The following can run at the same time. Modules with dependency are followed by arrows (->).
eval/spectra_cn.sh
->trio/spectra_hap.sh
trio/hap_blob.sh
trio/phase_block.sh
per assembly ->trio/block_n_stats.sh
Meryl, the k-mer counter inside, uses the maximum cpus available.
Set OMP_NUM_THREADS=24
for example to use 24 threads.
On slurm environment, simply run:
ln -s $MERQURY/_submit_merqury.sh # Link merqury
./_submit_merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
Change the sbatch
to match your environment. (ex. partition)
eval/spectra_cn.sh
: k-mer completeness, qv, spectra-cn and spectra-asm plots, asm-only.bed
and.tdf
for tracking errorseval/qv.sh
: just get the qv stats and quit.trio/spectra_hap.sh
: hap-mer level spectra-cn plots, hap-mer completenesstrio/hap_blob.sh
: blob plots of the hap-mers in each contg/scaffoldtrio/phase_block.sh
: phase block statistics, phase block N* plots, hap-mer tracks (.bed
and.tdf
files)trio/block_n_stats.sh
: continuity plots (phase block N* or NG* plots, phase block vs. contig/scaffold plots)trio/switch_error.sh
: this is run part ofphase_blck.sh
, however can be re-run with desired short-range switch parameters. Runtrio/block_n_stats.sh
along with it to get the associated plots.
Run each script without any parameters if not sure what to do.
For example, ./trio/switch_error.sh
will give a help message and quit.
Following wiki pages have more detailed examples.
1. Prepare meryl dbs (details)
- Get the right k size
- Build k-mer dbs with meryl
- Build hap-mers for trios
2. Overall assembly evaluation (details)
- Reference free QV estimate
- k-mer completeness (recovery rate)
- Spectra copy number analysis
- Track error bases in the assembly
3. Phasing assessment with hap-mers (details)
- Inherited hap-mer plots
- Hap-mer blob plots
- Hap-mer completeness (recovery rate)
- Spectra copy number analysis per hap-mers
- Phased block statistics and switch error rates
- Track each haplotype block in the assembly
Meryl dbs from Illumina WGS and hapmers are available here for
- A. thaliana COL-0 x CVI-0 F1
- NA12878 (HG001)
- HG002
Please use this paper to cite Merqury:
Rhie, A., Walenz, B.P., Koren, S. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020). https://doi.org/10.1186/s13059-020-02134-9