consensusLJA is an assembler designed to generate high-quality consensus genome assemblies using PacBio high-fidelity (HiFi) reads alone. This tool aims to simplify the assembly process for non-human genomes by producing nearly complete haplotype-mixed consensus assemblies, which are cost-effective and require fewer sequencing technologies compared to diploid assemblies.
- Generates consensus genome assemblies using only HiFi reads
- Provides nearly complete assemblies with a mosaic of two haplotypes
- Reduces the complexity and cost associated with diploid assembly generation
- Tailored for non-human genomes, though applicable to any genome sequencing projects
To install consensusLJA (cLJA), follow the steps below to install the required dependencies, clone the repository, and build the tool.
-
Install Dependencies:
Install the required dependencies to ensure proper functionality:- gcc: tested on v11.4.0
- minimap2: v2.21 (For creating BAM files)
- samtools: v1.11 (For processing BAM files)
- pysam: v0.22.1 (For processing BAM files)
These dependencies can be installed using Conda, which will create a new environment named
cLJA
(if you already installed these dependencies, skip this step):conda env create -f requirements.yml
-
Install LJA assembler:
cLJA uses the assembly graph generated by LJA. So LJA assembler (the experimental branch) is required.git clone https://github.com/AntonBankevich/LJA.git -b experimental cd LJA cmake . && make
-
Activate the Environment:
Activate theclja
Conda environment:conda activate clja
-
Clone the Repository:
Clone the consensusLJA repository:git clone https://github.com/ZhangZhenmiao/consensusLJA.git
-
Build the Tool:
Navigate to the repository and build the tool usingmake
:cd consensusLJA && make chmod +x cLJA get_reference.py
-
Add cLJA to Your PATH (Optional):
Add thecLJA
executable to your environment'sPATH
:export PATH="`pwd`":$PATH
Note: You can add this line to your
~/.bashrc
for convenience, so thatcLJA
is available in yourPATH
every time you start a new shell session (optional):echo -e '\nexport PATH="'$(pwd)'":$PATH' >> ~/.bashrc
With HiFi reads, first assemble using LJA. See more details in the LJA manual:
lja [options] -o <output_dir> --reads <reads_file> [--reads <reads_file2> ...]
The output directory will contain the following structure:
<output_dir>/
│
├── 00_CoverageBasedCorrection/
├── 01_TopologyBasedCorrection/
│ ├── final_dbg.dot
│ ├── final_dbg.fasta
├── 02_MDBG/
│ ├── mdbg_edge_seqs.fasta
├── 03_Polishing/
├── assembly.fasta
├── lja.log
├── mdbg.gfa
└── version.txt
After assembling with LJA, you can use cLJA to process the relevant files (LJA will be integrated to cLJA in later versions).
cLJA --dot=<string> --fasta=<string> --multidbg=<string> --output=<string>
-
-d, --dot <string>
:
This specifies thegraph.dot
file, typically found under01_TopologyBasedCorrection/final_dbg.dot
. It represents the condensed de Bruijn graph of LJA.
Example:--dot=final_dbg.dot, -d final_dbg.dot
-
-f, --fasta <string>
:
This expects thegraph.fasta
file, found under01_TopologyBasedCorrection/final_dbg.fasta
. It contains the sequences of the edges in the graph of LJA.
Example:--fasta=final_dbg.fasta, -f final_dbg.fasta
-
-m, --multidbg <string>
:
Specifies themultidbg
file, which contains the edge sequences in the multiplex de Bruijn graph of LJA, found under02_MDBG/mdbg_edge_seqs.fasta
.
Example:--multidbg=mdbg_edge_seqs.fasta, -m mdbg_edge_seqs.fasta
-
-o, --output <string>
:
The output directory where the results will be stored. This directory must be new and should not contain any prior data to avoid conflicts or overwrites.
Example:--output=/path/to/output_directory, -o /path/to/output_directory
-
-?, --help
:
Prints a help message that includes information about all the available options. Use this if you want a quick summary of how to use the tool.
Example:--help
In the example
folder under this google drive link, there is a small test dataset. The output directory from LJA has already been generated in LJA_output
. You can run cLJA using the command line below:
cLJA -d LJA_output/01_TopologyBasedCorrection/final_dbg.dot -f LJA_output/01_TopologyBasedCorrection/final_dbg.fasta -m LJA_output/02_MDBG/mdbg_edge_seqs.fasta -o cLJA_output
In this example, cLJA will process the final_dbg.dot
, final_dbg.fasta
, and mdbg_edge_seqs.fasta
files, and output the results into the cLJA_output
directory. It will generate the final consensus assembly in the file cLJA_output/graph.final.fasta
(currently it's homopolymer collapsed contigs; we will uncompress them in later versions).
If you encounter any issues or errors, please feel free to open an issue on the repository.