QUESTIONS: gabriel [dot] reno [ at sign ] gmail.com
leeHom is a program to trim adapters and merge overlapping sequencing using a Bayesian reconstruction. This is particularly useful for ancient DNA and or DNA shorter fragment sizes like cell-free DNA.
Go to https://github.com/grenaud/leeHom and do one of the following:
- Clone the repository:
git clone --recursive https://github.com/grenaud/leeHom.git
- Download the ZIP archive and then manually download the submodules
Make sure you have cmake
installed. If required, install it by typing apt-get install cmake
for Ubuntu for example. You also need JsonCpp which can be installed via apt-get install libjsoncpp-dev
-
Build the submodules and main code by typing :
make
To launch the program simply type:
src/leeHom
If the input is fastq, the output will also be fastq. Normally you will get different files. Here is a table explaining their meaning/content. The [PREFIX] is the output prefix specified using -fqo.
File suffix | Meaning |
---|---|
[PREFIX].fq.gz | Sequences that were either trimmed or untrimmed confidently by leeHom |
[PREFIX].fail.fq.gz | Sequences where the most likely original sequence length was not several fold more likely than the second most likely sequence length. The most likely sequence length is are nonetheless found in the file. |
File suffix | Meaning |
---|---|
[PREFIX].fq.gz | Sequences that were trimmed and merged confidently by leeHom |
[PREFIX]_r1.fq.gz | Forward reads that were neither trimmed nor merged confidently by leeHom |
[PREFIX]_r2.fq.gz | Reverse reads that were neither trimmed nor merged confidently by leeHom |
[PREFIX].fail.fq.gz | Sequences that were trimmed and merged by leeHom and where the most likely original sequence length was not several fold more likely than the second most likely sequence length |
[PREFIX]_r1.fail.fq.gz | Forward reads that were neither trimmed nor merged by leeHom and where the most likely original sequence length was not several fold more likely than the second most likely sequence length |
[PREFIX]_r2.fail.fq.gz | Reverse reads that were neither trimmed nor merged by leeHom and where the most likely original sequence length was not several fold more likely than the second most likely sequence length |
By default, leehom prints a log messsage to stderr. leehom counts 2 paired-end reads or one single-end read as one cluster. This is how to interpret the log:
Field | Meaning |
---|---|
Total | Total number of clusters analyzed by leehom |
Merged (trimming) | Number of clusters where the adapter was found and, for paired-end reads, a likely overlap between the forward and reverse reads was found |
Merged (overlap) | When using a prior or --ancientdna and paired-end data, the number of clusters where an likely overlap between the ends of paired-end reads was found and reads were merged as a single molecule |
Kept PE/SR | Cluster where no sufficient evidence for merging paired-end or trimming single-end reads was found |
Trimmed SR | When using single-end reads, the number of reads where a match to the adapter was found |
Adapter dimers/chimeras | Number of cluster where a chimeric sequence (two adapters merged together without insert) was found and were flagged as QC failed |
Failed Key | If the user provides a key (sequences must start with this sequences is it was part of the adapter and not covered by the sequencing primer) the number of clusters where the key was not found will be reported here. |
If you have previous data and have a representative sample of reasonable size for your library, using a prior will give you the best reconstruction possible. leehom models the distribution as a log-normal one. To infer the parametes of the log-normal distribution, you can use src/approxDist.R to get the parameters and use those as prior in leehom. To create this data from aligned modern DNA or (preferably aligned) pre-trimmed ancient, you can launch the following command on the BAM file:
samtools view aligned.bam | gawk 'and($2,5)==0 { print length($10) }; and($2,2)==2 && $9>0 { print $9 }' | gzip > sizedata.dat.gz
(We recommend the use of this command with BWA especially as it uses the properly paired flag).
Given that the resulting file has one molecule length per line. You can use the following R script to get the maximum likelihood parameters for the log-normal distribution:
src/approxDist.R testData/sizedata.dat.gz
Loading required package: survival
Loading required package: splines
meanlog sdlog
5.1444983185 0.2885764766
(0.0009125589) (0.0006452766)
The first number (5.1444983185
) represents the "Location" for log-normal distribution whereas the second number (0.2885764766
) represents the "Scale".
Those parameters can now be used in leehom as --loc 5.1444983185 --scale 0.2885764766
.
If you have no previous data, you can launch leehom using a uniform prior (default parameters). If you have modern DNA with relatively long insert sizes, you can use leehom as is. If you have ancient DNA and/or suspect that molecules are short, use the --ancientdna
which will merge reads that share only a partial overlap if they have sufficient probabilistic support.
You can use a file descriptor for each. For example if they are stored as FASTA:
src/leeHom -f `cat /mydata/forward.fa | tail -n+2 | tr -d "\n"` -s `cat /mydata/reverse.fa | tail -n+2 | tr -d "\n"`
Contact your sequencing center or core and ask them to send you the sequence of the sequencing adapters. They should be known in advance since, currently, leehom cannot infer the sequence but uses this information as a prior.
For single-end, no. For paired-end, mate have to be consecutive in the file (i.e. First mate 1
, second mate 1
, First mate 2
, second mate 2
, etc.. ). This can be done by either using raw data from the sequencer or sorting by name.
BAM files can also be used to store unmapped reads. leeHom should be used prior to mapping since reads will map to a more accurate location if the adapters have been removed and overlapping stretches have been merged.
Here is a table with the meaning for each FF flag:
FF | Meaning |
---|---|
1 | single-end reads trimmed |
2 | paired-end merged overlapping portions |
3 | paired-end trimmed and merged overlapping portions |
A small test set of 50,000 ancient DNA sequences from the Altai Neandertal can be found here: testData/rawAncientDNA.bam
To launch the reconstruction program for this test data set, use the following command:
src/leeHom -f AGATCGGAAGAGCACACGTCTGAACTCCAG -s GGAAGAGCGTCGTGTAGGGAAAGAGTGTAG --ancientdna -o testData/reconsAncientDNA.bam testData/rawAncientDNA.bam
The reconstructed ancient DNA fragments will be in testData/reconsAncientDNA.bam
For fastq data:
src/leeHom -f AGATCGGAAGAGCACACGTCTGAACTCCAG -s GGAAGAGCGTCGTGTAGGGAAAGAGTGTAG --ancientdna -fq1 testData/rawAncientDNA.f1.gz -fq2 testData/rawAncientDNA.f2.gz -fqo testData/outfq
The output files will have the testData/outfq*
prefix.
Alternatively, you can specify the names of all 6 files:
src/leeHom -f AGATCGGAAGAGCACACGTCTGAACTCCAG -s GGAAGAGCGTCGTGTAGGGAAAGAGTGTAG --ancientdna -fq1 testData/rawAncientDNA.f1.gz -fq2 testData/rawAncientDNA.f2.gz -fqose testData/outfq.fq.gz -fqosef testData/outfq.fail.fq.gz -fqope1 testData/outfq_r1.fq.gz -fqope1f testData/outfq_r1.fail.fq.gz -fqope2 testData/outfq_r2.fq.gz -fqope2f testData/outfq_r2.fail.fq.gz