Skip to content

Commit

Permalink
Combining different polishers
Browse files Browse the repository at this point in the history
  • Loading branch information
rrwick committed Jan 31, 2018
1 parent 10bcacc commit fd72cfd
Show file tree
Hide file tree
Showing 7 changed files with 1,268 additions and 20 deletions.
45 changes: 26 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,18 @@ This repository uses a bacterial genome to assess the read accuracy and consensu
* [Basecallers tested](#basecallers-tested)
* [Method](#method)
* [Results/discussion](#resultsdiscussion)
* [Speed](#speed)
* [Total yield](#total-yield)
* [Read identity](#read-identity)
* [Relative read length](#relative-read-length)
* [Assembly identity](#assembly-identity)
* [Read vs assembly identity](#read-vs-assembly-identity)
* [Nanopolish assembly identity](#nanopolish-assembly-identity)
* [Methylation](#methylation)
* [Medaka](#medaka)
* [Training sets](#training-sets)
* [Combining different basecallers](#combining-different-basecallers)
* [Speed](#speed)
* [Total yield](#total-yield)
* [Read identity](#read-identity)
* [Relative read length](#relative-read-length)
* [Assembly identity](#assembly-identity)
* [Read vs assembly identity](#read-vs-assembly-identity)
* [Nanopolish assembly identity](#nanopolish-assembly-identity)
* [Methylation](#methylation)
* [Medaka](#medaka)
* [Combining different basecallers](#combining-different-basecallers)
* [Combining different polishers](#combining-different-polishers)
* [Training sets](#training-sets)
* [Conclusions](#conclusions)
* [References](#references)

Expand Down Expand Up @@ -349,7 +350,7 @@ While Nanopolish can correct many of these errors, it would be better if the bas



# Medaka
### Medaka

Medaka is trying to solve a similar problem to Nanopolish: improving the consensus sequence accuracy using the alignment of multiple reads. It differs from Nanopolish in two significant ways. First, Medaka uses neural networks where Nanopolish uses HMMs. Second, it uses basecalled reads, not the raw signal ([though this is likely to change in the future](https://nanoporetech.github.io/medaka/future.html)). Here I test Medaka v0.2.0:

Expand All @@ -359,21 +360,27 @@ While Medaka could improve most assemblies, it was overall less effective than N



### Training sets
### Combining different basecallers

All supervised learning depends on a good training set, and basecalling is no exception. A nice example comes from the rgrgr_r94 model in Scrappie v1.1.0 and v1.1.1. The primary difference between these two versions is that in v1.1.0, only human DNA was used to train the basecaller, whereas v1.1.1 was trained with a mixed set of genomes ([described here](https://github.com/rrwick/Basecalling-comparison/issues/1) by Scrappie author Tim Massingham). I didn't include v1.1.0 in the above plots because it's a superseded version – it's here only to show the difference a training set makes. The difference in read identity is huge, but assembly identity had a subtler improvement:
This section previously looked at how well a combination of Albacore and Chiron reads assemble. The idea was that perhaps two different basecallers can somewhat 'cancel out' each other's systematic error, leading to a better assembly. This was the case with Albacore and Chiron v0.2, but Chiron v0.3 reads assemble so well that combining them with Albacore reads gives no improvement (it actually makes the assembly a bit worse).

<p align="center"><img src="images/scrappie_comparison.png" width="60%"></p>
I don't think this is relevant anymore, so I've removed it. You can see my earlier results in [a past version of this repository](https://github.com/rrwick/Basecalling-comparison/tree/d5ce4455c5c57d15abec1e625cafa56a7eef1a6e) if you're still interested.



### Combining different basecallers
### Combining different polishers

I tried assembly polishing with both Medaka and Nanopolish (methylation-aware) to see if a joint approach could yield better accuracies. I tried both Medaka followed by Nanopolish and vice versa, but neither combination could improve upon Nanopolish alone:

<p align="center"><img src="images/polishing_methods.png" width="35%"></p>

This section previously looked at how well a combination of Albacore and Chiron reads assemble. The idea was that perhaps two different basecallers can somewhat 'cancel out' each other's systematic error, leading to a better assembly. This was the case with Albacore and Chiron v0.2, but Chiron v0.3 reads assemble so well that combining them with Albacore reads gives no improvement (it actually makes the assembly a bit worse).

I don't think this is relevant anymore, so I've removed it. You can see my earlier results in [a past version of this repository](https://github.com/rrwick/Basecalling-comparison/tree/d5ce4455c5c57d15abec1e625cafa56a7eef1a6e) if you're still interested.

### Training sets

All supervised learning depends on a good training set, and basecalling is no exception. A nice example comes from the rgrgr_r94 model in Scrappie v1.1.0 and v1.1.1. The primary difference between these two versions is that in v1.1.0, only human DNA was used to train the basecaller, whereas v1.1.1 was trained with a mixed set of genomes ([described here](https://github.com/rrwick/Basecalling-comparison/issues/1) by Scrappie author Tim Massingham). I didn't include v1.1.0 in the above plots because it's a superseded version – it's here only to show the difference a training set makes. The difference in read identity is huge, but assembly identity had a subtler improvement:

<p align="center"><img src="images/scrappie_comparison.png" width="50%"></p>



Expand All @@ -395,7 +402,7 @@ Scrappie raw v1.3.0 (rgr_r94, rgrgr_r94 and rnnrf_r94 models) also did quite wel

Anyone interested in maximising assembly accuracy should be using Nanopolish. It improved all assemblies and took most up to about 99.9% (with the methylation-aware option). If you only care about assembly identity, Nanopolish makes your basecaller choice relatively unimportant.

While Medaka does not improve assemblies as well as Nanopolish, it operates on basecalled reads and requires only a fasta/fastq file, not the raw fast5 files. It may therefore be the best choice for assembly polishing when raw reads are not available. However, [the 'Future directions' section of Medaka's documentation](https://nanoporetech.github.io/medaka/future.html) indicates that signal-level processing may be in its future. Furthermore, Medaka uses neural networks, unlike Nanopolish's HMMs. The authors suggest that just as neural networks have outperformed HMMs in basecallers, they will also prove superior in consensus algorithms. Watch this space!
While Medaka does not improve assemblies as well as Nanopolish, it requires only a fasta/fastq file, not the raw fast5 files. It may therefore be the best choice for assembly polishing when raw reads are not available. However, the ['Future directions' section of Medaka's documentation](https://nanoporetech.github.io/medaka/future.html) indicates that signal-level processing may be in its future. Furthermore, Medaka uses neural networks, unlike Nanopolish's HMMs. The authors suggest that just as neural networks have outperformed HMMs in basecallers, they will also prove superior in consensus algorithms. Watch this space!



Expand Down
53 changes: 53 additions & 0 deletions analysis.sh
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ mkdir -p 11_nanopolish_meth
mkdir -p 12_nanopolish_meth_data
mkdir -p 13_medaka
mkdir -p 14_medaka_data
mkdir -p 15_combined_polish
mkdir -p 16_combined_polish_data

# Create a table of basic info about each read.
python3 "$python_script_dir"/read_table.py 01_raw_fast5 > 04_read_data/read_data.tsv
Expand Down Expand Up @@ -94,6 +96,18 @@ for f in $read_files; do
medaka_assembly_alignment=14_medaka_data/"$set"_medaka.paf
medaka_assembly_data=14_medaka_data/"$set"_medaka.tsv

medaka_nanopolish_assembly_dir=15_combined_polish
medaka_nanopolish_assembly=15_combined_polish/"$set"_medaka_nanopolish_meth.fasta
medaka_nanopolish_assembly_pieces=16_combined_polish_data/"$set"_medaka_nanopolish_meth_pieces.fasta
medaka_nanopolish_assembly_alignment=16_combined_polish_data/"$set"_medaka_nanopolish_meth.paf
medaka_nanopolish_assembly_data=16_combined_polish_data/"$set"_medaka_nanopolish_meth.tsv

nanopolish_medaka_assembly_dir=15_combined_polish
nanopolish_medaka_assembly=15_combined_polish/"$set"_nanopolish_meth_medaka.fasta
nanopolish_medaka_assembly_pieces=16_combined_polish_data/"$set"_nanopolish_meth_medaka_pieces.fasta
nanopolish_medaka_assembly_alignment=16_combined_polish_data/"$set"_nanopolish_meth_medaka.paf
nanopolish_medaka_assembly_data=16_combined_polish_data/"$set"_nanopolish_meth_medaka.tsv

printf "\n\n\n\n"
echo "NORMALISE READ HEADERS: "$set
echo "--------------------------------------------------------------------------------"
Expand Down Expand Up @@ -175,4 +189,43 @@ for f in $read_files; do
python3 "$python_script_dir"/read_length_identity.py $medaka_assembly_pieces $medaka_assembly_alignment > $medaka_assembly_data
rm $medaka_assembly_pieces $medaka_assembly_alignment

# printf "\n\n\n\n"
# echo "NANOPOLISH (METHYLATION-AWARE) OF MEDAKA ASSEMBLY: "$set
# echo "--------------------------------------------------------------------------------"
# python3 "$python_script_dir"/nanopolish_slurm_wrapper.py $medaka_assembly $all_reads_fixed_names $raw_fast5_dir $medaka_nanopolish_assembly_dir $nanopolish_exec_dir $threads meth
# rm "$all_reads_fixed_names".index*
# rm "$medaka_assembly".fai

# printf "\n\n\n\n"
# echo "ASSESS MEDAKA THEN NANOPOLISH (METHYLATION-AWARE) ASSEMBLY: "$set
# echo "--------------------------------------------------------------------------------"
# python3 "$python_script_dir"/chop_up_assembly.py $medaka_nanopolish_assembly 10000 > $medaka_nanopolish_assembly_pieces
# minimap2 -x map10k -t $threads -c reference.fasta $medaka_nanopolish_assembly_pieces > $medaka_nanopolish_assembly_alignment
# python3 "$python_script_dir"/read_length_identity.py $medaka_nanopolish_assembly_pieces $medaka_nanopolish_assembly_alignment > $medaka_nanopolish_assembly_data
# rm $medaka_nanopolish_assembly_pieces $medaka_nanopolish_assembly_alignment

# printf "\n\n\n\n"
# echo "MEDAKA OF NANOPOLISH (METHYLATION-AWARE) ASSEMBLY: "$set
# echo "--------------------------------------------------------------------------------"
# if [[ $all_reads_fixed_names = *"fastq.gz" ]]; then
# temp_reads="$nanopolish_medaka_assembly_dir"/"$set".fastq
# else
# temp_reads="$nanopolish_medaka_assembly_dir"/"$set".fasta
# fi
# gunzip -c "$all_reads_fixed_names" > $temp_reads
# source $medaka
# medaka_consensus -i $temp_reads -d $assembly -o "$nanopolish_medaka_assembly_dir"/"$set"_nanopolish_medaka -p $pomoxis -t $threads
# deactivate
# cp "$nanopolish_medaka_assembly_dir"/"$set"_nanopolish_medaka/consensus.fasta "$nanopolish_medaka_assembly"
# rm $temp_reads
# rm -r "$nanopolish_medaka_assembly_dir"/"$set"_nanopolish_medaka

# printf "\n\n\n\n"
# echo "ASSESS MEDAKA ASSEMBLY: "$set
# echo "--------------------------------------------------------------------------------"
# python3 "$python_script_dir"/chop_up_assembly.py $nanopolish_medaka_assembly 10000 > $nanopolish_medaka_assembly_pieces
# minimap2 -x map10k -t $threads -c reference.fasta $nanopolish_medaka_assembly_pieces > $nanopolish_medaka_assembly_alignment
# python3 "$python_script_dir"/read_length_identity.py $nanopolish_medaka_assembly_pieces $nanopolish_medaka_assembly_alignment > $nanopolish_medaka_assembly_data
# rm $nanopolish_medaka_assembly_pieces $nanopolish_medaka_assembly_alignment

done
Binary file added images/polishing_methods.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion nanopolish_slurm_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,10 @@ def main():
set_name = assembly_filename.split('/')[-1].split('.fasta')[0]
print('\nPreparing to run Nanopolish for ' + set_name)

final_assembly = ('../' + set_name + '.fasta').replace('_assembly.fasta', '_nanopolish.fasta')
if assembly_filename.endswith('medaka.fasta'):
final_assembly = ('../' + set_name + '.fasta').replace('medaka.fasta', 'medaka_nanopolish.fasta')
else:
final_assembly = ('../' + set_name + '.fasta').replace('_assembly.fasta', '_nanopolish.fasta')
if methylation_aware:
final_assembly = final_assembly.replace('_nanopolish.fasta', '_nanopolish_meth.fasta')

Expand Down
81 changes: 81 additions & 0 deletions plot_results.R
Original file line number Diff line number Diff line change
Expand Up @@ -409,3 +409,84 @@ blank <- rectGrob(gp=gpar(col="white"))
scrappie_comparison_plot <- grid.arrange(p1, blank, p2, ncol=3, widths=c(0.425, 0.15, 0.425))
# ggsave(scrappie_comparison_plot, file='plots/scrappie_comparison.pdf', width = 6, height = 4)











# The following code is for a violin plot comparing different polishing strategies.
basecaller <- "Albacore v2.1.10"
no_spaces <- gsub(" ", "_", basecaller)

polishing_names <- c(paste(basecaller, "(nopolishing)"),
paste(basecaller, "+Medaka"),
paste(basecaller, "+Nanopolish"),
paste(basecaller, "+Medaka +Nanopolish"),
paste(basecaller, "+Nanopolish +Medaka"))
polishing_labels <- gsub(" ", "\n", polishing_names, fixed=TRUE)
polishing_labels <- gsub("(nopolishing)", "(no polishing)", polishing_labels, fixed=TRUE)
polishing_colours <- c("#CCCCCC", "#4F84CC", "#DD616C", "#846DC2", "#AB61BA")
names(polishing_colours) <- polishing_names
polishing_fill_scale <- scale_fill_manual(name = "Polishing", values = polishing_colours)

no_polish_data_filename <- paste("results/", tolower(no_spaces), "_assembly.tsv", sep="")
medaka_data_filename <- paste("results/", tolower(no_spaces), "_medaka.tsv", sep="")
nanopolish_data_filename <- paste("results/", tolower(no_spaces), "_nanopolish_meth.tsv", sep="")
medaka_nanopolish_data_filename <- paste("results/", tolower(no_spaces), "_medaka_nanopolish_meth.tsv", sep="")
nanopolish_medaka_data_filename <- paste("results/", tolower(no_spaces), "_nanopolish_meth_medaka.tsv", sep="")

length_column <- paste("Length_", no_spaces, sep="")
rel_length_column <- paste("Rel_len_", no_spaces, sep="")

basecaller_identities <- c()

identity_column <- paste("Identity_", no_spaces, "_no_polish", sep="")
no_polish_data <- load_tsv_data(no_polish_data_filename, c("Name", length_column, identity_column, rel_length_column))
basecaller_identities <- c(basecaller_identities, identity_column)

identity_column <- paste("Identity_", no_spaces, "_medaka", sep="")
medaka_data <- load_tsv_data(medaka_data_filename, c("Name", length_column, identity_column, rel_length_column))
basecaller_identities <- c(basecaller_identities, identity_column)

identity_column <- paste("Identity_", no_spaces, "_nanopolish", sep="")
nanopolish_data <- load_tsv_data(nanopolish_data_filename, c("Name", length_column, identity_column, rel_length_column))
basecaller_identities <- c(basecaller_identities, identity_column)

identity_column <- paste("Identity_", no_spaces, "_medaka_nanopolish", sep="")
medaka_nanopolish_data <- load_tsv_data(medaka_nanopolish_data_filename, c("Name", length_column, identity_column, rel_length_column))
basecaller_identities <- c(basecaller_identities, identity_column)

identity_column <- paste("Identity_", no_spaces, "_nanopolish_medaka", sep="")
nanopolish_medaka_data <- load_tsv_data(nanopolish_medaka_data_filename, c("Name", length_column, identity_column, rel_length_column))
basecaller_identities <- c(basecaller_identities, identity_column)

polish_data <- data.frame(Name = numeric())
polish_data <- merge(polish_data, no_polish_data, by=1, all=TRUE)
polish_data <- merge(polish_data, medaka_data, by=1, all=TRUE)
polish_data <- merge(polish_data, nanopolish_data, by=1, all=TRUE)
polish_data <- merge(polish_data, medaka_nanopolish_data, by=1, all=TRUE)
polish_data <- merge(polish_data, nanopolish_medaka_data, by=1, all=TRUE)

assembly_lengths <- polish_data[grepl("Length_", names(polish_data))]
polish_data["Length"] <- round(apply(assembly_lengths, 1, median, na.rm = TRUE))

polish_identities <- polish_data[,c("Name", "Length", basecaller_identities)]
colnames(polish_identities) <- c("Name", "Length", polishing_names)
polish_identities <- melt(polish_identities, id=c("Name", "Length"))
colnames(polish_identities) <- c("Read_name", "Length", "Polishing", "Identity")

polishing_plot <- ggplot(polish_identities, aes(x = Polishing, y = Identity, weight = Length, fill = Polishing)) +
geom_violin(draw_quantiles = c(0.5), bw=0.06, width=1.1) +
polishing_fill_scale + my_theme + guides(fill=FALSE) +
scale_y_continuous(expand = c(0, 0), breaks = seq(0, 100, 0.1), minor_breaks = seq(0, 100, 0.05), labels = scales::unit_format("%")) +
scale_x_discrete(labels=polishing_labels) +
coord_cartesian(ylim=c(99.2, 100)) +
labs(title = "", x = "", y = "assembly identity")
polishing_plot
ggsave(polishing_plot, file='plots/polishing_methods.pdf', width = 4.75, height = 4)

Loading

0 comments on commit fd72cfd

Please sign in to comment.