Combining different polishers

rrwick · Jan 31, 2018 · fd72cfd · fd72cfd
1 parent 10bcacc
commit fd72cfd
Show file tree

Hide file tree

Showing 7 changed files with 1,268 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -22,17 +22,18 @@ This repository uses a bacterial genome to assess the read accuracy and consensu
 * [Basecallers tested](#basecallers-tested)
 * [Method](#method)
 * [Results/discussion](#resultsdiscussion)
-  * [Speed](#speed)
-  * [Total yield](#total-yield)
-  * [Read identity](#read-identity)
-  * [Relative read length](#relative-read-length)
-  * [Assembly identity](#assembly-identity)
-  * [Read vs assembly identity](#read-vs-assembly-identity)
-  * [Nanopolish assembly identity](#nanopolish-assembly-identity)
-  * [Methylation](#methylation)
-  * [Medaka](#medaka)
-  * [Training sets](#training-sets)
-  * [Combining different basecallers](#combining-different-basecallers)
+    * [Speed](#speed)
+    * [Total yield](#total-yield)
+    * [Read identity](#read-identity)
+    * [Relative read length](#relative-read-length)
+    * [Assembly identity](#assembly-identity)
+    * [Read vs assembly identity](#read-vs-assembly-identity)
+    * [Nanopolish assembly identity](#nanopolish-assembly-identity)
+    * [Methylation](#methylation)
+    * [Medaka](#medaka)
+    * [Combining different basecallers](#combining-different-basecallers)
+    * [Combining different polishers](#combining-different-polishers)
+    * [Training sets](#training-sets)
 * [Conclusions](#conclusions)
 * [References](#references)
 
@@ -349,7 +350,7 @@ While Nanopolish can correct many of these errors, it would be better if the bas
 
 
 
-# Medaka
+### Medaka
 
 Medaka is trying to solve a similar problem to Nanopolish: improving the consensus sequence accuracy using the alignment of multiple reads. It differs from Nanopolish in two significant ways. First, Medaka uses neural networks where Nanopolish uses HMMs. Second, it uses basecalled reads, not the raw signal ([though this is likely to change in the future](https://nanoporetech.github.io/medaka/future.html)). Here I test Medaka v0.2.0:
 
@@ -359,21 +360,27 @@ While Medaka could improve most assemblies, it was overall less effective than N
 
 
 
-### Training sets
+### Combining different basecallers
 
-All supervised learning depends on a good training set, and basecalling is no exception. A nice example comes from the rgrgr_r94 model in Scrappie v1.1.0 and v1.1.1. The primary difference between these two versions is that in v1.1.0, only human DNA was used to train the basecaller, whereas v1.1.1 was trained with a mixed set of genomes ([described here](https://github.com/rrwick/Basecalling-comparison/issues/1) by Scrappie author Tim Massingham). I didn't include v1.1.0 in the above plots because it's a superseded version – it's here only to show the difference a training set makes. The difference in read identity is huge, but assembly identity had a subtler improvement:
+This section previously looked at how well a combination of Albacore and Chiron reads assemble. The idea was that perhaps two different basecallers can somewhat 'cancel out' each other's systematic error, leading to a better assembly. This was the case with Albacore and Chiron v0.2, but Chiron v0.3 reads assemble so well that combining them with Albacore reads gives no improvement (it actually makes the assembly a bit worse).
 
-<p align="center"><img src="images/scrappie_comparison.png" width="60%"></p>
+I don't think this is relevant anymore, so I've removed it. You can see my earlier results in [a past version of this repository](https://github.com/rrwick/Basecalling-comparison/tree/d5ce4455c5c57d15abec1e625cafa56a7eef1a6e) if you're still interested.
 
 
 
-### Combining different basecallers
+### Combining different polishers
+
+I tried assembly polishing with both Medaka and Nanopolish (methylation-aware) to see if a joint approach could yield better accuracies. I tried both Medaka followed by Nanopolish and vice versa, but neither combination could improve upon Nanopolish alone:
+
+<p align="center"><img src="images/polishing_methods.png" width="35%"></p>
 
-This section previously looked at how well a combination of Albacore and Chiron reads assemble. The idea was that perhaps two different basecallers can somewhat 'cancel out' each other's systematic error, leading to a better assembly. This was the case with Albacore and Chiron v0.2, but Chiron v0.3 reads assemble so well that combining them with Albacore reads gives no improvement (it actually makes the assembly a bit worse).
 
-I don't think this is relevant anymore, so I've removed it. You can see my earlier results in [a past version of this repository](https://github.com/rrwick/Basecalling-comparison/tree/d5ce4455c5c57d15abec1e625cafa56a7eef1a6e) if you're still interested.
 
+### Training sets
+
+All supervised learning depends on a good training set, and basecalling is no exception. A nice example comes from the rgrgr_r94 model in Scrappie v1.1.0 and v1.1.1. The primary difference between these two versions is that in v1.1.0, only human DNA was used to train the basecaller, whereas v1.1.1 was trained with a mixed set of genomes ([described here](https://github.com/rrwick/Basecalling-comparison/issues/1) by Scrappie author Tim Massingham). I didn't include v1.1.0 in the above plots because it's a superseded version – it's here only to show the difference a training set makes. The difference in read identity is huge, but assembly identity had a subtler improvement:
 
+<p align="center"><img src="images/scrappie_comparison.png" width="50%"></p>
 
 
 
@@ -395,7 +402,7 @@ Scrappie raw v1.3.0 (rgr_r94, rgrgr_r94 and rnnrf_r94 models) also did quite wel
 
 Anyone interested in maximising assembly accuracy should be using Nanopolish. It improved all assemblies and took most up to about 99.9% (with the methylation-aware option). If you only care about assembly identity, Nanopolish makes your basecaller choice relatively unimportant.
 
-While Medaka does not improve assemblies as well as Nanopolish, it operates on basecalled reads and requires only a fasta/fastq file, not the raw fast5 files. It may therefore be the best choice for assembly polishing when raw reads are not available. However, [the 'Future directions' section of Medaka's documentation](https://nanoporetech.github.io/medaka/future.html) indicates that signal-level processing may be in its future. Furthermore, Medaka uses neural networks, unlike Nanopolish's HMMs. The authors suggest that just as neural networks have outperformed HMMs in basecallers, they will also prove superior in consensus algorithms. Watch this space!
+While Medaka does not improve assemblies as well as Nanopolish, it requires only a fasta/fastq file, not the raw fast5 files. It may therefore be the best choice for assembly polishing when raw reads are not available. However, the ['Future directions' section of Medaka's documentation](https://nanoporetech.github.io/medaka/future.html) indicates that signal-level processing may be in its future. Furthermore, Medaka uses neural networks, unlike Nanopolish's HMMs. The authors suggest that just as neural networks have outperformed HMMs in basecallers, they will also prove superior in consensus algorithms. Watch this space!
 
 
 

diff --git a/analysis.sh b/analysis.sh
@@ -54,6 +54,8 @@ mkdir -p 11_nanopolish_meth
 mkdir -p 12_nanopolish_meth_data
 mkdir -p 13_medaka
 mkdir -p 14_medaka_data
+mkdir -p 15_combined_polish
+mkdir -p 16_combined_polish_data
 
 # Create a table of basic info about each read.
 python3 "$python_script_dir"/read_table.py 01_raw_fast5 > 04_read_data/read_data.tsv
@@ -94,6 +96,18 @@ for f in $read_files; do
     medaka_assembly_alignment=14_medaka_data/"$set"_medaka.paf
     medaka_assembly_data=14_medaka_data/"$set"_medaka.tsv
 
+    medaka_nanopolish_assembly_dir=15_combined_polish
+    medaka_nanopolish_assembly=15_combined_polish/"$set"_medaka_nanopolish_meth.fasta
+    medaka_nanopolish_assembly_pieces=16_combined_polish_data/"$set"_medaka_nanopolish_meth_pieces.fasta
+    medaka_nanopolish_assembly_alignment=16_combined_polish_data/"$set"_medaka_nanopolish_meth.paf
+    medaka_nanopolish_assembly_data=16_combined_polish_data/"$set"_medaka_nanopolish_meth.tsv
+
+    nanopolish_medaka_assembly_dir=15_combined_polish
+    nanopolish_medaka_assembly=15_combined_polish/"$set"_nanopolish_meth_medaka.fasta
+    nanopolish_medaka_assembly_pieces=16_combined_polish_data/"$set"_nanopolish_meth_medaka_pieces.fasta
+    nanopolish_medaka_assembly_alignment=16_combined_polish_data/"$set"_nanopolish_meth_medaka.paf
+    nanopolish_medaka_assembly_data=16_combined_polish_data/"$set"_nanopolish_meth_medaka.tsv
+
     printf "\n\n\n\n"
     echo "NORMALISE READ HEADERS: "$set
     echo "--------------------------------------------------------------------------------"
@@ -175,4 +189,43 @@ for f in $read_files; do
     python3 "$python_script_dir"/read_length_identity.py $medaka_assembly_pieces $medaka_assembly_alignment > $medaka_assembly_data
     rm $medaka_assembly_pieces $medaka_assembly_alignment
 
+    # printf "\n\n\n\n"
+    # echo "NANOPOLISH (METHYLATION-AWARE) OF MEDAKA ASSEMBLY: "$set
+    # echo "--------------------------------------------------------------------------------"
+    # python3 "$python_script_dir"/nanopolish_slurm_wrapper.py $medaka_assembly $all_reads_fixed_names $raw_fast5_dir $medaka_nanopolish_assembly_dir $nanopolish_exec_dir $threads meth
+    # rm "$all_reads_fixed_names".index*
+    # rm "$medaka_assembly".fai
+
+    # printf "\n\n\n\n"
+    # echo "ASSESS MEDAKA THEN NANOPOLISH (METHYLATION-AWARE) ASSEMBLY: "$set
+    # echo "--------------------------------------------------------------------------------"
+    # python3 "$python_script_dir"/chop_up_assembly.py $medaka_nanopolish_assembly 10000 > $medaka_nanopolish_assembly_pieces
+    # minimap2 -x map10k -t $threads -c reference.fasta $medaka_nanopolish_assembly_pieces > $medaka_nanopolish_assembly_alignment
+    # python3 "$python_script_dir"/read_length_identity.py $medaka_nanopolish_assembly_pieces $medaka_nanopolish_assembly_alignment > $medaka_nanopolish_assembly_data
+    # rm $medaka_nanopolish_assembly_pieces $medaka_nanopolish_assembly_alignment
+
+    # printf "\n\n\n\n"
+    # echo "MEDAKA OF NANOPOLISH (METHYLATION-AWARE) ASSEMBLY: "$set
+    # echo "--------------------------------------------------------------------------------"
+    # if [[ $all_reads_fixed_names = *"fastq.gz" ]]; then
+    #     temp_reads="$nanopolish_medaka_assembly_dir"/"$set".fastq
+    # else
+    #     temp_reads="$nanopolish_medaka_assembly_dir"/"$set".fasta
+    # fi
+    # gunzip -c "$all_reads_fixed_names" > $temp_reads
+    # source $medaka
+    # medaka_consensus -i $temp_reads -d $assembly -o "$nanopolish_medaka_assembly_dir"/"$set"_nanopolish_medaka -p $pomoxis -t $threads
+    # deactivate
+    # cp "$nanopolish_medaka_assembly_dir"/"$set"_nanopolish_medaka/consensus.fasta "$nanopolish_medaka_assembly"
+    # rm $temp_reads
+    # rm -r "$nanopolish_medaka_assembly_dir"/"$set"_nanopolish_medaka
+
+    # printf "\n\n\n\n"
+    # echo "ASSESS MEDAKA ASSEMBLY: "$set
+    # echo "--------------------------------------------------------------------------------"
+    # python3 "$python_script_dir"/chop_up_assembly.py $nanopolish_medaka_assembly 10000 > $nanopolish_medaka_assembly_pieces
+    # minimap2 -x map10k -t $threads -c reference.fasta $nanopolish_medaka_assembly_pieces > $nanopolish_medaka_assembly_alignment
+    # python3 "$python_script_dir"/read_length_identity.py $nanopolish_medaka_assembly_pieces $nanopolish_medaka_assembly_alignment > $nanopolish_medaka_assembly_data
+    # rm $nanopolish_medaka_assembly_pieces $nanopolish_medaka_assembly_alignment
+
 done
diff --git a/images/polishing_methods.png b/images/polishing_methods.png
diff --git a/nanopolish_slurm_wrapper.py b/nanopolish_slurm_wrapper.py
@@ -45,7 +45,10 @@ def main():
     set_name = assembly_filename.split('/')[-1].split('.fasta')[0]
     print('\nPreparing to run Nanopolish for ' + set_name)
 
-    final_assembly = ('../' + set_name + '.fasta').replace('_assembly.fasta', '_nanopolish.fasta')
+    if assembly_filename.endswith('medaka.fasta'):
+        final_assembly = ('../' + set_name + '.fasta').replace('medaka.fasta', 'medaka_nanopolish.fasta')
+    else:
+        final_assembly = ('../' + set_name + '.fasta').replace('_assembly.fasta', '_nanopolish.fasta')
     if methylation_aware:
         final_assembly = final_assembly.replace('_nanopolish.fasta', '_nanopolish_meth.fasta')
 

diff --git a/plot_results.R b/plot_results.R
@@ -409,3 +409,84 @@ blank <- rectGrob(gp=gpar(col="white"))
 scrappie_comparison_plot <- grid.arrange(p1, blank, p2, ncol=3, widths=c(0.425, 0.15, 0.425))
 # ggsave(scrappie_comparison_plot, file='plots/scrappie_comparison.pdf', width = 6, height = 4)
 
+
+
+
+
+
+
+
+
+
+
+# The following code is for a violin plot comparing different polishing strategies.
+basecaller <- "Albacore v2.1.10"
+no_spaces <- gsub(" ", "_", basecaller)
+
+polishing_names <- c(paste(basecaller, "(nopolishing)"),
+                     paste(basecaller, "+Medaka"),
+                     paste(basecaller, "+Nanopolish"),
+                     paste(basecaller, "+Medaka +Nanopolish"),
+                     paste(basecaller, "+Nanopolish +Medaka"))
+polishing_labels <- gsub(" ", "\n", polishing_names, fixed=TRUE)
+polishing_labels <- gsub("(nopolishing)", "(no polishing)", polishing_labels, fixed=TRUE)
+polishing_colours <- c("#CCCCCC", "#4F84CC", "#DD616C", "#846DC2", "#AB61BA")
+names(polishing_colours) <- polishing_names
+polishing_fill_scale <- scale_fill_manual(name = "Polishing", values = polishing_colours)
+
+no_polish_data_filename <- paste("results/", tolower(no_spaces), "_assembly.tsv", sep="")
+medaka_data_filename <- paste("results/", tolower(no_spaces), "_medaka.tsv", sep="")
+nanopolish_data_filename <- paste("results/", tolower(no_spaces), "_nanopolish_meth.tsv", sep="")
+medaka_nanopolish_data_filename <- paste("results/", tolower(no_spaces), "_medaka_nanopolish_meth.tsv", sep="")
+nanopolish_medaka_data_filename <- paste("results/", tolower(no_spaces), "_nanopolish_meth_medaka.tsv", sep="")
+
+length_column <- paste("Length_", no_spaces, sep="")
+rel_length_column <- paste("Rel_len_", no_spaces, sep="")
+
+basecaller_identities <- c()
+
+identity_column <- paste("Identity_", no_spaces, "_no_polish", sep="")
+no_polish_data <- load_tsv_data(no_polish_data_filename, c("Name", length_column, identity_column, rel_length_column))
+basecaller_identities <- c(basecaller_identities, identity_column)
+
+identity_column <- paste("Identity_", no_spaces, "_medaka", sep="")
+medaka_data <- load_tsv_data(medaka_data_filename, c("Name", length_column, identity_column, rel_length_column))
+basecaller_identities <- c(basecaller_identities, identity_column)
+
+identity_column <- paste("Identity_", no_spaces, "_nanopolish", sep="")
+nanopolish_data <- load_tsv_data(nanopolish_data_filename, c("Name", length_column, identity_column, rel_length_column))
+basecaller_identities <- c(basecaller_identities, identity_column)
+
+identity_column <- paste("Identity_", no_spaces, "_medaka_nanopolish", sep="")
+medaka_nanopolish_data <- load_tsv_data(medaka_nanopolish_data_filename, c("Name", length_column, identity_column, rel_length_column))
+basecaller_identities <- c(basecaller_identities, identity_column)
+
+identity_column <- paste("Identity_", no_spaces, "_nanopolish_medaka", sep="")
+nanopolish_medaka_data <- load_tsv_data(nanopolish_medaka_data_filename, c("Name", length_column, identity_column, rel_length_column))
+basecaller_identities <- c(basecaller_identities, identity_column)
+
+polish_data <- data.frame(Name = numeric())
+polish_data <- merge(polish_data, no_polish_data, by=1, all=TRUE)
+polish_data <- merge(polish_data, medaka_data, by=1, all=TRUE)
+polish_data <- merge(polish_data, nanopolish_data, by=1, all=TRUE)
+polish_data <- merge(polish_data, medaka_nanopolish_data, by=1, all=TRUE)
+polish_data <- merge(polish_data, nanopolish_medaka_data, by=1, all=TRUE)
+
+assembly_lengths <- polish_data[grepl("Length_", names(polish_data))]
+polish_data["Length"] <- round(apply(assembly_lengths, 1, median, na.rm = TRUE))
+
+polish_identities <- polish_data[,c("Name", "Length", basecaller_identities)]
+colnames(polish_identities) <- c("Name", "Length", polishing_names)
+polish_identities <- melt(polish_identities, id=c("Name", "Length"))
+colnames(polish_identities) <- c("Read_name", "Length", "Polishing", "Identity")
+
+polishing_plot <- ggplot(polish_identities, aes(x = Polishing, y = Identity, weight = Length, fill = Polishing)) +
+  geom_violin(draw_quantiles = c(0.5), bw=0.06, width=1.1) +
+  polishing_fill_scale + my_theme + guides(fill=FALSE) +
+  scale_y_continuous(expand = c(0, 0), breaks = seq(0, 100, 0.1), minor_breaks = seq(0, 100, 0.05), labels = scales::unit_format("%")) +
+  scale_x_discrete(labels=polishing_labels) +
+  coord_cartesian(ylim=c(99.2, 100)) +
+  labs(title = "", x = "", y = "assembly identity")
+polishing_plot
+ggsave(polishing_plot, file='plots/polishing_methods.pdf', width = 4.75, height = 4)
+