index.html

<!DOCTYPE html>
<html>

  <head>
    <meta charset='utf-8'>
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <meta name="description" content="Meta-binning : a pipeline to reconstruct genome sequence from metagenomes">

    <link rel="stylesheet" type="text/css" media="screen" href="stylesheets/stylesheet.css">

    <title>Meta-binning</title>
  </head>

  <body>

    <!-- HEADER -->
    <div id="header_wrap" class="outer">
        <header class="inner">
          <a id="forkme_banner" href="https://github.com/mingleiR/meta-binning">View on GitHub</a>

          <h1 id="project_title">Meta-binning</h1>
          <h2 id="project_tagline">a pipeline to reconstruct genome sequence from metagenomes</h2>

            <section id="downloads">
              <a class="zip_download_link" href="https://github.com/mingleiR/meta-binning/zipball/master">Download this project as a .zip file</a>
              <a class="tar_download_link" href="https://github.com/mingleiR/meta-binning/tarball/master">Download this project as a tar.gz file</a>
            </section>
        </header>
    </div>

    <!-- MAIN CONTENT -->
    <div id="main_content_wrap" class="outer">
      <section id="main_content" class="inner">

<hr>
<h1>A pipeline to binning metagenomic sequence based KDE and normal mixture model</h1>
<hr>
<p>A typical process of metagenomics analysis includes sampling, sequencing, data cleaning, assembly and gene calling. Because of the incomplete nature of sampling and internal bias in sequencing, it is difficult to recovery the full genomes of each genome from the metagenomics. Binning provides necessary complementation in practice.</p>
<p>In metagenomics, binning is a process of grouping raw reads or assembled contigs and assigning them to operational taxiomic unit. Here, we mean the contigs assembled from the raw reads. There are two basic methods of binning sequence, and one is based on composition features, the other is based on sequence similarity. We present an effective binning approach combined two strategies above. In addtion, an operable and detailed step-by-step guide is provided.</p>
<ul>
<li>Step 1. Collect important features value
<ul>
<li>1-1. Trimming</li>
<li>1-2. De novo Assembly</li>
<li>1-3. Extraction of selected fetures data</li>
</ul>
</li>
<li>Step 2. Filter the metagenomic assembly</li>
<li>Step 3. Binning the metagenomic assembly using selected features
<ul>
<li>3-1. Compute the KDE using the weighted data and create contour graph</li>
<li>3-2. Determine the initial group of data based on different level values of contour</li>
<li>3-3. Fit normal mixture model for the assembly</li>
<li>3-4. Validate each genome bin</li>
</ul>
</li>
<li>Step 4. Taxonomy assignment and functional analysis for each genome bin.
<ul>
<li>4-1. Toxonomy anslysis using MEGAN</li>
<li>4-2. Phylogenetic anlysis based on 16S rRNA sequence</li>
<li>4-3. Clusters of Orthologous Groups (COG) analysis</li>
<li>4-4. Pathway analysis using GhostKOALA service</li>
</ul>
</li>
</ul>
<h2><a id="Step_1_Collect_important_features_value_29"></a>Step 1. Collect important features value</h2>
<p>After getting the sequencing data, we can utilize a lot of open source tools to analyze the data, including trimming, assembling and annotation. The following step is for reference only.</p>
<h3><a id="11_Trimming_32"></a>1-1. Trimming</h3>
<p>Although there are many useful toolkits to perform the quality control, We choose Trimmomatic to clean our sequencing datasets. The detailed parameters is following:</p>
<pre><code>  java -jar trimmomatic-<span class="hljs-number">0.32</span><span class="hljs-class">.jar</span> -threads <span class="hljs-number">8</span> -phred33 -trimlog PE_trim<span class="hljs-class">.log</span> \
        PE_1<span class="hljs-class">.fastq</span> PE_2<span class="hljs-class">.fastq</span> \  
        PE_1_trimmed_paired<span class="hljs-class">.fq</span> PE_1_trimmed_unpaired<span class="hljs-class">.fq</span> \  
        PE_2_trimmed_paired<span class="hljs-class">.fq</span> PE_2_trimmed_unpaired<span class="hljs-class">.fq</span> \  
        ILLUMINACLIP:<span class="hljs-variable">$HOME</span>/Software/Trimmomatic-<span class="hljs-number">0.32</span>/adapters/Index/TruSeq2-PE-GCCAAT<span class="hljs-class">.fa</span>:<span class="hljs-number">2</span>:<span class="hljs-number">30</span>:<span class="hljs-number">10</span>:<span class="hljs-number">1</span>:true \  
        LEADING:<span class="hljs-number">3</span> TRAILING:<span class="hljs-number">3</span> MAXINFO:<span class="hljs-number">40</span>:<span class="hljs-number">0.4</span> MINLEN:<span class="hljs-number">20</span>  
</code></pre>
<p>The dataset generated from the mate-pair library was trimmed like above commands.</p>
<h3><a id="12_De_novo_Assembly_46"></a>1-2. De novo Assembly</h3>
<p>There are two types of sequence assembly, that is overlap/layout/consensus approach and de Bruijn graph approach. Because of the characteristics of next-generation sequencing platform and the incompletion of reference genome, de nove assembly is appropriate for metagenomics assembly. According to our experience, we use SPAdes Genomes Assembler to assembly our data. Executinng SPAdes as the following command:</p>
<pre><code>  spades<span class="hljs-class">.py</span> --dataset data<span class="hljs-class">.yaml</span> -t <span class="hljs-number">8</span> -o ./ -k <span class="hljs-number">21</span>,<span class="hljs-number">35</span>,<span class="hljs-number">51</span>,<span class="hljs-number">69</span> --carefull
</code></pre>
<p>Before using the commands above, you need to create your data.yaml file as the SPAdes online manual told.</p>
<h3><a id="13_Extraction_of_selected_fetures_data_55"></a>1-3. Extraction of selected fetures data</h3>
<p>In order to bin metagenomics sequence, we have to collect some fetures about the assembled fragments. For each contigs, GC content, length and coverage are extracted from assembly file using our R script. The script outputs the file <code>contigs.info</code>  including 4 column, that is contig name, contig length, GC content and coverage.</p>
<p>The main analysis of this pipeline uses R language, so the following codes are R codes unless noted. So, just make sure you have started R and type the code at the R prompt. Another convenient way to execute the following codes is to use a powerful GUI of R, such as RStudio.</p>
<pre><code><span class="hljs-function"><span class="hljs-title">suppressMessages</span><span class="hljs-params">(library(Biostrings)</span></span>)  
assembly<span class="hljs-class">.file</span> &lt;- file.<span class="hljs-function"><span class="hljs-title">choose</span><span class="hljs-params">()</span></span>  
ctgsSeq &lt;- <span class="hljs-function"><span class="hljs-title">readDNAStringSet</span><span class="hljs-params">(assembly.file)</span></span>  
seqsName &lt;- <span class="hljs-function"><span class="hljs-title">names</span><span class="hljs-params">(ctgsSeq)</span></span>  
ctgs &lt;- data.<span class="hljs-function"><span class="hljs-title">frame</span><span class="hljs-params">(name= seqsName, stringsAsFactors=FALSE)</span></span>  
ctgs<span class="hljs-variable">$length</span> &lt;- <span class="hljs-function"><span class="hljs-title">width</span><span class="hljs-params">(ctgsSeq)</span></span>  
baseStat&lt;- <span class="hljs-function"><span class="hljs-title">alphabetFrequency</span><span class="hljs-params">(ctgsSeq)</span></span>  
ctgsGc &lt;- <span class="hljs-function"><span class="hljs-title">rowSums</span><span class="hljs-params">(baseStat[, <span class="hljs-number">2</span>:<span class="hljs-number">3</span>])</span></span>/<span class="hljs-function"><span class="hljs-title">rowSums</span><span class="hljs-params">(baseStat[, <span class="hljs-number">1</span>:<span class="hljs-number">4</span>])</span></span>  
ctgs<span class="hljs-variable">$gc</span> &lt;- as.<span class="hljs-function"><span class="hljs-title">numeric</span><span class="hljs-params">(format(ctgsGc, digits= <span class="hljs-number">6</span>)</span></span>)  
</code></pre>
<p>The output of some assembler, such as SPAdes, Velvet and AByss, provide the coverage value of each cotigs in the header. If there isn’t coverage information in the assembly data, we can map reads to contigs using BWA or Bowtie to get coverage information instead. Here the aligment is done using Bowite in the Linux enviroment.</p>
<p>Note: the kmer coverage is different from the read coverage (or the nucleotide coverage), but they are closely-related.</p>
<pre><code>bowtie-build --offrate <span class="hljs-number">1</span> assembly<span class="hljs-class">.fa</span> assembly<span class="hljs-class">.index</span>  
bowtie -as -v <span class="hljs-number">3</span> -<span class="hljs-tag">p</span> <span class="hljs-number">8</span> -X <span class="hljs-number">400</span> assembly<span class="hljs-class">.index</span> -<span class="hljs-number">1</span> read1<span class="hljs-class">.fq</span> -<span class="hljs-number">2</span> read2<span class="hljs-class">.fq</span> | samtools view -Sb &gt; alignment<span class="hljs-class">.bam</span>
express -o expr_out assembly<span class="hljs-class">.fa</span> alignment<span class="hljs-class">.bam</span>
</code></pre>
<p>In the R console, we read the aligment file to get coverage information of the assembly</p>
<pre><code>align<span class="hljs-class">.file</span> &lt;- file.<span class="hljs-function"><span class="hljs-title">choose</span><span class="hljs-params">()</span></span>
xprs &lt;- read.<span class="hljs-function"><span class="hljs-title">table</span><span class="hljs-params">(align.file, header= TRUE, as.is= TRUE)</span></span>
xprs &lt;- xprs[<span class="hljs-function"><span class="hljs-title">order</span><span class="hljs-params">(xprs<span class="hljs-variable">$target_id</span>)</span></span>, ]
ctgs &lt;- ctgs[<span class="hljs-function"><span class="hljs-title">order</span><span class="hljs-params">(ctgs<span class="hljs-variable">$name</span>)</span></span>, ]
ctgs<span class="hljs-variable">$coverage</span> &lt;- <span class="hljs-function"><span class="hljs-title">sqrt</span><span class="hljs-params">(xprs<span class="hljs-variable">$fpkm</span>)</span></span>
ctgs &lt;- <span class="hljs-function"><span class="hljs-title">subset</span><span class="hljs-params">(ctgs, coverage!=<span class="hljs-number">0</span>)</span></span>
</code></pre>
<h2><a id="Step_2_Clean_the_assembly_data_96"></a>Step 2. Clean the assembly data</h2>
<p>During assembly stage, most of assemblers filter the result with some default parameter, while others export all result. Of all contigs, those which have a very short length or high coverage are nonsense, because they might be repeats or other common regions in multiple organisms. To reduce the interference of these sequence, we filter the assembly by the length and coverage. In addition, GC contents and coverage are transformate normally to fit the model.</p>
<pre><code><span class="hljs-comment"># filter by length</span>
temp &lt;- ctgs
ctgs &lt;- subset(ctgs, <span class="hljs-built_in">length</span> &gt; <span class="hljs-number">2000</span>)  

<span class="hljs-comment"># write special weight function</span>
specialWeightFun &lt;- <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-title">df</span>){</span>
    <span class="hljs-built_in">num</span> &lt;- <span class="hljs-keyword">as</span>.<span class="hljs-keyword">numeric</span>(<span class="hljs-keyword">as</span>.matrix(df[, c(<span class="hljs-string">"coverage"</span>, <span class="hljs-string">"gc"</span>)]))
    <span class="hljs-built_in">len</span> &lt;- df[, <span class="hljs-string">"length"</span>]
    num2 &lt;- rep(<span class="hljs-built_in">num</span>, times= rep(<span class="hljs-built_in">len</span>, <span class="hljs-number">2</span>L))
    dim(num2) &lt;- c(<span class="hljs-built_in">sum</span>(<span class="hljs-built_in">len</span>), <span class="hljs-number">2</span>L)
    <span class="hljs-constant">return</span>(num2)
}

<span class="hljs-comment"># weight the data</span>
ctgsWeighted &lt;- specialWeightFun(ctgs)
coverageUpper &lt;- quantile(ctgsWeighted[, <span class="hljs-number">1</span>], <span class="hljs-number">0.99</span>, names= <span class="hljs-constant">FALSE</span>)

<span class="hljs-comment"># filter by coverage using the weighted data</span>
ctgsWeighted &lt;- ctgsWeighted[ctgsWeighted[, <span class="hljs-number">1</span>] &lt;= coverageUpper]
ctgs &lt;- ctgs[ctgs$coverage &lt;= coverageUpper]

<span class="hljs-comment"># transformation</span>
normalTrans &lt;- <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-title">x</span>) <span class="hljs-title">0</span><span class="hljs-number">.5</span>*<span class="hljs-title">log</span>(<span class="hljs-title">x</span>/(<span class="hljs-title">1-x</span>))</span>
ctgs$gc &lt;- normalTrans(ctgs$gc)
</code></pre>
<h2><a id="Step_3_Binning_the_metagenomic_assembly_using_selected_features_127"></a>Step 3. Binning the metagenomic assembly using selected features</h2>
<h3><a id="31_Compute_the_KDE_using_the_weighted_data_and_create_contour_graph_128"></a>3-1. Compute the KDE using the weighted data and create contour graph</h3>
<p>Our method exploits the advantage of  contour plot which can represent the relationships of three variables in two dimensions. Before drawing contour plot, we have to compute a 2D kernel density estimation.</p>
<pre><code><span class="hljs-function"><span class="hljs-title">library</span><span class="hljs-params">(KernSmooth)</span></span>
bandwidth &lt;- <span class="hljs-function"><span class="hljs-title">c</span><span class="hljs-params">(dpik(ctgs<span class="hljs-variable">$coverage</span>)</span></span>, <span class="hljs-function"><span class="hljs-title">dpik</span><span class="hljs-params">(ctgs<span class="hljs-variable">$gc</span>)</span></span>)
dens &lt;- <span class="hljs-function"><span class="hljs-title">bkde2D</span><span class="hljs-params">(ctgsWeighted[, <span class="hljs-number">1</span>:<span class="hljs-number">2</span>], bandwidth= bandwidth, gridsize= c(<span class="hljs-number">401</span>L, <span class="hljs-number">401</span>L)</span></span>, truncate= TRUE)
</code></pre>
<p>When drawing contour, you need to adjust the arguments controlling the levels of contour in order to obtain the overview of all contigs. The contour filled with color looks much fancy, that ignore some patterns of lower density areas in the contour.</p>
<pre><code><span class="hljs-function"><span class="hljs-title">contour</span><span class="hljs-params">(dens<span class="hljs-variable">$x1</span>, dens<span class="hljs-variable">$x2</span>, dens<span class="hljs-variable">$fhat</span>, nlevels= <span class="hljs-number">30</span>)</span></span>
<span class="hljs-function"><span class="hljs-title">contour</span><span class="hljs-params">(dens<span class="hljs-variable">$x1</span>, dens<span class="hljs-variable">$x2</span>, dens<span class="hljs-variable">$fhat</span>, levels= seq(min(dens<span class="hljs-variable">$fhat</span>)</span></span>, <span class="hljs-function"><span class="hljs-title">max</span><span class="hljs-params">(dens<span class="hljs-variable">$fhat</span>)</span></span>, by= <span class="hljs-number">0.5</span>))
myPalette &lt;- <span class="hljs-function"><span class="hljs-title">colorRampPalette</span><span class="hljs-params">(c(<span class="hljs-string">'white'</span>, <span class="hljs-string">'red'</span>, <span class="hljs-string">'orange'</span>)</span></span>, bias= <span class="hljs-number">2</span>)
filled.<span class="hljs-function"><span class="hljs-title">contour</span><span class="hljs-params">(dens<span class="hljs-variable">$x1</span>, dens<span class="hljs-variable">$x2</span>, dens<span class="hljs-variable">$fhat</span>, nlevels= <span class="hljs-number">20</span>, color= myPalette)</span></span>
</code></pre>
<h3><a id="32_Determine_the_initial_group_of_data_based_on_different_level_values_of_contour_146"></a>3-2. Determine the initial group of data based on different level values of contour</h3>
<p>With the aid of the contour, you usually find some patterns in the assembly. Each higher density area of contour contain most of contigs which might derive from an individual organism. Based on the assumption, we select the contigs in every higher density areas.</p>
<p>The process of selecting the contigs in each area may be time-consuming, it totally depends on the complexity of the metagenomics project. Sometimes, you need to combine more than one level value to get the optimal inital group.</p>
<pre><code><span class="hljs-comment"># try some level values manually, and determine the optimal one.</span>
library(sp)
levelValue &lt;- <span class="hljs-number">0</span>.<span class="hljs-number">5</span>
contour(dens<span class="hljs-variable">$x1</span>, dens<span class="hljs-variable">$x2</span>, dens<span class="hljs-variable">$fhat</span>, levels= levelValue)
cl &lt;- contourLines(dens<span class="hljs-variable">$x1</span>, dens<span class="hljs-variable">$x2</span>, dens<span class="hljs-variable">$fhat</span>, levels= levelValue)
ctgs<span class="hljs-variable">$group</span> &lt;- <span class="hljs-number">0</span>
library(sp)
<span class="hljs-keyword">for</span>( i <span class="hljs-keyword">in</span> <span class="hljs-number">1</span><span class="hljs-symbol">:length</span>(cl) ){
    gi &lt;- as.logical(point.<span class="hljs-keyword">in</span>.polygon(ctgs<span class="hljs-variable">$coverage</span>, ctgs<span class="hljs-variable">$gc</span>, cl[[i]]<span class="hljs-variable">$x</span>, cl[[i]]<span class="hljs-variable">$y</span>))
    ctgs[gi, <span class="hljs-string">'group'</span>] &lt;- i
}

<span class="hljs-comment"># display initial group</span>
library(ggplot2)
(p &lt;- ggplot(data= ctgs, aes(x= coverage, y= gc)))
p + geom_points(aes(color= factor(group))
</code></pre>
<h3><a id="33_Fit_normal_mixture_model_for_the_assembly_171"></a>3-3. Fit normal mixture model for the assembly</h3>
<p>The Gaussian mixture model (GMM) is a powerful model for data clustering. Here we model the metagenomic assembly as a mixture of multiple Gaussian distributions where each component correspondes one genome bin. And we use mclust package which implemented the expectation-maximization algorithm (EM) to accomplish the parameter estimation of GMM.</p>
<p>After finishing the clustering through GMM, we utilize the essential gene sets to draw back some assembly fragements from the unassigned sequences. The process of obtaining the information of essential gene contained in the assembly was doned by the shell script <code>cal_assemlby_ess-gene.sh</code>.</p>
<pre><code># perform the second binning
library(mclust)
z &lt;- unmap(ctgs<span class="hljs-variable">$group</span>)
Vinv &lt;- hypvol(ctgs[, c(<span class="hljs-string">"coverage"</span>, <span class="hljs-string">"gc"</span>)], reciprocal = TRUE)
wt &lt;- ctgs<span class="hljs-variable">$length</span> / <span class="hljs-keyword">max</span>(ctgs<span class="hljs-variable">$length</span>)
fit.weighted &lt;- me.weighted(<span class="hljs-string">"VVV"</span>, ctgs[, c(<span class="hljs-string">"coverage"</span>, <span class="hljs-string">"gc"</span>)], z = z, weights = wt, Vinv = Vinv)
ctgs<span class="hljs-variable">$class</span> &lt;- map(fit.weighted<span class="hljs-variable">$z</span>)-<span class="hljs-number">1</span>
table(ctgs<span class="hljs-variable">$class</span>)
tapply(ctgs<span class="hljs-variable">$length</span>, ctgs<span class="hljs-variable">$class</span>, sum)/<span class="hljs-number">10</span>^<span class="hljs-number">6</span>

# read the alignment info against <span class="hljs-number">109</span> essential genes and get 
# some fragments <span class="hljs-keyword">in</span> <span class="hljs-keyword">group</span> <span class="hljs-number">0</span> back.
#####
# ./calc_contig_ess_gene_info.sh
#####
ctgs.ess &lt;- read.table(<span class="hljs-string">"essential_gene.info"</span>, header=FALSE, as.is=TRUE)
colnames(ctgs.ess) &lt;- c(<span class="hljs-string">'count'</span>, <span class="hljs-string">'name'</span>)

idx &lt;- <span class="hljs-keyword">intersect</span>(ctgs[ctgs<span class="hljs-variable">$class</span>==<span class="hljs-number">0</span>,<span class="hljs-string">'name'</span>], ctgs.ess<span class="hljs-variable">$name</span>)
ctgs0.ess &lt;- ctgs[<span class="hljs-keyword">match</span>(idx, ctgs<span class="hljs-variable">$name</span>), ]
ctgs0.ess<span class="hljs-variable">$count</span> &lt;- ctgs.ess[<span class="hljs-keyword">match</span>(idx, ctgs.ess<span class="hljs-variable">$name</span>), <span class="hljs-string">'count'</span>]

ctgs0.dist &lt;- <span class="hljs-keyword">matrix</span>(<span class="hljs-number">0</span>, nrow=nrow(ctgs0.ess),
                     ncol= length(unique(ctgs<span class="hljs-variable">$class</span>)))
coor.<span class="hljs-keyword">scale</span> &lt;- (<span class="hljs-keyword">max</span>(ctgs<span class="hljs-variable">$coverage</span>)-<span class="hljs-keyword">min</span>(ctgs<span class="hljs-variable">$coverage</span>))/(<span class="hljs-keyword">max</span>(ctgs<span class="hljs-variable">$gc</span>)-<span class="hljs-keyword">min</span>(ctgs<span class="hljs-variable">$gc</span>))
ctgs0.ess &lt;- transform(ctgs0.ess, gc.s= gc<span class="hljs-variable">*coor</span>.<span class="hljs-keyword">scale</span>)
ctgs &lt;- transform(ctgs, gc.s= gc<span class="hljs-variable">*coor</span>.<span class="hljs-keyword">scale</span>)
each.<span class="hljs-keyword">group</span> &lt;- split(ctgs, f= ctgs<span class="hljs-variable">$class</span>)

<span class="hljs-keyword">for</span>(i <span class="hljs-keyword">in</span> seq.<span class="hljs-keyword">int</span>(ctgs0.ess<span class="hljs-variable">$name</span>)){
  tmp.num &lt;-lapply(each.<span class="hljs-keyword">group</span>, function(idx){
    tmp.cen &lt;- kmeans(idx[, c(<span class="hljs-string">"coverage"</span>, <span class="hljs-string">"gc.s"</span>)], centers=<span class="hljs-number">1</span>, iter.<span class="hljs-keyword">max</span>=<span class="hljs-number">100</span>)<span class="hljs-variable">$centers</span>
    tmp.dist &lt;- dist(<span class="hljs-keyword">matrix</span>(c(tmp.cen, ctgs0.ess<span class="hljs-variable">$coverage</span>[i], ctgs0.ess<span class="hljs-variable">$gc</span>.s[i]), nrow=<span class="hljs-number">2</span>, byrow=T))
    as.<span class="hljs-keyword">matrix</span>(tmp.dist)[<span class="hljs-number">2</span>]
  })
  ctgs0.dist[i, ] &lt;- unlist(tmp.num)
}
ctgs0.dist &lt;- ctgs0.dist[, -<span class="hljs-number">1</span>]
ctgs0.ess<span class="hljs-variable">$class</span>.dis &lt;- apply(ctgs0.dist, <span class="hljs-number">1</span>, which.<span class="hljs-keyword">min</span>)

# display the final binning
cols &lt;- c(<span class="hljs-string">"C1"</span>=<span class="hljs-string">"red"</span>, <span class="hljs-string">"C2"</span>=<span class="hljs-string">"green"</span>, <span class="hljs-string">"C3"</span>=<span class="hljs-string">"blue"</span>, <span class="hljs-string">"C4"</span>=<span class="hljs-string">"yellow"</span>, <span class="hljs-string">"C5"</span>=<span class="hljs-string">"blueviolet"</span>,
          <span class="hljs-string">"C6"</span>=<span class="hljs-string">"gold"</span>, <span class="hljs-string">"C7"</span>=<span class="hljs-string">"aquamarine"</span>, <span class="hljs-string">"C8"</span>=<span class="hljs-string">"darkolivegreen"</span>, <span class="hljs-string">"C9"</span>=<span class="hljs-string">"deepskyblue"</span>,
          <span class="hljs-string">"C0"</span>=<span class="hljs-string">"gray"</span>)
p &lt;- ggplot(ctgs, aes(coverage, gc, <span class="hljs-keyword">color</span> = class, <span class="hljs-keyword">size</span> = length)) + geom_point()
(p+scale_color_manual(values= cols) )

# split the assembly <span class="hljs-keyword">file</span> <span class="hljs-keyword">in</span> each <span class="hljs-keyword">group</span> and export them out <span class="hljs-keyword">in</span> Fasta <span class="hljs-keyword">format</span> to prepare <span class="hljs-keyword">for</span> the downstream analysis
suppressMessages(library(Biostrings))
ctgsSeq &lt;- readDNAStringSet(<span class="hljs-string">"assembly.fa"</span>, <span class="hljs-keyword">format</span>= <span class="hljs-string">'fasta'</span>)
eachGroup&lt;- split(ctgs<span class="hljs-variable">$name</span>, ctgs<span class="hljs-variable">$class</span>)
sapply(names(eachGroup), 
       function(x){seqs &lt;- ctgsSeq[eachGroup[[x]]];
                   writeXStringSet(seqs, filepath= paste0(<span class="hljs-string">"super_"</span>, x, <span class="hljs-string">".fa"</span>))
                   })
</code></pre>
<h3><a id="34_Validate_each_genome_bin_234"></a>3-4. Validate each genome bin</h3>
<p>After binning the assembly, the validation work needs to done for each genome bin, including the completeness and integrity.</p>
<pre><code>  prodigal -<span class="hljs-tag">a</span> temp_orfs<span class="hljs-class">.faa</span> -<span class="hljs-tag">i</span> assembly<span class="hljs-class">.fa</span> -m -o temp_prodigal<span class="hljs-class">.stdout</span> -<span class="hljs-tag">p</span> meta -<span class="hljs-tag">q</span>  
  perl -wlan -e <span class="hljs-string">'print $F[0];'</span> temp_orfs<span class="hljs-class">.faa</span> &gt; orfs<span class="hljs-class">.faa</span>  
  hmmsearch --tblout temp_hmm_orfs<span class="hljs-class">.txt</span> --cut_tc --notextw --cpu <span class="hljs-number">8</span> essential<span class="hljs-class">.hmm</span> orfs<span class="hljs-class">.faa</span> &gt; temp_hmm<span class="hljs-class">.stdout</span>  
  perl -wlan -e <span class="hljs-string">'/#/ or print "@F[0,3]";'</span> temp_hmm_orfs<span class="hljs-class">.txt</span> &gt; assembly_2_ess-gene<span class="hljs-class">.txt</span>  
</code></pre>
<p>The above code is copied from the Supplementary Information of Albertsen et al. Based on the file <code>assembly_2_ess-gene.txt</code>, we can easily infer the completeness of each genomic bin. Besides, we also provide a little shell script <code>ess-gene_counts.sh</code> to count the statistic about the essential genes for each genome bin.</p>
<h2><a id="Step_4_Assign_taxonomy_to_each_cluster_and_then_pay_attention_on_interesting_genomes_248"></a>Step 4. Assign taxonomy to each cluster and then pay attention on interesting genomes</h2>
<p>Taxonmical annotation and function annotation of the genes are performed as following. At first, the predicted proteins of each group were aligned to the NCBI-NR database using BLASTP (best hit with E&lt;10-5), The BLASTP outputs were then filtered (at least 40% amino acid sequence identity and 50% length hit) and the imported into MEGAN5 for taxonomic classification. The detailed parameters of these programs are listed here.</p>
<h3><a id="41_Toxonomy_anslysis_using_MEGAN_251"></a>4-1. Toxonomy anslysis using MEGAN</h3>
<pre><code>  <span class="hljs-comment"># run blastp </span>
  <span class="hljs-title">blastp</span> -query orfs.faa -db /path/to/local/nr_database/ -out blastp.out -evalue 1e-<span class="hljs-number">5</span> \
          -max_target_seqs <span class="hljs-number">1</span> -num_threads <span class="hljs-number">8</span> -outfmt \
          <span class="hljs-string">"6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore slen"</span>

  <span class="hljs-comment"># filter results</span>
  perl -wlan -e <span class="hljs-string">'BEGIN{$,="\t";} print <span class="hljs-variable">@F</span>[0..11] if((<span class="hljs-variable">$F</span>[3]/<span class="hljs-variable">$F</span>[12]&gt;=0.5) &amp;&amp; (<span class="hljs-variable">$F</span>[2]&gt;=40));'</span>
</code></pre>
<p>After preparing the input files, run MEGAN at the command line with proper memory settings. Please refer to the <a href="http://ab.inf.uni-tuebingen.de/data/software/megan5/download/manual.pdf">MEGAN5 manual</a> here.</p>
<h3><a id="42_Phylogenetic_anlysis_based_on_16S_rRNA_sequence_265"></a>4-2. Phylogenetic anlysis based on 16S rRNA sequence</h3>
<p>To test the result of MEGAN, the 16S ribosomal RNA genes of each group were predicted by RNAmmer and the 16S ribosomal RNA genes of their closest neighbors were downloaded from the NCBI, the alignments were run in MEGA5 (neighbor-joining) at the default settings with 1 000 bootstraps.</p>
<h3><a id="43_Clusters_of_Orthologous_Groups_COG_analysis_268"></a>4-3. Clusters of Orthologous Groups (COG) analysis</h3>
<p>The COG analysis is performed by the <a href="http://sourceforge.net/projects/cogtriangles/files/">COG software</a>, and the steps are followed by the readme file in the software.</p>
<h3><a id="44_Pathway_analysis_using_GhostKOALA_service_271"></a>4-4. Pathway analysis using GhostKOALA service</h3>
<p>To further compare the functional potential of each group, the predicted ORFs were analyzed using the <a href="http://www.kegg.jp/ghostkoala/">GhostKOALA service</a> on the KEGG website. When the results were returned through your email, open the links and then compare your interesting pathway.</p>

      </section>
    </div>

    <!-- FOOTER  -->
    <div id="footer_wrap" class="outer">
      <footer class="inner">
        <p class="copyright">Meta-binning maintained by <a href="https://github.com/mingleiR">mingleiR</a></p>
        <p>Published with <a href="https://pages.github.com">GitHub Pages</a></p>
      </footer>
    </div>

    

  </body>
</html>