search.xml

<?xml version="1.0" encoding="utf-8"?>
<search>
  
  
    
    <entry>
      <title><![CDATA[2017 Deep Learning Study Tasks]]></title>
      <url>https://linbojin.github.io/2017/04/18/2017-Study-Tasks/</url>
      <content type="html"><![CDATA[<h3 id="Online-Courses"><a href="#Online-Courses" class="headerlink" title="Online Courses"></a>Online Courses</h3><ul>
<li>[ ] <a href="http://web.stanford.edu/class/cs224n/" target="_blank" rel="external">CS224n: Natural Language Processing with Deep Learning (Winter 2017)</a> | <a href="http://web.stanford.edu/class/cs224n/syllabus.html" target="_blank" rel="external">Syllabus</a> | <a href="https://www.youtube.com/playlist?list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6" target="_blank" rel="external">Lecture Videos</a></li>
<li>[ ] <a href="https://www.coursera.org/learn/machine-learning" target="_blank" rel="external">CS229: Machine Learning</a></li>
<li>[ ] <a href="http://cs231n.stanford.edu/index.html" target="_blank" rel="external">CS231n: Convolutional Neural Networks for Visual Recognition (Spring 2017)</a> | <a href="http://cs231n.stanford.edu/syllabus.html" target="_blank" rel="external">Syllabus</a> | <a href="">Lecture Videos</a></li>
</ul>
<h3 id="Reading-Books"><a href="#Reading-Books" class="headerlink" title="Reading Books"></a>Reading Books</h3><ul>
<li>[ ] 机器学习，周志华</li>
<li>[ ] <a href="http://neuralnetworksanddeeplearning.com/" target="_blank" rel="external">Michael Nielson’s Deep Learning Book or NNDL</a></li>
<li>[ ] <a href="http://www.deeplearningbook.org/" target="_blank" rel="external">The Deep Learning Book by Ian Goodfellow</a></li>
<li>[ ] Pattern Recognition and Machine Learning</li>
</ul>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[TensorFlow学习笔记01-在EC2上安装]]></title>
      <url>https://linbojin.github.io/2017/03/04/tf01/</url>
      <content type="html"><![CDATA[<ul>
<li>Essentials</li>
<li>CUDA 8.0</li>
<li>cuDNN v5.1, for CUDA 8.0</li>
<li>TensorFlow 1.0.0</li>
</ul>
<p>选择EC2 p2.xlarge: 1 GPU (Nvidia K80), 61G RAM, $0.900 hourly<br>AMI: Ubuntu Server 16.04 LTS (HVM), SSD Volume Type - ami-a58d0dc5<br><a id="more"></a><br>此处略过如何启动ec2 instance，下面的操作直接在instance上进行。</p>
<h2 id="安装dependencies-amp-build-tools"><a href="#安装dependencies-amp-build-tools" class="headerlink" title="安装dependencies &amp; build tools"></a>安装dependencies &amp; build tools</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">sudo apt-get update &amp;&amp; sudo apt-get -y upgrade</div><div class="line">sudo apt-get install -y build-essential git swig default-jdk zip zlib1g-dev</div><div class="line"> </div><div class="line"><span class="comment"># 确定gcc已经安装</span></div><div class="line">gcc --version</div><div class="line"> </div><div class="line"><span class="comment"># 判断是否有NVIDIA GPU</span></div><div class="line">lspci | grep -i nvidia</div></pre></td></tr></table></figure>
<p>p2.xlarge GPU 如下：<br><img src="/media/14886258446603.jpg" alt=""></p>
<p>We need to blacklist Nouveau which has a conflict with the nvidia driver.</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="built_in">echo</span> <span class="_">-e</span> <span class="string">"blacklist nouveau\nblacklist lbm-nouveau\noptions nouveau modeset=0\nalias nouveau off\nalias lbm-nouveau off\n"</span> | sudo tee /etc/modprobe.d/blacklist-nouveau.conf <span class="built_in">echo</span> options nouveau modeset=0 | sudo tee <span class="_">-a</span> /etc/modprobe.d/nouveau-kms.conf</div><div class="line">sudo update-initramfs -u</div><div class="line">sudo reboot</div></pre></td></tr></table></figure>
<p>安装Kenel headers</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">sudo apt-get install -y linux-image-extra-virtual</div><div class="line">sudo reboot</div><div class="line"></div><div class="line">sudo apt-get install -y linux-source linux-headers-`uname -r`</div></pre></td></tr></table></figure>
<h2 id="安装Cuda-8-0"><a href="#安装Cuda-8-0" class="headerlink" title="安装Cuda 8.0"></a>安装Cuda 8.0</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb</div><div class="line"></div><div class="line">sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb</div><div class="line">rm cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb</div><div class="line"></div><div class="line">sudo apt-get update</div><div class="line">sudo apt-get install -y cuda</div></pre></td></tr></table></figure>
<h3 id="配置环境变量"><a href="#配置环境变量" class="headerlink" title="配置环境变量"></a>配置环境变量</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">vim ~/.profile</div><div class="line"></div><div class="line"><span class="built_in">export</span> CUDA_HOME=/usr/<span class="built_in">local</span>/cuda</div><div class="line"><span class="built_in">export</span> CUDA_ROOT=/usr/<span class="built_in">local</span>/cuda</div><div class="line"><span class="built_in">export</span> PATH=<span class="variable">$PATH</span>:<span class="variable">$CUDA_ROOT</span>/bin</div><div class="line"><span class="built_in">export</span> LD_LIBRARY_PATH=<span class="variable">$LD_LIBRARY_PATH</span>:<span class="variable">$CUDA_ROOT</span>/lib64</div><div class="line"></div><div class="line">sudo reboot</div></pre></td></tr></table></figure>
<h3 id="验证Cuda安装成功"><a href="#验证Cuda安装成功" class="headerlink" title="验证Cuda安装成功"></a>验证Cuda安装成功</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">nvcc --version</div><div class="line"></div><div class="line"><span class="comment"># verify the driver is installed</span></div><div class="line">nvidia-smi</div><div class="line"></div><div class="line"><span class="built_in">cd</span> /usr/<span class="built_in">local</span>/cuda/</div><div class="line"><span class="built_in">cd</span> samples</div><div class="line">sudo make</div><div class="line"></div><div class="line"><span class="built_in">cd</span> ./1_Utilities/deviceQuery</div><div class="line">./deviceQuery</div></pre></td></tr></table></figure>
<p><img src="/media/14886272410398.jpg" alt=""><br><img src="/media/14886272706287.jpg" alt=""><br><img src="/media/14886278898974.jpg" alt=""></p>
<h2 id="安装cuDNN-v5-1"><a href="#安装cuDNN-v5-1" class="headerlink" title="安装cuDNN v5.1"></a>安装cuDNN v5.1</h2><p><a href="https://developer.nvidia.com/rdp/cudnn-download" target="_blank" rel="external">https://developer.nvidia.com/rdp/cudnn-download</a><br>cuDNN v5.1 Runtime Library for Ubuntu14.04 (Deb)<br>cuDNN v5.1 Developer Library for Ubuntu14.04 (Deb)</p>
<p>需要先加入Accelerated Computing Developer Program，然后下载到本地，再上传到ec2，然后安装</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">sudo dpkg -i libcudnn5_5.1.5-1+cuda8.0_amd64.deb</div><div class="line">sudo dpkg -i libcudnn5-dev_5.1.5-1+cuda8.0_amd64.deb</div></pre></td></tr></table></figure>
<p>The libcupti-dev library, which is the NVIDIA CUDA Profile Tools Interface. This library provides advanced profiling support.</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">sudo apt-get install libcupti-dev</div></pre></td></tr></table></figure>
<h2 id="安装Tensorflow"><a href="#安装Tensorflow" class="headerlink" title="安装Tensorflow"></a>安装Tensorflow</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">wget https://repo.continuum.io/archive/Anaconda2-4.3.0-Linux-x86_64.sh</div><div class="line">bash Anaconda2-4.3.0-Linux-x86_64.sh</div><div class="line"></div><div class="line">conda create -n tensorflow</div><div class="line"><span class="built_in">source</span> activate tensorflow</div><div class="line">pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.0-cp27-none-linux_x86_64.whl</div></pre></td></tr></table></figure>
<p>运行成功，Tensorflow + GPU</p>
<p><img src="/media/14886284271425.jpg" alt=""></p>
<p>Ref:<br><a href="https://www.tensorflow.org/install/install_linux" target="_blank" rel="external">https://www.tensorflow.org/install/install_linux</a><br><a href="https://gist.github.com/erikbern/78ba519b97b440e10640" target="_blank" rel="external">https://gist.github.com/erikbern/78ba519b97b440e10640</a><br><a href="http://expressionflow.com/2016/10/09/installing-tensorflow-on-an-aws-ec2-p2-gpu-instance/" target="_blank" rel="external">http://expressionflow.com/2016/10/09/installing-tensorflow-on-an-aws-ec2-p2-gpu-instance/</a><br><a href="https://medium.com/@giltamari/tensorflow-getting-started-gpu-installation-on-ec2-9b9915d95d6f#.ef96jc7a4" target="_blank" rel="external">https://medium.com/@giltamari/tensorflow-getting-started-gpu-installation-on-ec2-9b9915d95d6f#.ef96jc7a4</a><br><a href="https://eatcodeplay.com/installing-tensorflow-with-python-3-on-ec2-gpu-instances-f9fa199eb3cc#.142acv4zq" target="_blank" rel="external">https://eatcodeplay.com/installing-tensorflow-with-python-3-on-ec2-gpu-instances-f9fa199eb3cc#.142acv4zq</a></p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[2016 Reading List]]></title>
      <url>https://linbojin.github.io/2016/10/04/2016-Reading-List/</url>
      <content type="html"><![CDATA[<p>November</p>
<ol>
<li>The Linux Command Line: <a href="http://linuxcommand.org/tlcl.php" target="_blank" rel="external">http://linuxcommand.org/tlcl.php</a></li>
<li>Scala by Example: <a href="http://www.scala-lang.org/docu/files/ScalaByExample.pdf" target="_blank" rel="external">http://www.scala-lang.org/docu/files/ScalaByExample.pdf</a></li>
</ol>
<div class="github-widget" data-repo="linbojin/linbojin.github.io"></div>

]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Java Basic Knowledge]]></title>
      <url>https://linbojin.github.io/2016/05/13/Java-Basic-Knowledge/</url>
      <content type="html"><![CDATA[<p>A Java program can be defined as a collection of objects that <strong>communicate via invoking each other’s methods</strong>.</p>
<ul>
<li>Java is a Object-Oriented Language:<ul>
<li>Polymorphism</li>
<li>Inheritance</li>
<li>Encapsulation</li>
<li>Abstraction</li>
<li>Classes: A class can be defined as a template/blue print that describes the behaviors/states that object of its type support.</li>
<li>Objects: Objects have states and behaviors. </li>
<li>Instance</li>
<li>Method</li>
<li>Message Parsing</li>
</ul>
</li>
</ul>
<a id="more"></a>
<h4 id="Inheritance"><a href="#Inheritance" class="headerlink" title="Inheritance:"></a>Inheritance:</h4><p>In Java, classes can be derived from classes. Basically if you need to create a new class and here is already a class that has some of the code you require, then it is possible to <strong>derive your new class from the already existing code</strong>.<br>This concept allows you to <strong>reuse the fields and methods of the existing class without having to rewrite the code</strong> in a new class. In this scenario the existing class is called the <strong>superclass</strong> and the derived class is called the <strong>subclass</strong>.</p>
<h4 id="Interfaces"><a href="#Interfaces" class="headerlink" title="Interfaces:"></a>Interfaces:</h4><p>In Java language, an interface can be defined as a <strong>contract between objects on how to communicate with each other</strong>. Interfaces play a vital role when it comes to the concept of inheritance.<br>An interface defines the methods, a deriving class(subclass) should use. But <strong>the implementation of the methods is totally up to the subclass</strong>.</p>
<h4 id="Variable-Types-inside-Classes"><a href="#Variable-Types-inside-Classes" class="headerlink" title="Variable Types inside Classes:"></a>Variable Types inside Classes:</h4><ul>
<li><strong>Local variable</strong>: There is <strong>no default value</strong> for local variables so local variables should be declared and an initial value should be assigned before the first use. </li>
<li><strong>instance variable</strong>: When a space is allocated for an object in the heap, a slot for each instance variable value is created. <strong>Within static methods, instance variable should be called using the fully qualified name: ObjectReference.VariableName.</strong></li>
<li><strong>class variable</strong>(with the static keyword): There would <strong>only be one copy of each class variable per class</strong>, regardless of how many objects are created from it: <strong>ClassName.VariableName</strong></li>
</ul>
<h4 id="Java-Data-Types"><a href="#Java-Data-Types" class="headerlink" title="Java Data Types:"></a>Java Data Types:</h4><ul>
<li>Primitive Data Types:<ul>
<li>byte:  8-bit signed, -128(-2^7) ~ 127 (inclusive)(2^7 -1)</li>
<li>short: 16-bit signed, -32,768(-2^15) ~ 32,767 (inclusive) (2^15 -1)</li>
<li>int:   32-bit signed, -2,147,483,648(-2^31) ~ 2,147,483,647(inclusive)(2^31 -1)</li>
<li>long:  64-bit signed, (-2^63) ~ (2^63-1)</li>
<li>float:  single-precision 32-bit, </li>
<li>double: double-precision 64-bit, the default data type for decimal values.</li>
<li>boolean: one bit, true and false.</li>
<li>char: single 16-bit Unicode character, Minimum value is ‘\u0000’ (or 0), Maximum value is ‘\uffff’ (or 65,535 inclusive).</li>
</ul>
</li>
<li>Reference Data Types: Class</li>
</ul>
<h4 id="Java-Class-Types"><a href="#Java-Class-Types" class="headerlink" title="Java Class Types:"></a>Java Class Types:</h4><ul>
<li>abstract classes</li>
<li>final classes</li>
<li>Inner classes</li>
<li>Anonymous classes.</li>
</ul>
<h4 id="Java-Modifiers"><a href="#Java-Modifiers" class="headerlink" title="Java Modifiers:"></a>Java Modifiers:</h4><p>Like other languages, it is possible to modify classes, methods, etc., by using modifiers. There are two categories of modifiers:</p>
<ul>
<li>Access Modifiers:  <ul>
<li><strong>default</strong>: <strong>Visible to the package</strong> </li>
<li>public: Visible to the world </li>
<li>protected: Visible to the package and all subclasses</li>
<li>private: Visible to the class only</li>
</ul>
</li>
<li>Non-access Modifiers:<ul>
<li>The <strong>static</strong> modifier for creating class <strong>methods</strong> and <strong>variables</strong></li>
<li>The <strong>final</strong> modifier for finalizing the implementations of <strong>classes</strong>, <strong>methods</strong>, and <strong>variables</strong>.</li>
<li>The <strong>abstract</strong> modifier for creating abstract <strong>classes</strong> and <strong>methods</strong>.</li>
<li>The <strong>synchronized</strong> and <strong>volatile</strong> modifiers, which are used for <strong>threads</strong>.</li>
</ul>
</li>
</ul>
<h4 id="Java-Basic-Operators"><a href="#Java-Basic-Operators" class="headerlink" title="Java Basic Operators"></a>Java Basic Operators</h4><p><a href="http://www.tutorialspoint.com/java/java_basic_operators.htm" target="_blank" rel="external">http://www.tutorialspoint.com/java/java_basic_operators.html</a><br>The Bitwise Operators:<br>works on bits and performs bit-by-bit operation.</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">public</span> <span class="class"><span class="keyword">class</span> <span class="title">Test</span> </span>&#123;</div><div class="line"></div><div class="line">  <span class="function"><span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">void</span> <span class="title">main</span><span class="params">(String args[])</span> </span>&#123;</div><div class="line">     <span class="keyword">int</span> a = <span class="number">60</span>;	<span class="comment">/* 60 = 0011 1100 */</span>  </div><div class="line">     <span class="keyword">int</span> b = <span class="number">13</span>;	<span class="comment">/* 13 = 0000 1101 */</span></div><div class="line">     <span class="keyword">int</span> c = <span class="number">0</span>;</div><div class="line"></div><div class="line">     c = a &amp; b;       <span class="comment">/* 12 = 0000 1100 */</span> </div><div class="line">     System.out.println(<span class="string">"a &amp; b = "</span> + c );</div><div class="line"></div><div class="line">     c = a | b;       <span class="comment">/* 61 = 0011 1101 */</span></div><div class="line">     System.out.println(<span class="string">"a | b = "</span> + c );</div><div class="line"></div><div class="line">     c = a ^ b;       <span class="comment">/* 49 = 0011 0001 */</span></div><div class="line">     System.out.println(<span class="string">"a ^ b = "</span> + c );</div><div class="line"></div><div class="line">     c = ~a;          <span class="comment">/*-61 = 1100 0011 */</span></div><div class="line">     System.out.println(<span class="string">"~a = "</span> + c );</div><div class="line"></div><div class="line">     c = a &lt;&lt; <span class="number">2</span>;     <span class="comment">/* 240 = 1111 0000 */</span></div><div class="line">     System.out.println(<span class="string">"a &lt;&lt; 2 = "</span> + c );</div><div class="line"></div><div class="line">     c = a &gt;&gt; <span class="number">2</span>;     <span class="comment">/* 15 = 1111 */</span></div><div class="line">     System.out.println(<span class="string">"a &gt;&gt; 2  = "</span> + c );</div><div class="line"></div><div class="line">     c = a &gt;&gt;&gt; <span class="number">2</span>;     <span class="comment">/* 15 = 0000 1111 */</span></div><div class="line">     System.out.println(<span class="string">"a &gt;&gt;&gt; 2 = "</span> + c );</div><div class="line">  &#125;</div><div class="line">&#125;</div></pre></td></tr></table></figure>
<p>The instance of Operator: </p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">boolean</span> result = <span class="string">"str"</span> <span class="keyword">instanceof</span> String;</div></pre></td></tr></table></figure>
<h4 id="Java-Loop-Control"><a href="#Java-Loop-Control" class="headerlink" title="Java Loop Control"></a>Java Loop Control</h4><figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">while</span> (x &lt; <span class="number">10</span>)&#123;</div><div class="line">  do sth;</div><div class="line">  x++;</div><div class="line">&#125;</div><div class="line"></div><div class="line"><span class="keyword">for</span> (<span class="keyword">int</span> i=<span class="number">0</span>; i&lt;<span class="number">10</span>; i++) &#123;</div><div class="line">  do sth;</div><div class="line">&#125;</div><div class="line"></div><div class="line">do &#123;</div><div class="line">  sth;</div><div class="line">  x++;</div><div class="line">&#125; <span class="keyword">while</span> (x &lt; <span class="number">10</span>)</div></pre></td></tr></table></figure>
<p><strong>Advanced for loop</strong><br><figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">int</span> [] numbers = &#123;<span class="number">10</span>, <span class="number">20</span>, <span class="number">30</span>, <span class="number">40</span>, <span class="number">50</span>&#125;;</div><div class="line"><span class="keyword">for</span>(<span class="keyword">int</span> x : numbers )&#123;</div><div class="line">  System.out.print( x );</div><div class="line">  System.out.print(<span class="string">","</span>);</div><div class="line">&#125;</div></pre></td></tr></table></figure></p>
<h4 id="Java-Decision-Making"><a href="#Java-Decision-Making" class="headerlink" title="Java Decision Making"></a>Java Decision Making</h4><figure class="highlight ceylon"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">if</span>(x)&#123;</div><div class="line">	xxx</div><div class="line">&#125; <span class="keyword">else</span> <span class="keyword">if</span> &#123;</div><div class="line">  xxx</div><div class="line">&#125; <span class="keyword">else</span> &#123;</div><div class="line">  xxx</div><div class="line">&#125;</div><div class="line"></div><div class="line"><span class="keyword">switch</span>(expression)&#123;</div><div class="line">    <span class="keyword">case</span> <span class="keyword">value</span> :</div><div class="line">       <span class="comment">//Statements</span></div><div class="line">       <span class="keyword">break</span>; <span class="comment">//optional</span></div><div class="line">    <span class="keyword">case</span> <span class="keyword">value</span> :</div><div class="line">       <span class="comment">//Statements</span></div><div class="line">       <span class="keyword">break</span>; <span class="comment">//optional</span></div><div class="line">    <span class="comment">//You can have any number of case statements.</span></div><div class="line">    <span class="keyword">default</span> : <span class="comment">//Optional</span></div><div class="line">       <span class="comment">//Statements</span></div><div class="line">&#125;</div></pre></td></tr></table></figure>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Advanced Spark Learning Resources]]></title>
      <url>https://linbojin.github.io/2016/04/16/Advanced-Spark-Learning-Resources/</url>
      <content type="html"><![CDATA[<h3 id="Advanced-Training-Tutorials-About-RDD-and-Spark-Internals"><a href="#Advanced-Training-Tutorials-About-RDD-and-Spark-Internals" class="headerlink" title="Advanced Training Tutorials About RDD and Spark Internals"></a>Advanced Training Tutorials About RDD and Spark Internals</h3><ul>
<li>Spark Summit EAST 2015, March 18-19, 2015, Spark Version 1.3 <ul>
<li>(<strong>Recommand</strong>) <a href="https://spark-summit.org/east-2015/training" target="_blank" rel="external">Advanced Apache Spark</a> – <a href="https://www.linkedin.com/in/blueplastic" target="_blank" rel="external">Sameer Farooqui</a><br><a href="https://spark-summit.org/2014/wp-content/uploads/2015/03/SparkSummitEast2015-AdvDevOps-StudentSlides.pdf" target="_blank" rel="external">Slide</a> | <a href="https://www.youtube.com/watch?v=7ooZ4S7Ay6Y&amp;index=2&amp;list=PL-x35fyliRwhrzM1Hq62WX4UeIIEqw3SU" target="_blank" rel="external">Video</a></li>
</ul>
</li>
</ul>
<a id="more"></a>
<ul>
<li>Spark Summit 2015, June 15-17, 2015, Spark Version 1.4<ul>
<li><a href="https://spark-summit.org/2015/training/#devops" target="_blank" rel="external">DevOps with Apache Spark Workshop</a> – <a href="https://www.linkedin.com/in/blueplastic" target="_blank" rel="external">Sameer Farooqui</a><br><a href="http://www.slideshare.net/SparkSummit/dev-ops-training" target="_blank" rel="external">Slide</a> | <a href="https://www.youtube.com/watch?v=l4ZYUfZuRbU&amp;list=PL-x35fyliRwioDix9XjD3HptH8ro55SuB&amp;index=6" target="_blank" rel="external">Video 1</a> | <a href="https://www.youtube.com/watch?v=G7PcSBhfSQo&amp;index=6&amp;list=PL-x35fyliRwioDix9XjD3HptH8ro55SuB" target="_blank" rel="external">Video 2</a></li>
</ul>
</li>
</ul>
<ul>
<li>Spark Summit 2014, June 30- July 2, 2014, Spark Version 1.1<ul>
<li>Advanced Spark Internals and Tuning – <a href="https://www.linkedin.com/in/reynoldxin" target="_blank" rel="external">Reynold Xin</a><br><a href="https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf" target="_blank" rel="external">Slides</a> | <a href="https://www.youtube.com/watch?v=HG2Yd-3r4-M&amp;list=PLTPXxbhUt-YWGNTaDj6HSjnHMxiTD1HCR&amp;index=1" target="_blank" rel="external">Video</a></li>
<li>Spark SQL – <a href="https://www.linkedin.com/in/michaelarmbrust" target="_blank" rel="external">Michael Armbrust</a><br><a href="https://databricks-training.s3.amazonaws.com/slides/SparkSQLTraining.Summit.July2014.pdf" target="_blank" rel="external">Slides</a> | <a href="https://www.youtube.com/watch?v=5TKSM1UdSXQ&amp;index=2&amp;list=PLTPXxbhUt-YWGNTaDj6HSjnHMxiTD1HCR" target="_blank" rel="external">Video</a></li>
</ul>
</li>
</ul>
<h3 id="Spark-SQL-DataFrame-DataSet-And-Tungsten"><a href="#Spark-SQL-DataFrame-DataSet-And-Tungsten" class="headerlink" title="Spark SQL, DataFrame, DataSet And Tungsten"></a>Spark SQL, DataFrame, DataSet And Tungsten</h3><ul>
<li><p><a href="https://spark-summit.org/2015/events/spark-dataframes-simple-and-fast-analysis-of-structured-data/" target="_blank" rel="external">Spark DataFrames: Simple and Fast Analysis of Structured Data</a> – Michael Armbrust<br><a href="http://www.slideshare.net/databricks/spark-dataframes-simple-and-fast-analytics-on-structured-data-at-spark-summit-2015?from_action=save" target="_blank" rel="external">Slide</a> | <a href="https://www.youtube.com/watch?v=xWkJCUcD55w&amp;list=PL-x35fyliRwgfhffEpywn4q23ykotgQJ6&amp;index=14" target="_blank" rel="external">Video</a></p>
</li>
<li><p><a href="https://spark-summit.org/east-2016/events/structuring-spark-dataframes-datasets-and-streaming/" target="_blank" rel="external">Structuring Spark: DataFrames, Datasets, and Streaming</a> – Michael Armbrust<br><a href="http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust" target="_blank" rel="external">Slide</a> | <a href="https://www.youtube.com/watch?v=i7l3JQRx7Qw&amp;feature=youtu.be" target="_blank" rel="external">Video</a> </p>
</li>
<li><p><a href="https://spark-summit.org/2015/events/keynote-9/" target="_blank" rel="external">From DataFrames to Tungsten: A Peek into Spark’s Future</a> – Reynold Xin<br><a href="http://www.slideshare.net/SparkSummit/reynold-xin" target="_blank" rel="external">Slide</a> | <a href="https://www.youtube.com/watch?v=VbSar607HM0&amp;list=PL-x35fyliRwgdKsaLFMwl-Q-vSd7-X6mi&amp;index=9" target="_blank" rel="external">Video</a></p>
</li>
<li><p><a href="https://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing-spark-closer-to-bare-metal/" target="_blank" rel="external">Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal</a> – Josh Rosen<br><a href="http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen" target="_blank" rel="external">Slide</a> | <a href="https://www.youtube.com/watch?v=5ajs8EIPWGI&amp;index=5&amp;list=PL-x35fyliRwgfhffEpywn4q23ykotgQJ6" target="_blank" rel="external">Video</a></p>
</li>
</ul>
<h3 id="Spark-Research-Papers"><a href="#Spark-Research-Papers" class="headerlink" title="Spark Research Papers"></a>Spark Research Papers</h3><ul>
<li><p>Zaharia, Matei, et al. “<a href="http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf" target="_blank" rel="external"><strong>Spark: Cluster Computing with Working Sets</strong></a>“. HotCloud 10 (2010): 10-10.</p>
</li>
<li><p>Zaharia, Matei, et al. “<a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf" target="_blank" rel="external"><strong>Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</strong></a>“. Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.</p>
</li>
<li><p>Armbrust, Michael, et al. “<a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf" target="_blank" rel="external"><strong>Spark SQL: Relational Data Processing in Spark</strong></a>“. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015.</p>
</li>
<li><p>Zaharia, Matei, et al. “<a href="http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf" target="_blank" rel="external"><strong>Discretized Streams: Fault-Tolerant Streaming Computation at Scale</strong></a>“. Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013.</p>
</li>
<li><p>Xin, Reynold S., et al. “<a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf" target="_blank" rel="external"><strong>GraphX: A Resilient Distributed Graph System on Spark</strong></a>“. First International Workshop on Graph Data Management Experiences and Systems. ACM, 2013.</p>
</li>
</ul>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Hadoop Guide Chapter 11 Administering Hadoop]]></title>
      <url>https://linbojin.github.io/2016/04/03/Hadoop-Guide-Chapter-11-Administering-Hadoop/</url>
      <content type="html"></content>
    </entry>
    
    <entry>
      <title><![CDATA[Hadoop Guide Chapter 10 Setting Up a Hadoop Cluster]]></title>
      <url>https://linbojin.github.io/2016/04/03/Hadoop-Guide-Chapter-10-Setting-Up-a-Hadoop-Cluster/</url>
      <content type="html"></content>
    </entry>
    
    <entry>
      <title><![CDATA[Setup Vim as an IDE]]></title>
      <url>https://linbojin.github.io/2016/04/01/Setup-Vim-as-an-IDE/</url>
      <content type="html"><![CDATA[<p>Simple steps to setup your Vim as an IDE for python, scala and so on.<br>If you are not fimiliar with Vim, you can read this blog first: <a href="http://linbojin.github.io/2016/03/04/Getting-Started-with-Vim-by-Vimtutor/">Getting Started with Vim by Vimtutor</a>.</p>
<h3 id="Install-spf13-vim"><a href="#Install-spf13-vim" class="headerlink" title="Install spf13-vim"></a>Install spf13-vim</h3><p><a href="http://vim.spf13.com/#install" target="_blank" rel="external">spf13-vim</a> is a distribution of vim plugins and resources for Vim, GVim and MacVim. We firstly install it as the basic IDE and then do some customizations:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">$ sudo yum update</div><div class="line">$ sudo yum install git</div><div class="line">$ vim --version             <span class="comment"># checkout &gt;=7.4</span></div><div class="line">$ curl http://j.mp/spf13-vim3 -L -o - | sh</div></pre></td></tr></table></figure>
<a id="more"></a>
<p>Useful shortcuts:</p>
<ul>
<li><code>ctrl + e</code>: open/close NERDTree left tool</li>
<li><code>ctrl + p</code>: search and open files</li>
<li><code>m</code> (inside NERDTree): Open NERDTree Menu<ul>
<li><code>a</code>: add file or folder </li>
</ul>
</li>
</ul>
<h3 id="Customization"><a href="#Customization" class="headerlink" title="Customization"></a>Customization</h3><p>For customization, we can create three files:</p>
<figure class="highlight stylus"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">~/<span class="selector-class">.vimrc</span><span class="selector-class">.local</span> </div><div class="line">~/<span class="selector-class">.vimrc</span><span class="selector-class">.before</span><span class="selector-class">.local</span>  </div><div class="line">~/<span class="selector-class">.vimrc</span><span class="selector-class">.bundles</span><span class="selector-class">.local</span></div></pre></td></tr></table></figure>
<p>Make tab equal to 2 spaces:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">$ vim .vimrc.local</div><div class="line">autocmd FileType * setlocal expandtab tabstop=2 shiftwidth=2 softtabstop=2</div></pre></td></tr></table></figure>
<p>Select the language you want to support, it will install related plugins:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">$ vim .vimrc.before.local    </div><div class="line"><span class="built_in">let</span> g:spf13_bundle_groups=[<span class="string">'general'</span>, <span class="string">'writing'</span>, <span class="string">'youcompleteme'</span>, <span class="string">'programming'</span>, <span class="string">'scala'</span>, <span class="string">'php'</span>, <span class="string">'ruby'</span>, <span class="string">'python'</span>, <span class="string">'javascript'</span>, <span class="string">'html'</span>, <span class="string">'misc'</span>]</div><div class="line"><span class="built_in">let</span> g:spf13_no_autochdir = 1</div><div class="line"><span class="built_in">let</span> g:ycm_path_to_python_interpreter = <span class="string">'/usr/bin/python'</span></div></pre></td></tr></table></figure>
<p>After configration, we need to run the following command inside vim:<br><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">$ vim</div><div class="line">:BundleInstall</div></pre></td></tr></table></figure></p>
<p>It will give WARN from youcompleteme plugin, you need to compile it by yourself in next step.</p>
<h3 id="Compile-youcompleteme"><a href="#Compile-youcompleteme" class="headerlink" title="Compile youcompleteme"></a>Compile youcompleteme</h3><p><a href="https://github.com/Valloric/YouCompleteMe#fedora-linux-x64" target="_blank" rel="external">YouCompleteMe</a> is a powerful code-completion engine for Vim:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">$ sudo yum install automake gcc gcc-c++ kernel-devel cmake</div><div class="line">$ sudo yum install python-devel python3-devel</div><div class="line"></div><div class="line"><span class="comment"># Compiling YCM with semantic support for C-family languages</span></div><div class="line">$ <span class="built_in">cd</span> ~/.vim/bundle/YouCompleteMe</div><div class="line">$ ./install.py --clang-completer</div></pre></td></tr></table></figure>
<p>After it completes, do BundleInstall again:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">$ Vim</div><div class="line"></div><div class="line">:BundleInstall</div></pre></td></tr></table></figure>
<h3 id="ScreenShot"><a href="#ScreenShot" class="headerlink" title="ScreenShot"></a>ScreenShot</h3><p>For Scala<br><img src="/media/Screen%20Shot%202016-04-01%20at%2015.55.15.png" alt="Scala"></p>
<p>For Python<br><img src="/media/Screen%20Shot%202016-04-01%20at%2015.59.57.png" alt="Python"></p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Hadoop Guide Chapter 3: The Hadoop Distributed Filesystem]]></title>
      <url>https://linbojin.github.io/2016/03/23/Hadoop-Guide-Chapter-3-HDFS/</url>
      <content type="html"><![CDATA[<p>— <strong><em>HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.</em></strong> —</p>
<h3 id="The-design-of-HDFS"><a href="#The-design-of-HDFS" class="headerlink" title="The design of HDFS"></a>The design of HDFS</h3><p>Filesystems that <strong>manage the storage across a network of machines</strong> are called <strong>distributed filesystems</strong>. One of the biggest challenges is making the distributed filesystem <strong>tolerate node failure without suffering data loss</strong>. The hadoop distributed filesystem is called <strong><a href="http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html" target="_blank" rel="external">HDFS</a></strong>, which stands for Hadoop Distributed Filesystem.<a id="more"></a> HDFS is a filesystem designed for <strong>storing very large files</strong> with <strong>streaming data access patterns</strong>, running on <strong>clusters of commodity hardware</strong>.</p>
<p>HDFS is built around the idea that the most efficient data processing parttern is <strong>write-once, read-many-times pattern</strong>. Because the namenode holds filesystem metadata in memory, <strong>the limit to the number of files in a filesystem is governed by the amount of memory on the namenode</strong>. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory.</p>
<p>Files in HDFS may be written to by a single writer. Writers are always made <strong>at the end of the file, in append-only fashion</strong>.</p>
<h3 id="HDFS-Concepts"><a href="#HDFS-Concepts" class="headerlink" title="HDFS Concepts"></a>HDFS Concepts</h3><h4 id="Blocks"><a href="#Blocks" class="headerlink" title="Blocks"></a>Blocks</h4><p>A disk has a <strong>block size</strong>, which is <strong>the minimum amount of data that it can read or write</strong>. Filesystems for a single disk build on this by dealing with data in blocks, which are <strong>an integral multiple of the disk block size</strong>. Filesystem blocks are typically a few <strong>kilobytes in size</strong>, whereas disk blocks are <strong>normally 512 bytes</strong>.</p>
<p>Like in a filesystem for a single disk, <strong>files in HDFS are broken into block-sized chunks</strong>, which are <strong>stored as independent units</strong>. <strong>Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage</strong>.</p>
<p>HDFS has a much larger unit - <strong>128 MB</strong> by default. Why is a block in HDFS so large? The reason is to <strong>minimize the cost of seeks</strong>. If the block is large enough, <strong>the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block</strong>. This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally <strong>operate on one block at a time</strong>, so if you have too few tasks(fewer than nodes in the cluster), your jobs will run slower than they could oterwise.</p>
<p><strong>There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster.</strong> In fact, it would be possible, if unusual, to store a single file on an HDFS cluster whose blocks filled all the disks in the cluster. </p>
<p><strong>Having a block abstraction for a distributed filesystem brings several benefits</strong>:</p>
<ul>
<li>A file can be larger than any single disk in the network.</li>
<li>Making the unit of abstraction a block rather than a file simplifies the storage subsystem. So the storage subsystem only deals with blocks, simplifying storage management: blocks are a fixed size.</li>
<li>Furthermore, blocks fit well with replication for proiding fault tolerance and availability.</li>
</ul>
<p>If <strong>a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client</strong>. A block that is no longer available due to corruption or machine failure <strong>can be replicated from its alternative locations to other live machines to bring the replication factor back to the normal level,</strong> which is callled <strong>Data Integrity</strong> on guarding against corrupt data. </p>
<p>Some applications can <strong>choose to set a high replication factor for the blocks in a popular file to spread the read load on the cluster</strong>. </p>
<p>Command to list the blocks that make up each file in HDFS:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ hdfs fsck &lt;path&gt; -files -blocks</div></pre></td></tr></table></figure>
<h4 id="Namenodes-and-Datanodes"><a href="#Namenodes-and-Datanodes" class="headerlink" title="Namenodes and Datanodes"></a>Namenodes and Datanodes</h4><p>An HDFS cluster has two types of nodes: <strong>a namenode and a number of datanodes</strong>.</p>
<p>The namenode <strong>manages the filesystem namespace</strong>. It maintains the <strong>filesystem tree and the metadata</strong> for all the files and directories. This information is stored persistently on the local disk in the form of two files: <strong>the namespace image and the edit log</strong>. The namenode also <strong>knows the datanodes on which all the blocks for a given file are located</strong>; however, it <strong>does not store block locations</strong> persistently, which will be reconstructed from datanodes when the system starts. The block mappings are stored in a namenode’s memory, and not on disk.</p>
<p>Datanodes are the workhorses of the filesystem. They will <strong>report back to the namenode periodically with lists of blocks that they are storing</strong>.</p>
<p>If the namenode failed, <strong>all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks</strong> on the datanodes.</p>
<p>Hadoop provides two mechanisms to make the namenode resilient to failure.:</p>
<ul>
<li><strong>Back up the files</strong> that make up the persistent state of the filesystem metadata. Hadoop can <strong>be configured so that the namenode writes its persistent state to multiple filesystems</strong> (local disk or remote NFS mount). These writes are synchronous and atomic.</li>
<li>Run a <strong>secondary namenode</strong> which <strong>does not act as a namdenode</strong>. <strong>Its main role is to periodically merge the namespace image with edit log to prevent the edit log from becoming too large.</strong> The secondary namenode usually <strong>runs on a separate physical machine</strong> because it requires plenty of CPU and as much memory as the namenode to perform merge. It <strong>keeps a copy of the merged namespace image</strong>, which can be used in the event of the namenode failing. However, <strong>the state of the secondary namenode lags that of the primary</strong>, so in the event of total failure of the primary, <strong>data loss is almost certain</strong>. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary. </li>
<li>For <strong>High Availability</strong>, it is possible to run a <strong>hot standby namenode</strong> instead of a secondary.</li>
</ul>
<h4 id="Block-Caching"><a href="#Block-Caching" class="headerlink" title="Block Caching"></a>Block Caching</h4><p>Normally a datanode reads blocks from disk, but for <strong>frequently accessed files </strong>the blocks may be <strong>explicitly cached in the datanode’s memory</strong>, in an <strong>off-heap block cache</strong>.</p>
<h3 id="The-Command-Line-Interface"><a href="#The-Command-Line-Interface" class="headerlink" title="The Command-Line Interface"></a>The Command-Line Interface</h3><p>By defalut, HDFS will replicate each filesystem block into <strong>3 replications</strong>. When running with a single datanode, HDFS can’t replicate blocks to three datanodes, so it would <strong>perpetually warn about blocks being under-replicated</strong>.<br>Let’s create a directory first just to see how it is displayed in the listing: </p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">$ hdfs dfs -mkdir books</div><div class="line">$ hdfs dfs -ls .</div><div class="line"></div><div class="line">output</div><div class="line">(file mode) (replication)</div><div class="line">drwxr-xr-x   - root supergroup    0 2016-03-16 13:22 books</div><div class="line">-rw-r--r--   1 root supergroup  119 2016-03-16 13:21 test.txt</div></pre></td></tr></table></figure>
<p>The entry in replication column is empty for directories because the concept of replication does not apply to them — <strong>directories are treated as metadata and stored by the namenode, not the datanodes</strong>. </p>
<p>File Permissions in HDFS<br>There are tree types of permission in HDFS:</p>
<ul>
<li>The read permission (r) is required to read files or list the contents of a directory. </li>
<li>The write permission (w) is required to write a file or, for a directory, to create or delete files or directories in it. </li>
<li>The execute permission (x) is ignored for a file because you can’t execute a file on HDFS (unlike POSIX), and for a directory this permission is required to access its children. </li>
</ul>
<p>Each file and directory has <strong>an owner, a group, and a mode</strong>. The mode (e.g. <code>drwxr-xr-x</code>) is made up of </p>
<ul>
<li><code>d</code> for dir or <code>-</code> for files</li>
<li>the permissions for the user who is the owner</li>
<li>the permissions for the users who are members of the group</li>
<li>the permissions for users who are neither the owners nor members of the group. </li>
</ul>
<p>By default, Hadoop <strong>runs with security disabled</strong>, which means that a client’s identity is not authenticated. There is a concept of a superuser, which is the identity of the namenode process. Permissions checks are not performed for the superuser.</p>
<h3 id="Hadoop-Filesystems"><a href="#Hadoop-Filesystems" class="headerlink" title="Hadoop Filesystems"></a>Hadoop Filesystems</h3><p>Hadoop has an <strong>abstract notion of filesystems</strong>, of which HDFS is just one implementation. The Java abstract class <strong>org.apache.hadoop.fs.FileSystem</strong> represents the client interface to a filesystem in Hadoop, and there are several concrete implementations:<br><img src="/media/Screen%20Shot%202016-03-28%20at%2016.03.25.png" alt="Hadoop Filesystems"><br>Hadoop provides many interfaces to its filesystems, and it generally uses the <strong>URI scheme to pick the correct filesystem instance</strong> to communicate with </p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">$ hadoop fs -ls file:///</div><div class="line">$ hadoop fs -ls hdfs://localhost:9000/</div></pre></td></tr></table></figure>
<h4 id="Interfaces"><a href="#Interfaces" class="headerlink" title="Interfaces"></a>Interfaces</h4><p>Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the Java API. The <strong>filesystem shell</strong>, for example, is a Java application that uses the Java <code>FileSystem</code> class to provide filesystem operations. Here are two commonly used other filesystem interfaces with HDFS:</p>
<ul>
<li><p>NFS<br>  It is possible to <strong>mount HDFS on a local client’s filesystem</strong> using Hadoop’s <strong>NFSv3 gateway</strong>. You can then <strong>use Unix utilities</strong> (such as <code>ls</code> and <code>cat</code>) to <strong>interact with the filesystem</strong>, upload files, and in general use POSIX libraries to access the filesystem from any programming language. <strong>Appending to a file works, but random modifications of a file do not, since HDFS can only write to the end of a file</strong>.</p>
</li>
<li><p>FUSE<br>  Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be integrated as Unix filesystems. Hadoop’s Fuse-DFS contrib module allows HDFS (or any Hadoop filesystem) to be mounted as a standard local filesystem. Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. At the time of writing, the Hadoop NFS gateway is the more robust solution to mounting HDFS, so should be preferred over Fuse-DFS. </p>
</li>
</ul>
<h3 id="Data-Flow"><a href="#Data-Flow" class="headerlink" title="Data Flow"></a>Data Flow</h3><h4 id="File-Read"><a href="#File-Read" class="headerlink" title="File Read"></a>File Read</h4><p>The image shows the main sequence of events when reading a file from HDFS cluster.<br><img src="/media/Screen%20Shot%202016-03-28%20at%2017.22.01.png" alt="Anatomy of a File Read"></p>
<h4 id="File-Write"><a href="#File-Write" class="headerlink" title="File Write"></a>File Write</h4><p>The image shows the main sequence of events when writing a file to HDFS cluster.<br><img src="/media/Screen%20Shot%202016-03-28%20at%2017.22.14.png" alt="Anatomy of a File Write"></p>
<h3 id="Parallel-copying-across-clusters-with-distcp"><a href="#Parallel-copying-across-clusters-with-distcp" class="headerlink" title="Parallel copying across clusters with distcp"></a>Parallel copying across clusters with distcp</h3><p>The HDFS access patterns that we have seen so far focus on <strong>single-threaded access</strong>. It’s possible to act on a collection of files. Hadoop comes with a useful program called <code>distcp</code> for copying data to and from Hadoop filesystems <strong>in parallel</strong>:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">$ hadoop distcp file1 file2      <span class="comment"># same to hadoop fs -cp file1 file2</span></div><div class="line">$ hadoop distcp dir1 dir2        <span class="comment"># If dir2 exists, new structure will be dir2/dir1 </span></div><div class="line">$ hadoop distcp -overwrite dir1 dir2    <span class="comment"># dir2 will be overwritten</span></div><div class="line">$ hadoop distcp -update dir1 dir2       <span class="comment"># synchronize the change with dir2</span></div></pre></td></tr></table></figure>
<p><code>distcp</code> is implemented <strong>as a MapReduce job</strong> where the work of copying is done <strong>by the maps that run in parallel across the cluster</strong>. Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data by bucketing files into roughly equal allocations. By default, up to <strong>20 maps are used</strong>, but this can be changed by specifying the <code>-m</code> argument to distcp.<br>A very common use case for distcp is for <strong>transferring data between two HDFS clusters</strong>:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># -delete: delete any files or directories from the destination </span></div><div class="line"><span class="comment">#          that are not present in the source </span></div><div class="line"><span class="comment"># -p:      file status attributes like permissions, block size, </span></div><div class="line"><span class="comment">#          and replication are preserved </span></div><div class="line">$ hadoop distcp -update -delete -p hdfs://namenode1/foo hdfs://namenode2/foo</div></pre></td></tr></table></figure>
<h4 id="Keeping-an-HDFS-Cluster-Balanced"><a href="#Keeping-an-HDFS-Cluster-Balanced" class="headerlink" title="Keeping an HDFS Cluster Balanced"></a>Keeping an HDFS Cluster Balanced</h4><p>When copying data into HDFS, it’s important to consider cluster balance. <strong>HDFS works best when the file blocks are evenly spread across the cluster</strong>, so you want to ensure that distcp doesn’t disrupt this. For example, if you specified -m 1, a single map would do the copy, which results that the first replica of each block would reside on the node running the map (until the disk filled up). So it’s best to start by running distcp with the default of 20 maps per node. </p>
<p>However, you can also use the <strong>balancer tool</strong> (known as <strong><a href="http://www.cloudera.com/documentation/archive/cdh/4-x/4-7-1/CDH4-Installation-Guide/cdh4ig_balancer.html" target="_blank" rel="external">Balancer</a></strong>) to subsequently even out the block distribution across the cluster. </p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Hadoop Guide Chapter 1: Meet Hadoop]]></title>
      <url>https://linbojin.github.io/2016/03/22/Hadoop-Guide-Chapter-1-Meet-Hadoop/</url>
      <content type="html"><![CDATA[<p>—— <strong><em>More data usually beats better algorithms.</em></strong> ——</p>
<h3 id="A-Brief-History-of-Apache-Hadoop"><a href="#A-Brief-History-of-Apache-Hadoop" class="headerlink" title="A Brief History of Apache Hadoop"></a>A Brief History of Apache Hadoop</h3><p><a href="http://hadoop.apache.org/" target="_blank" rel="external">Hadoop</a> was created by <a href="https://en.wikipedia.org/wiki/Doug_Cutting" target="_blank" rel="external">Doug Cutting</a>, the creator of <a href="http://lucene.apache.org/" target="_blank" rel="external">Apache Lucene</a>, the widely used text search library. Hadoop has its origins in <a href="http://nutch.apache.org/" target="_blank" rel="external">Apache Nutch</a>, an open source web search engine, itself a part of the Lucene project.<br><a id="more"></a><br>Nutch was started in 2002 whose architecture wouldn’t scale to the billions of pages on the Web. In 2003, Google published a paper that described the architecture of Google’s distributed filesystem, called GFS, which would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In 2004, Nutch’s developers set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS). In 2004, Google published another paper that introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS. </p>
<p>NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in <strong>February 2006</strong> they moved out of Nutch to form an independent subproject of Lucene called <strong>Hadoop</strong>. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale. This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster. In <strong>January 2008</strong>, Hadoop was made its <strong>own top-level project at Apache</strong>, confirming its success and its diverse, active community. </p>
<p>Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a general purpose storage and analysis platform for big data has been recognized by the industry. Commercial Hadoop support is available from large, established enterprise vendors, including EMC, IBM, Microsoft, and Oracle, as well as from <strong>specialist Hadoop companies such as Cloudera, Hortonworks, and MapR</strong>. </p>
<h3 id="Apache-Hadoop-Ecosystem"><a href="#Apache-Hadoop-Ecosystem" class="headerlink" title="Apache Hadoop Ecosystem"></a>Apache Hadoop Ecosystem</h3><p><img src="/media/14589963307830.jpg" alt="Apache Hadoop Ecosystem"></p>
<h3 id="Data-Storage-and-Analysis"><a href="#Data-Storage-and-Analysis" class="headerlink" title="Data Storage and Analysis"></a>Data Storage and Analysis</h3><p><strong>Although the storage capacities of hard drives have increased massively over the years, access speeds - the rate at which data can be read from drives - have not kept up.</strong> One typical drive from 1990 could store 1370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in aroud five minutes. Over 20 years later, 1-terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data oof the disk.</p>
<p>MapReduce provides <strong>a programming model that abstracts the problem </strong>from disk reads and writes, transforming it into a <strong>computation over sets of keys and values</strong>.</p>
<p>MapReduce is a <strong>batch query</strong> processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative. MapReduce is fundamentally a batch processing system, and is <strong>not suitable for interactive analysis</strong>.</p>
<h3 id="Beyond-Batch"><a href="#Beyond-Batch" class="headerlink" title="Beyond Batch"></a>Beyond Batch</h3><p>The first component to provide <strong>online access was <a href="https://hbase.apache.org/" target="_blank" rel="external">HBase</a></strong>, a <strong>key-value</strong> store that uses HDFS for its underlying storage. HBase provides <strong>both online read/write access of individual rows and batch operations for reading and writing data in bulk</strong>, making it a good solution for building applications on. </p>
<p>The real enabler for <strong>new processing models</strong> in Hadoop was the introduction of <strong>YARN</strong> (which stands for <em>Yet Another Resource Negotiator</em>) in Hadoop 2. YARN is a <strong>cluster resource management system</strong>, which allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster. Here are examples: </p>
<ul>
<li>Interactive SQL: achive <strong>low-latency responses</strong> for SQL queries on large dataset sizes, <a href="https://hive.apache.org/" target="_blank" rel="external">Hive</a>, Impala.</li>
<li>Interactive Analytics at Scale: <a href="http://druid.io/" target="_blank" rel="external">Druid</a></li>
<li>Iterative processing: machine learning, Spark</li>
<li>Streaming processing: Streaming systems like <strong>Storm</strong>, <strong>Spark Streaming</strong> make it possible to run <strong>realtime</strong>, distributed computations on <strong>unbounded streams</strong> of data and emit results to Hadoop storage or external systems.</li>
<li>Search: <a href="http://lucene.apache.org/solr/" target="_blank" rel="external">Solr</a></li>
</ul>
<h3 id="Comparison-with-Other-Systems"><a href="#Comparison-with-Other-Systems" class="headerlink" title="Comparison with Other Systems"></a>Comparison with Other Systems</h3><h4 id="Relational-Database-Management-Systems"><a href="#Relational-Database-Management-Systems" class="headerlink" title="Relational Database Management Systems"></a>Relational Database Management Systems</h4><p>Seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. </p>
<p>If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate at which it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. Difference between MapReduce and RDBMS:</p>
<ul>
<li><strong>MapReduce is a good fit</strong> for problems that need to <strong>analyze the whole dataset in a batch fashion</strong>, particularly for ad hoc analysis. </li>
<li>An <strong>RDBMS</strong> is <strong>good for point queries or updates</strong>, where <strong>the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data</strong>.</li>
<li><p>MapReduce suits applications where the data is <strong>written once and read many times</strong>, whereas a relational database is good for datasets that <strong>continually updated</strong>.</p>
<p>  <img src="/media/Screen%20Shot%202016-03-26%20at%2019.36.24.png" alt="Screen Shot 2016-03-26 at 19.36.24"></p>
</li>
</ul>
<p>Another difference between Hadoop and an RDBMS is the amount of structure in the datasets on which they operate. There are three kinds of data:</p>
<ul>
<li><strong>Structure data</strong> is organized into <strong>entities that have a defined format</strong>, such as XML documents or database tables that conform to a particular predefined schema.</li>
<li><strong>Semi-structure data</strong> is looser and though there may be a schema which may be used  only as a guide to the structure of the data, for example JSON.</li>
<li><strong>Unstructured data</strong> does not have any particular internal structure, for example, image data.</li>
</ul>
<p><strong>Hadoop works well on unstructure of semi-structured data</strong> because it is designed to <strong>interpret the data at processing time (so called schema-on-read)</strong>. This provides flexibility and <strong>avoids the costly data loading phase of an RDBMS</strong>, since in Hadoop it is just a file copy.</p>
<h4 id="High-performance-computing-or-Grid-Computing"><a href="#High-performance-computing-or-Grid-Computing" class="headerlink" title="High-performance computing or Grid Computing"></a>High-performance computing or Grid Computing</h4><p>HPC is to <strong>distribute the work across a cluster</strong> of machines, which <strong>access a shared filesystem</strong>, hosted by a storage area network (SAN). This <strong>works well for predominantly compute-intensive jobs</strong>, but it becomes <strong>a problem when nodes need to access larger data volumes</strong> and the <strong>network bandwidth</strong> will be the bottleneck and <strong>compute nodes become idle</strong>.</p>
<p>Hadoop tries to <strong>co-locate the data</strong> wih the compute nodes, so data access is fast because it is local. This feature, known as <strong>data locality</strong>, is <strong>at the heart of data processing in Hadoop and is the reason for its good performance</strong>.</p>
<p>Distributed processing frameworks like MapReduce <strong>spare the programmer from having to think about failure</strong>, since the implementation <strong>detects failed tasks and reschedules replacements</strong> on machines that are healthy. MapReduce is able to do this because it is a <strong>shared-nothing architecture</strong>, meaning that tasks have no dependence on the other. By contrast, <strong>Message Passing Interface (MPI) programs have to explicitly manage their own checkpointing and recovery</strong>, which gives more control to the programmer but makes them more difficult to write.</p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Hadoop The Definitive Guide Reading Notes]]></title>
      <url>https://linbojin.github.io/2016/03/22/Hadoop-The-Definitive-Guide-Reading-Notes/</url>
      <content type="html"><![CDATA[<p><strong>Hadoop: The Definitive Guide, Fourth Edition</strong>: <a href="http://shop.oreilly.com/product/0636920033448.do" target="_blank" rel="external">http://shop.oreilly.com/product/0636920033448.do</a><br><strong>Code and Data</strong>: <a href="http://hadoopbook.com/code.html" target="_blank" rel="external">http://hadoopbook.com/code.html</a><br>Download <strong>ncdc weather dataset</strong>: <a href="https://gist.github.com/rehevkor5/2e407950ca687b36fc54" target="_blank" rel="external">https://gist.github.com/rehevkor5/2e407950ca687b36fc54</a><a id="more"></a><br>Building and Running:<br>    <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">$ git <span class="built_in">clone</span> https://github.com/tomwhite/hadoop-book.git</div><div class="line">$ <span class="built_in">cd</span> hadoop-book</div><div class="line"></div><div class="line"><span class="comment"># do a full build and create example JAR files in the top-level directory</span></div><div class="line">$ mvn package -DskipTests</div><div class="line"></div><div class="line"><span class="comment"># run an example</span></div><div class="line">$ <span class="built_in">export</span> HADOOP_CLASSPATH=hadoop-examples.jar</div><div class="line">$ hadoop MaxTemperature /hadoop-book/ncdc/ output1</div></pre></td></tr></table></figure></p>
<h2 id="Reading-Notes"><a href="#Reading-Notes" class="headerlink" title="Reading Notes"></a>Reading Notes</h2><ul>
<li><a href="https://linbojin.github.io/2016/03/22/Hadoop-Guide-Chapter-1-Meet-Hadoop/">Chapter 1: Meet Hadoop</a></li>
<li><a href="https://linbojin.github.io/2016/03/23/Hadoop-Guide-Chapter-3-HDFS/">Chapter 3: HDFS</a></li>
<li><a href="https://linbojin.github.io/2016/04/03/Hadoop-Guide-Chapter-10-Setting-Up-a-Hadoop-Cluster/">Chapter 10: Setting Up a Hadoop Cluster</a></li>
<li><a href="https://linbojin.github.io/2016/04/03/Hadoop-Guide-Chapter-11-Administering-Hadoop/">Chapter 11: Administering Hadoop</a></li>
</ul>
<hr>
<h4 id="Skipped-sections"><a href="#Skipped-sections" class="headerlink" title="Skipped sections"></a>Skipped sections</h4><ul>
<li>Chapter 2 MapReduce</li>
<li>Chapter 3 HDFS <ul>
<li>HDFS Federation</li>
<li>HDFS High Availability</li>
<li>The Java Interface </li>
<li>Data Flow, Coherency Model </li>
</ul>
</li>
</ul>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Practical Vim: Dot Formula]]></title>
      <url>https://linbojin.github.io/2016/03/16/Practical-Vim-Dot-Formula/</url>
      <content type="html"><![CDATA[<p>—— <strong><em>Dot Formula: One keystroke to move and one keystroke to execute.</em></strong> ——</p>
<p>Doc Command: repeat last changel</p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Install and Manage Node Versions with NVM]]></title>
      <url>https://linbojin.github.io/2016/03/07/Install-and-Manage-Node-Versions-with-NVM/</url>
      <content type="html"><![CDATA[<p>It’s very easy to install and manage multiple active node.js versions by <a href="https://github.com/creationix/nvm" target="_blank" rel="external">Node Version Manager(NVM)</a>.</p>
<h3 id="Install-or-update-nvm"><a href="#Install-or-update-nvm" class="headerlink" title="Install or update nvm"></a>Install or update nvm</h3><p>First you’ll need to make sure your system has a c++ compiler. For OSX, XCode will work. And then install or update nvm by the following command:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># The script clones the nvm repository to ~/.nvm and adds the source line to your profile (~/.bash_profile, ~/.zshrc or ~/.profile).</span></div><div class="line"></div><div class="line">$ curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.31.0/install.sh | bash</div></pre></td></tr></table></figure>
<a id="more"></a>
<h3 id="Manage-node-versions"><a href="#Manage-node-versions" class="headerlink" title="Manage node versions"></a>Manage node versions</h3><p>To download, compile, and install the latest v5.7.x release of node, do this:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ nvm install 5.7</div></pre></td></tr></table></figure>
<p>And then in any new shell just use the installed version:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ nvm use 5.7</div></pre></td></tr></table></figure>
<p>You can also get the path to the executable to where it was installed:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ nvm <span class="built_in">which</span> 5.7</div></pre></td></tr></table></figure>
<p>You can see what versions are installed:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$  nvm ls</div></pre></td></tr></table></figure>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Getting Started with Vim by Vimtutor]]></title>
      <url>https://linbojin.github.io/2016/03/04/Getting-Started-with-Vim-by-Vimtutor/</url>
      <content type="html"><![CDATA[<p>———————- <strong><em>Vim: The God of Editors</em></strong> ———————-</p>
<p>This is a simple <a href="http://www.vim.org/" target="_blank" rel="external">Vim</a> Tutorial from vim built-in documents, you can get the whole <strong>vimtutor</strong> by typing <code>vimtutor</code> in shell or <code>vimtutor -g</code> for GUI version. It is intended to give a brief overview of the Vim editor, just enough to allow you to use the editor fairly easily.</p>
<h4 id="Lesson-1-Text-Editing-Commands"><a href="#Lesson-1-Text-Editing-Commands" class="headerlink" title="Lesson 1: Text Editing Commands"></a>Lesson 1: Text Editing Commands</h4><figure class="highlight vim"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line"><span class="number">1</span>. The <span class="built_in">cursor</span> <span class="keyword">is</span> moved using either the arrow <span class="built_in">keys</span> <span class="built_in">or</span> the hjkl key<span class="variable">s:</span></div><div class="line">   	h (<span class="keyword">left</span>)       <span class="keyword">j</span> (down)       <span class="keyword">k</span> (<span class="keyword">up</span>)       <span class="keyword">l</span> (<span class="keyword">right</span>)</div><div class="line"><span class="number">2</span>. To start Vim from the <span class="keyword">shell</span> prompt <span class="built_in">type</span>:  <span class="keyword">vim</span> FILENAME <span class="symbol">&lt;ENTER&gt;</span></div><div class="line"><span class="number">3</span>. To <span class="keyword">exit</span> Vim <span class="built_in">type</span>:   <span class="symbol">&lt;ESC&gt;</span>  :q!  <span class="symbol">&lt;ENTER&gt;</span>   <span class="keyword">to</span> trash <span class="keyword">all</span> <span class="keyword">changes</span>.</div><div class="line">            OR <span class="built_in">type</span>:   <span class="symbol">&lt;ESC&gt;</span>  :<span class="keyword">wq</span>  <span class="symbol">&lt;ENTER&gt;</span>   <span class="keyword">to</span> save the <span class="keyword">changes</span>.</div><div class="line">            OR <span class="built_in">type</span>:   <span class="symbol">&lt;ESC&gt;</span>  shift + zz     <span class="keyword">to</span> save the <span class="keyword">changes</span></div><div class="line"><span class="number">4</span>. To <span class="keyword">delete</span> the character at the <span class="built_in">cursor</span> <span class="built_in">type</span>:  <span class="keyword">x</span></div><div class="line"><span class="number">5</span>. To <span class="keyword">insert</span> <span class="built_in">or</span> <span class="keyword">append</span> text <span class="built_in">type</span>:</div><div class="line">         i   <span class="built_in">type</span> inserted text   <span class="symbol">&lt;ESC&gt;</span>      <span class="keyword">insert</span> before the <span class="built_in">cursor</span></div><div class="line">         A   <span class="built_in">type</span> appended text   <span class="symbol">&lt;ESC&gt;</span>      <span class="keyword">append</span> after the <span class="built_in">line</span></div></pre></td></tr></table></figure>
<a id="more"></a>
<h4 id="Lesson-2-Deletion-Commands"><a href="#Lesson-2-Deletion-Commands" class="headerlink" title="Lesson 2: Deletion Commands"></a>Lesson 2: Deletion Commands</h4><figure class="highlight sql"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line">1. To <span class="keyword">delete</span> <span class="keyword">from</span> the <span class="keyword">cursor</span> up <span class="keyword">to</span> the <span class="keyword">next</span> word <span class="keyword">type</span>:    dw</div><div class="line"><span class="number">2.</span> <span class="keyword">To</span> <span class="keyword">delete</span> <span class="keyword">from</span> the <span class="keyword">cursor</span> <span class="keyword">to</span> the <span class="keyword">end</span> <span class="keyword">of</span> a line <span class="keyword">type</span>:   d$</div><div class="line"><span class="number">3.</span> <span class="keyword">To</span> <span class="keyword">delete</span> a whole line <span class="keyword">type</span>:    dd</div><div class="line"><span class="number">4.</span> <span class="keyword">To</span> <span class="keyword">repeat</span> a motion prepend it <span class="keyword">with</span> a <span class="built_in">number</span>:   <span class="number">2</span>w</div><div class="line"><span class="number">5.</span> The <span class="keyword">format</span> <span class="keyword">for</span> a <span class="keyword">change</span> command <span class="keyword">is</span>:  d2w / <span class="number">4</span>dd</div><div class="line">     <span class="keyword">operator</span>   [<span class="built_in">number</span>]   motion</div><div class="line">       <span class="keyword">where</span>:</div><div class="line">         <span class="keyword">operator</span> - <span class="keyword">is</span> what <span class="keyword">to</span> <span class="keyword">do</span>, such <span class="keyword">as</span>  d  <span class="keyword">for</span> <span class="keyword">delete</span></div><div class="line">         [<span class="built_in">number</span>] - <span class="keyword">is</span> an optional <span class="keyword">count</span> <span class="keyword">to</span> <span class="keyword">repeat</span> the motion</div><div class="line">         motion   - moves <span class="keyword">over</span> the <span class="built_in">text</span> <span class="keyword">to</span> operate <span class="keyword">on</span>, such <span class="keyword">as</span>  w, e, $, etc.</div><div class="line">   <span class="keyword">For</span> exmaple: d2w: <span class="keyword">delete</span> <span class="number">2</span> words</div><div class="line">                d4d: <span class="keyword">delete</span> <span class="number">4</span> <span class="keyword">lines</span></div><div class="line"><span class="number">6.</span> <span class="keyword">To</span> <span class="keyword">move</span> <span class="keyword">to</span> the <span class="keyword">start</span> <span class="keyword">of</span> the line <span class="keyword">use</span> a zero:  <span class="number">0</span></div><div class="line"><span class="number">7.</span> <span class="keyword">To</span> <span class="keyword">undo</span> previous actions, <span class="keyword">type</span>:           u  (lowercase u)</div><div class="line"><span class="number">8.</span> <span class="keyword">To</span> <span class="keyword">undo</span> all the changes <span class="keyword">on</span> a line, <span class="keyword">type</span>:  U  (capital U)</div><div class="line"><span class="number">9.</span> <span class="keyword">To</span> <span class="keyword">undo</span> the <span class="keyword">undo</span><span class="string">'s, type:                 CTRL-R</span></div></pre></td></tr></table></figure>
<h4 id="Lesson-3-Replace-and-Change-Commands"><a href="#Lesson-3-Replace-and-Change-Commands" class="headerlink" title="Lesson 3: Replace and Change Commands"></a>Lesson 3: Replace and Change Commands</h4><figure class="highlight livecodeserver"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line"><span class="number">1.</span> To <span class="built_in">put</span> back <span class="keyword">text</span> that has just been deleted, type   p .  </div><div class="line">  This puts <span class="keyword">the</span> deleted <span class="keyword">text</span> AFTER <span class="keyword">the</span> cursor (<span class="keyword">if</span> <span class="keyword">a</span> <span class="built_in">line</span> was deleted <span class="keyword">it</span> will go <span class="keyword">on</span> <span class="title">the</span> <span class="title">line</span> <span class="title">below</span> <span class="title">the</span> <span class="title">cursor</span>).</div><div class="line"><span class="number">2.</span> To <span class="built_in">replace</span> <span class="keyword">the</span> <span class="keyword">character</span> under <span class="keyword">the</span> cursor, type   r   <span class="keyword">and</span> <span class="keyword">then</span> <span class="keyword">the</span> <span class="built_in">new</span> <span class="keyword">character</span>. eg:</div><div class="line">    Type  <span class="number">3</span>rx <span class="built_in">to</span> <span class="built_in">replace</span> <span class="keyword">the</span> <span class="number">3</span> <span class="keyword">characters</span> <span class="keyword">by</span> <span class="string">'xxx'</span></div><div class="line"><span class="number">3.</span> The change operator allows you <span class="built_in">to</span> change <span class="built_in">from</span> <span class="keyword">the</span> cursor <span class="built_in">to</span> <span class="keyword">the</span> motion, eg:</div><div class="line">    Type  ce  <span class="built_in">to</span> change <span class="built_in">from</span> <span class="keyword">the</span> cursor <span class="built_in">to</span> <span class="keyword">the</span> <span class="function"><span class="keyword">end</span> <span class="title">of</span> <span class="title">the</span> <span class="title">word</span>, </span></div><div class="line">          c$  <span class="built_in">to</span> change <span class="built_in">to</span> <span class="keyword">the</span> <span class="function"><span class="keyword">end</span> <span class="title">of</span> <span class="title">a</span> <span class="title">line</span>.</span></div><div class="line"><span class="number">4.</span> The <span class="built_in">format</span> <span class="keyword">for</span> change is:</div><div class="line">    c   [<span class="built_in">number</span>]   motion</div></pre></td></tr></table></figure>
<h4 id="Lesson-4-Jump-Search-Substitute-Commands"><a href="#Lesson-4-Jump-Search-Substitute-Commands" class="headerlink" title="Lesson 4: Jump, Search, Substitute Commands"></a>Lesson 4: Jump, Search, Substitute Commands</h4><figure class="highlight livecodeserver"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line"><span class="number">1.</span> CTRL-G      displays your location <span class="keyword">in</span> <span class="keyword">the</span> <span class="built_in">file</span> <span class="keyword">and</span> <span class="keyword">the</span> <span class="built_in">file</span> status</div><div class="line">   G           moves <span class="built_in">to</span> <span class="keyword">the</span> <span class="function"><span class="keyword">end</span> <span class="title">of</span> <span class="title">the</span> <span class="title">file</span>.</span></div><div class="line">   <span class="built_in">number</span> G    moves <span class="built_in">to</span> that <span class="built_in">line</span> <span class="built_in">number</span>.</div><div class="line">   gg          moves <span class="built_in">to</span> <span class="keyword">the</span> <span class="keyword">first</span> <span class="built_in">line</span>.</div><div class="line"></div><div class="line"><span class="number">2.</span> Typing  /  followed <span class="keyword">by</span> <span class="keyword">a</span> phrase searches FORWARD <span class="keyword">for</span> <span class="keyword">the</span> phrase.</div><div class="line">   Typing  ?  followed <span class="keyword">by</span> <span class="keyword">a</span> phrase searches BACKWARD <span class="keyword">for</span> <span class="keyword">the</span> phrase.</div><div class="line">   After <span class="keyword">a</span> search type  n  <span class="built_in">to</span> find <span class="keyword">the</span> next occurrence <span class="keyword">in</span> <span class="keyword">the</span> same direction</div><div class="line">                    <span class="keyword">or</span>  N  <span class="built_in">to</span> search <span class="keyword">in</span> <span class="keyword">the</span> opposite direction.</div><div class="line">   CTRL-O takes you back <span class="built_in">to</span> older positions, CTRL-I <span class="built_in">to</span> newer positions.</div><div class="line"></div><div class="line"><span class="number">3.</span> Typing  %  <span class="keyword">while</span> <span class="keyword">the</span> cursor is <span class="keyword">on</span> <span class="title">a</span> (,),[,],&#123;, <span class="title">or</span> &#125; <span class="title">goes</span> <span class="title">to</span> <span class="title">its</span> <span class="title">match</span>.</div><div class="line"></div><div class="line"><span class="number">4.</span> To substitute <span class="built_in">new</span> <span class="keyword">for</span> <span class="keyword">the</span> <span class="keyword">first</span> old <span class="keyword">in</span> <span class="keyword">a</span> <span class="built_in">line</span> type    :s/old/<span class="built_in">new</span></div><div class="line">   To substitute <span class="built_in">new</span> <span class="keyword">for</span> all <span class="string">'old'</span>s <span class="keyword">on</span> <span class="title">a</span> <span class="title">line</span> <span class="title">type</span>       :<span class="title">s</span>/<span class="title">old</span>/<span class="title">new</span>/<span class="title">g</span></div><div class="line">   To substitute phrases between <span class="literal">two</span> <span class="built_in">line</span> <span class="comment">#'s type       :#,#s/old/new/g</span></div><div class="line">   To substitute all occurrences <span class="keyword">in</span> <span class="keyword">the</span> <span class="built_in">file</span> type        :%s/old/<span class="built_in">new</span>/g</div><div class="line">   To ask <span class="keyword">for</span> confirmation <span class="keyword">each</span> <span class="built_in">time</span> <span class="built_in">add</span> <span class="string">'c'</span>             :%s/old/<span class="built_in">new</span>/gc</div></pre></td></tr></table></figure>
<h4 id="Lesson-5-Execute-External-Commands"><a href="#Lesson-5-Execute-External-Commands" class="headerlink" title="Lesson 5: Execute External Commands"></a>Lesson 5: Execute External Commands</h4><figure class="highlight livecodeserver"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line"><span class="number">1.</span> :! <span class="keyword">command</span>  <span class="title">executes</span> <span class="title">an</span> <span class="title">external</span> <span class="title">command</span>.</div><div class="line">       :!ls            -  shows <span class="keyword">a</span> <span class="built_in">directory</span> listing.</div><div class="line">       :!rm FILENAME   -  removes <span class="built_in">file</span> FILENAME.</div><div class="line"><span class="number">2.</span> :w fname   writes <span class="keyword">the</span> current Vim <span class="built_in">file</span> <span class="built_in">to</span> disk <span class="keyword">with</span> name FILENAME.</div><div class="line"><span class="number">3.</span> v motion :w FILENAME  saves <span class="keyword">the</span> Visually selected <span class="keyword">lines</span> <span class="keyword">in</span> <span class="built_in">file</span> FILENAME.</div><div class="line"><span class="number">4.</span> :r fname   retrieves disk <span class="built_in">file</span> FILENAME <span class="keyword">and</span> puts <span class="keyword">it</span> below cursor position.</div><div class="line"><span class="number">5.</span> :r !ls     reads <span class="keyword">the</span> output <span class="keyword">of</span> ls <span class="keyword">command</span> <span class="title">and</span> <span class="title">puts</span> <span class="title">it</span> <span class="title">below</span> <span class="title">cursor</span> <span class="title">position</span>.</div></pre></td></tr></table></figure>
<h4 id="Lesson-6-Open-Append-Set-Commands"><a href="#Lesson-6-Open-Append-Set-Commands" class="headerlink" title="Lesson 6: Open Append Set Commands"></a>Lesson 6: Open Append Set Commands</h4><figure class="highlight sql"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">1. Type  o  to open a line BELOW the cursor and <span class="keyword">start</span> <span class="keyword">Insert</span> mode.</div><div class="line">   <span class="keyword">Type</span>  O  <span class="keyword">to</span> <span class="keyword">open</span> a line ABOVE the cursor.</div><div class="line"><span class="number">2.</span> <span class="keyword">Type</span>  a  <span class="keyword">to</span> <span class="keyword">insert</span> <span class="built_in">text</span> <span class="keyword">AFTER</span> the cursor.</div><div class="line">   <span class="keyword">Type</span>  A  <span class="keyword">to</span> <span class="keyword">insert</span> <span class="built_in">text</span> <span class="keyword">after</span> the <span class="keyword">end</span> <span class="keyword">of</span> the line.</div><div class="line"><span class="number">3.</span> The  e  command moves <span class="keyword">to</span> the <span class="keyword">end</span> <span class="keyword">of</span> a word.</div><div class="line"><span class="number">4.</span> The  y  <span class="keyword">operator</span> yanks (copies) <span class="built_in">text</span>,  p  puts (pastes) it.</div><div class="line"><span class="number">5.</span> Typing a capital  R  enters <span class="keyword">Replace</span> <span class="keyword">mode</span> <span class="keyword">until</span>  &lt;ESC&gt;  <span class="keyword">is</span> pressed.</div><div class="line"><span class="number">6.</span> Typing <span class="string">":set xxx"</span> <span class="keyword">sets</span> the <span class="keyword">option</span> <span class="string">"xxx"</span>.  <span class="keyword">Some</span> options <span class="keyword">are</span>:</div><div class="line">      <span class="string">'ic'</span> <span class="keyword">or</span> <span class="string">'ignorecase'</span>       <span class="keyword">ignore</span> <span class="keyword">upper</span>/<span class="keyword">lower</span> <span class="keyword">case</span> <span class="keyword">when</span> searching</div><div class="line">      <span class="string">'is'</span> <span class="keyword">or</span> <span class="string">'incsearch'</span>        <span class="keyword">show</span> <span class="keyword">partial</span> matches <span class="keyword">for</span> a <span class="keyword">search</span> phrase</div><div class="line">      <span class="string">'hls'</span> <span class="keyword">or</span> <span class="string">'hlsearch'</span>        highlight all matching phrases</div><div class="line"><span class="number">7.</span> Prepend <span class="string">"no"</span> <span class="keyword">to</span> <span class="keyword">switch</span> an <span class="keyword">option</span> <span class="keyword">off</span>:   :<span class="keyword">set</span> noic</div></pre></td></tr></table></figure>
<h4 id="Lesson-7-Getting-help"><a href="#Lesson-7-Getting-help" class="headerlink" title="Lesson 7: Getting help"></a>Lesson 7: Getting help</h4><figure class="highlight sql"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">1. Type  :<span class="keyword">help</span>          <span class="keyword">to</span> <span class="keyword">open</span> a <span class="keyword">help</span> window.</div><div class="line"><span class="number">2.</span> <span class="keyword">Type</span>  :<span class="keyword">help</span> cmd      <span class="keyword">to</span> find <span class="keyword">help</span> <span class="keyword">on</span>  cmd .</div><div class="line"><span class="number">3.</span> <span class="keyword">Type</span>  CTRL-W CTRL-W  <span class="keyword">to</span> jump <span class="keyword">to</span> another window</div><div class="line"><span class="number">4.</span> <span class="keyword">Type</span>  :q             <span class="keyword">to</span> <span class="keyword">close</span> the <span class="keyword">help</span> window</div><div class="line"><span class="number">5.</span> <span class="keyword">Create</span> a vimrc <span class="keyword">startup</span> script <span class="keyword">to</span> <span class="keyword">keep</span> your preferred settings.</div><div class="line">      :r $VIMRUNTIME/vimrc_example.vim</div><div class="line"><span class="number">6.</span> <span class="keyword">When</span> typing a  :  command, </div><div class="line">      press CTRL-D <span class="keyword">to</span> see possible completions.</div><div class="line">      Press &lt;TAB&gt; <span class="keyword">to</span> <span class="keyword">use</span> one completion.</div></pre></td></tr></table></figure>
<h4 id="Next-Step"><a href="#Next-Step" class="headerlink" title="Next Step"></a>Next Step</h4><p>It is far from complete as Vim has many many more commands. The next step you can read the vim built-in user manual:</p>
<figure class="highlight sql"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">:<span class="keyword">help</span> <span class="keyword">user</span>-<span class="keyword">manual</span></div><div class="line">	Press  CTRL-]  <span class="keyword">to</span> jump <span class="keyword">to</span> a subject <span class="keyword">under</span> the cursor.</div><div class="line">	Press  CTRL-O  <span class="keyword">to</span> jump back (<span class="keyword">repeat</span> <span class="keyword">to</span> <span class="keyword">go</span> further back).</div></pre></td></tr></table></figure>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Hacking PySpark inside Jupyter Notebook]]></title>
      <url>https://linbojin.github.io/2016/01/27/Hacking-pyspark-in-Jupyter-Notebook/</url>
      <content type="html"><![CDATA[<p>Python is a wonderful programming language for data analytics. Normally, I prefer to write python codes inside <a href="http://jupyter.org/" target="_blank" rel="external">Jupyter Notebook</a> (previous known as <a href="http://ipython.org/" target="_blank" rel="external">IPython</a>), because it allows us to create and share documents that contain live code, equations, visualizations and explanatory text. <a href="http://spark.apache.org/" target="_blank" rel="external">Apache Spark</a> is a fast and general engine for large-scale data processing. <a href="http://spark.apache.org/docs/latest/api/python/pyspark.html" target="_blank" rel="external">PySpark</a> is the Python API for Spark. So it’s a good start point to <strong>write PySpark codes inside jupyter</strong> if you are interested in data science:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">IPYTHON_OPTS=<span class="string">"notebook"</span> pyspark --master spark://localhost:7077 --executor-memory 7g</div></pre></td></tr></table></figure>
<p><img src="/media/Screen%20Shot%202016-01-27%20at%2021.44.39.png" alt="Hacking PySpark inside Jupyter Notebook"><br><a id="more"></a></p>
<h2 id="Install-Jupyter"><a href="#Install-Jupyter" class="headerlink" title="Install Jupyter"></a>Install Jupyter</h2><p>If you are a pythoner, I <strong>highly recommend</strong> installing <a href="https://www.continuum.io" target="_blank" rel="external">Anaconda</a>. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.<br>Go to <a href="https://www.continuum.io/downloads" target="_blank" rel="external">https://www.continuum.io/downloads</a>, find the instructions for downloading and installing Anaconda (jupyter will be included):</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">$ wget https://&#123;somewhere&#125;/Anaconda2-2.4.1-MacOSX-x86_64.sh</div><div class="line">$ bash Anaconda2-2.4.1-MacOSX-x86_64.sh</div><div class="line">$ python</div><div class="line">Python 2.7.11 |Anaconda 2.4.1 (x86_64)| (default, Dec  6 2015, 18:57:58)</div><div class="line">[GCC 4.2.1 (Apple Inc. build 5577)] on darwin</div><div class="line">Type <span class="string">"help"</span>, <span class="string">"copyright"</span>, <span class="string">"credits"</span> or <span class="string">"license"</span> <span class="keyword">for</span> more information.</div><div class="line">Anaconda is brought to you by Continuum Analytics.</div><div class="line">Please check out: http://continuum.io/thanks and https://anaconda.org</div><div class="line">&gt;&gt;&gt;</div></pre></td></tr></table></figure>
<p>You can easily run Jupyter notebook:</p>
<figure class="highlight delphi"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ jupyter notebook # Go <span class="keyword">to</span> http:<span class="comment">//localhost:8888</span></div></pre></td></tr></table></figure>
<p><img src="/media/Screen%20Shot%202016-01-27%20at%2022.08.51.png" alt="jupyter notebook"></p>
<h2 id="Install-Spark"><a href="#Install-Spark" class="headerlink" title="Install Spark"></a>Install Spark</h2><p>If you are not familiar with spark, you can go to read spark offical documents: </p>
<ul>
<li><a href="http://spark.apache.org/docs/latest/" target="_blank" rel="external">Spark Overview</a></li>
<li><a href="http://spark.apache.org/docs/latest/programming-guide.html" target="_blank" rel="external">Spark Programming Guide</a></li>
</ul>
<p>Here is a simply instruction for installing spark:</p>
<figure class="highlight maxima"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div></pre></td><td class="code"><pre><div class="line"># MacOS</div><div class="line">$ brew install apache-spark</div><div class="line"></div><div class="line"># Linux</div><div class="line">$ wget http://d3kbcqa49mib13.cloudfront.net/spark-<span class="number">1.6</span>.0-bin-hadoop2.6.tgz</div><div class="line">$ tar zxvf spark-<span class="number">1.6</span>.0-bin-hadoop2.6.tgz</div><div class="line">$ vim .bashrc</div><div class="line"> export PATH=/&#123;your_path&#125;/spark-<span class="number">1.6</span>.0-bin-hadoop2.6/sbin:$PATH</div><div class="line"> export PATH=/&#123;your_path&#125;/spark-<span class="number">1.6</span>.0-bin-hadoop2.6/bin:$PATH</div><div class="line">$ source .bashrc</div><div class="line"></div><div class="line"># Run PySpark shell </div><div class="line">$ pyspark</div><div class="line">Welcome to</div><div class="line">      ____              <span class="symbol">__</span></div><div class="line">     / <span class="symbol">__</span>/<span class="symbol">__</span>  ___ _____/ /<span class="symbol">__</span></div><div class="line">    <span class="symbol">_</span>\ \/ <span class="symbol">_</span> \/ <span class="symbol">_</span> `/ <span class="symbol">__</span>/  '<span class="symbol">_</span>/</div><div class="line">   /<span class="symbol">__</span> / .<span class="symbol">__</span>/\<span class="symbol">_</span>,<span class="symbol">_</span>/<span class="symbol">_</span>/ /<span class="symbol">_</span>/\<span class="symbol">_</span>\   version <span class="number">1.6</span>.0</div><div class="line">      /<span class="symbol">_</span>/</div><div class="line">	</div><div class="line">Using Python version <span class="number">2.7</span>.11 (default, Dec  <span class="number">6</span> <span class="number">2015</span> <span class="number">18</span>:<span class="number">08</span>:<span class="number">32</span>)</div><div class="line">SparkContext available as sc, HiveContext available as sqlContext.</div><div class="line">&gt;&gt;&gt;</div></pre></td></tr></table></figure>
<h2 id="Launch-PySpark-inside-IPython-jupyter"><a href="#Launch-PySpark-inside-IPython-jupyter" class="headerlink" title="Launch PySpark inside IPython(jupyter)"></a>Launch PySpark inside IPython(jupyter)</h2><p>Launch the PySpark shell in <strong>IPython</strong>:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div></pre></td><td class="code"><pre><div class="line">$ PYSPARK_DRIVER_PYTHON=ipython pyspark</div><div class="line">or </div><div class="line">$ IPYTHON=1 pyspark</div><div class="line"></div><div class="line"> Welcome to</div><div class="line">      ____              __</div><div class="line">     / __/__  ___ _____/ /__</div><div class="line">    _\ \/ _ \/ _ `/ __/  <span class="string">'_/</span></div><div class="line">   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0</div><div class="line">      /_/</div><div class="line"></div><div class="line">Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)</div><div class="line">SparkContext available as sc, HiveContext available as sqlContext.</div><div class="line"></div><div class="line">In [1]:</div></pre></td></tr></table></figure>
<p>Launch the PySpark shell in <strong>IPython Notebook</strong>, <a href="http://localhost:8888" target="_blank" rel="external">http://localhost:8888</a>:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=<span class="string">"notebook"</span> pyspark</div><div class="line">or </div><div class="line">$ IPYTHON_OPTS=<span class="string">"notebook"</span> pyspark</div><div class="line"></div><div class="line"><span class="comment"># You can also specify running memory </span></div><div class="line">$ IPYTHON_OPTS=<span class="string">"notebook"</span> pyspark --executor-memory 7g</div></pre></td></tr></table></figure>
<h2 id="Run-PySpark-on-a-cluster-inside-IPython-jupyter"><a href="#Run-PySpark-on-a-cluster-inside-IPython-jupyter" class="headerlink" title="Run PySpark on a cluster inside IPython(jupyter)"></a>Run PySpark on a cluster inside IPython(jupyter)</h2><p>It’s assumed you deployed a spark cluster in standalone mode, and the master ip is <code>localhost</code>. </p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">IPYTHON_OPTS=<span class="string">"notebook"</span> pyspark --master spark://localhost:7077 --executor-memory 7g</div><div class="line"></div><div class="line"><span class="comment"># you can add some python modules</span></div><div class="line">IPYTHON_OPTS=<span class="string">"notebook"</span> pyspark  \</div><div class="line">--master spark://localhost:7077  \</div><div class="line">--executor-memory 7g             \</div><div class="line">--py-files tensorflow-py2.7.egg</div></pre></td></tr></table></figure>
<p><img src="/media/Screen%20Shot%202016-01-27%20at%2023.24.46.png" alt="PySpark on cluster"></p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[NPM Playbook]]></title>
      <url>https://linbojin.github.io/2016/01/17/NPM-Playbook/</url>
      <content type="html"><![CDATA[<p><a href="https://www.npmjs.com/" target="_blank" rel="external">NPM</a> (node package manager) is a package management tool for <a href="https://nodejs.org/en/" target="_blank" rel="external">Node.js</a>.<br><a href="https://nodejs.org/en/" target="_blank" rel="external">Node.js</a> is an <a href="https://github.com/nodejs/node" target="_blank" rel="external">open source</a> JavaScript runtime built on <strong>Chrome’s V8 JavaScript engine</strong>. Node.js uses an <strong>event-driven</strong>, <strong>non-blocking I/O model</strong> that makes it lightweight and efficient. Note that Node.js is a <strong>server side runtime environment rather than a language</strong>. </p>
<h2 id="Initial-project"><a href="#Initial-project" class="headerlink" title="Initial project"></a>Initial project</h2><p><a href="https://docs.npmjs.com/files/package.json" target="_blank" rel="external"><strong>package.json</strong></a> will be firstly created by <code>npm init</code>:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ npm init  <span class="comment"># create package.json</span></div></pre></td></tr></table></figure>
<a id="more"></a>
<ul>
<li><p>package.json     includes four main parts: <strong>basic module info</strong>, <strong>dependencies</strong>, <strong>devDependencies</strong>, <strong>scripts</strong>: </p>
  <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div></pre></td><td class="code"><pre><div class="line">// package.json</div><div class="line">	</div><div class="line">&#123;</div><div class="line">  <span class="string">"name"</span>: <span class="string">"nodexpress"</span>,</div><div class="line">  <span class="string">"version"</span>: <span class="string">"1.0.0"</span>,</div><div class="line">  <span class="string">"description"</span>: <span class="string">""</span>,</div><div class="line">  <span class="string">"main"</span>: <span class="string">"app.js"</span>,</div><div class="line">  <span class="string">"scripts"</span>: &#123;</div><div class="line">    <span class="string">"test"</span>: <span class="string">"echo \"Error: no test specified\" &amp;&amp; exit 1"</span>,</div><div class="line">    <span class="string">"start"</span>: <span class="string">"node app.js"</span></div><div class="line">  &#125;,</div><div class="line">  <span class="string">"author"</span>: <span class="string">"tony"</span>,</div><div class="line">  <span class="string">"license"</span>: <span class="string">"ISC"</span>,</div><div class="line">  <span class="string">"dependencies"</span>: &#123;</div><div class="line">    <span class="string">"express"</span>: <span class="string">"^4.13.3"</span></div><div class="line">  &#125;,</div><div class="line">  <span class="string">"devDependencies"</span>: &#123;</div><div class="line">    <span class="string">"gulp"</span>: <span class="string">"^3.9.0"</span>,</div><div class="line">    <span class="string">"gulp-inject"</span>: <span class="string">"^1.5.0"</span>,</div><div class="line">    <span class="string">"gulp-jscs"</span>: <span class="string">"^2.0.0"</span>,</div><div class="line">    <span class="string">"gulp-jshint"</span>: <span class="string">"^1.11.2"</span>,</div><div class="line">    <span class="string">"gulp-nodemon"</span>: <span class="string">"^2.0.6"</span>,</div><div class="line">    <span class="string">"jshint-stylish"</span>: <span class="string">"^2.1.0"</span>,</div><div class="line">    <span class="string">"wiredep"</span>: <span class="string">"^3.0.0"</span></div><div class="line">  &#125;</div><div class="line">&#125;</div></pre></td></tr></table></figure>
</li>
</ul>
<ul>
<li><p>scripts inside <code>packages.json</code> to run shell commands</p>
  <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">$ npm start will run $ node xx.js</div><div class="line">$ npm run xxx <span class="keyword">for</span> custom commands</div><div class="line"></div><div class="line">// packages.json</div><div class="line"><span class="string">"scripts"</span>: &#123;</div><div class="line">    <span class="string">"test"</span>: <span class="string">"echo \"Error: no test specified\" &amp;&amp; exit 1"</span>,</div><div class="line">    <span class="string">"start"</span>: <span class="string">"node app.js"</span></div><div class="line">  &#125;,</div></pre></td></tr></table></figure>
</li>
</ul>
<ul>
<li><p>versioning packages</p>
  <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="string">"express"</span>: <span class="string">"^4.13.3"</span> =&gt; will install the latest <span class="string">"4.XX.X"</span> verison</div><div class="line"><span class="string">"express"</span>: <span class="string">"~4.13.3"</span> =&gt; will install the latest <span class="string">"4.13.X"</span> verison</div><div class="line"><span class="string">"express"</span>: <span class="string">"4.13.3"</span>  =&gt; will keep this version</div></pre></td></tr></table></figure>
</li>
</ul>
<ul>
<li><p>default settings for package.json</p>
  <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">$ npm <span class="built_in">set</span> init-author-name <span class="string">'Tony'</span></div><div class="line">$ npm get init-author-name</div><div class="line">Tony</div><div class="line">$ npm config delete init-author-name</div><div class="line">$ cat ~/.npmrc   <span class="comment"># all the default settings will be saved inside this file.</span></div><div class="line">init-author-name=Tony</div></pre></td></tr></table></figure>
</li>
</ul>
<h2 id="Install-and-uninstall-packages"><a href="#Install-and-uninstall-packages" class="headerlink" title="Install and uninstall packages"></a>Install and uninstall packages</h2><p>Search nodejs packages: <a href="https://www.npmjs.com/" target="_blank" rel="external">https://www.npmjs.com/</a></p>
<ul>
<li><p>Project with <code>packages.json</code> can install(update) modules easily:</p>
  <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ npm install  <span class="comment"># automatically install dependencies inside packages.json</span></div></pre></td></tr></table></figure>
</li>
<li><p>Install single module, for shorhands: <a href="https://docs.npmjs.com/misc/config" target="_blank" rel="external">https://docs.npmjs.com/misc/config</a></p>
  <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">$ npm install express --save       <span class="comment"># save dependency into package.json</span></div><div class="line">$ npm install gulp --save-dev      <span class="comment"># save devDependency into package.json</span></div><div class="line">$ npm install bower -g      <span class="comment"># install global, so can call bower inside terminal</span></div><div class="line"></div><div class="line">$ npm install express@<span class="string">"4.2.x"</span> --save     <span class="comment"># install specific version</span></div><div class="line">$ npm install underscore@<span class="string">"&gt;=1.1.0 &lt;1.4.0"</span> --save</div><div class="line"></div><div class="line">$ npm i express -S    <span class="comment"># Shorthands for --save</span></div><div class="line">$ npm i gulp -D      <span class="comment"># for --save-dev</span></div></pre></td></tr></table></figure>
</li>
<li><p>Remove</p>
  <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">$ npm uninstall [-g] underscore --save  <span class="comment"># remove dependency from pkg.json</span></div><div class="line">$ npm prune                             <span class="comment"># add extraneous pkg into pkg.json</span></div><div class="line">$ npm prune --production                <span class="comment"># rm dev dependencies from pkg.json</span></div></pre></td></tr></table></figure>
</li>
</ul>
<ul>
<li><p>Upgrade npm</p>
  <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ npm i npm@latest -g</div></pre></td></tr></table></figure>
</li>
</ul>
<h2 id="List-installed-packages"><a href="#List-installed-packages" class="headerlink" title="List installed packages"></a>List installed packages</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div></pre></td><td class="code"><pre><div class="line">$ npm list --depth 0 </div><div class="line">nodexpress@1.0.0 /Users/tony/Hacker/nodexpress</div><div class="line">├── express@4.13.3</div><div class="line">├── gulp@3.9.0</div><div class="line">├── gulp-inject@1.5.0</div><div class="line">├── gulp-jscs@2.0.0</div><div class="line">├── gulp-jshint@1.12.0</div><div class="line">├── gulp-nodemon@2.0.6</div><div class="line">├── jshint-stylish@2.1.0</div><div class="line">└── wiredep@3.0.0</div><div class="line"></div><div class="line">$ npm list --depth 0 --prod <span class="literal">true</span>    <span class="comment"># production dependencies packages only</span></div><div class="line">nodexpress@1.0.0 /Users/tony/Hacker/nodexpress</div><div class="line">└── express@4.13.3</div><div class="line">  </div><div class="line">$ npm list --depth 0 --dev <span class="literal">true</span>     <span class="comment"># devDependencies packages only</span></div><div class="line">nodexpress@1.0.0 /Users/tony/Hacker/nodexpress</div><div class="line">├── gulp@3.9.0</div><div class="line">├── gulp-inject@1.5.0</div><div class="line">├── gulp-jscs@2.0.0</div><div class="line">├── gulp-jshint@1.12.0</div><div class="line">├── gulp-nodemon@2.0.6</div><div class="line">├── jshint-stylish@2.1.0</div><div class="line">└── wiredep@3.0.0</div><div class="line"></div><div class="line">$ npm list --global <span class="literal">true</span> --depth 0  <span class="comment"># global packages</span></div><div class="line">/usr/<span class="built_in">local</span>/lib</div><div class="line">├── bower@1.7.2</div><div class="line">├── gulp@3.9.0</div><div class="line">├── hexo-cli@0.2.0</div><div class="line">└── npm@3.3.12</div><div class="line"></div><div class="line">$ npm list --depth 0 --long <span class="literal">true</span>   <span class="comment"># detail info</span></div><div class="line">$ npm list --depth 0 --json <span class="literal">true</span>   <span class="comment"># detail info with json format</span></div></pre></td></tr></table></figure>
<h2 id="Package-official-github-repo"><a href="#Package-official-github-repo" class="headerlink" title="Package official github repo"></a>Package official github repo</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ npm repo underscore   <span class="comment"># Jump to browser and open github repo</span></div></pre></td></tr></table></figure>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Books of 2016]]></title>
      <url>https://linbojin.github.io/2016/01/15/Books-of-2016/</url>
      <content type="html"><![CDATA[<h3 id="January"><a href="#January" class="headerlink" title="January"></a>January</h3><ul>
<li>Learning Spark<ul>
<li>Book: <a href="http://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624" target="_blank" rel="external">http://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624</a> </li>
<li>Github: <a href="https://github.com/databricks/learning-spark" target="_blank" rel="external">https://github.com/databricks/learning-spark</a>  <br><br><img src="/media/14574490909410.jpg" alt="Learning Spark"></li>
</ul>
</li>
</ul>
<a id="more"></a>
<hr>
<h3 id="March"><a href="#March" class="headerlink" title="March"></a>March</h3><ul>
<li>Hadoop: The Definitive Guide 4th Edition<ul>
<li>Book: <a href="http://shop.oreilly.com/product/0636920033448.do" target="_blank" rel="external">http://shop.oreilly.com/product/0636920033448.do</a></li>
<li>Website: <a href="http://hadoopbook.com/" target="_blank" rel="external">http://hadoopbook.com/</a> <br><br><img src="/media/Screen%20Shot%202016-04-03%20at%2022.10.34.png" alt="Hadoop: The Definitive Guide"></li>
</ul>
</li>
</ul>
<hr>
<h3 id="Backlog"><a href="#Backlog" class="headerlink" title="Backlog"></a>Backlog</h3><ul>
<li><p>Practical Vim: Edit Text at the Speed of Thought 2nd Edition</p>
<ul>
<li>Book: <a href="https://pragprog.com/book/dnvim2/practical-vim-second-edition" target="_blank" rel="external">https://pragprog.com/book/dnvim2/practical-vim-second-edition</a></li>
<li>Source Code: <a href="https://pragprog.com/titles/dnvim2/source_code" target="_blank" rel="external">https://pragprog.com/titles/dnvim2/source_code</a> <br><br><img src="/media/14574488254700.jpg" alt="Practical Vim"></li>
</ul>
</li>
<li><p>The Linux Command Line (Second Internet Edition): <a href="http://linuxcommand.org/tlcl.php" target="_blank" rel="external">http://linuxcommand.org/tlcl.php</a><br><img src="/media/14555475523301.jpg" alt="TLCL"></p>
</li>
<li><p>Advanced Analytics with Spark</p>
<ul>
<li>Book: <a href="http://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766" target="_blank" rel="external">http://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766</a></li>
<li>Github: <a href="https://github.com/sryza/aas" target="_blank" rel="external">https://github.com/sryza/aas</a> <br><br><img src="/media/14538187148785.jpg" alt="Advanced Analytics with Spark"></li>
</ul>
</li>
<li><p>硅谷之谜: <a href="https://book.douban.com/subject/26665230/" target="_blank" rel="external">https://book.douban.com/subject/26665230/</a></p>
</li>
</ul>
<p>How Linux Works: What Every Superuser Should Know: <a href="http://www.amazon.com/How-Linux-Works-Superuser-Should-ebook/dp/B00PKTGLWM/ref=sr_1_1?s=digital-text&amp;ie=UTF8&amp;qid=1455547859&amp;sr=1-1&amp;keywords=linux" target="_blank" rel="external">http://www.amazon.com/How-Linux-Works-Superuser-Should-ebook/dp/B00PKTGLWM/ref=sr_1_1?s=digital-text&amp;ie=UTF8&amp;qid=1455547859&amp;sr=1-1&amp;keywords=linux</a></p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Loopback API Framework]]></title>
      <url>https://linbojin.github.io/2016/01/14/Loopback-API-Framework/</url>
      <content type="html"><![CDATA[<p>The LoopBack framework is a set of Node.js modules that you can use independently or together to quickly build applications that expose REST APIs.</p>
<h3 id="Resources"><a href="#Resources" class="headerlink" title="Resources"></a>Resources</h3><p>Loopback: <a href="http://loopback.io/" target="_blank" rel="external">http://loopback.io/</a><br>Getting started: <a href="http://loopback.io/getting-started/" target="_blank" rel="external">http://loopback.io/getting-started/</a><br>Create a simple API: <a href="https://docs.strongloop.com/display/public/LB/Create+a+simple+API" target="_blank" rel="external">https://docs.strongloop.com/display/public/LB/Create+a+simple+API</a><br>LoopBack core concepts: <a href="https://docs.strongloop.com/display/public/LB/LoopBack+core+concepts" target="_blank" rel="external">https://docs.strongloop.com/display/public/LB/LoopBack+core+concepts</a></p>
<a id="more"></a>
<h3 id="Build-an-App-named-loopback-getting-started"><a href="#Build-an-App-named-loopback-getting-started" class="headerlink" title="Build an App named loopback-getting-started"></a>Build an App named loopback-getting-started</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div><div class="line">66</div><div class="line">67</div><div class="line">68</div><div class="line">69</div><div class="line">70</div><div class="line">71</div><div class="line">72</div><div class="line">73</div><div class="line">74</div><div class="line">75</div><div class="line">76</div></pre></td><td class="code"><pre><div class="line">$ npm install -g strongloop</div><div class="line"></div><div class="line"><span class="comment"># Create app</span></div><div class="line">$ slc loopback</div><div class="line">     _-----_</div><div class="line">    |       |    .--------------------------.</div><div class="line">    |--(o)--|    |  Let<span class="string">'s create a LoopBack |</span></div><div class="line">   `---------´   |       application!       |</div><div class="line">    ( _´U`_ )    '--------------------------<span class="string">'</span></div><div class="line">    /___A___\</div><div class="line">     |  ~  |</div><div class="line">   __'.___.<span class="string">'__</span></div><div class="line"> ´   `  |° ´ Y `</div><div class="line">[?] What's the name of your application? loopback-getting-started</div><div class="line">[?] Enter name of the directory to contain the project: loopback-getting-started</div><div class="line"></div><div class="line"></div><div class="line"><span class="comment"># Create model people</span></div><div class="line">$ <span class="built_in">cd</span> loopback-getting-started</div><div class="line">$ slc loopback:model</div><div class="line">[?] Enter the model name: people</div><div class="line">[?] Select the data-source to attach people to: db (memory)</div><div class="line">[?] Select model<span class="string">'s base class PersistedModel</span></div><div class="line">[?] Expose people via the REST API? Yes</div><div class="line">[?] Custom plural form (used to build REST URL):</div><div class="line">[?] Common model or server only? common</div><div class="line">Let's add some people properties now.</div><div class="line"></div><div class="line">Enter an empty property name when done.</div><div class="line">[?] Property name: firstname</div><div class="line">   invoke   loopback:property</div><div class="line">[?] Property <span class="built_in">type</span>: string</div><div class="line">[?] Required? Yes</div><div class="line"></div><div class="line">Let<span class="string">'s add another people property.</span></div><div class="line">Enter an empty property name when done.</div><div class="line">[?] Property name: lastname</div><div class="line">   invoke   loopback:property</div><div class="line">[?] Property type: string</div><div class="line">[?] Required? Yes</div><div class="line"></div><div class="line">Let's add another people property.</div><div class="line">Enter an empty property name when done.</div><div class="line"></div><div class="line"><span class="comment"># Create another model CoffeeShop</span></div><div class="line">$ slc loopback:model</div><div class="line">[?] Enter the model name: CoffeeShop</div><div class="line">[?] Select the data-source to attach CoffeeShop to: db (memory)</div><div class="line">[?] Select model<span class="string">'s base class PersistedModel</span></div><div class="line">[?] Expose CoffeeShop via the REST API? Yes</div><div class="line">[?] Custom plural form (used to build REST URL):</div><div class="line">[?] Common model or server only? common</div><div class="line">Let's add some CoffeeShop properties now.</div><div class="line"></div><div class="line">Enter an empty property name when done.</div><div class="line">[?] Property name: name</div><div class="line">   invoke   loopback:property</div><div class="line">[?] Property <span class="built_in">type</span>: string</div><div class="line">[?] Required? Yes</div><div class="line"></div><div class="line">Let<span class="string">'s add another CoffeeShop property.</span></div><div class="line">Enter an empty property name when done.</div><div class="line">[?] Property name: city</div><div class="line">   invoke   loopback:property</div><div class="line">[?] Property type: string</div><div class="line">v Required? Yes</div><div class="line"></div><div class="line">Let's add another CoffeeShop property.</div><div class="line">Enter an empty property name when done.</div><div class="line">[?] Property name:</div><div class="line"></div><div class="line"></div><div class="line"><span class="comment"># Run the application</span></div><div class="line">$ node .</div><div class="line">Web server listening at: http://0.0.0.0:3000</div><div class="line">Browse your REST API at http://0.0.0.0:3000/explorer</div></pre></td></tr></table></figure>
<p><a href="https://docs.strongloop.com/display/public/LB/Project+layout+reference" target="_blank" rel="external">project structure</a></p>
<ul>
<li>server - Node application scripts and configuration files.</li>
<li>common[server]/models sub-directory contains all model JSON and JavaScript files.</li>
<li>client - Client JavaScript, HTML, and CSS files.</li>
<li>common - Files common to client and server. </li>
</ul>
<p><img src="/media/Screen%20Shot%202016-01-14%20at%2015.49.45.png" alt="loopback project structure"></p>
<h3 id="Use-API-Explorer"><a href="#Use-API-Explorer" class="headerlink" title="Use API Explorer"></a>Use API Explorer</h3><p>Goto <a href="http://0.0.0.0:3000" target="_blank" rel="external">http://0.0.0.0:3000</a>:</p>
<figure class="highlight json"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">&#123;<span class="attr">"started"</span>:<span class="string">"2016-01-14T08:05:00.339Z"</span>,<span class="attr">"uptime"</span>:<span class="number">48.7</span>&#125;</div></pre></td></tr></table></figure>
<p>Goto <a href="http://0.0.0.0:3000/explorer" target="_blank" rel="external">http://0.0.0.0:3000/explorer</a>:</p>
<p><img src="/media/Screen%20Shot%202016-01-14%20at%2016.09.32.png" alt="Loopback API Explorer"></p>
<h4 id="Create-a-new-instance-of-the-model-and-persist-it-into-the-data-source"><a href="#Create-a-new-instance-of-the-model-and-persist-it-into-the-data-source" class="headerlink" title="Create a new instance of the model and persist it into the data source"></a>Create a new instance of the model and persist it into the data source</h4><p>API: POST  /CoffeeShops<br><img src="/media/14527611403060.jpg" alt="POST API"><br><img src="/media/14527612340194.jpg" alt="application&#39;s response after post data"></p>
<h4 id="Retrieve-the-data-inside-datasource"><a href="#Retrieve-the-data-inside-datasource" class="headerlink" title="Retrieve the data inside datasource"></a>Retrieve the data inside datasource</h4><p>API: GET  /CoffeeShops<br><img src="/media/Screen%20Shot%202016-01-14%20at%2016.49.38.png" alt="GET API"><br><img src="/media/Screen%20Shot%202016-01-14%20at%2016.50.03.png" alt="get data"></p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Spark Source Codes 01 Submit and Run Jobs]]></title>
      <url>https://linbojin.github.io/2016/01/10/Spark-Source-Codes-01-Submit-and-Run-Jobs/</url>
      <content type="html"><![CDATA[<p>standalone mode</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ <span class="built_in">cd</span> &#123;SPARK_HOME&#125;/libexec/sbin/</div></pre></td></tr></table></figure>
<h3 id="Start-Master-at-8080"><a href="#Start-Master-at-8080" class="headerlink" title="Start Master at 8080,"></a>Start Master at 8080,</h3><p>org.apache.spark.deploy.master.Master<br>onStart()</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># spark command: java -Xms1g -Xmx1g org.apache.spark.deploy.master.Master </span></div><div class="line"><span class="comment">#                --ip localhost --port 7077 --webui-port 8080</span></div><div class="line">$ ./start-master.sh </div><div class="line"></div><div class="line">Output Logs:</div><div class="line">16/01/10 20:45:23 INFO Master: Registered signal handlers <span class="keyword">for</span> [TERM, HUP, INT]</div><div class="line">16/01/10 20:45:23 WARN NativeCodeLoader: Unable to load native-hadoop library <span class="keyword">for</span> your platform... using <span class="built_in">builtin</span>-java classes <span class="built_in">where</span> applicable</div><div class="line">16/01/10 20:45:24 INFO SecurityManager: Changing view acls to: tony</div><div class="line">16/01/10 20:45:24 INFO SecurityManager: Changing modify acls to: tony</div><div class="line">16/01/10 20:45:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tony); users with modify permissions: Set(tony)</div><div class="line">16/01/10 20:45:24 INFO Utils: Successfully started service <span class="string">'sparkMaster'</span> on port 7077.</div><div class="line">16/01/10 20:45:24 INFO Master: Starting Spark master at spark://localhost:7077</div><div class="line">16/01/10 20:45:24 INFO Master: Running Spark version 1.6.0</div><div class="line">16/01/10 20:45:24 INFO Utils: Successfully started service <span class="string">'MasterUI'</span> on port 8080.</div><div class="line">16/01/10 20:45:24 INFO MasterWebUI: Started MasterWebUI at http://192.168.0.112:8080</div><div class="line">16/01/10 20:45:24 INFO Utils: Successfully started service on port 6066.</div><div class="line">16/01/10 20:45:24 INFO StandaloneRestServer: Started REST server <span class="keyword">for</span> submitting applications on port 6066</div><div class="line">16/01/10 20:45:24 INFO Master: I have been elected leader! New state: ALIVE</div></pre></td></tr></table></figure>
<h3 id="Start-Worker-at-8081"><a href="#Start-Worker-at-8081" class="headerlink" title="Start Worker at 8081"></a>Start Worker at 8081</h3><p>onStart() =&gt; registerWithMaster()</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># spark command: java -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker </span></div><div class="line"><span class="comment">#                --webui-port 8081 spark://localhost:7077</span></div><div class="line">$ ./start-slave.sh spark://localhost:7077</div><div class="line"></div><div class="line">Output Logs:</div><div class="line">16/01/10 20:50:45 INFO Worker: Registered signal handlers <span class="keyword">for</span> [TERM, HUP, INT]</div><div class="line">16/01/10 20:50:45 WARN NativeCodeLoader: Unable to load native-hadoop library <span class="keyword">for</span> your platform... using <span class="built_in">builtin</span>-java classes <span class="built_in">where</span> applicable</div><div class="line">16/01/10 20:50:45 INFO SecurityManager: Changing view acls to: tony</div><div class="line">16/01/10 20:50:45 INFO SecurityManager: Changing modify acls to: tony</div><div class="line">16/01/10 20:50:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tony); users with modify permissions: Set(tony)</div><div class="line">16/01/10 20:50:46 INFO Utils: Successfully started service <span class="string">'sparkWorker'</span> on port 49576.</div><div class="line">16/01/10 20:50:46 INFO Worker: Starting Spark worker 192.168.0.112:49576 with 4 cores, 7.0 GB RAM</div><div class="line">16/01/10 20:50:46 INFO Worker: Running Spark version 1.6.0</div><div class="line">16/01/10 20:50:46 INFO Worker: Spark home: /usr/<span class="built_in">local</span>/Cellar/apache-spark/1.6.0/libexec</div><div class="line">16/01/10 20:50:46 INFO Utils: Successfully started service <span class="string">'WorkerUI'</span> on port 8081.</div><div class="line">16/01/10 20:50:46 INFO WorkerWebUI: Started WorkerWebUI at http://192.168.0.112:8081</div><div class="line">16/01/10 20:50:46 INFO Worker: Connecting to master localhost:7077...</div><div class="line">16/01/10 20:50:46 INFO Worker: Successfully registered with master spark://localhost:7077</div></pre></td></tr></table></figure>
<h3 id="Start-Spark-shell-over-cluster-on-http-localhost-4040"><a href="#Start-Spark-shell-over-cluster-on-http-localhost-4040" class="headerlink" title="Start Spark-shell over cluster on http://localhost:4040"></a>Start Spark-shell over cluster on <a href="http://localhost:4040" target="_blank" rel="external">http://localhost:4040</a></h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ MASTER=spark://localhost:7077 spark-shell</div></pre></td></tr></table></figure>
<p><img src="/media/14526662171327.jpg" alt="14526662171327"></p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">scala&gt; sc.textFile(<span class="string">"README.md"</span>).filter(_.contains(<span class="string">"Spark"</span>)).count</div></pre></td></tr></table></figure>
<p><img src="/media/14526662553694.jpg" alt="14526662553694"></p>
<p>sc.textFile(“”)</p>
<p>RDD Object</p>
<p>DAGScheduler: error between stages</p>
<p>==TaskSet===&gt;</p>
<p>TaskScheduler: error inside stage</p>
<p>org.apache.spark.scheduler.TaskScheduler</p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Reading Spark Souce Code in IntelliJ IDEA]]></title>
      <url>https://linbojin.github.io/2016/01/09/Reading-Spark-Souce-Code-in-IntelliJ-IDEA/</url>
      <content type="html"><![CDATA[<p>It’s a good choice to read spark souce code in IntelliJ IDEA. This tutorial introduces how to do it.</p>
<h3 id="Get-spark-repository"><a href="#Get-spark-repository" class="headerlink" title="Get spark repository"></a>Get spark repository</h3><ol>
<li>Fork <a href="https://github.com/apache/spark" target="_blank" rel="external">apache spark</a> project to your Github account</li>
<li><p>Clone spark to local:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">$ git <span class="built_in">clone</span> git@github.com:username/spark.git</div><div class="line">$ <span class="built_in">cd</span> spark/</div></pre></td></tr></table></figure>
 <a id="more"></a></li>
<li><p>Add apache spark remote (to keep up-to-date with apache spark repo):</p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">$ git remote add apache https://github.com/apache/spark.git</div><div class="line"><span class="comment"># check remote accounts</span></div><div class="line">$ git remote -v</div></pre></td></tr></table></figure>
</li>
<li><p>Sync repo with apache spark:</p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># Fetch the branches and their respective commits from the apache repo</span></div><div class="line">$ git fetch apache</div><div class="line"><span class="comment"># Update codes</span></div><div class="line">$ git pull apache master</div></pre></td></tr></table></figure>
</li>
<li><p>Push new updates to your own github account repo:</p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ git push origin master</div></pre></td></tr></table></figure>
</li>
<li><p>Create new develop branch for developing: </p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ git checkout -b develop</div></pre></td></tr></table></figure>
</li>
<li><p>Push develop branch to your github repo: </p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ git push -u origin develop</div></pre></td></tr></table></figure>
</li>
</ol>
<h3 id="Built-spark-in-Intellij-IDEA-15"><a href="#Built-spark-in-Intellij-IDEA-15" class="headerlink" title="Built spark in Intellij IDEA 15"></a>Built spark in Intellij IDEA 15</h3><ol>
<li>Install <a href="https://www.jetbrains.com/idea/download/" target="_blank" rel="external">IntelliJ IDEA 15</a> as well as <a href="https://plugins.jetbrains.com/plugin/?id=1347" target="_blank" rel="external">IDEA Scala Plugin</a></li>
<li><p>Make sure your are in your own develop branch:</p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ git checkout develop</div></pre></td></tr></table></figure>
</li>
<li><p>Open spark project in IDEA (directly open <strong>pom.xml</strong> file)<br>Menu -&gt; File -&gt; <strong>Open</strong> -&gt; {spark}/<strong>pom.xml</strong> </p>
</li>
<li><p>Modify <code>java.version</code> to your java version inside <strong>pom.xml</strong></p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># pom.xml</span></div><div class="line">&lt;java.version&gt;1.8&lt;/java.version&gt;</div></pre></td></tr></table></figure>
</li>
<li><p>Build spark by sbt</p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ build/sbt assembly</div></pre></td></tr></table></figure>
</li>
<li><p>Validating spark is built successfully</p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ ./bin/spark-shell</div></pre></td></tr></table></figure>
</li>
</ol>
<h3 id="Reading-spark-codes"><a href="#Reading-spark-codes" class="headerlink" title="Reading spark codes"></a>Reading spark codes</h3><p>It’s better to read or change codes on your develop branch and sync with apache spark repo inside master branch. So normally, you can update your develop branch by following commands:</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">$ git checkout master</div><div class="line">$ git pull apache master</div><div class="line">$ git checkout develop</div><div class="line">$ git merge master</div></pre></td></tr></table></figure>
<p>Useful IDEA Shortcuts:</p>
<pre><code>command + o : search classes
command + b : go to implementation
command + [ : go back to the previous location
shift + command + F : search files 
</code></pre><p>Several important classes:</p>
<pre><code>SparkContext.scala 
DAGScheduler.scala
TaskSchedulerImpl.scala
BlockManager.scala
</code></pre><hr>
<p>Ref: Building Spark: <a href="http://spark.apache.org/docs/latest/building-spark.html" target="_blank" rel="external">http://spark.apache.org/docs/latest/building-spark.html</a></p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Build blog with Hexo and MWeb]]></title>
      <url>https://linbojin.github.io/2016/01/09/Build-blog-by-Hexo-and-MWeb/</url>
      <content type="html"><![CDATA[<h2 id="Requirements"><a href="#Requirements" class="headerlink" title="Requirements"></a>Requirements</h2><p><a href="https://github.com/" target="_blank" rel="external">Github account</a>: blog deployed by GitHub Pages<br><a href="https://hexo.io/" target="_blank" rel="external">Hexo</a>: blog framework to generate static blog files<br><a href="http://www.mweb.im/" target="_blank" rel="external">MWeb</a>: Markdown writing tool</p>
<h2 id="Setup-Github"><a href="#Setup-Github" class="headerlink" title="Setup Github"></a>Setup Github</h2><ol>
<li>Create a new repository named username.github.io</li>
<li><a href="https://username.github.io" target="_blank" rel="external">https://username.github.io</a> will be your personal blog address</li>
</ol>
<a id="more"></a>
<h2 id="Setup-Hexo"><a href="#Setup-Hexo" class="headerlink" title="Setup Hexo"></a>Setup Hexo</h2><p>Make sure you have installed node.js and git.</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">$ npm install -g hexo-cli</div><div class="line">$ hexo init blog    </div><div class="line">$ <span class="built_in">cd</span> blog</div><div class="line">$ hexo new <span class="string">"my first blog"</span></div></pre></td></tr></table></figure>
<h2 id="Setup-MWeb"><a href="#Setup-MWeb" class="headerlink" title="Setup MWeb"></a>Setup MWeb</h2><ol>
<li>Install MWeb in App Store</li>
<li>Lauch MWeb and go into <code>External Mode</code> (Command + E)</li>
<li>Add new external source: <code>{blog}/source</code>, and choose <code>Media Save Path</code> as <code>Absolute</code> </li>
<li>Enjoy writing</li>
</ol>
<h2 id="Deploy-Blog"><a href="#Deploy-Blog" class="headerlink" title="Deploy Blog"></a>Deploy Blog</h2><ol>
<li><p>Install <a href="https://github.com/hexojs/hexo-deployer-git" target="_blank" rel="external">hexo-deployer-git</a></p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ npm install hexo-deployer-git --save</div></pre></td></tr></table></figure>
</li>
<li><p>Edit <code>{blog}/_config.yml</code></p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">$ vim _config.yml</div><div class="line"></div><div class="line"><span class="comment"># Deployment</span></div><div class="line">deploy:</div><div class="line">  <span class="built_in">type</span>: git</div><div class="line">  repo: git@github.com:username/username.github.io.git</div><div class="line">  branch: master</div></pre></td></tr></table></figure>
</li>
<li><p>Deploy after you make any changes</p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ hexo deploy -g</div></pre></td></tr></table></figure>
</li>
<li><p>You can also test local: <a href="http://0.0.0.0:4000/" target="_blank" rel="external">http://0.0.0.0:4000/</a></p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$ hexo server</div></pre></td></tr></table></figure>
</li>
</ol>
<h2 id="Tips"><a href="#Tips" class="headerlink" title="Tips"></a>Tips</h2><ol>
<li><p>Back up <code>source</code> folder and add <strong>version control</strong>: markdown source files will be inside <code>{blog}/source</code> and this folder will not be tracked by git. So I sync this important folder to cloud:</p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">$ cp -r &#123;blog&#125;/<span class="built_in">source</span> &#123;Dropbox&#125;/blog/<span class="built_in">source</span></div><div class="line">$ vim _config.yml <span class="comment"># point source_dir to the new synced path  </span></div><div class="line">	</div><div class="line"><span class="comment"># Directory</span></div><div class="line">source_dir: &#123;Dropbox&#125;/blog/<span class="built_in">source</span></div></pre></td></tr></table></figure>
</li>
<li><p>Add README.md to repo: create README.md inside <code>{source}</code> folder and modify <code>{blog}/_config.yml</code>, so that README.md can introduce your repo on github:</p>
 <figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">$ vim _config.yml</div><div class="line">	</div><div class="line"><span class="comment"># Directory</span></div><div class="line">skip_render: README.md</div></pre></td></tr></table></figure>
</li>
</ol>
<p>Ref: <a href="https://hexo.io/docs/index.html" target="_blank" rel="external">Hexo</a>, <a href="http://zh.mweb.im/mweb-1.4-add-floder-octpress-support.html" target="_blank" rel="external">MWeb for Octpress</a> </p>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Spark Introduction part 1 Coding]]></title>
      <url>https://linbojin.github.io/2016/01/09/Spark-Introduction-part-1-Coding/</url>
      <content type="html"><![CDATA[<h2 id="Basic-Functions"><a href="#Basic-Functions" class="headerlink" title="Basic Functions"></a>Basic Functions</h2><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">sc.parallelize(<span class="type">List</span>(<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>,<span class="number">4</span>,<span class="number">5</span>,<span class="number">6</span>)).map(_ * <span class="number">2</span>).filter(_ &gt; <span class="number">5</span>).collect()</div><div class="line">*** res: <span class="type">Array</span>[<span class="type">Int</span>] = <span class="type">Array</span>(<span class="number">6</span>, <span class="number">8</span>, <span class="number">10</span>, <span class="number">12</span>) ***</div><div class="line"></div><div class="line"><span class="keyword">val</span> rdd = sc.parallelize(<span class="type">List</span>(<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>,<span class="number">4</span>,<span class="number">5</span>,<span class="number">6</span>,<span class="number">7</span>,<span class="number">8</span>,<span class="number">9</span>,<span class="number">10</span>))</div><div class="line">rdd.reduce(_+_)</div><div class="line">*** res: <span class="type">Int</span> = <span class="number">55</span> ***</div></pre></td></tr></table></figure>
<a id="more"></a>
<h3 id="union-amp-intersection-amp-join-amp-lookup"><a href="#union-amp-intersection-amp-join-amp-lookup" class="headerlink" title="union &amp; intersection &amp; join &amp; lookup"></a>union &amp; intersection &amp; join &amp; lookup</h3><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">val</span> rdd1 = sc.parallelize(<span class="type">List</span>((<span class="string">"a"</span>, <span class="number">1</span>), (<span class="string">"a"</span>, <span class="number">2</span>), (<span class="string">"b"</span>, <span class="number">1</span>), (<span class="string">"b"</span>, <span class="number">3</span>)))</div><div class="line"><span class="keyword">val</span> rdd2 = sc.parallelize(<span class="type">List</span>((<span class="string">"a"</span>, <span class="number">3</span>), (<span class="string">"a"</span>, <span class="number">4</span>), (<span class="string">"b"</span>, <span class="number">1</span>), (<span class="string">"b"</span>, <span class="number">2</span>)))</div><div class="line"></div><div class="line"><span class="keyword">val</span> unionRDD = rdd1.union(rdd2)</div><div class="line">unionRDD.collect() </div><div class="line">*** res: <span class="type">Array</span>((a,<span class="number">1</span>), (a,<span class="number">2</span>), (b,<span class="number">1</span>), (b,<span class="number">3</span>), (a,<span class="number">3</span>), (a,<span class="number">4</span>), (b,<span class="number">1</span>), (b,<span class="number">2</span>)) ***</div><div class="line"></div><div class="line"><span class="keyword">val</span> intersectionRDD = rdd1.intersection(rdd2)</div><div class="line">intersectionRDD.collect() </div><div class="line">*** res: <span class="type">Array</span>[(<span class="type">String</span>, <span class="type">Int</span>)] = <span class="type">Array</span>((b,<span class="number">1</span>)) ***</div><div class="line"></div><div class="line"><span class="keyword">val</span> joinRDD = rdd1.join(rdd2)</div><div class="line">joinRDD.collect()</div><div class="line">*** res: <span class="type">Array</span>[(<span class="type">String</span>, (<span class="type">Int</span>, <span class="type">Int</span>))] = <span class="type">Array</span>((a,(<span class="number">1</span>,<span class="number">3</span>)), (a,(<span class="number">1</span>,<span class="number">4</span>)), (a,(<span class="number">2</span>,<span class="number">3</span>)), (a,(<span class="number">2</span>,<span class="number">4</span>)), (b,(<span class="number">1</span>,<span class="number">1</span>)), (b,(<span class="number">1</span>,<span class="number">2</span>)), (b,(<span class="number">3</span>,<span class="number">1</span>)), (b,(<span class="number">3</span>,<span class="number">2</span>))) ***</div><div class="line"></div><div class="line">rdd1.lookup(<span class="string">"a"</span>)</div><div class="line">*** res: <span class="type">Seq</span>[<span class="type">Int</span>] = <span class="type">WrappedArray</span>(<span class="number">1</span>, <span class="number">2</span>) *** </div><div class="line"></div><div class="line">unionRDD.lookup(<span class="string">"a"</span>)</div><div class="line">*** res: <span class="type">Seq</span>[<span class="type">Int</span>] = <span class="type">WrappedArray</span>(<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>, <span class="number">4</span>) ***</div><div class="line"></div><div class="line">joinRDD.lookup(<span class="string">"a"</span>)</div><div class="line">*** res: <span class="type">Seq</span>[(<span class="type">Int</span>, <span class="type">Int</span>)] = <span class="type">ArrayBuffer</span>((<span class="number">1</span>,<span class="number">3</span>), (<span class="number">1</span>,<span class="number">4</span>), (<span class="number">2</span>,<span class="number">3</span>), (<span class="number">2</span>,<span class="number">4</span>)) ***</div></pre></td></tr></table></figure>
<h3 id="chars-count-example"><a href="#chars-count-example" class="headerlink" title="chars count example"></a>chars count example</h3><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">val</span> rdd = sc.textFile(<span class="string">"/Users/tony/spark/spark-xiaoxiang-v1/chapter-01/char.data"</span>)</div><div class="line"></div><div class="line"><span class="keyword">val</span> charCount = rdd.flatMap(_.split(<span class="string">" "</span>))</div><div class="line">                   .map(char =&gt; (char.toLowerCase, <span class="number">1</span>))</div><div class="line">                   .reduceByKey(_+_)</div><div class="line">charCount.collect()</div><div class="line"></div><div class="line">charCount.saveAsTextFile(<span class="string">"/Users/tony/spark/spark-xiaoxiang-v1/chapter-01/result"</span>)</div><div class="line"></div><div class="line"><span class="keyword">val</span> charCountSort = rdd.flatMap(_.split(<span class="string">" "</span>))</div><div class="line">                       .map(char =&gt; (char.toLowerCase, <span class="number">1</span>))</div><div class="line">                       .reduceByKey(_+_)</div><div class="line">                       .map( p =&gt; (p._2, p._1) )</div><div class="line">                       .sortByKey(<span class="literal">false</span>)</div><div class="line">                       .map( p =&gt; (p._2, p._1) )</div><div class="line">charCountSort.collect()</div></pre></td></tr></table></figure>
<h2 id="Cluster-Programming"><a href="#Cluster-Programming" class="headerlink" title="Cluster Programming"></a>Cluster Programming</h2><p>/sbin: start or stop spark cluster<br>/bin:  start programs like spark-shell</p>
<h3 id="Cluster-configuration"><a href="#Cluster-configuration" class="headerlink" title="Cluster configuration"></a>Cluster configuration</h3><p>==scp these same configurations to all the cluster nodes==</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div></pre></td><td class="code"><pre><div class="line">spark-env.sh</div><div class="line">-----------------<span class="built_in">export</span> JAVA_HOME=<span class="built_in">export</span> SPARK_MASTER_IP=localhost<span class="built_in">export</span> SPARK_WORKER_CORES=<span class="built_in">export</span> SPARK_WORKER_INSTANCES=<span class="built_in">export</span> SPARK_WORKER_MEMORY=<span class="built_in">export</span> SPARK_MASTER_PORT=<span class="built_in">export</span> SPARK_JAVA_OPTS=<span class="string">"-verbose:gc -XX:-PrintGCDetails”</span></div><div class="line"># if set up on server, uncomment the below line</div><div class="line"># export SPARK_PUBLIC_DNS=ec2-54-179-156-156.ap-southeast-1.compute.amazonaws.comslaves</div><div class="line">----------xx.xx.xx.2xx.xx.xx.3xx.xx.xx.4xx.xx.xx.5</div><div class="line"></div><div class="line">spark-defaults.conf </div><div class="line">--------------------</div><div class="line">spark.eventLog.enabled           true</div><div class="line">spark.eventLog.dir               /tmp/spark-events</div><div class="line">spark.history.fs.logDirectory    /tmp/spark-log-directory</div><div class="line"></div><div class="line">spark.master spark://server:7077 </div><div class="line">spark.local.dir /data/tmp_spark_dir/ </div><div class="line">spark.executor.memory 10g</div></pre></td></tr></table></figure>
<h3 id="Run-spark-shell-on-cluster"><a href="#Run-spark-shell-on-cluster" class="headerlink" title="Run spark-shell on cluster"></a>Run spark-shell on cluster</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">MASTER=<span class="built_in">local</span>[4] ADD_JARS=code.jar ./spark-shell</div><div class="line"></div><div class="line">spark-shell --master spark://localhost:7077</div></pre></td></tr></table></figure>
<p>SparkConf - Configuration for Spark Applications</p>
<p>Setting up Properties, from the least important to the most important</p>
<ul>
<li><code>conf/spark-defaults.conf</code> - the default</li>
<li><code>--conf</code> - the command line option used by <code>spark-shell</code> and <code>spark-submit</code></li>
<li><code>SparkConf</code></li>
</ul>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div></pre></td><td class="code"><pre><div class="line">$ spark-shell --conf spark.logConf=<span class="literal">true</span></div><div class="line">16/01/13 14:13:07 INFO SparkContext: Running Spark version 1.6.0</div><div class="line">16/01/13 14:13:07 INFO SparkContext: Spark configuration:</div><div class="line">spark.app.name=Spark shell</div><div class="line">spark.eventLog.dir=/tmp/spark-events</div><div class="line">spark.eventLog.enabled=<span class="literal">true</span></div><div class="line">spark.history.fs.logDirectory=/tmp/spark-log-directory</div><div class="line">spark.jars=</div><div class="line">spark.logConf=<span class="literal">true</span></div><div class="line">spark.master=<span class="built_in">local</span>[*]</div><div class="line">spark.repl.class.uri=http://172.30.64.148:57710</div><div class="line">spark.submit.deployMode=client</div><div class="line"></div><div class="line">scala&gt; sc.getConf.toDebugString</div><div class="line">res0: String =</div><div class="line">spark.app.id=<span class="built_in">local</span>-1452665588896</div><div class="line">spark.app.name=Spark shell</div><div class="line">spark.driver.host=172.30.64.148</div><div class="line">spark.driver.port=57711</div><div class="line">spark.eventLog.dir=/tmp/spark-events</div><div class="line">spark.eventLog.enabled=<span class="literal">true</span></div><div class="line">spark.executor.id=driver</div><div class="line">spark.externalBlockStore.folderName=spark<span class="_">-e</span>0f9f7c6-2759-44e9-bc5e-423fba7b16ad</div><div class="line">spark.history.fs.logDirectory=/tmp/spark-log-directory</div><div class="line">spark.jars=</div><div class="line">spark.logConf=<span class="literal">true</span></div><div class="line">spark.master=<span class="built_in">local</span>[*]</div><div class="line">spark.repl.class.uri=http://172.30.64.148:57710</div><div class="line">spark.submit.deployMode=client</div><div class="line"></div><div class="line">scala&gt; sc.getConf.getOption(<span class="string">"spark.local.dir"</span>)</div><div class="line">res1: Option[String] = None</div><div class="line"></div><div class="line">scala&gt; sc.getConf.getOption(<span class="string">"spark.app.name"</span>)</div><div class="line">res2: Option[String] = Some(Spark shell)</div><div class="line"></div><div class="line">scala&gt;  sc.getConf.get(<span class="string">"spark.master"</span>)</div><div class="line">res3: String = <span class="built_in">local</span>[*]</div></pre></td></tr></table></figure>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">hdfs dfs -getmerge /path result</div><div class="line">wc <span class="_">-l</span> result</div><div class="line">head result</div><div class="line">tail result</div></pre></td></tr></table></figure>
]]></content>
    </entry>
    
    <entry>
      <title><![CDATA[Spark Introduction part 1]]></title>
      <url>https://linbojin.github.io/2016/01/09/Spark-Introduction-part-1/</url>
      <content type="html"><![CDATA[<p>Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.</p>
<p>BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.</p>
<p><a href="https://amplab.cs.berkeley.edu/software/" target="_blank" rel="external"><img src="/media/14523324529237.jpg" alt="Berkeley Data Analytics Stack"></a></p>
<a id="more"></a>
<table>
<thead>
<tr>
<th>Spark Components</th>
<th>VS.</th>
<th>Hadoop Components</th>
</tr>
</thead>
<tbody>
<tr>
<td>Spark Core</td>
<td>&lt;——&gt;</td>
<td>Apache Hadoop MR</td>
</tr>
<tr>
<td>Spark Streaming</td>
<td>&lt;——&gt;</td>
<td>Apache Storm</td>
</tr>
<tr>
<td>Spark SQL</td>
<td>&lt;——&gt;</td>
<td>Apache Hive</td>
</tr>
<tr>
<td>Spark GraphX</td>
<td>&lt;——&gt;</td>
<td>MPI(taobao)</td>
</tr>
<tr>
<td>Spark MLlib</td>
<td>&lt;——&gt;</td>
<td>Apache Mahout </td>
</tr>
</tbody>
</table>
<p>==Why spark is fast:==</p>
<ul>
<li>in-memory computing</li>
<li>Directed Acyclic Graph (DAG) engine, compiler can see the whole computing graph in advance so that it can optimize it. Delay Scheduling</li>
</ul>
<h2 id="Resilient-Distributed-Dataset"><a href="#Resilient-Distributed-Dataset" class="headerlink" title="Resilient Distributed Dataset"></a>Resilient Distributed Dataset</h2><ul>
<li>A list of ==partitions==</li>
<li>A ==function== for computing each split </li>
<li>A list of ==dependencies== on other RDDs</li>
<li>Optionally, a ==Partitioner== for key-value RDDs (e.g. to say that the RDD is hash-partitioned)</li>
<li>Optionally, a list of ==preferred locations== to compute each split on (e.g. block locations for an HDFS file)</li>
</ul>
<h2 id="Storage-Strategy"><a href="#Storage-Strategy" class="headerlink" title="Storage Strategy"></a>Storage Strategy</h2><figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line"><span class="class"><span class="keyword">class</span> <span class="title">StorageLevel</span> <span class="title">private</span>(<span class="params"></span></span></div><div class="line">	private var useDisk_ : <span class="type">Boolean</span>,</div><div class="line">	private var useMemory_ : <span class="type">Boolean</span>,</div><div class="line">	private var deserialized_ : <span class="type">Boolean</span>,</div><div class="line">	private var replication_ : <span class="type">Int</span> = 1)</div><div class="line">	</div><div class="line"><span class="keyword">val</span> <span class="type">MEMORY_ONLY_</span> = <span class="keyword">new</span> <span class="type">StorageLevel</span>(<span class="literal">false</span>, <span class="literal">true</span>, <span class="literal">true</span>)</div></pre></td></tr></table></figure>
<h2 id="RDD-transformation-amp-action"><a href="#RDD-transformation-amp-action" class="headerlink" title="RDD, transformation &amp; action"></a>RDD, transformation &amp; action</h2><p>lazy evaluation</p>
<p><img src="/media/14523268512595.jpg" alt="transformation and actions"></p>
<h2 id="Lineage-amp-Dependency-amp-Fault-Tolerance"><a href="#Lineage-amp-Dependency-amp-Fault-Tolerance" class="headerlink" title="Lineage &amp; Dependency &amp; Fault Tolerance"></a>Lineage &amp; Dependency &amp; Fault Tolerance</h2><h3 id="Lineage"><a href="#Lineage" class="headerlink" title="Lineage"></a>Lineage</h3><p>==Basic for spark fault tolerance==</p>
<p>Lineage Graph<br><img src="/media/14523268935091.jpg" alt="lineage graph"></p>
<h3 id="Dependency"><a href="#Dependency" class="headerlink" title="Dependency"></a>Dependency</h3><ul>
<li><p>Narrow Dependencies: one partition depends on one partition</p>
<ul>
<li>calculation can be done on single node.</li>
</ul>
</li>
<li><p>Wide Dependencies: one partition depends on muliti partitions</p>
<ul>
<li>If one partition fails, all parent partitions need to be computed.</li>
<li>should use rdd.persist to cache the middle outputs</li>
</ul>
</li>
</ul>
<p><img src="/media/14523269133572.jpg" alt="Dependency"></p>
<h2 id="Spark-1-0-updated"><a href="#Spark-1-0-updated" class="headerlink" title="Spark 1.0 updated"></a>Spark 1.0 updated</h2><p>spark submit<br>history server (persistent UI)<br>spark-defaults.conf </p>
]]></content>
    </entry>
    
  
  
</search>