atom.xml

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[Romain NIO - Blog dealing about Data]]></title>
  <link href="http://www.rnio.me/atom.xml" rel="self"/>
  <link href="http://www.rnio.me/"/>
  <updated>2016-03-12T13:12:04+01:00</updated>
  <id>http://www.rnio.me/</id>
  <author>
    <name><![CDATA[Romain NIO]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    <title type="html"><![CDATA[Build hadoop native librairies]]></title>
    <link href="http://www.rnio.me/blog/2015/06/16/build-hadoop-native-librairies/"/>
    <updated>2015-06-16T19:21:41+02:00</updated>
    <id>http://www.rnio.me/blog/2015/06/16/build-hadoop-native-librairies</id>
    <content type="html"><![CDATA[<p>The Hadoop native librairies are compiled for 32 bits plateforms. If you are using Hadoop on x64, you have probably been faced to the following issue :</p>

<pre><code> WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
</code></pre>

<p>For performances purposes, it better to recompile those libraries according to your plateform.</p>

<p>It’s a good idea to compile on the same architecture than your Hadoop production plateform. Of course, avoid any compilations on your production server Not sure that hadoop native librairies are compiled for 32 bits plateform ? You can check that with the following command :</p>

<pre><code>file $HADOOP_HOME/lib/native/libhadoop.so.1.0.0
</code></pre>

<p>Here the result :</p>

<pre><code>file libhadoop.so.1.0.0: ELF 32-bit
</code></pre>

<h2 id="download-source">Download Source</h2>

<p>Visit http://mirrors.ircam.fr/pub/apache/hadoop/common/ and find the tarball of your Hadoop version. Download it :</p>

<pre><code>wget http://mirrors.ircam.fr/pub/apache/hadoop/common/hadoop-2.4.1/hadoop-2.4.1.tar.gz
</code></pre>

<p>Install dependencies :</p>

<pre><code>sudo apt-get install cmake autoconf automake libtool gcc zlib1g-dev pkg-config libssl-dev openssl gcc g++ make maven zlib zlib1g-dev libcurl4-o
</code></pre>

<p>Install protobuf :</p>

<pre><code>wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz gunzip protobuf-2.5.0.tar.gz
tar -xvf protobuf-2.5.0.tar
cd protobuf-2.5.0
sudo ./configure --prefix=/usr sudo make
sudo make install
</code></pre>

<h2 id="compile-hadoop">Compile Hadoop</h2>
<p>Unzip you tarball :</p>

<pre><code>tar -xzf hadoop-2.4.1-src.tar.gz
</code></pre>

<p>Enter in your folder :</p>

<pre><code>cd hadoop-2.4.1-src/
</code></pre>

<p>Set your environment :</p>

<pre><code>export Platform=x64
Compile :
mvn package -Pdist,native -DskipTests -Dtar
</code></pre>

<p>If you face issues while compiling, google is your friend ;). If all is OK, you will have this kind of output :</p>

<pre><code>[INFO] ------------------------------------------------------------- [INFO] BUILD SUCCESS
[I------------------------------------------------------------------ [INFO] Total time: 5:27.684s
[INFO] Finished at: Wed Jul 02 19:33:51 CEST 2014
[INFO] Final Memory: 165M/834M
[INFO] -----------------------------------------------------------------------
</code></pre>

<p>You can find the librairies in this folder :</p>

<pre><code>cd ./hadoop-dist/target/hadoop-2.4.1/lib/native
</code></pre>

<p>We can see all built librairies (“ls ­lh”) :</p>

<pre><code>-rw-r--r-- 1 hadoop hadoop 1.1M Jul 2 19:07 libhadoop.a
lrwxrwxrwx 1 hadoop hadoop 18 Jul 2 19:07 libhadoop.so -&gt; libhadoop.so.1.0.0 -rwxr-xr-x 1 hadoop hadoop 650K Jul 2 19:07 libhadoop.so.1.0.0
-rw-r--r-- 1 hadoop hadoop 1.4M Jul 2 19:07 libhadooppipes.a
-rw-r--r-- 1 hadoop hadoop 421K Jul 2 19:07 libhadooputils.a
-rw-r--r-- 1 hadoop hadoop 373K Jul 2 19:07 libhdfs.a
lrwxrwxrwx 1 hadoop hadoop 16 Jul 2 19:07 libhdfs.so -&gt; libhdfs.so.0.0.0 -rwxr-xr-x 1 hadoop hadoop 245K Jul 2 19:07 libhdfs.so.0.0.0
</code></pre>

<p>At this step, you can check the plateform of the librairies :</p>

<pre><code>file libhadoop.so.1.0.0
</code></pre>

<p>The result seems to be OK :</p>

<pre><code>libhadoop.so.1.0.0: ELF 64-bit LSB shared object, x86-64
</code></pre>

<p>save them and archive this package :</p>

<pre><code>tar -cvzf hadoop-native-libraries-2.4.1.tgz *
</code></pre>

<h2 id="copy-librairies-on-your-cluster">Copy librairies on your cluster</h2>

<p>This step need to be reproduced for namenode and datanode.
You just need to copy all those file in $HADOOP_HOME/lib/native (eg : /usr/local/hadoop/lib/native) :</p>

<pre><code>$ rsync hadoop-native-libraries-2.4.1.tgz &lt;your_hadoop_production_server&gt;:/usr/local/hadoop/lib/native/
</code></pre>

<p>Enter in your Hadoop home (eg : /usr/local/hadoop/lib/native)</p>

<pre><code>$ cd $HADOOP_HOME/lib/native
</code></pre>

<p>Extract archives :</p>

<pre><code>$ tar -xzf hadoop-native-libraries-2.4.1.tgz
</code></pre>

<h2 id="configure-your-environment">Configure your environment</h2>

<p>You probably have a specific unix user for your hadoop cluster. Add those lines in your ~/.bashrc (or coure, edit paths according to your configuration):</p>

<pre><code>export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR $HADOOP_OPTS"
</code></pre>

<h2 id="stop-and-restart-hadoop">Stop and restart Hadoop</h2>

<p>Stop cluster :</p>

<pre><code>./stop-dfs.sh 
./stop-yarn.sh
</code></pre>

<p>Source again your bashrc :</p>

<pre><code>source ~/.bashrc
</code></pre>

<p>Start Hadoop :</p>

<pre><code>./start-dfs.sh
./start-yarn.sh
</code></pre>

<p>Test that the message disappears :</p>

<pre><code>$ hadoop fs -ls /user/hadoop
Found 1 item
-rwxr-xr-x 1 hadoop supergroup 8 2014-07-01 14:06 /user/hadoop/toto.txt
</code></pre>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Split large file in bash]]></title>
    <link href="http://www.rnio.me/blog/2014/06/05/split-large-file-unix-in-bash-command-line/"/>
    <updated>2014-06-05T23:44:18+02:00</updated>
    <id>http://www.rnio.me/blog/2014/06/05/split-large-file-unix-in-bash-command-line</id>
    <content type="html"><![CDATA[<p>When you are dealing with large file, it’s complicated to share or manipulate them.</p>

<p>On linux the split command can be useful for you.</p>

<p>Basic Usage :</p>

<pre><code>$ split [-l] [-b] filename prefix
</code></pre>

<p>For example, if you want to split a large file named “clients.csv” and create files with 100k records / file, apply the following command:</p>

<pre><code>$ split -l 100000 clients.csv splitted_clients-
“splitted_client” is the prefix applied to each generated files
</code></pre>

]]></content>
  </entry>
  
</feed>