Skip to content

BigDataHadoop

stockiNail edited this page Oct 12, 2015 · 3 revisions

Big Data

Big data is a general term used to describe the voluminous amount of unstructured and semi-structured data a company creates. Data that would take too much time and cost too much money to load into a relational database for analysis. Although Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.

A primary goal for looking at big data is to discover repeatable business patterns. It's generally accepted that unstructured data, most of it located in text files, accounts for at least 80% of an organization's data. If left unmanaged, the sheer volume of unstructured data that's generated each year within an enterprise can be costly in terms of storage. Unmanaged data can also pose a liability if information cannot be located in the event of a compliance audit or lawsuit.

Big data analytics is often associated with cloud computing because the analysis of large data sets in real-time requires a framework like MapReduce to distribute the work among tens, hundreds or even thousands of computers.

Hadoop

Hadoop is "the" open source platform that can be used to manage big data.

(quoted from big data definition in Cloudera)

With data growing so rapidly and the rise of unstructured data accounting for 90% of the data today, the time has come for enterprises to re-evaluate their approach to data storage, management and analytics. Legacy systems will remain necessary for specific high-value, low-volume workloads, and complement the use of Hadoop optimizing the data management structure in your organization by putting the right Big Data workloads in the right systems.

Spring For Apache Hadoop

(quoted from Spring For Apache Hadoop definition)

Spring for Apache Hadoop simplifies developing Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, Map-Reduce, Pig, and Hive. It also provides integration with other Spring ecosystem project such as Spring Integration and Spring Batch enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration.

JEM meet Hadoop

As described in the picture below, for a complete integration between JEM and Hadoop, two things should be possible for JEM to do:

  • Acces (read & write) Hadoop data and hence access the HDFS
  • Submit job (Mapreduce and eventually YARN application) in the Hadoop platform

Before to describe the possible implementation of the above points, just remember that because JEM uses Spring Batch as one of its out of the box JCL implementation, it can take advantage of all the features of Spring for Apache Hadoop.

That said currently JEM has 2 possible way to get access to HDFS:

  • Using Spring Batch with Spring for Apache Hadoop integration.
  • Using the HDFS Nfs Gateway that starting from Apache Hadoop 2.3 allow to mount the HDFS as a client file system

For what concern the submission of Hadoop job, once again, JEM will take advantage of the integration with Apache for Hadoop and Spring Batch. Now, if you look in the picture below, you can observe that, for job submission, we are explicitly referring to Hadoop 1.x distribution and hence only Mapreduce and not YARN application. This is because Spring for Apache Hadoop support Hadoop 2.x only using a java runtime 1.8+. At the moment instead, the JEM community want to stay on a java runtime 1.6+ because most of the company are still related to that distribution. That said, once we will move to a java runtime 1.8+, JEM will be also integrated with Hadoop 2.x and will be able to handle also YARN application.

http://www.pepstock.org/resources/Jem-Hadoop.png

HDFS NFS Gateway

The ability of Apache Hadoop, starting from release 2.3, to expose HDFS as a file system via NFS is for JEM a very important feature because it allows the JEM to use the HDFS as one of its Multiple data path. This means that a single job (inside the JEM environment), during its business logic, will be able to handle data sets both from HDFS and "traditional" file systems. To properly mount an HDFS as a file system via NFS see HDFS Nfs Gateway.

Schedule a Mapreduce from JEM

To be able to submit a Mapreduce in JEM, some java libraries need to be correctly copied inside the JEM Home lib folder and the JEM classpath folder of the Global File System. This process cannot be done automatically by JEM because it depends on the Hadoop distribution that you want the JEM to integrate with.

Let see in detail what you need to do.

  1. Download the client hadoop 1.x distribution compatible with your Hadoop 1.x installation.
  2. Download also the libraries needed by your hadoop 1.x client distribution
  3. Copy the client distribution somewhere in the JEM classpath folder . Let say ${jem.classpath}/hadoop
  4. Copy the libraries needed by the hadoop client in the JEM classpath folder . Let say ${jem.classpath}/hadoop/lib
  5. Download the spring-data-hadoop-xxx.jar that fit with your hadoop 1.x distribution. You can find detail instruction here
  6. Copy the spring-data-hadoop-xxx.jar from above point in the JEM Home/lib/ext folder

The PEPSTOCK community has tested integration with Hortonworks HDP 1.3 using the sandbox that you can download here. So if you are not interested for a specific Hadoop distribution, but you just want to test JEM-Hadoop integration you can:

  1. Download the client hadoop 1.2.1 here
  2. Download also the libraries needed by your hadoop 1.2.1 client distribution here
  3. Copy the client distribution somewhere in the JEM classpath folder . Let say ${jem.classpath}/hadoop
  4. Copy the libraries needed by the hadoop client in the JEM classpath folder . Let say ${jem.classpath}/hadoop/lib
  5. Download the spring-data-hadoop-1.0.2.RELEASE.jar that fit the HDP 1.3 Hadoop distribution here
  6. Copy the spring-data-hadoop-xxx.jar from above point in the JEM Home/lib/ext folder

After you have correctly download and copy the needed jar, you also need a Mapreduce job to be launched. For this we will use the well known word count Map-Reduce example that you will find in your distribution (hadoop-examples.jar) and that for the HDP 1.3 you can find it here. Before to launch the map reduce from JEM, copy the hadoop-examples.jar somewhere in the JEM classpath folder . Let say ${jem.classpath}/hodoop-application.

You can now submit in the JEM environment the following JCL that will launch in the Hadoop environment the word count map reduce:

<beans:beans xmlns="http://www.springframework.org/schema/batch"
    xmlns:beans="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:hdp="http://www.springframework.org/schema/hadoop"
	xsi:schemaLocation="
           http://www.springframework.org/schema/beans 
           http://www.springframework.org/schema/beans/spring-beans-3.1.xsd
           http://www.springframework.org/schema/batch 
           http://www.springframework.org/schema/batch/spring-batch-2.2.xsd
           http://www.springframework.org/schema/hadoop 
           http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">

	<!-- 
		Application Context
	 -->
	<beans:bean id="transactionManager"
        class="org.springframework.batch.support.transaction.ResourcelessTransactionManager"/>

	<beans:bean id="jobRepository" 
  		class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean">
    	<beans:property name="transactionManager" ref="transactionManager"/>
	</beans:bean>

	<beans:bean id="jobLauncher"
		class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
    	<beans:property name="jobRepository" ref="jobRepository" />
	</beans:bean>

	<hdp:configuration>
            fs.default.name=hdfs://192.168.163.135:8020
        </hdp:configuration>

	<!-- This is the best way to launch an hadoop job since every 
		 exception will be correctly notify to the client. Other options
		 such as tasklet-jar are not a good way to sumbmit and hadoop job
		 since exception during hadoop job execution will not be notified to
		 the client  -->
    <hdp:job-tasklet id="hadoop-tasklet" job-ref="word_count_job" wait-for-completion="true" />

    <hdp:job id="word_count_job" 
        input-path="/user/hue/jobsub/sample_data/" 
        output-path="/user/hue/jobsub/output/"
        mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper"
        reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"/>
	<!-- 
		Jem Props
		
		as you can see we use the priorClassPath example this is because hadoop lib/ need to be loaded first in the classloader.
		
		Moreover to be able to launch this job you need to:
		1) load in jemhome/lib/ext the spring-data-hadoop-xxx.jar that fit with your hadoop 1.x distribution.
		2) load in the ${jem.classpath}/hadoop the cliet hadoop 1.x distribution
		3) load in the ${jem.classpath}/hadoop/lib the libraries needed by the client hadoop distribution
		4) load in the ${jem.classpath}/hodoop-application/ the hadoop-examples.jar containing among others the world count example

		you can find the needed jar inside the hadoop folder and you need an hortonworks hdp 1.3 hadoop server
	 -->

	<beans:bean id="jem.bean" class="org.pepstock.jem.springbatch.JemBean">
		<beans:property name="jobName" value="HADOOP" />
		<beans:property name="priorClassPath" value="${jem.classpath}/hadoop/*;${jem.classpath}/hadoop/lib/*;${jem.classpath}/hodoop-application/*" />
	</beans:bean>

	<!-- 
		null: does nothing
	 -->
	<job id="TEST_HADOOP_SB_JOB">
    	<step id="script">
			<tasklet ref="hadoop-tasklet" />
		</step>

	</job>
	
</beans:beans>
Clone this wiki locally