Skip to content

Hadoop Validator

Alex Bain edited this page Jun 8, 2017 · 5 revisions

Table of Contents

Hadoop Validator

The Hadoop Plugin includes the Hadoop Validator, which provides Gradle tasks that perform local validation of your Hadoop jobs. In particular, the Hadoop Validator includes tasks for data validation, schema validation and syntax checking for Hadoop ecosystem jobs.

These tasks should deliver a signficant boost in developer productivity by enabling you to validate your Hadoop jobs locally at build time, which helps you to avoid the process of waiting for your Hadoop job to be submitted to the cluster, only to see the job fail due to a trivial error.

Currently, the Hadoop Validator provides validation tasks for Apache Pig jobs. However, the Hadoop Validator is built in such a way that it can be easily extended for other Hadoop ecosystem bundles like Apache Hive or Apache Spark.

Please note that the Hadoop Validator is currently an experimental feature.

The .hadoopValidatorProperties file

Many of the validation tasks depend on information stored in the .hadoopValidatorProperties file in the project directory. If this file does not exist, it will be automatically created by the Hadoop Plugin.

Run the Hadoop Validation Task

To execute the Hadoop Validator tasks for your project, run ./gradlew hadoopValidate. The Hadoop Validator will examine all the jobs configured with the Hadoop DSL for your project and attempt to validate them.

Currently, the hadoopValidate task executes the pigValidate task for Apache Pig jobs.

Apache Pig Validation Tasks

The pigValidate task finds the Apache Pig jobs configured with the Hadoop DSL for your project and executes the pigDataExists, pigDependencyExists and pigSyntaxValidator tasks described below.

pigDataExists Task

This task checks for existence of data files loaded by Apache Pig in HDFS, whose NameNode address must be declared in the .hadoopValidatorProperties file.

pigDependencyExists Task

This task checks for existence of jar dependencies declared in Apache Pig scripts. The jar dependencies can be local, located in HDFS or in an Ivy repository. For dependencies located in HDFS or in an Ivy repository, the HDFS NameNode address or Ivy repository URL must be declared in the .hadoopValidatorProperties file.

pigSyntaxValidator Task

This task checks the syntax for Apache Pig scripts and generates a parameter substitution file, which is an input to the other Apache Pig Validation tasks.