Skip to content

Hadoop DSL Language FAQ

convexquad edited this page Jul 27, 2015 · 14 revisions

Table of Contents

Hadoop DSL Language FAQ

How do I clear previously declared job dependencies or workflow targets?

See Clearing Job Dependencies and Workflow Targets While Cloning.

How do I set Hadoop job configuration parameters (like the amount of memory my jobs use)?

(Since version 0.3.6) If you are using a subclass of the HadoopJavaProcessJob class (such as hadoopJavaJob, hiveJob, javaJob, pigJob, kafkaPushJob or voldemortBuildPushJob), you can Hadoop job configuration parameters using the following syntax:

hadoopJavaJob('jobName') {                     // Or another subtype of HadoopJavaProcessJob
  set confProperties: [
    'mapreduce.map.memory.mb' : 2048,          // Sets the amount of physical RAM allocated to the map tasks of your job. Should be in increments of 2048.
    'mapreduce.map.java.opts' : '-Xmx 1536m',  // Sets the amount of Xmx for your map tasks. Should be less than mapreduce.map.memory.mb to accommodate stack and code size.
    'mapreduce.reduce.memory.mb' : 4096,       // Sets the amount of physical RAM allocated to the reduce tasks of your job. Should be in increments of 2048.
    'mapreduce.reduce.java.opts': '-Xmx 3584m' // Sets the amount of Xmx for your reduce tasks. Should be less than mapreduce.reduce.memory.mb to accommodate stack and code size.
  ]
}

Using set confProperties causes properties to be written to your Azkaban job file that are prefixed with hadoop-inject. Azkaban will automatically inject any properties it sees that are prefixed with hadoop-inject into the Hadoop Job configuration object.

How can I affect the JVM properties of my Azkaban job?

(Since version 0.3.6) For any subclass of the JavaProcessJob class, you can set the classpath, JVM properties, -Xms and -Xmx of the client JVM started by Azkaban to run the job:

javaProcessJob('jobName') {
  uses 'com.linkedin.foo.HelloJavaProcessJob'  // Required. Sets java.class=com.linkedin.foo.HelloJavaProcessJob in the job file
  jvmClasspath './*:./lib/*'                   // Sets the classpath of the JVM started by Azkaban to run the job
  set jvmProperties: [                         // Sets jvm.args=-DpropertyName1=propertyValue1 -DpropertyName2=propertyValue2 in the job file.
    'jvmPropertyName1' : 'jvmPropertyValue1',  // These arguments are passed to the JVM started by Azkaban to run the job.
    'jvmPropertyName2' : 'jvmPropertyValue2'
  ]
  Xms 96                                       // Sets -Xms for the JVM started by Azkaban to run the job
  Xmx 384                                      // Sets -Xmx for the JVM started by Azkaban to run the job
}

These properties affect the JVM of the client process started by Azkaban to run the job only. In particular, these properties do NOT affect the JVM properties that run map and reduce tasks for Hadoop jobs. If you are running a map-reduce job, you usually do not need to increase the amount of memory for the Azkaban client process.

To affect the JVM properties that run map and reduce tasks, you need to set the appropriate Hadoop job configuration parameter using set confProperties.

Can I split my DSL code into more than one file? For example, I would like to have one DSL file per workflow.

Yes, you can. For projects with large workflows, splitting them into one DSL file per workflow is a reasonable idea.

// In your build.gradle, you can apply all the DSL files you want
apply from: 'src/main/gradle/workflow1.gradle'
apply from: 'src/main/gradle/workflow2.gradle'
apply from: 'src/main/gradle/workflow3.gradle'
If I have my code split into more than one file, can I define something in one file and then refer to it in another?

Yes and no. Unfortunately, Groovy def variables (e.g. def foo = "bar") are limited to the scope of the file in which they are declared. However, you can use the Hadoop DSL definitionSet feature to work around this problem.

For the Hadoop DSL, we have our own explicit name resolution, so you interchangeably refer to named Hadoop DSL elements (such as jobs and workflows) across files.

Any named elements you declare in a particular scope, you can then use in that scope, even in a different file. Since the names are resolved at build time (see the section above on scope), you can even refer to job or workflow that you haven't defined yet (but define later, perhaps even in another file). Here's a quick example:

// In the file workflow1.gradle
hadoopJavaJob('cacheLibraries') {
  // 'cacheLibraries' now bound in Hadoop DSL global scope
  uses 'com.linkedin.drama.mapred.CacheLibraryJob'
}
 
hadoop {
  buildPath "azkaban/conf/jobs"
  workflow('countByCountryFlow') {
    // Workflow 'countByCountryFlow' now bound in hadoop scope
    hadoopJavaJob('countByCountry') {
      uses 'com.linkedin.hello.mapreduce.CountByCountryJob'
      // Job 'countByCountry' now bound in the workflow scope for countByCountryFlow
    }
    targets 'countByCountry'
  }
}
 
// In the file workflow2.gradle
hadoop {
  workflow('anotherExampleFlow') {
    addJob('cacheLibraries') {         // Lookup 'cacheLibraries' in Hadoop DSL global scope works even across files
    }
    addJob('.hadoop.countByCountryFlow.countByCountry') {  // Do a fully-qualified lookup and clone of the countByCountry job from workflow1.gradle
    }
    targets 'countByCountry'
  }
}
How can I escape strings in the Hadoop DSL?

You can use Groovy language features and API functions to help you escape strings in the Hadoop DSL. Groovy makes differences between single-quoted strings, double-quoted strings, triply double-quoted strings, slashy strings, dollar-sign strings and more. This should allow you to easily define things like Avro schemas without making your code impossible to read.

WANTED: If someone has a recommendation for the best way to define an Avro schema in the Hadoop DSL (or better ways to escape other kinds of commonly-used strings), please let us know.

// http://docs.groovy-lang.org/latest/html/gapi/groovy/json/StringEscapeUtils.html
def escape = { s -> groovy.json.StringEscapeUtils.escapeJava(s) }
 
// Read http://mrhaki.blogspot.com/2009/08/groovy-goodness-string-strings-strings.html for more information on all the types of Groovy strings!
def schema = escape("""{
    "type" : "record",
    "name" : "member_summary",
    "namespace" : "com.linkedin.data.derived",
    "fields" : [ {
      "name" : "memberId",
      "type" : [ "null", "long" ]
    } ]
  }""");
 
noOpJob('test') {
  set properties: [
    'test1' : escape("line1\nline2"),
    'test2' : '"mySchema": "foo": [ "bar": "bazz"]',
    'test3' : "\"${escape('"mySchema": "foo": [ "bar": "bazz"]')}\"",
    'test4' : "\"${schema}\""
  ]
}
 
// This results in the following output in the compiled Azkaban job file:
test1=line1\nline2
test2="mySchema": "foo": [ "bar": "bazz"]
test3="\"mySchema\": \"foo\": [ \"bar\": \"bazz\"]"
test4="{\n      \"type\" : \"record\",\n      \"name\" : \"member_summary\",\n      \"namespace\" : \"com.linkedin.data.derived\",\n      \"fields\" : [ {\n        \"name\" : \"memberId\",\n        \"type\" : [ \"null\", \"long\" ]\n      } ]\n    }"
Clone this wiki locally