subcategory
Compute

databricks_job Resource

The databricks_job resource allows you to manage Databricks Jobs to run non-interactive code in a databricks_cluster.

Example Usage

data "databricks_current_user" "me" {}
data "databricks_spark_version" "latest" {}
data "databricks_node_type" "smallest" {
  local_disk = true
}

resource "databricks_notebook" "this" {
  path     = "${data.databricks_current_user.me.home}/Terraform"
  language = "PYTHON"
  content_base64 = base64encode(<<-EOT
    # created from ${abspath(path.module)}
    display(spark.range(10))
    EOT
  )
}

resource "databricks_job" "this" {
  name = "Terraform Demo (${data.databricks_current_user.me.alphanumeric})"

  new_cluster {
    num_workers   = 1
    spark_version = data.databricks_spark_version.latest.id
    node_type_id  = data.databricks_node_type.smallest.id
  }

  notebook_task {
    notebook_path = databricks_notebook.this.path
  }
}

output "notebook_url" {
  value = databricks_notebook.this.url
}

output "job_url" {
  value = databricks_job.this.url
}

Jobs with Multiple Tasks

-> Note In terraform configuration, it is recommended to define tasks in alphabetical order of their task_key arguments, so that you get consistent and readable diff. Whenever tasks are added or removed, or task_key is renamed, you'll observe a change in the majority of tasks. It's related to the fact that the current version of the provider treats task blocks as an ordered list. Alternatively, task block could have been an unordered set, though end-users would see the entire block replaced upon a change in single property of the task.

It is possible to create jobs with multiple tasks using task blocks:

resource "databricks_job" "this" {
  name = "Job with multiple tasks"

  task {
    task_key = "a"

    new_cluster {
      num_workers   = 1
      spark_version = data.databricks_spark_version.latest.id
      node_type_id  = data.databricks_node_type.smallest.id
    }

    notebook_task {
      notebook_path = databricks_notebook.this.path
    }
  }

  task {
    task_key = "b"

    depends_on {
      task_key = "a"
    }

    existing_cluster_id = databricks_cluster.shared.id

    spark_jar_task {
      main_class_name = "com.acme.data.Main"
    }
  }
}

Every task block can have almost all available arguments with the addition of task_key attribute and depends_on blocks to define cross-task dependencies.

Argument Reference

The following arguments are required:

name - (Optional) An optional name for the job. The default value is Untitled.
new_cluster - (Optional) Same set of parameters as for databricks_cluster resource.
existing_cluster_id - (Optional) If existing_cluster_id, the ID of an existing cluster that will be used for all runs of this job. When running jobs on an existing cluster, you may need to manually restart the cluster if it stops responding. We strongly suggest to use new_cluster for greater reliability.
always_running - (Optional) (Bool) Whenever the job is always running, like a Spark Streaming application, on every update restart the current active run or start it again, if nothing it is not running. False by default. Any job runs are started with parameters specified in spark_jar_task or spark_submit_task or spark_python_task or notebook_task blocks.
library - (Optional) (Set) An optional list of libraries to be installed on the cluster that will execute the job. Please consult libraries section for databricks_cluster resource.
retry_on_timeout - (Optional) (Bool) An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout.
max_retries - (Optional) (Integer) An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a FAILED result_state or INTERNAL_ERROR life_cycle_state. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry.
timeout_seconds - (Optional) (Integer) An optional timeout applied to each run of this job. The default behavior is to have no timeout.
min_retry_interval_millis - (Optional) (Integer) An optional minimal interval in milliseconds between the start of the failed run and the subsequent retry run. The default behavior is that unsuccessful runs are immediately retried.
max_concurrent_runs - (Optional) (Integer) An optional maximum allowed number of concurrent runs of the job. Defaults to 1.
email_notifications - (Optional) (List) An optional set of email addresses notified when runs of this job begin and complete and when this job is deleted. The default behavior is to not send any emails. This field is a block and is documented below.
schedule - (Optional) (List) An optional periodic schedule for this job. The default behavior is that the job runs when triggered by clicking Run Now in the Jobs UI or sending an API request to runNow. This field is a block and is documented below.

schedule Configuration Block

quartz_cron_expression - (Required) A Cron expression using Quartz syntax that describes the schedule for a job. This field is required.
timezone_id - (Required) A Java timezone ID. The schedule for a job will be resolved with respect to this timezone. See Java TimeZone for details. This field is required.
pause_status - (Optional) Indicate whether this schedule is paused or not. Either “PAUSED” or “UNPAUSED”. When the pause_status field is omitted and a schedule is provided, the server will default to using "UNPAUSED" as a value for pause_status.

spark_jar_task Configuration Block

parameters - (Optional) (List) Parameters passed to the main method.
main_class_name - (Optional) The full name of the class containing the main method to be executed. This class must be contained in a JAR provided as a library. The code should use SparkContext.getOrCreate to obtain a Spark context; otherwise, runs of the job will fail.

spark_submit_task Configuration Block

You can invoke Spark submit tasks only on new clusters. In the new_cluster specification, libraries and spark_conf are not supported. Instead, use --jars and --py-files to add Java and Python libraries and --conf to set the Spark configuration. By default, the Spark submit job uses all available memory (excluding reserved memory for Databricks services). You can set --driver-memory, and --executor-memory to a smaller value to leave some room for off-heap usage. Please use spark_jar_task, spark_python_task or notebook_task wherever possible.

parameters - (Optional) (List) Command-line parameters passed to spark submit.

spark_python_task Configuration Block

python_file - (Required) The URI of the Python file to be executed. databricks_dbfs_file and S3 paths are supported. This field is required.
parameters - (Optional) (List) Command line parameters passed to the Python file.

notebook_task Configuration Block

base_parameters - (Optional) (Map) Base parameters to be used for each run of this job. If the run is initiated by a call to run-now with parameters specified, the two parameters maps will be merged. If the same key is specified in base_parameters and in run-now, the value from run-now will be used. If the notebook takes a parameter that is not specified in the job’s base_parameters or the run-now override parameters, the default value from the notebook will be used. Retrieve these parameters in a notebook using dbutils.widgets.get.
notebook_path - (Required) The absolute path of the databricks_notebook to be run in the Databricks workspace. This path must begin with a slash. This field is required.

pipeline_task Configuration Block

pipeline_id - (Required) The pipeline's unique ID.

python_wheel_task Configuration Block

entry_point - (Optional) Python function as entry point for the task
package_name - (Optional) Name of Python package
parameters - (Optional) Parameters for the task
named_parameters - (Optional) Named parameters for the task

email_notifications Configuration Block

on_failure - (Optional) (List) list of emails to notify on failure
no_alert_for_skipped_runs - (Optional) (Bool) don't send alert for skipped runs
on_start - (Optional) (List) list of emails to notify on failure
on_success - (Optional) (List) list of emails to notify on failure

Exported attributes

In addition to all arguments above, the following attributes are exported:

url - URL of the job on the given workspace

Access Control

By default, all users can create and modify jobs unless an administrator enables jobs access control. With jobs access control, individual permissions determine a user’s abilities.

databricks_permissions can control which groups or individual users can Can View, Can Manage Run, and Can Manage.
databricks_cluster_policy can control which kinds of clusters users can create for jobs.

Timeouts

The timeouts block allows you to specify create and update timeouts if you have an always_running job. Please launch TF_LOG=DEBUG terraform apply whenever you observe timeout issues.

timeouts {
  create = "20m"
  update = "20m
}

Import

The resource job can be imported using the id of the job

$ terraform import databricks_job.this <job-id>

Related Resources

The following resources are often used in the same context:

End to end workspace management guide.
databricks_cluster to create Databricks Clusters.
databricks_cluster_policy to create a databricks_cluster policy, which limits the ability to create clusters based on a set of rules.
databricks_current_user data to retrieve information about databricks_user or databricks_service_principal, that is calling Databricks REST API.
databricks_dbfs_file data to get file content from Databricks File System (DBFS).
databricks_dbfs_file_paths data to get list of file names from get file content from Databricks File System (DBFS).
databricks_dbfs_file to manage relatively small files on Databricks File System (DBFS).
databricks_global_init_script to manage global init scripts, which are run on all databricks_cluster and databricks_job.
databricks_instance_pool to manage instance pools to reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances.
databricks_instance_profile to manage AWS EC2 instance profiles that users can launch databricks_cluster and access data, like databricks_mount.
databricks_library to install a library on databricks_cluster.
databricks_node_type data to get the smallest node type for databricks_cluster that fits search criteria, like amount of RAM or number of cores.
databricks_notebook to manage Databricks Notebooks.
databricks_pipeline to deploy Delta Live Tables.
databricks_repo to manage Databricks Repos.
databricks_spark_version data to get Databricks Runtime (DBR) version that could be used for spark_version parameter in databricks_cluster and other resources.
databricks_workspace_conf to manage workspace configuration for expert usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job.md

job.md

databricks_job Resource

Example Usage

Jobs with Multiple Tasks

Argument Reference

schedule Configuration Block

spark_jar_task Configuration Block

spark_submit_task Configuration Block

spark_python_task Configuration Block

notebook_task Configuration Block

pipeline_task Configuration Block

python_wheel_task Configuration Block

email_notifications Configuration Block

Exported attributes

Access Control

Timeouts

Import

Related Resources

Files

job.md

Latest commit

History

job.md

File metadata and controls

databricks_job Resource

Example Usage

Jobs with Multiple Tasks

Argument Reference

schedule Configuration Block

spark_jar_task Configuration Block

spark_submit_task Configuration Block

spark_python_task Configuration Block

notebook_task Configuration Block

pipeline_task Configuration Block

python_wheel_task Configuration Block

email_notifications Configuration Block

Exported attributes

Access Control

Timeouts

Import

Related Resources