- Designed for collaborative use, such as ad hoc analysis, data exploration, and development.
- Multiple users can share these clusters.
- Cost-effective for tasks that benefit from resource sharing.
- May not guarantee immediate availability due to resource sharing.
- Specifically for running automated jobs.
- Terminate once the job is completed, reducing resource usage and cost.
- Dedicated to a specific task and optimized for performance.
- More suitable for workflows with strict Service Level Agreements (SLAs).
Aspect | All-purpose Clusters | Job Clusters |
---|---|---|
SLA Requirements | Versatile but less predictable for strict SLAs | Dedicated and optimized for performance |
Resource Usage | Cost-effective for exploration and development tasks | Efficient for time-sensitive tasks, releasing resources promptly |
Trade-offs | Flexibility but potential delay due to resource sharing | Prioritizes specific jobs but higher resource costs |
- Combines the advantages of Data Lakes and Data Warehouses.
- Stores data in Parquet format and transaction logs in JSON format.
dbutils
: Module for interacting with Databricks.dbutils.help()
: Lists available modules.dbutils.fs.help()
: Provides file system utilities, e.g.,dbutils.fs.ls('path')
.
describe database db_name;
: Retrieves the location of a database.spark.Table
: Registers a table through SparkSession.DESCRIBE DATABASE
orDESCRIBE SCHEMA
: Returns metadata of a database.
- Builds upon standard data formats.
- VACUUM command: Deletes unused data files older than a specified retention period.
- Supports ACID transactions and scalable metadata handling.
- Delta tables store data in Parquet format with transaction logs in JSON format.
- Supports Git operations like pull, fetch, and manage branches.
- Useful for managing development work and versioning.
dbfs:/user/hive/warehouse/db_hr.db
: Default location for databases.- PIVOT: Transforms rows into columns for better readability.
CREATE TABLE USING
: Creates external tables from external data sources (e.g., CSV).CREATE SCHEMA
: Alias forCREATE DATABASE
.
trigger(availableNow=True)
: Runs the stream in batch mode and stops after processing available data.- Auto Loader: Tracks discovered files using checkpointing for exactly-once ingestion guarantees.
- Default processing interval:
trigger(processingTime="500ms")
.
Aspect | Triggered Execution | Continuous Execution | Development | Production |
---|---|---|---|---|
Execution Trigger | Event-driven | Continuous data arrival | N/A | N/A |
Processing Characteristics | Batch or near-real-time | Real-time processing | Iterative development & testing | Stable, optimized execution |
Environment | N/A | N/A | Interactive notebooks or development environments | Dedicated clusters optimized for performance |
Resource Allocation | N/A | N/A | Limited resources for experimentation and testing | Dedicated resources for stable and scalable execution |
- Enables creating streaming live tables using the
STREAM()
function. - Use
COMMENT "Contains PII"
to indicate that a new table includes personally identifiable information (PII).
- COPY INTO: Suitable for ingesting thousands of files.
- Auto Loader: Preferred for ingesting millions of files or more and handles schema evolution.
- Orchestrates data processing tasks in a Directed Acyclic Graph (DAG).
- Allows repairing failed multi-task jobs by rerunning only the unsuccessful tasks and their dependents.
MERGE
command: Writes data into Delta tables while avoiding duplicate records.
CREATE FUNCTION blue() RETURNS STRING COMMENT 'Blue color code' LANGUAGE SQL RETURN '0000FF';
spark.readStream
: Reads streaming data.SELECT * FROM file_format.'path.type'
: Queries a table using file format and path.CREATE TABLE table_name DEEP CLONE source_table
: Creates a deep clone of a table.CREATE TABLE table_name SHALLOW CLONE source_table
: Creates a shallow clone of a table.
Data Explorer
: Manages data object permissions.- View types:
- Standard view:
CREATE VIEW view_name AS query
- Temporary view:
CREATE TEMP VIEW view_name AS query
- Global temporary view:
CREATE GLOBAL TEMP VIEW view_name AS query
- Standard view:
- Gold tables: Refined data suitable for business reporting (e.g., BI dashboards).
- Silver tables: Cleansed and conformed data.
- Bronze tables: Raw data with minimal transformation.
- Raw data: Unprocessed data directly ingested from source systems.
- The query below is in the Bronze layer, reading data from cloud storage into an uncleaned orders table.
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(ordersLocation)
.writeStream
.option("checkpointLocation", checkpointPath)
.table("uncleanedOrders")
)
- Follow the steps outlined in Databricks documentation.
- Parquet is a columnar storage file format, preferred for its efficiency in data storage and retrieval.
- Learn more about Parquet.
DESCRIBE HISTORY employees
: Retrieves history and version details of a table.- Query nested data (e.g., JSON):
SELECT customer_id, profile:name FROM customer
.
DESCRIBE DETAILS tablename
: Retrieves all details about the given table.
.trigger(once=True)
: Processes one batch of data..trigger(availableNow=True)
: Processes data immediately..trigger(processingTime='60 minutes')
: Sets a 1-hour processing interval.
- Expectations apply data quality checks on each record passing through a query.
spark.readStream.table("table_name")
- Incrementally processes new data files as they arrive in cloud storage.
- Provides a Structured Streaming source called cloudFiles.
-- Creates a Delta table
CREATE TABLE student (id INT, name STRING, age INT);
-- Use data from another table
CREATE TABLE student_copy AS SELECT * FROM student;
-- Creates a CSV table from an external directory
CREATE TABLE student USING CSV LOCATION '/mnt/csv_files';
-- Specify table comment and properties
CREATE TABLE student (id INT, name STRING, age INT)
COMMENT 'this is a comment'
TBLPROPERTIES ('foo'='bar');
-- Create partitioned table
CREATE TABLE student (id INT, name STRING, age INT)
PARTITIONED BY (age);
-- Create a table with a generated column
CREATE TABLE rectangles(a INT, b INT, area INT GENERATED ALWAYS AS (a * b));
CREATE FUNCTION convert_f_to_c(unit STRING, temp DOUBLE)
RETURNS DOUBLE
RETURN CASE
WHEN unit = "F" THEN (temp - 32) * (5/9)
ELSE temp
END;
SELECT convert_f_to_c(unit, temp) AS c_temp
FROM tv_temp;
- Demonstrating set operators using
number1
andnumber2
tables.
CREATE TEMPORARY VIEW number1(c) AS VALUES (3), (1), (2), (2), (3), (4);
CREATE TEMPORARY VIEW number2(c) AS VALUES (5), (1), (1), (2);
SELECT c FROM number1 EXCEPT SELECT c FROM number2;
SELECT c FROM number1 MINUS SELECT c FROM number2;
SELECT c FROM number1 EXCEPT ALL SELECT c FROM number2;
SELECT c FROM number1 MINUS ALL SELECT c FROM number2;
SELECT c FROM number1 INTERSECT SELECT c FROM number2;
SELECT c FROM number1 UNION SELECT c FROM number2;
SELECT filter(array(1, 2, 3, 4), i -> i % 2 == 0);
This SQL-like function filters an array to include only elements that satisfy a given condition. For example, the above query returns an array containing only the even numbers from the input array.
SELECT transform(array(1, 2, 3, 4), i -> i * 2);
This function transforms each element of an array according to a specified transformation rule. For instance, the above query doubles each element of the input array.
Supported alert destinations include:
- Slack
- Webhook
- MS Teams
- PagerDuty