Skip to content

Psingh12354/databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 

Repository files navigation

Databricks Overview

Cluster Types

All-purpose Clusters

  • Designed for collaborative use, such as ad hoc analysis, data exploration, and development.
  • Multiple users can share these clusters.
  • Cost-effective for tasks that benefit from resource sharing.
  • May not guarantee immediate availability due to resource sharing.

Job Clusters

  • Specifically for running automated jobs.
  • Terminate once the job is completed, reducing resource usage and cost.
  • Dedicated to a specific task and optimized for performance.
  • More suitable for workflows with strict Service Level Agreements (SLAs).
Aspect All-purpose Clusters Job Clusters
SLA Requirements Versatile but less predictable for strict SLAs Dedicated and optimized for performance
Resource Usage Cost-effective for exploration and development tasks Efficient for time-sensitive tasks, releasing resources promptly
Trade-offs Flexibility but potential delay due to resource sharing Prioritizes specific jobs but higher resource costs

Cluster Types

Databricks Lakehouse

  • Combines the advantages of Data Lakes and Data Warehouses.
  • Stores data in Parquet format and transaction logs in JSON format.

Utilities

  • dbutils: Module for interacting with Databricks.
    • dbutils.help(): Lists available modules.
    • dbutils.fs.help(): Provides file system utilities, e.g., dbutils.fs.ls('path').

Database Commands

  • describe database db_name;: Retrieves the location of a database.
  • spark.Table: Registers a table through SparkSession.
  • DESCRIBE DATABASE or DESCRIBE SCHEMA: Returns metadata of a database.

Edit History

Delta Lake

  • Builds upon standard data formats.
  • VACUUM command: Deletes unused data files older than a specified retention period.
  • Supports ACID transactions and scalable metadata handling.
  • Delta tables store data in Parquet format with transaction logs in JSON format.

Databricks Repos

  • Supports Git operations like pull, fetch, and manage branches.
  • Useful for managing development work and versioning.

Data Exploration

  • dbfs:/user/hive/warehouse/db_hr.db: Default location for databases.
  • PIVOT: Transforms rows into columns for better readability.
  • CREATE TABLE USING: Creates external tables from external data sources (e.g., CSV).
  • CREATE SCHEMA: Alias for CREATE DATABASE.

Spark Structured Streaming

  • trigger(availableNow=True): Runs the stream in batch mode and stops after processing available data.
  • Auto Loader: Tracks discovered files using checkpointing for exactly-once ingestion guarantees.
  • Default processing interval: trigger(processingTime="500ms").
Aspect Triggered Execution Continuous Execution Development Production
Execution Trigger Event-driven Continuous data arrival N/A N/A
Processing Characteristics Batch or near-real-time Real-time processing Iterative development & testing Stable, optimized execution
Environment N/A N/A Interactive notebooks or development environments Dedicated clusters optimized for performance
Resource Allocation N/A N/A Limited resources for experimentation and testing Dedicated resources for stable and scalable execution

Delta Live Tables (DLT)

  • Enables creating streaming live tables using the STREAM() function.
  • Use COMMENT "Contains PII" to indicate that a new table includes personally identifiable information (PII).

Data Ingestion

  • COPY INTO: Suitable for ingesting thousands of files.
  • Auto Loader: Preferred for ingesting millions of files or more and handles schema evolution.

Databricks Jobs

  • Orchestrates data processing tasks in a Directed Acyclic Graph (DAG).
  • Allows repairing failed multi-task jobs by rerunning only the unsuccessful tasks and their dependents.
  • MERGE command: Writes data into Delta tables while avoiding duplicate records.

User-Defined Functions (UDF)

CREATE FUNCTION blue() RETURNS STRING COMMENT 'Blue color code' LANGUAGE SQL RETURN '0000FF';

Querying Data

  • spark.readStream: Reads streaming data.
  • SELECT * FROM file_format.'path.type': Queries a table using file format and path.
  • CREATE TABLE table_name DEEP CLONE source_table: Creates a deep clone of a table.
  • CREATE TABLE table_name SHALLOW CLONE source_table: Creates a shallow clone of a table.

Databricks SQL

  • Data Explorer: Manages data object permissions.
  • View types:
    • Standard view: CREATE VIEW view_name AS query
    • Temporary view: CREATE TEMP VIEW view_name AS query
    • Global temporary view: CREATE GLOBAL TEMP VIEW view_name AS query

Medallion Architecture

  • Gold tables: Refined data suitable for business reporting (e.g., BI dashboards).
  • Silver tables: Cleansed and conformed data.
  • Bronze tables: Raw data with minimal transformation.
  • Raw data: Unprocessed data directly ingested from source systems.

Medallion Architecture

Sample Query

  • The query below is in the Bronze layer, reading data from cloud storage into an uncleaned orders table.
(spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load(ordersLocation)
     .writeStream
        .option("checkpointLocation", checkpointPath)
        .table("uncleanedOrders")
)

Connecting to GitHub

Parquet File Format

  • Parquet is a columnar storage file format, preferred for its efficiency in data storage and retrieval.
  • Learn more about Parquet.

Time Travel

  • DESCRIBE HISTORY employees: Retrieves history and version details of a table.
  • Query nested data (e.g., JSON): SELECT customer_id, profile:name FROM customer.

Time Travel

Garbage Collection

Querying Table Details

  • DESCRIBE DETAILS tablename: Retrieves all details about the given table.

Triggers in Spark Structured Streaming

  • .trigger(once=True): Processes one batch of data.
  • .trigger(availableNow=True): Processes data immediately.
  • .trigger(processingTime='60 minutes'): Sets a 1-hour processing interval.

Managing Data Quality with DLT

  • Expectations apply data quality checks on each record passing through a query.

DLT Data Quality

Reading Data from DLT

spark.readStream.table("table_name")

Auto Loader

  • Incrementally processes new data files as they arrive in cloud storage.
  • Provides a Structured Streaming source called cloudFiles.

Creating Tables in Databricks

-- Creates a Delta table
CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
CREATE TABLE student USING CSV LOCATION '/mnt/csv_files';

-- Specify table comment and properties
CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

-- Create partitioned table
CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
CREATE TABLE rectangles(a INT, b INT, area INT GENERATED ALWAYS AS (a * b));

UDF Example

CREATE FUNCTION convert_f_to_c(unit STRING, temp DOUBLE)
RETURNS DOUBLE
RETURN CASE
  WHEN unit = "F" THEN (temp - 32) * (5/9)
  ELSE temp
END;

SELECT convert_f_to_c(unit, temp) AS c_temp
FROM tv_temp;

Set Operators

  • Demonstrating set operators using number1 and number2 tables.
CREATE TEMPORARY VIEW number1(c) AS VALUES (3), (1), (2), (2), (3), (4);
CREATE TEMPORARY VIEW number2(c) AS VALUES (5), (1), (1), (2);

SELECT c FROM number1 EXCEPT SELECT c FROM number2;
SELECT c FROM number1 MINUS SELECT c FROM number2;
SELECT c FROM number1 EXCEPT ALL SELECT c FROM number2;
SELECT c FROM number1 MINUS ALL SELECT c FROM number2;
SELECT c FROM number1 INTERSECT SELECT c FROM number2;
SELECT c FROM number1 UNION SELECT c FROM number2;

Array Manipulation Functions

Filter Function

SELECT filter(array(1, 2, 3, 4), i -> i % 2 == 0);

This SQL-like function filters an array to include only elements that satisfy a given condition. For example, the above query returns an array containing only the even numbers from the input array.

Transform Function

SELECT transform(array(1, 2, 3, 4), i -> i * 2);

This function transforms each element of an array according to a specified transformation rule. For instance, the above query doubles each element of the input array.


Alert Destinations

Supported alert destinations include:

  • Email
  • Slack
  • Webhook
  • MS Teams
  • PagerDuty