Skip to content

dcwangmit01/spark-snowflake-profiling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Snowflake Profiling

Just because we write a job in Spark doesn't mean it is automatically optimized to be parallelized and distributed.

The code in this benchmark repository runs 4 implementations of a Spark job submitted to a local docker-composed Spark cluster. Each implementation of the spark job queries data from Snowflake in different ways, providing a benchmark for identifying the best method.

Assessment Criteria

The best way to understand if your spark job is optimal (i.e. parallel and distributed) is to check the Spark UI of the driver node.

We want our spark jobs to look like this (parallel and distributed)

Instead of our spark jobs looking like this (non-parallel and non-distributed)

Sometimes, you may end up with a Spark job that looks like this (unevenly parallel and distributed), which is just as bad as the non-parallel and non-distributed case.

TLDR

Using the Snowflake Spark Driver without setting any special parameters will result in a parallel data pull far superior to the Snowflake JDBC Driver with parallelization parameters set. This is the best method.

Methods and Findings

Snowflake Spark Driver - Sequential Pull

Snowflake Spark Driver - Parallel Pull

Snowflake JDBC Driver - Sequential Pull

  • Based on Example
  • Results
    • As expected, only a single Spark worker pulled from Snowflake.
    • No parallelization occured, and this was certainly the slowest method.
  • Assessment
    • The workload was not parallel and not distributed

Snowflake JDBC Driver - Parallel Pull

  • Based on Example
  • Results
    • As expected, this resulted in a parallel data pull using multiple Spark workers.
    • The tasks were spread over several workers, for example
      • Worker 1: select * from db.schema.table where key >= 0 and key < 1000000
      • Worker 2: select * from db.schema.table where key >= 1000000 and key < 2000000
      • Worker 3: select * from db.schema.table where key >= 2000000 and key < 3000000
    • Unexpected, the test showed that with this particular data pull took as long as the sequential pull for this particular dataset and query.
      • The ordinality of the index column did not allow the load to be spread evenly across partitioned pulls, especially when using the "limit" clause.
      • Thus, even though the tasks were spread over many workers, some tasks were for a small subset of data, while one task downloaded nearly all of the data.
  • Assessment
    • The workload was unevenly parallel but was indeed distributed

Configuring the benchmarks

  • Create the file snowflake-job/snowflake_config.json
{
  "username": "<snowflake_username>",
  "password": "<snowflake_password>",
  "database": "<snowflake_db>",
  "role": "<snowflake_role>",
  "schema": "<snowflake_schema>",
  "url": "<snowflake_instance>.<aws_region>.snowflakecomputing.com:443",
  "warehouse": "<snowflake_warehouse>",
  "table": "<showflake>table"
}
  • Edit the query in Makefile

Executing the benchmarks

# Start the docker cluster
make spark-start

# Run all of the pull mechanisms
make snowflake all

# While running, View the spark-client Spark UI
: http://localhost:8081

# Check out the results to see the time it took.
tail logs/*

Useful Links