The purpose of this lab is to create and build a Spark project using Scala and SBT. By following these steps, you’ll learn how to set up the project, understand its structure, and use SBT commands to compile and package your application.
-
Create a New Project:
- Follow the tutorial at https://github.com/osekoo/spark-scala.g8 to generate a new Spark project using the Giter8 template engine.
- Run the following command:
sbt new osekoo/spark-scala.g8
- Provide the required details, such as the project name (wordcount), Scala version (2.12.18), and Spark version (3.5.2).
-
Open the Project in an IDE:
- Use your favorite IDE (e.g., Visual Studio Code or IntelliJ IDEA) to open the project as described in the tutorial. This will make editing and managing your code easier.
The project template generates the following files and folders:
- Purpose: The main configuration file for the SBT build tool.
- Content:
- Project settings (name, version, Scala version).
- Dependencies for Spark and other libraries.
- Example:
name := "wordcount" version := "0.1" scalaVersion := "2.12.18" val sparkVersion = "3.5.2" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided" )
- Purpose: The main entry point of your Spark application.
- Content:
- A basic Spark job that processes data.
- Example:
import org.apache.spark.sql.SparkSession
object MainApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder .appName("Word Count") .getOrCreate() // create a Spark session
spark.sparkContext.setLogLevel("ERROR") // set the log level to ERROR (possible values: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN)
val size = args(0).toInt // number of lines to generate
wordCount(spark, size) // call the wordCount function
spark.stop() // stop the Spark session
}
private def wordCount(spark: SparkSession, size: Int): Unit = { val fruits = Seq("apple", "banana", "carrot", "orange", "kiwi", "melon", "pineapple") // list of fruits val colors = Seq("red", "yellow", "orange", "green", "brown", "blue", "purple") // list of colors // pick between 5 and 15 colored fruits randomly as one item of a seq and repeat them 1000 times to create a dataset val data = (1 to size).map(_ => (1 to scala.util.Random.nextInt(10) + 5).map(_ => s"${colors(scala.util.Random.nextInt(colors.length))} ${fruits(scala.util.Random.nextInt(fruits.length))}").mkString(", "))
// print the first 10 items of the dataset
println("\n============================ Dataset ============================")
data.take(10).foreach(println)
println("=================================================================\n")
// create an RDD from the dataset
val rdd = spark.sparkContext.parallelize(data)
val wordCounts = rdd
.flatMap(line => line.split("[ ,]")) // split each line into words
.filter(word => word.nonEmpty) // filter out empty words
.map(word => (word, 1)) // create a tuple of (word, 1)
.reduceByKey((a, b) => a + b) // sum the counts
.sortBy(a => a._2, ascending = false) // sort by count in descending order
println("\n============================ Word count result ============================")
wordCounts.collect().foreach(println) // print the result
println("===========================================================================\n")
} }
#### **3. `spark-env`**
- **Purpose**: A script to start the local Spark cluster.
- **Content**:
- Starts Spark master and worker processes.
- **Usage**: Run it to start the Spark cluster.
#### **4. `spark-submit-job`**
- **Purpose**: A script to execute spark job.
- **Content**:
- Run it to submit spark job and execute it on the local cluster.
### **Step 3: SBT Commands**
SBT (Scala Build Tool) is essential for managing your project. Below are the key SBT commands you’ll use:
1. **`sbt compile`**:
- Compiles the Scala source files.
- Checks for syntax errors and ensures dependencies are correctly resolved.
2. **`sbt package`**:
- Packages your project into a JAR file.
- The resulting JAR is located in the `target/scala-<version>/` directory.
3. **`sbt assembly`** (Optional):
- Creates a "fat JAR" containing all dependencies.
- Useful for deploying the application to a cluster.
4. **`sbt clean`**:
- Removes previously compiled files and artifacts.
### **Step 4: Run the Application**
#### **1. Package the Application**
Run the following command to compile and package the project:
```bash
sbt package
Start the cluster using the spark-env
script:
./spark-env
From the Spark environement, run the application using the spark-submit-job
script:
./spark-submit-job.sh
While the application is running, open the Spark UI to monitor job execution:
- URL: http://localhost:8080
- The UI shows details about stages, tasks, and executors.
After completing the application run, stop the Spark cluster:
exit
spark-submit
is a command-line tool provided by Apache Spark to submit and run Spark applications on various Spark-supported cluster managers (e.g., standalone, YARN, Mesos, Kubernetes) or locally on your machine.
It acts as the entry point to execute your compiled Spark application JAR, passing it configuration parameters, application-specific arguments, and resource details.
- Reads the Application JAR: The tool loads the JAR file containing your application code.
- Submits the Job to Spark Cluster: Sends the application to the Spark cluster (or local Spark environment) based on the specified
--master
configuration. - Allocates Resources: Determines how many cores, memory, and executors are needed.
- Runs the Application: Executes your Spark code and provides logs for debugging or monitoring.
The basic syntax for spark-submit
is:
spark-submit [options] <application-jar> [application-arguments]
-
Cluster Manager
--master
: Specifies where to run the application (e.g., local, standalone cluster, or YARN).- Examples:
--master local[*]
: Run locally with all available cores.--master spark://<master-host>:7077
: Submit to a standalone cluster.
-
Deploy Mode
--deploy-mode
: Defines how the driver runs.client
: Driver runs on the submitting machine.cluster
: Driver runs on the cluster (for non-local clusters).
-
Application Resources
--num-executors
: Number of executors to use.--executor-cores
: Number of cores per executor.--executor-memory
: Memory allocated per executor (e.g.,2G
).
-
Application Main Class
--class
: Specifies the main class of your application.
-
Files and Dependencies
--jars
: Additional JAR files required for the application.--files
: Files to distribute to executors.--packages
: Maven coordinates for additional dependencies.
-
Logging and Debugging
--verbose
: Displays detailed output.--conf
: Sets custom Spark configurations (e.g.,spark.executor.extraJavaOptions
).
spark-submit \
--deploy-mode client \
--master "spark://localhost:7077" \
--executor-cores 4 \
--executor-memory 2G \
--num-executors 4 \
--class "MainApp" \
"target/scala-2.12/wordcount_2.12-0.1.jar" \
- Runs the application locally using all available cores.
- The main class is
MainApp
. - The JAR file is located in
target/scala-2.12/wordcount_2.12-0.1.jar
.
spark-submit \
--master spark://spark-master:7077 \
--class MainApp \
--files data/input.csv \
--jars external-lib.jar \
target/scala-2.12/mysparkapp_2.12-0.1.jar
- Distributes
data/input.csv
to executors. - Includes
external-lib.jar
as an additional dependency.
- Development: Use
sbt
or IDE tools to run locally during development. - Testing: Use
spark-submit
with--master local
for testing. - Production: Use
spark-submit
to deploy applications to standalone clusters or resource managers like YARN or Kubernetes.
In the lab, the spark-submit-job
script internally uses spark-submit
to start the Spark application. You can view or customize the spark-submit
command in the script:
This demonstrates how spark-submit
is essential for running Spark applications in various environments, offering flexibility and control over resource allocation and execution.
-
Project Structure:
- What is the purpose of the
build.sbt
file? - How does
MainApp.scala
interact with Spark?
- What is the purpose of the
-
SBT Commands:
- What is the difference between
sbt package
andsbt assembly
?
- What is the difference between
-
Spark Monitoring:
- What information does the Spark UI provide?
- How can you optimize your Spark job based on Spark UI metrics?
-
Scripts:
- What does the
spark-env
script automate? - What does the
spark-submit-job
script automate?
- What does the
By completing this lab, you will:
- Set up and build a Spark project using Scala and SBT.
- Understand the project structure and key files.
- Use SBT commands to compile and package your Spark application.
- Run and monitor your application using Spark's built-in tools.
Feel free to ask questions or explore additional details to enhance your understanding!