- Allow users to choose which .NET framework to build for
- Building through Visual Studio Code
- Building fully automatically through .NET Core CLI
If you already have all the pre-requisites, skip to the build steps below.
-
Download and install the .NET Core 3.1 SDK - installing the SDK will add the
dotnet
toolchain to your path. -
Install Visual Studio 2019 (Version 16.4 or later). The Community version is completely free. When configuring your installation, include these components at minimum:
- .NET desktop development
- All Required Components
- .NET Framework 4.6.1 Development Tools
- All Required Components
- .NET Core cross-platform development
- All Required Components
- .NET desktop development
-
Install Java 1.8
- Select the appropriate version for your operating system e.g., jdk-8u201-windows-x64.exe for Win x64 machine.
- Install using the installer and verify you are able to run
java
from your command-line
-
Install Apache Maven 3.6.3+
- Download Apache Maven 3.6.3
- Extract to a local directory e.g.,
c:\bin\apache-maven-3.6.3\
- Add Apache Maven to your PATH environment variable e.g.,
c:\bin\apache-maven-3.6.3\bin
- Verify you are able to run
mvn
from your command-line
-
Install Apache Spark 2.3+
-
Download Apache Spark 2.3+ and extract it into a local folder (e.g.,
c:\bin\spark-2.3.2-bin-hadoop2.7\
) using 7-zip. -
Add Apache Spark to your PATH environment variable e.g.,
c:\bin\spark-2.3.2-bin-hadoop2.7\bin
-
Add a new environment variable
SPARK_HOME
e.g.,C:\bin\spark-2.3.2-bin-hadoop2.7\
-
Verify you are able to run
spark-shell
from your command-line📙 Click to see sample console output
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.2 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c
-
-
Install WinUtils
- Download
winutils.exe
binary from WinUtils repository. You should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2.3.2. - Save
winutils.exe
binary to a directory of your choice e.g.,c:\hadoop\bin
- Set
HADOOP_HOME
to reflect the directory with winutils.exe (without bin). For instance, using command-line:set HADOOP_HOME=c:\hadoop
- Set PATH environment variable to include
%HADOOP_HOME%\bin
. For instance, using command-line:set PATH=%HADOOP_HOME%\bin;%PATH%
- Download
Please make sure you are able to run dotnet
, java
, mvn
, spark-shell
from your command-line before you move to the next section. Feel there is a better way? Please open an issue and feel free to contribute.
Note: A new instance of the command-line may be required if any environment variables were updated.
For the rest of the section, it is assumed that you have cloned Spark .NET repo into your machine e.g., c:\github\dotnet-spark\
git clone https://github.com/dotnet/spark.git c:\github\dotnet-spark
When you submit a .NET application, Spark .NET has the necessary logic written in Scala that inform Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the Spark .NET Scala Source Code.
Regardless of whether you are using .NET Framework or .NET Core, you will need to build the Spark .NET Scala extension layer. This is easy to do:
cd src\scala
mvn clean package
You should see JARs created for the supported Spark versions:
microsoft-spark-2-3\target\microsoft-spark-2-3_2.11-<version>.jar
microsoft-spark-2-4\target\microsoft-spark-2-4_2.11-<version>.jar
microsoft-spark-3-0\target\microsoft-spark-3-0_2.12-<version>.jar
-
Open
src\csharp\Microsoft.Spark.sln
in Visual Studio and build theMicrosoft.Spark.CSharp.Examples
project under theexamples
folder (this will in turn build the .NET bindings project as well). If you want, you can write your own code in theMicrosoft.Spark.Examples
project:// Instantiate a session var spark = SparkSession .Builder() .AppName("Hello Spark!") .GetOrCreate(); var df = spark.Read().Json(args[0]); // Print schema df.PrintSchema(); // Apply a filter and show results df.Filter(df["age"] > 21).Show();
Once the build is successfuly, you will see the appropriate binaries produced in the output directory.
📙 Click to see sample console output
Directory: C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461 Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 3/6/2019 12:18 AM 125440 Apache.Arrow.dll -a---- 3/16/2019 12:00 AM 13824 Microsoft.Spark.CSharp.Examples.exe -a---- 3/16/2019 12:00 AM 19423 Microsoft.Spark.CSharp.Examples.exe.config -a---- 3/16/2019 12:00 AM 2720 Microsoft.Spark.CSharp.Examples.pdb -a---- 3/16/2019 12:00 AM 143360 Microsoft.Spark.dll -a---- 3/16/2019 12:00 AM 63388 Microsoft.Spark.pdb -a---- 3/16/2019 12:00 AM 34304 Microsoft.Spark.Worker.exe -a---- 3/16/2019 12:00 AM 19423 Microsoft.Spark.Worker.exe.config -a---- 3/16/2019 12:00 AM 11900 Microsoft.Spark.Worker.pdb -a---- 3/16/2019 12:00 AM 23552 Microsoft.Spark.Worker.xml -a---- 3/16/2019 12:00 AM 332363 Microsoft.Spark.xml ------------------------------------------- More framework files -------------------------------------
Note: We are currently working on automating .NET Core builds for Spark .NET. Until then, we appreciate your patience in performing some of the steps manually.
-
Build the Worker
cd C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\ dotnet publish -f netcoreapp3.1 -r win-x64
📙 Click to see sample console output
PS C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker> dotnet publish -f netcoreapp3.1 -r win-x64 Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core Copyright (C) Microsoft Corporation. All rights reserved. Restore completed in 299.95 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj. Restore completed in 306.62 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\Microsoft.Spark.Worker.csproj. Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.0\Microsoft.Spark.dll Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\Microsoft.Spark.Worker.dll Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\publish\
-
Build the Samples
cd C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\ dotnet publish -f netcoreapp3.1 -r win-x64
📙 Click to see sample console output
PS C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples> dotnet publish -f netcoreapp3.1 -r win10-x64 Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core Copyright (C) Microsoft Corporation. All rights reserved. Restore completed in 44.22 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj. Restore completed in 336.94 ms for C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\Microsoft.Spark.CSharp.Examples.csproj. Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.0\Microsoft.Spark.dll Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win-x64\Microsoft.Spark.CSharp.Examples.dll Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win-x64\publish\
Once you build the samples, running them will be through spark-submit
regardless of whether you are targeting .NET Framework or .NET Core apps. Make sure you have followed the pre-requisites section and installed Apache Spark.
-
Set the
DOTNET_WORKER_DIR
orPATH
environment variable to include the path where theMicrosoft.Spark.Worker
binary has been generated (e.g.,c:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.Worker\Debug\net461
for .NET Framework,c:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\publish
for .NET Core) -
Open Powershell and go to the directory where your app binary has been generated (e.g.,
c:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461
for .NET Framework,c:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp3.1\win1-x64\publish
for .NET Core) -
Running your app follows the basic structure:
spark-submit.cmd ` [--jars <any-jars-your-app-is-dependent-on>] ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` <path-to-microsoft-spark-jar> ` <path-to-your-app-exe> <argument(s)-to-your-app>
Here are some examples you can run:
- Microsoft.Spark.Examples.Sql.Batch.Basic
spark-submit.cmd ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<version>.jar ` Microsoft.Spark.CSharp.Examples.exe Sql.Batch.Basic %SPARK_HOME%\examples\src\main\resources\people.json
- Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount
spark-submit.cmd ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<version>.jar ` Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredNetworkWordCount localhost 9999
- Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)
spark-submit.cmd ` --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<version>.jar ` Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
- Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)
spark-submit.cmd --jars path\to\net.jpountz.lz4\lz4-1.3.0.jar,path\to\org.apache.kafka\kafka-clients-0.10.0.1.jar,path\to\org.apache.spark\spark-sql-kafka-0-10_2.11-2.3.2.jar,`path\to\org.slf4j\slf4j-api-1.7.6.jar,path\to\org.spark-project.spark\unused-1.0.0.jar,path\to\org.xerial.snappy\snappy-java-1.1.2.6.jar ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<version>.jar ` Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
- Microsoft.Spark.Examples.Sql.Batch.Basic
Feel this experience is complicated? Help us by taking up Simplify User Experience for Running an App