Sgx-Spark

This is Apache Spark with modifications to run security sensitive code inside Intel SGX enclaves. The implementation leverages sgx-lkl, a library OS that allows to run Java-based applications inside SGX enclaves.

Docker quick start

This guide shows how to run Sgx-Spark in a few simple steps using Docker. Most parts of the setup and deployment are wrapped within Docker containers. Compliation and deployment should thus be smooth.

Preparing the Sgx-Spark Docker environment

Clone this Sgx-Spark repository
Build the Sgx-Spark base image. The name of the resulting Docker image is sgxpsark. This process might take a while (30-60 mins):
```
  sgx-spark/dockerfiles$ docker build -t sgxspark .
```
Prepare the disk image that will be required by sgx-lkl. Due to restrictions of Docker, this step can currently not be implemented as part of the above Docker build process. Thus, this step is platform-dependent. The process has been successfully tested on Ubuntu 16.04 and Arch Linux:
```
  sgx-spark/lkl$ make prepare-image
```
Create a Docker network device that will be used for communication by the Docker containers. Note that by creating a user-defined network, Docker will create an embedded DNS server so that workers can find the Spark master by name.
```
  sgx-spark$ docker network create sgxsparknet
```

Running Sgx-Spark jobs using Docker

From within directory sgx-spark/dockerfiles, run the Sgx-Spark master node, the Sgx-Spark worker node, as well as the actual Sgx-Spark job as follows.

Run the Sgx-Spark master node:

  sgx-spark/dockerfiles$ docker run \
  --user user \
  --env-file $(pwd)/docker-env \
  --net sgxsparknet \
  --name sgxspark-docker-master \
  -p 7077:7077 \
  -p 8082:8082 \
  -ti sgxspark /sgx-spark/master.sh

Run the Sgx-Spark worker node:

  sgx-spark/dockerfiles$ docker run \
  --user user \
  --memory="4g" \
  --shm-size="8g" \
  --env-file $(pwd)/docker-env \
  --net sgxsparknet \
  --privileged \
  -v $(pwd)/../lkl:/spark-image:ro \
  -ti sgxspark /sgx-spark/worker-and-enclave.sh

Run the Sgx-Spark job as follows.

As of writing, the three jobs below are known to be fully supported:

WordCount

  sgx-spark/dockerfiles$ docker run \
  --user user \
  --memory="4g" \
  --shm-size="8g" \
  --env-file $(pwd)/docker-env \
  --net sgxsparknet \
  --privileged \
  -v $(pwd)/../lkl:/spark-image:ro \
  -e SPARK_JOB_CLASS=org.apache.spark.examples.MyWordCount \
  -e SPARK_JOB_NAME=WordCount \
  -e SPARK_JOB_ARG0=README.md \
  -e SPARK_JOB_ARG1=output \
  -ti sgxspark /sgx-spark/driver-and-enclave.sh

KMeans

  sgx-spark/dockerfiles$ docker run \
  --user user \
  --memory="4g" \
  --shm-size="8g" \
  --env-file $(pwd)/docker-env \
  --net sgxsparknet \
  --privileged \
  -v $(pwd)/../lkl:/spark-image:ro \
  -e SPARK_JOB_CLASS=org.apache.spark.examples.mllib.KMeansExample \
  -e SPARK_JOB_NAME=KMeans \
  -e SPARK_JOB_ARG0=data/mllib/kmeans_data.txt \
  -ti sgxspark /sgx-spark/driver-and-enclave.sh

LineCount

  sgx-spark/dockerfiles$ docker run \
  --user user \
  --memory="4g" \
  --shm-size="8g" \
  --env-file $(pwd)/docker-env \
  --net sgxsparknet \
  --privileged \
  -v $(pwd)/../lkl:/spark-image:ro \
  -e SPARK_JOB_CLASS=org.apache.spark.examples.LineCount \
  -e SPARK_JOB_NAME=LineCount \
  -e SPARK_JOB_ARG0=SgxREADME.md \
  -ti sgxspark /sgx-spark/driver-and-enclave.sh

Native compilation, installation and deployment

To run Sgx-Spark natively, proceed as detailed in the following.

Install package dependencies

Install all required dependencies. For Ubuntu 16.04, these can be installed as follows:

$ sudo apt-get update
$ sudo apt-get install -y --no-install-recommends scala libtool autoconf curl xutils-dev git build-essential maven openjdk-8-jdk ssh bc python autogen wget autotools-dev sudo automake

Compile and install Google Protocol Buffer 2.5.0

Hadoop, and thus Spark, depends on Google Protocol Buffers (GPB) in version 2.5.0:

Make sure to uninstall any other versions of GPB

Install GPB v2.5.0. Instructions for Ubuntu 16.04 are as follows:

  $ cd /tmp
  /tmp$ wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
  /tmp$ tar xvf protobuf-2.5.0.tar.gz
  /tmp$ cd protobuf-2.5.0
  /tmp/protobuf-2.5.0$ ./autogen.sh && ./configure && make && sudo make install
  /tmp/protobuf-2.5.0$ sudo apt-get install -y --no-install-recommends libprotoc-dev

Instructions for Arch Linux are available at https://stackoverflow.com/a/29799354/2273470.

Compile sgx-lkl

As Sgx-Spark uses sgx-lkl, the latter must have been downloaded and compiled successfully. As of writing (June 14, 2018), sgx-lkl should be compiled using branch cleanup-musl. Please follow the documentation of sgx-lkl and ensure that your installation of sgx-lkl executes simple Java applications successfully.

Compile Sgx-Spark

sgx-spark$ build/mvn -DskipTests package

As part of this compilation process, a modified Hadoop library has been compiled. Copy the Hadoop JAR file into the Sgx-Spark jars directory:
```
  sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/
```
Sgx-Spark ships with a native C library (libringbuff.so) that enables shared-memory-based communication between two JVMs. Compile as follows:
```
  sgx-spark/C$ make install
```

Prepare the Sgx-Spark disk images that will be run using sgx-lkl

Adjust file spark-sgx/lkl/Makefile for your environment:

Variable SGX_LKL must point to your sgx-lkl directory (see Prerequisites).
Build the Sgx-Spark disk image required for sgx-lkl:
```
  sgx-spark/lkl$ make clean all
```

Run Sgx-Spark using sgx-lkl

Finally, we are ready to run (i) the Sgx-Spark master node, (ii) the Sgx-Spark worker node, (iii) the worker's enclave, (iv) the Sgx-Spark client, and (v) the client's enclave. In the following commands, replace: <hostname> with the master node's actual hostname; <sgx-lkl> with the path to your sgx-lkl installation.

Note: After running each example, make sure to (i) restart all processes, (ii) delete all shared memory files in /dev/shm.

If you run all the nodes locally, you need to add the following line to variables.sh:
```
  export SPARK_LOCAL_IP=127.0.0.1
```
Run the Master node
```
  sgx-spark$ ./master.sh
```
Run the Worker node
```
  sgx-spark$ ./worker.sh
```
Run the enclave for the Worker node
```
  sgx-spark$ ./worker-enclave.sh
```
Run the enclave for the driver program. This is the process that will output the job results!
```
  sgx-spark$ ./driver-enclave.sh
```
Finally, submit a Spark job. The result will be output in the process we started just before.
- WordCount
```
  sgx-spark$ ./submitwordcount.sh
```
- KMeans
```
  sgx-spark$ ./submitkmeans.sh
```
- LineCount
```
  sgx-spark$ ./submitlinecount.sh
```

Native execution of the same Spark installation

In order to run the above installation without SGX, start your environment as follows:

Start the Master node as above
Start the Worker node as above, but change environment variable SGX_ENABLED=true to SGX_ENABLED=false
Do not start the enclaves
Submit the Spark job as above, but change evironment variable SGX_ENABLED=true to SGX_ENABLED=false

Important developer notes

Code changes and recompilation

There are a few important things to keep in mind when developing Sgx-Spark:

Whenever you change parts of the code, obviously, you must recompile the Spark code
```
  sgx-spark$ mvn package -DskipTests
```
There have been (not clearly definable) situations in which the above command did not compile all of the changed files. In this case, issue:
```
  sgx-spark$ mvn clean package -DskipTests
```
After making changes to the Sgx-Spark code and after compiling the Java/Scala code (see above), you always need to rebuild the lkl image that will be used by sgx-lkl:
```
  sgx-spark/lkl$ make clean all
```

If you changed parts of the Hadoop code (in directory hadoop-2.6.5-src), you will also need to copy the resulting *jar file:

  sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/

Lastly, do not forget to remove all related shared memory files in /dev/shm/ before running your next experiment!

Running without sgx-lkl

Development with sgx-lkl can be tedious. For development purposes, a special flag allows to run the enclave-side of Sgx-Spark in a regular JVM rather than on top of sgx-lkl. To make use of this feature, run the enclave JVMs using scripts worker-enclave-nosgx.sh and driver-enclave-nosgx.sh.

Under the hood, these scripts set environment variable DEBUG_IS_ENCLAVE_REAL=false (defaults to true) and provide the JVM with a value for environment variable SGXLKL_SHMEM_FILE. Note that the value of SGXLKL_SHMEM_FILE must be the same as the one provided for the corresponding Worker (worker.sh) and Driver (driver.sh).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sgx-Spark

Docker quick start

Preparing the Sgx-Spark Docker environment

Running Sgx-Spark jobs using Docker

Native compilation, installation and deployment

Install package dependencies

Compile and install Google Protocol Buffer 2.5.0

Compile sgx-lkl

Compile Sgx-Spark

Prepare the Sgx-Spark disk images that will be run using sgx-lkl

Run Sgx-Spark using sgx-lkl

Native execution of the same Spark installation

Important developer notes

Code changes and recompilation

Running without sgx-lkl

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sgx-Spark

Docker quick start

Preparing the Sgx-Spark Docker environment

Running Sgx-Spark jobs using Docker

Native compilation, installation and deployment

Install package dependencies

Compile and install Google Protocol Buffer 2.5.0

Compile sgx-lkl

Compile Sgx-Spark

Prepare the Sgx-Spark disk images that will be run using sgx-lkl

Run Sgx-Spark using sgx-lkl

Native execution of the same Spark installation

Important developer notes

Code changes and recompilation

Running without sgx-lkl