Skip to content

Commit

Permalink
Merge pull request apache#78 from mesosphere/add-pyspark-documentation
Browse files Browse the repository at this point in the history
Documented Python support and Spark shell.
  • Loading branch information
susanxhuynh authored Oct 26, 2016
2 parents 6433570 + 5e09c7c commit d54418e
Showing 1 changed file with 48 additions and 3 deletions.
51 changes: 48 additions & 3 deletions docs/user-docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ DC/OS Spark includes:
* [Mesos Cluster Dispatcher][2]
* [Spark History Server][3]
* DC/OS Spark CLI
* Interactive Spark shell

## Benefits

Expand Down Expand Up @@ -59,6 +60,10 @@ dispatcher and the history server

$ dcos spark run --submit-args="--class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 30"

1. Run a Python Spark job:

$ dcos spark run --submit-args="https://downloads.mesosphere.com/spark/examples/pi.py 30"

1. View your job:

Visit the Spark cluster dispatcher at
Expand Down Expand Up @@ -508,6 +513,10 @@ more][13].

$ dcos spark run --submit-args=`--class MySampleClass http://external.website/mysparkapp.jar 30`

Or, for a Python job

$ dcos spark run --submit-args="http://external.website/mysparkapp.py 30"

`dcos spark run` is a thin wrapper around the standard Spark
`spark-submit` script. You can submit arbitrary pass-through options
to this script via the `--submit-args` options.
Expand Down Expand Up @@ -555,6 +564,42 @@ To set Spark properties with a configuration file, create a
`spark-defaults.conf` file and set the environment variable
`SPARK_CONF_DIR` to the containing directory. [Learn more][15].

<a name="pysparkshell"></a>
# Interactive Spark Shell

You can run Spark commands interactively in the Spark shell. The Spark shell is available
in either Scala or Python.

1. SSH into a node in the DC/OS cluster. [Learn how to SSH into your cluster and get the agent node ID](https://dcos.io/docs/latest/administration/access-node/sshcluster/).

$ dcos node ssh --master-proxy --mesos-id=<agent-node-id>

1. Run a Spark Docker image.

$ docker pull mesosphere/spark:1.0.4-2.0.1

$ docker run -it --net=host mesosphere/spark:1.0.4-2.0.1 /bin/bash

1. Run the Scala Spark shell from within the Docker image.

$ ./bin/spark-shell --master mesos://<internal-master-ip>:5050 --conf spark.mesos.executor.docker.image=mesosphere/spark:1.0.4-2.0.1 --conf spark.mesos.executor.home=/opt/spark/dist

Or, run the Python Spark shell.

$ ./bin/pyspark --master mesos://<internal-master-ip>:5050 --conf spark.mesos.executor.docker.image=mesosphere/spark:1.0.4-2.0.1 --conf spark.mesos.executor.home=/opt/spark/dist

1. Run Spark commands interactively.

In the Scala shell:

$ val textFile = sc.textFile("/opt/spark/dist/README.md")
$ textFile.count()

In the Python shell:

$ textFile = sc.textFile("/opt/spark/dist/README.md")
$ textFile.count()

<a name="uninstall"></a>
# Uninstall

Expand Down Expand Up @@ -628,14 +673,14 @@ output:
<a name="limitations"></a>
# Limitations

* DC/OS Spark only supports submitting jars. It does not support
Python or R.
* DC/OS Spark only supports submitting jars and Python scripts. It
does not support R.

* Spark jobs run in Docker containers. The first time you run a
Spark job on a node, it might take longer than you expect because of
the `docker pull`.

* Spark shell is not supported. For interactive analytics, we
* For interactive analytics, we
recommend Zeppelin, which supports visualizations and dynamic
dependency management.

Expand Down

0 comments on commit d54418e

Please sign in to comment.