Skip to content

Commit

Permalink
Merge pull request apache#331 from holdenk/master
Browse files Browse the repository at this point in the history
Add a script to download sbt if not present on the system

As per the discussion on the dev mailing list this script will use the system sbt if present or otherwise attempt to install the sbt launcher. The fall back error message in the event it fails instructs the user to install sbt. While the URLs it fetches from aren't controlled by the spark project directly, they are stable and the current authoritative sources.
  • Loading branch information
rxin committed Jan 7, 2014
2 parents b97ef21 + 60a7a6b commit a862caf
Show file tree
Hide file tree
Showing 17 changed files with 81 additions and 33 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
*.iml
*.iws
.idea/
sbt/*.jar
.settings
.cache
/build/
Expand Down
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,11 @@ This README file only contains basic setup instructions.
## Building

Spark requires Scala 2.10. The project is built using Simple Build Tool (SBT),
which can be obtained [here](http://www.scala-sbt.org). To build Spark and its example programs, run:
which can be obtained [here](http://www.scala-sbt.org). If SBT is installed we
will use the system version of sbt otherwise we will attempt to download it
automatically. To build Spark and its example programs, run:

sbt assembly
./sbt/sbt assembly

Once you've built Spark, the easiest way to start using it is the shell:

Expand All @@ -41,7 +43,7 @@ locally with one thread, or "local[N]" to run locally with N threads.
Testing first requires [Building](#Building) Spark. Once Spark is built, tests
can be run using:

`sbt test`
`./sbt/sbt test`

## A Note About Hadoop Versions

Expand All @@ -55,22 +57,22 @@ For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
versions without YARN, use:

# Apache Hadoop 1.2.1
$ SPARK_HADOOP_VERSION=1.2.1 sbt assembly
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly

# Cloudera CDH 4.2.0 with MapReduce v1
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt assembly
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly

For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
with YARN, also set `SPARK_YARN=true`:

# Apache Hadoop 2.0.5-alpha
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

# Cloudera CDH 4.2.0 with MapReduce v2
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt assembly
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly

# Apache Hadoop 2.2.X and newer
$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt assembly
$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

When developing a Spark application, specify the Hadoop version by adding the
"hadoop-client" artifact to your project's dependencies. For example, if you're
Expand Down
2 changes: 1 addition & 1 deletion bin/pyspark
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ if [ ! -f "$FWDIR/RELEASE" ]; then
ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*.jar >& /dev/null
if [[ $? != 0 ]]; then
echo "Failed to find Spark assembly in $FWDIR/assembly/target" >&2
echo "You need to build Spark with sbt assembly before running this program" >&2
echo "You need to build Spark with sbt/sbt assembly before running this program" >&2
exit 1
fi
fi
Expand Down
2 changes: 1 addition & 1 deletion bin/run-example
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ if [ -e "$EXAMPLES_DIR"/target/spark-examples*[0-9Tg].jar ]; then
fi
if [[ -z $SPARK_EXAMPLES_JAR ]]; then
echo "Failed to find Spark examples assembly in $FWDIR/examples/target" >&2
echo "You need to build Spark with sbt assembly before running this program" >&2
echo "You need to build Spark with sbt/sbt assembly before running this program" >&2
exit 1
fi

Expand Down
2 changes: 1 addition & 1 deletion bin/spark-class
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ if [ ! -f "$FWDIR/RELEASE" ]; then
jars_list=$(ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/ | grep "spark-assembly.*hadoop.*.jar")
if [ "$num_jars" -eq "0" ]; then
echo "Failed to find Spark assembly in $FWDIR/assembly/target/scala-$SCALA_VERSION/" >&2
echo "You need to build Spark with 'sbt assembly' before running this program." >&2
echo "You need to build Spark with 'sbt/sbt assembly' before running this program." >&2
exit 1
fi
if [ "$num_jars" -gt "1" ]; then
Expand Down
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,10 @@ To mark a block of code in your markdown to be syntax highlighted by jekyll duri

## API Docs (Scaladoc and Epydoc)

You can build just the Spark scaladoc by running `sbt doc` from the SPARK_PROJECT_ROOT directory.
You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PROJECT_ROOT directory.

Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory.

When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).

NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.
4 changes: 2 additions & 2 deletions docs/_plugins/copy_api_dirs.rb
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@
curr_dir = pwd
cd("..")

puts "Running sbt doc from " + pwd + "; this may take a few minutes..."
puts `sbt doc`
puts "Running sbt/sbt doc from " + pwd + "; this may take a few minutes..."
puts `sbt/sbt doc`

puts "Moving back into docs dir."
cd("docs")
Expand Down
2 changes: 1 addition & 1 deletion docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: global
title: Spark API documentation (Scaladoc)
---

Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt doc` from the Spark project home directory.
Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt/sbt doc` from the Spark project home directory.

- [Spark](api/core/index.html)
- [Spark Examples](api/examples/index.html)
Expand Down
2 changes: 1 addition & 1 deletion docs/hadoop-third-party-distributions.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ with these distributions:
When compiling Spark, you'll need to
[set the SPARK_HADOOP_VERSION flag](index.html#a-note-about-hadoop-versions):

SPARK_HADOOP_VERSION=1.0.4 sbt assembly
SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly

The table below lists the corresponding `SPARK_HADOOP_VERSION` code for each CDH/HDP release. Note that
some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
Expand Down
6 changes: 3 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you n

Spark uses [Simple Build Tool](http://www.scala-sbt.org), which is bundled with it. To compile the code, go into the top-level Spark directory and run

sbt assembly
sbt/sbt assembly

For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_VERSION}}. If you write applications in Scala, you will need to use this same version of Scala in your own program -- newer major versions may not work. You can get the right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/).

Expand Down Expand Up @@ -56,12 +56,12 @@ Hadoop, you must build Spark against the same version that your cluster uses.
By default, Spark links to Hadoop 1.0.4. You can change this by setting the
`SPARK_HADOOP_VERSION` variable when compiling:

SPARK_HADOOP_VERSION=2.2.0 sbt assembly
SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly

In addition, if you wish to run Spark on [YARN](running-on-yarn.html), set
`SPARK_YARN` to `true`:

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

Note that on Windows, you need to set the environment variables on separate lines, e.g., `set SPARK_HADOOP_VERSION=1.2.1`.

Expand Down
2 changes: 1 addition & 1 deletion docs/python-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ The script automatically adds the `bin/pyspark` package to the `PYTHONPATH`.
The `bin/pyspark` script launches a Python interpreter that is configured to run PySpark applications. To use `pyspark` interactively, first build Spark, then launch it directly from the command line without any options:

{% highlight bash %}
$ sbt assembly
$ sbt/sbt assembly
$ ./bin/pyspark
{% endhighlight %}

Expand Down
8 changes: 4 additions & 4 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ See the [programming guide](scala-programming-guide.html) for a more complete re
To follow along with this guide, you only need to have successfully built Spark on one machine. Simply go into your Spark directory and run:

{% highlight bash %}
$ sbt assembly
$ sbt/sbt assembly
{% endhighlight %}

# Interactive Analysis with the Spark Shell
Expand Down Expand Up @@ -146,7 +146,7 @@ If you also wish to read data from Hadoop's HDFS, you will also need to add a de
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<your-hdfs-version>"
{% endhighlight %}

Finally, for sbt to work correctly, we'll need to layout `SimpleApp.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a JAR package containing the application's code, then use `sbt run` to execute our program.
Finally, for sbt to work correctly, we'll need to layout `SimpleApp.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a JAR package containing the application's code, then use `sbt/sbt run` to execute our program.

{% highlight bash %}
$ find .
Expand All @@ -157,8 +157,8 @@ $ find .
./src/main/scala
./src/main/scala/SimpleApp.scala

$ sbt package
$ sbt run
$ sbt/sbt package
$ sbt/sbt run
...
Lines with a: 46, Lines with b: 23
{% endhighlight %}
Expand Down
6 changes: 3 additions & 3 deletions docs/running-on-yarn.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ was added to Spark in version 0.6.0, and improved in 0.7.0 and 0.8.0.
We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster.
This can be built by setting the Hadoop version and `SPARK_YARN` environment variable, as follows:

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

The assembled JAR will be something like this:
`./assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly_{{site.SPARK_VERSION}}-hadoop2.0.5.jar`.
Expand All @@ -25,7 +25,7 @@ The build process now also supports new YARN versions (2.2.x). See below.
- The assembled jar can be installed into HDFS or used locally.
- Your application code must be packaged into a separate JAR file.

If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt assembly`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.
If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt assembly`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.

# Configuration

Expand Down Expand Up @@ -72,7 +72,7 @@ The command to launch the YARN Client is as follows:
For example:

# Build the Spark assembly JAR and the Spark examples JAR
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt assembly
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

# Configure logging
$ cp conf/log4j.properties.template conf/log4j.properties
Expand Down
2 changes: 1 addition & 1 deletion docs/scala-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ In addition, if you wish to access an HDFS cluster, you need to add a dependency
artifactId = hadoop-client
version = <your-hdfs-version>

For other build systems, you can run `sbt assembly` to pack Spark and its dependencies into one JAR (`assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop*.jar`), then add this to your CLASSPATH. Set the HDFS version as described [here](index.html#a-note-about-hadoop-versions).
For other build systems, you can run `sbt/sbt assembly` to pack Spark and its dependencies into one JAR (`assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop*.jar`), then add this to your CLASSPATH. Set the HDFS version as described [here](index.html#a-note-about-hadoop-versions).

Finally, you need to import some Spark classes and implicit conversions into your program. Add the following lines:

Expand Down
9 changes: 6 additions & 3 deletions make-distribution.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,16 @@ DISTDIR="$FWDIR/dist"
# Get version from SBT
export TERM=dumb # Prevents color codes in SBT output

if ! test `which sbt` ;then
VERSIONSTRING=$FWDIR/sbt/sbt "show version"

if [ $? == -1 ] ;then
echo -e "You need sbt installed and available on your path."
echo -e "Download sbt from http://www.scala-sbt.org/"
exit -1;
fi

VERSION=$(sbt "show version" | tail -1 | cut -f 2 | sed 's/^\([a-zA-Z0-9.-]*\).*/\1/')
VERSION=$(echo "${VERSIONSTRING}" | tail -1 | cut -f 2 | sed 's/^\([a-zA-Z0-9.-]*\).*/\1/')
echo "Version is ${VERSION}"

# Initialize defaults
SPARK_HADOOP_VERSION=1.0.4
Expand Down Expand Up @@ -92,7 +95,7 @@ export SPARK_HADOOP_VERSION
export SPARK_YARN
cd $FWDIR

"sbt" "assembly/assembly"
"sbt/sbt" "assembly/assembly"

# Make directories
rm -rf "$DISTDIR"
Expand Down
1 change: 0 additions & 1 deletion project/build.properties
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#

sbt.version=0.12.4
43 changes: 43 additions & 0 deletions sbt/sbt
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash
# This script launches sbt for this project. If present it uses the system
# version of sbt. If there is no system version of sbt it attempts to download
# sbt locally.
SBT_VERSION=`awk -F "=" '/sbt\\.version/ {print $2}' ./project/build.properties`
URL1=http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
URL2=http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
JAR=sbt/sbt-launch-${SBT_VERSION}.jar

printf "Checking for system sbt ["
if hash sbt 2>/dev/null; then
printf "FOUND]\n"
# Use System SBT
sbt "$@"
else
printf "NOT FOUND]\n"
# Download sbt or use already downloaded
if [ ! -d .sbtlib ]; then
mkdir .sbtlib
fi
if [ ! -f ${JAR} ]; then
# Download
printf "Attempting to fetch sbt\n"
if hash curl 2>/dev/null; then
curl --progress-bar ${URL1} > ${JAR} || curl --progress-bar ${URL2} > ${JAR}
elif hash wget 2>/dev/null; then
wget --progress=bar ${URL1} -O ${JAR} || wget --progress=bar ${URL2} -O ${JAR}
else
printf "You do not have curl or wget installed, please install sbt manually from http://www.scala-sbt.org/\n"
exit -1
fi
fi
if [ ! -f ${JAR} ]; then
# We failed to download
printf "Our attempt to download sbt locally to ${JAR} failed. Please install sbt manually from http://www.scala-sbt.org/\n"
exit -1
fi
printf "Launching sbt from ${JAR}\n"
java \
-Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \
-jar ${JAR} \
"$@"
fi

0 comments on commit a862caf

Please sign in to comment.