GetStarted_standalone_osx

Running CaffeOnSpark on Standalone Spark Cluster (OSX)

Clone CaffeOnSpark code.

git clone https://github.com/yahoo/CaffeOnSpark.git --recursive
export CAFFE_ON_SPARK=$(pwd)/CaffeOnSpark

Install Apache Hadoop 2.6 per http://hadoop.apache.org/releases.html, and install Apache Spark 1.6.0 per instruction at http://spark.apache.org/downloads.html.

${CAFFE_ON_SPARK}/scripts/local-setup-hadoop.sh
export HADOOP_HOME=$(pwd)/hadoop-2.6.4
export PATH=${HADOOP_HOME}/bin:${PATH}
${CAFFE_ON_SPARK}/scripts/local-setup-spark.sh
export SPARK_HOME=$(pwd)/spark-1.6.0-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:${PATH}

Install caffe prerequists per http://caffe.berkeleyvision.org/installation.html or from http://installing-caffe-the-right-way.wikidot.com/start

For CPU Mode: Make sure that all the dependent libraries are compiled with libc++.
Check this using otool.
Eg: otool -L /usr/local/Cellar/opencv/2.4.12_2/lib/libopencv_objdetect.dylib.
You should see libc++ linked like:
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0)

If you see some dependent libs with stdlibc++, you need to recompile the lib from source with libc++.

Create a CaffeOnSpark/caffe-public/Makefile.config
Check your $JAVA_HOME is set

pushd ${CAFFE_ON_SPARK}/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
popd

Change/Specify the path in INCLUDE_DIRS and LIBRARY_DIRS to the dependent libs as per your local installation. This is a critical step in making sure everything compiles well.

Uncomment settings as needed:

CPU_ONLY := 1  #if you have CPU

For CPU_ONLY, as stated above, make sure all dependent libs are compiled with libc++.
For GPU mode on osx,
If using CUDA > 7.0 nothing special is required except comment CPU_ONLY
But on OS X >= 10.9 with CUDA < 7.0, you may need to compile all dependent libs with stdlibc++

Comment out INFINIBAND in all cases on OSX unless you have libverbs driver for the same (not tested)

Build CaffeOnSpark

export DYLD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib
export DYLD_LIBRARY_PATH=${DYLD_LIBRARY_PATH}:/usr/local/cuda/lib:/usr/local/mkl/lib/intel64/
export LD_LIBRARY_PATH=${DYLD_LIBRARY_PATH}
pushd ${CAFFE_ON_SPARK}
make buildosx
popd

Please make sure to put in the right path as per your local installation for cuda libs (if you choose GPU)
and mkl libs if you use MKL

Install mnist dataset

${CAFFE_ON_SPARK}/scripts/setup-mnist.sh

Adjust ${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt to use absolute paths, such as.

source: "file:///home/afeng/CaffeOnSpark/data/mnist_train_lmdb/"
source: "file:///home/afeng/CaffeOnSpark/data/mnist_test_lmdb/"

Adjust data/lenet_memory_solver.prototxt with appropriate mode.

solver_mode: CPU #GPU if you use GPU nodes

Launch standalone Spark cluster

Start master:

${SPARK_HOME}/sbin/start-master.sh

Here is an example of Spark log for the above command, which contains a Spark master URL starting with prefix "spark://".

Strt one or more workers and connect them to the master via master-spark-URL. Go to MasterWebUI, make sure that you have the exact # of workers launched.

export MASTER_URL=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=2 
export CORES_PER_WORKER=1 
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES})) 
${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER_URL}

Train a DNN network using CaffeOnSpark with 2 Spark executors with Ethernet connection. If you have Infiniband interface, please use "-connection infiniband" instead.

Before launching CaffeOnSpark check that your hostname and other hosts you connect to are resolvable.
You may need to add your/peer host name in /etc/hosts.

pushd ${CAFFE_ON_SPARK}/data
rm -rf ${CAFFE_ON_SPARK}/mnist_lenet.model
rm -rf ${CAFFE_ON_SPARK}/lenet_features_result
spark-submit --master ${MASTER_URL} \
    --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
    --conf spark.cores.max=${TOTAL_CORES} \
    --conf spark.task.cpus=${CORES_PER_WORKER} \
    --conf spark.driver.extraLibraryPath="${DYLD_LIBRARY_PATH}" \
    --conf spark.executorEnv.DYLD_LIBRARY_PATH="${DYLD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -features accuracy,loss -label label \
        -conf lenet_memory_solver.prototxt \
	-clusterSize ${SPARK_WORKER_INSTANCES} \
        -devices 1 \
	-connection ethernet \
        -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \
        -output file:${CAFFE_ON_SPARK}/lenet_features_result
ls -l ${CAFFE_ON_SPARK}/mnist_lenet.model
cat ${CAFFE_ON_SPARK}/lenet_features_result/*

Please check the Spark Worker Web UI to see the progress of training. You should see standard Caffe logs illustrated below.

I0215 04:45:41.444522 26306 solver.cpp:237] Iteration 0, loss = 2.45106
I0215 04:45:41.444772 26306 solver.cpp:253]     Train net output #0: loss = 2.45106 (* 1 = 2.45106 loss)
I0215 04:45:41.444911 26306 sgd_solver.cpp:106] Iteration 0, lr = 0.01
I0215 04:46:10.320430 26306 solver.cpp:237] Iteration 100, loss = 0.337411
I0215 04:46:10.320597 26306 solver.cpp:253]     Train net output #0: loss = 0.337411 (* 1 = 0.337411 loss)
I0215 04:46:10.320667 26306 sgd_solver.cpp:106] Iteration 100, lr = 0.00992565
I0215 04:46:37.602695 26306 solver.cpp:237] Iteration 200, loss = 0.2749
I0215 04:46:37.602886 26306 solver.cpp:253]     Train net output #0: loss = 0.2749 (* 1 = 0.2749 loss)
I0215 04:46:37.602932 26306 sgd_solver.cpp:106] Iteration 200, lr = 0.00985258
I0215 04:46:59.177289 26306 solver.cpp:237] Iteration 300, loss = 0.165734
I0215 04:46:59.177484 26306 solver.cpp:253]     Train net output #0: loss = 0.165734 (* 1 = 0.165734 loss)
I0215 04:46:59.177533 26306 sgd_solver.cpp:106] Iteration 300, lr = 0.00978075
I0215 04:47:27.075026 26306 solver.cpp:237] Iteration 400, loss = 0.26131
I0215 04:47:27.075108 26306 solver.cpp:253]     Train net output #0: loss = 0.26131 (* 1 = 0.26131 loss)
I0215 04:47:27.075125 26306 sgd_solver.cpp:106] Iteration 400, lr = 0.00971013

The feature result file should look like:

{"SampleID":"00009597","accuracy":[1.0],"loss":[0.028171852],"label":[2.0]}
{"SampleID":"00009598","accuracy":[1.0],"loss":[0.028171852],"label":[6.0]}
{"SampleID":"00009599","accuracy":[1.0],"loss":[0.028171852],"label":[1.0]}
{"SampleID":"00009600","accuracy":[0.97],"loss":[0.0677709],"label":[5.0]}
{"SampleID":"00009601","accuracy":[0.97],"loss":[0.0677709],"label":[0.0]}
{"SampleID":"00009602","accuracy":[0.97],"loss":[0.0677709],"label":[1.0]}
{"SampleID":"00009603","accuracy":[0.97],"loss":[0.0677709],"label":[2.0]}
{"SampleID":"00009604","accuracy":[0.97],"loss":[0.0677709],"label":[3.0]}
{"SampleID":"00009605","accuracy":[0.97],"loss":[0.0677709],"label":[4.0]}

Access CaffeOnSpark from Python

Get started with python on CaffeOnSpark

Shutdown Spark cluster

${SPARK_HOME}/sbin/stop-slave.sh
${SPARK_HOME}/sbin/stop-master.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GetStarted_standalone_osx

Running CaffeOnSpark on Standalone Spark Cluster (OSX)

Clone this wiki locally