Skip to content
Ondřej Moravčík edited this page May 4, 2015 · 16 revisions

There are 3 way to configure ruby-spark and spark. Configuration can be changed before creating context. After that config is read-only.

By environment variable

SPARK_RUBY_SERIALIZER="oj" ruby-spark shell

By property file

Content of property file:

# This is just a comment
spark.ruby.serializer         oj
spark.ruby.batch_size         4096
spark.ruby.executor.options   -W1
# For shell
ruby-spark shell --properties-file conf.conf
# In ruby
Spark.config.from_file(FILE_PATH)

Direct

This muts be done before starting.

Spark.config do
  set_app_name "RubySpark"
  set_master "local[*]"
  set 'spark.ruby.batch_size', 100
  set 'spark.ruby.serializer', 'oj'
end

Spark.config.set('spark.ruby.batch_size', 100)

During data uploading

$sc.parallelize(1..10, 3, serializer: "oj")

Ruby configuration in Spark

Key Default value Description
spark.ruby.parallelize_strategy inplace What happen with Array during parallelize method
inplace: an array is converted using choosen serializer
deep_copy: and array is cloned first to prevent changin the original data
spark.ruby.serializer marshal Default serializer
marshal: ruby's default (slowest) oj: faster than marshal but doesn't work on jruby
message_pack: fastest but cannot serialize large numbers and some objects
spark.ruby.batch_size 1024 Number of items which will be serialized and send as one item
spark.ruby.worker.type process Type of workers.
process: new workers are created by fork function
thread: worker is represented as thread
spark.ruby.executor.uri Where is located ruby-spark gem. Is field is empty - executor will use installed gem.
spark.ruby.executor.command %s Command template for ruby script execution. Can be useful if you are using some ruby version manager.

Rbenv:
bash --norc -i -c "export HOME=/home/user; cd; source .bashrc; %s"

Template must contain '%s' which represent origin ruby command.
spark.ruby.executor.options Ruby options for scripts.
spark.ruby.executor.env.[VARIABLE] Environment variables for ruby scripts.
Clone this wiki locally