Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batching parameters #344

Closed
sskgit opened this issue Mar 6, 2017 · 16 comments
Closed

Batching parameters #344

sskgit opened this issue Mar 6, 2017 · 16 comments

Comments

@sskgit
Copy link

sskgit commented Mar 6, 2017

What is the best way to use batching parameters like max_batch_size, batch_timeout_micros, num_batch_threads and other parameters? Tried using them while running the query client.

In the below example I have 100 images and I want to batch in size of 10. The query runs for all images instead of 10.
bazel-bin/tensorflow_serving/example/demo_batch --server=localhost:9000 --max_batch_size = 10

Also, for batch scheduling how to make it run every 10 secs after the first batch is done? Appreciate some ideas..

@chrisolston
Copy link
Contributor

Check out batching/README.md.

@sskgit
Copy link
Author

sskgit commented Mar 6, 2017

Thanks @chrisolston . I went thru the README.md for BatchingSession and BasicBatchScedhuler. Most of the parameters needed are in these 2 files.

However, based on query client (inception_client.py), it doesn't look like it calls basic_batch_scheduler and/or BasicBatchScheduler. So, the question is how are these session and scheduling parameters passed to the query client? Looks like I am missing something...

@chrisolston
Copy link
Contributor

With the ModelServer binary the batching takes place in the server not the client. I just checked and unfortunately the batching parameters are currently hard-coded to the defaults. It would be easy to extend model_servers/main.cc to be able to accept the parameters via a textual proto file (analogous to model_config_file) containing a BatchingParameters proto, if you're up for making a (simple) code contribution :)

As a work-around, you can hard-code some specific values into the BatchingParameters proto in main.cc.

@chrisolston
Copy link
Contributor

On second thought, since it's only a few lines of code I will add the flag. I'm working on the change internally at Google, and if things go smoothly it will propagate to open-source in a week or so.

@sskgit
Copy link
Author

sskgit commented Mar 7, 2017

@chrisolston Thanks for the clarification. Will try the workaround for now.

@abuvaneswari
Copy link

As of now, tensorflow_model_server seems to support batching_parameter_file as an argument. Can someone point me to a template or specification for this file? I searched around and could not come across anything in a timely manner

@chrisolston
Copy link
Contributor

As the flag documentation states, it's an ascii protobuf for the BatchingParameters proto (defined in session_bundle_config.proto). You can find information about the ascii protobuf format elsewhere. It basically looks like:
message {
field1: value1
field2: value2
}

@abuvaneswari
Copy link

What is the default setting for batching parameters when enable_batching=true? I looked at the following config file, played with the parameter settings and ran the model server against each of the settings, but the results do not match up with what I get when I do not supply any file name to batching_parameters_file (but have enable_batching=true)..
serving/tensorflow_serving/servables/tensorflow/testdata/batching_config.txt

@chrisolston
Copy link
Contributor

@sreddybr3
Copy link

sreddybr3 commented Nov 26, 2017

I am running TFS (custom build for GPU) with standard InceptionV3 model.

Amazon EC2 P2.xlarge instance:

export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0
  • build tensorflow_model_server
bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server
  • run server using batching config file
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf

batching.conf file contents

max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256

Tensorflow Serving initialisation logs shows that GPU is visible and utilised.

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

will show how GPU is being utilised when performance script is run.

Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

@pharrellyhy
Copy link

Hi @sreddybr3,
I'm trying to use batching to speed up inference. In my setting, tensorflow is not built in optimized mode but it should be ok for batching. In my test case, the input shape is [32, 112, 112, 3], so in batching.conf I set max_batch_size to 32. This will cost the same time to finish the test, say, 500 requests. While if I increase the max_batch_size, the performance is even worse. I even tweak the value of num_batch_threads which seems not helping too much. Do you have any thoughts? Thanks!

@misterpeddy
Copy link
Member

This is a great discussion and points to a need for documentation of batching for modelserver binary. I have opened #1379 to add docs - if anyone has any thoughts or suggestions please comment on that issue :)

@TheR3d1
Copy link

TheR3d1 commented Jul 5, 2019

I am running TFS (custom build for GPU) with standard InceptionV3 model.

Amazon EC2 P2.xlarge instance:

export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0
  • build tensorflow_model_server
bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server
  • run server using batching config file
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf

batching.conf file contents

max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256

Tensorflow Serving initialisation logs shows that GPU is visible and utilised.

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

will show how GPU is being utilised when performance script is run.

Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

I am trying to use a batching config file using --batching_parameters_file. Where this file has to be placed in order to be used?

@aaur0
Copy link

aaur0 commented Aug 18, 2019

I am running TFS (custom build for GPU) with standard InceptionV3 model.
Amazon EC2 P2.xlarge instance:

export CUDA_HOME=/usr/local/cuda \
       TF_NEED_CUDA=1 \
       TF_CUDA_CLANG=0 \
       GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
       TF_CUDA_VERSION=8.0 \
       CUDA_TOOLKIT_PATH=/usr/local/cuda \
       TF_CUDNN_VERSION=6 \
       CUDNN_INSTALL_PATH=/usr/local/cuda \
       CC_OPT_FLAGS="-c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda" \
       PYTHON_BIN_PATH="/usr/bin/python" \
       USE_DEFAULT_PYTHON_LIB_PATH=1 \
       TF_NEED_JEMALLOC=1 \
       TF_NEED_GCP=0 \
       TF_NEED_HDFS=0 \
       TF_ENABLE_XLA=0 \
       TF_NEED_OPENCL=0 \
       TF_NEED_MKL=0 \
       TF_NEED_MPI=0 \
       TF_NEED_VERBS=0
  • build tensorflow_model_server
bazel build -c opt --copt=-mavx --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=opt --config=cuda \
    --crosstool_top=@local_config_cuda//crosstool:toolchain \
    tensorflow_serving/model_servers:tensorflow_model_server
  • run server using batching config file
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/inception-export --enable_batching --batching_parameters_file=batching.conf

batching.conf file contents

max_batch_size { value: 16 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

I have set num_batch_threads equal to number of CPU cores i.e. 4
Varied batch_timeout_micros between 0, 1000, 10000, 100000, 500000
varied max_batch_size between 8,16,32,64,128,256
Tensorflow Serving initialisation logs shows that GPU is visible and utilised.

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

will show how GPU is being utilised when performance script is run.
Test 1: Single process with 100 thread(s) per process and 1 request per thread
Results:
Without batching (average response time for multiple execution is 2.75 sec):

avg. resp. time (msec) | failure rate % | model
2898.545 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.1 sec):

avg. resp. time (msec) | failure rate % | model
1179.988 0.00% inception

Test 2: Single process with 100 thread(s) per process and 10 requests per thread
Results:
Without batching (average response time for multiple execution is 4.5 sec):

avg. resp. time (msec) | failure rate % | model
4525.389 0.00% inception

With batching using above mentioned config (average response time for multiple execution is 1.4 sec):

avg. resp. time (msec) | failure rate % | model
1377.638 0.00% inception

Numbers looks better with batching. We have TF benchmark for training: https://www.tensorflow.org/performance/benchmarks
do we any benchmark for TensorFlow serving using InceptionV3 model?

I am trying to use a batching config file using --batching_parameters_file. Where this file has to be placed in order to be used?

just give the absolute path of the file. In the case of docker, you will have to mount your local folder or have the file in the container itself.

@mikezhang95
Copy link

Can we reload the batch config after the server starts ? I see model config has such function.

@DachuanZhao
Copy link

Same problem . I run tf-serving like this :

sudo docker run -p 8501:8501 -d --name="tf_serving" \
--mount type=bind,source=/mnt1/zhaodachuan/tf_model/push/lr,target=/models/push_lr \
-v /mnt1/zhaodachuan/tf-serving/config_file/batch_size.config:/models/config/batch_size.config \
-e MODEL_NAME=push_lr -t tensorflow/serving --enable_batching=true \
--batching_parameters_file=/models/config/batch_size.config

And my batch_size.config is :

max_batch_size { value: 1000000 }
batch_timeout_micros { value: 0 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 8 }

It works as slow as I don't use --enable_batching=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests