Marginal Improvement Between INT8 and FP16 #168

alexriggio · 2023-03-31T16:05:52Z

I have quantized a BERT model for binary text classification and am only getting a marginal improvement in speed over FP16.

Tested on both an A4000 and A100 GPU.

A4000 --> TensorRT INT-8: 34.48ms, TensorRT FP16: 38.72ms
A100 ---> TensorRT INT-8: 11.53ms, TensorRT FP16: 11.75ms

These are the components that were quant disabled:

disable bert.encoder.layer.1.intermediate.dense._input_quantizer
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_0
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_1
disable bert.encoder.layer.2.output.layernorm_quantizer_0
disable bert.encoder.layer.2.output.layernorm_quantizer_1
disable bert.encoder.layer.3.attention.output.dense._input_quantizer
disable bert.encoder.layer.10.attention.self.key._input_quantizer
disable bert.encoder.layer.11.attention.output.dense._input_quantizer
disable bert.encoder.layer.11.output.dense._input_quantizer

The debug logs from the A4000 run are attached here:

trt_logs_int8_quantization.txt

Also, it looks like there is no option to quantize the embeddings. Is there a particular reason not to quantize them?

Any insight into these results is greatly appreciated. Thanks.

Versions:
Python: 3.10.9
transformers-deploy: 0.5.4
TensorRT: 8.4.1.5
Onnxruntime (GPU): 1.12.0
Cuda: 11.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marginal Improvement Between INT8 and FP16 #168

Marginal Improvement Between INT8 and FP16 #168

alexriggio commented Mar 31, 2023

Marginal Improvement Between INT8 and FP16 #168

Marginal Improvement Between INT8 and FP16 #168

Comments

alexriggio commented Mar 31, 2023