fix checkpoint load error and stop updating paramters in evaluation stage #3124

eedalong · 2020-11-25T04:12:25Z

#3119
When testing QAT in NNI, users would run into these problems

checkpoint load fails, which will cause incremental training fail
in evaluation stage, parameters of weight quantization and output quantization should not be updated
when we load checkpoint and then evaluate immediately, we would see bad model performance because of the reset of inner variable :[steps]

this pr fix these problems

eedalong · 2020-11-26T02:02:05Z

Here scale and zero point are defined with a 1-dimensioon tensor, but might got size changed to 0-dimension tensor during calculation by function call update_quantization_param(bits, rmin, rmax).

rmin and rmax are calculated by torch.min() and torch.max(), they are 0-dimension tensor, and in update_quantization_param(bits, rmin, rmax), you calculated min value by rmin = min(rmin, 0), after this step, rmin is still a 0-dimension tensor, such as Tensor(1.23) but not Tensor([1.23]).

Then when we save checkpoint, we might save 0-dimension tensor for scale and zero_point, And now BOMM!

checkpoint load is failed due to paramter size dismatch, because scale and zero_point are defined as 1-dimension Tensor but we see 0-dimension Tensor in checkpoint. this is fixed in commit 1ed0163

linbinskn · 2020-11-27T09:37:23Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+                                                                       module.ema_decay, self.bound_model.steps)
+            module.tracked_max_biased, module.tracked_max = update_ema(module.tracked_max_biased, current_max,
+                                                                       module.ema_decay, self.bound_model.steps)
+            module.scale, module.zero_point = update_quantization_param(output_bits, module.tracked_min, module.tracked_max)


At present, we have to re-calculate the scale and zero_point of output and weight. Because the parameter of activation and weight all use the same scale and zero-point of such module. If quantize both activation and weight during testing without updating, parameters would be only one of them, and the result could be wrong. But this design that scale and zero_point using the same parameter doesn't make sense, we plan to refactor it in the following release.

um, Indeed activation and weight use the same field, but once a user defined to quantize output of the layer, module.zero_point module.scale will always be activation's zero_point and scale , so it will not cause anything wrong

linbinskn · 2020-11-27T09:41:39Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

        for layer, config in modules_to_compress:
-            layer.module.register_buffer("zero_point", None)
-            layer.module.register_buffer("scale", None)
+            layer.module.register_buffer("zero_point", torch.Tensor([0.0]))


If we just use 0-dimension definition here, would it be more simple and we don't need to add conversion?

define 0-dimension does save some work,defining it to 1-dimension tensor is just for keeping consitent with other buffered variables. Because other buffered variables are also a real value, there's no reason why scale and zero_point, who are also just two real numbers, are defined as a 0-dimension tensor. It might confuse users

eedalong · 2020-11-29T15:13:05Z

add cuda transformation in quantize calculation, because when using multi GPU training , we might have tensor type mismatch problem, i.e. we might have an error of adding a cpu tensor and a cuda tensor. Although I have no idea why this is happening ==.

J-shang · 2020-11-30T02:26:35Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+        rmax = torch.max(rmax, torch.Tensor([0]).cuda())
+        qmin = torch.Tensor([0]).cuda()
+        qmax = torch.Tensor([(1 << bits) - 1]).cuda()
+    else:


why this place use 1-dimension, it seems torch.min, torch.max and update_ema should return 0-dimension, although the call in line 277 will return 1-dimension due to use self.bound_model.steps. Maybe it would be better to unify these places.

if we claim scale and zero-point as a 1-dimension tensor in QAT, but here changes scale and zero_point to 0-dimension, then we load checkpoint after saving, we would have type mismatch problem, cuz scale and zero_point are claimed as 1-dimension tensor, but in checkpoint, they are saved as a 0-dimension tensor

So what i do here is to keep the scale and zero_point's size consistent during all calculation

fix checkpoint load error and stop updating paramters in training stage

05d2ef9

eedalong changed the title ~~fix checkpoint load error and stop updating paramters in training stage~~ fix checkpoint load error and stop updating paramters in evaluation stage Nov 25, 2020

yuanxiulong added 2 commits November 26, 2020 09:36

fix dtype mismatch problem

834d941

avoid modify tensor size during calculation

1ed0163

liuzhe-lz requested review from linbinskn and J-shang November 27, 2020 08:52

linbinskn reviewed Nov 27, 2020

View reviewed changes

add cuda transformation before _quantize calculation

dc8185a

linbinskn approved these changes Nov 30, 2020

View reviewed changes

J-shang reviewed Nov 30, 2020

View reviewed changes

J-shang approved these changes Nov 30, 2020

View reviewed changes

linbinskn approved these changes Nov 30, 2020

View reviewed changes

liuzhe-lz approved these changes Nov 30, 2020

View reviewed changes

liuzhe-lz merged commit fc0ff8c into microsoft:master Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix checkpoint load error and stop updating paramters in evaluation stage #3124

fix checkpoint load error and stop updating paramters in evaluation stage #3124

eedalong commented Nov 25, 2020 •

edited

Loading

eedalong commented Nov 26, 2020

linbinskn Nov 27, 2020

eedalong Nov 29, 2020

linbinskn Nov 27, 2020

eedalong Nov 29, 2020 •

edited

Loading

eedalong commented Nov 29, 2020

J-shang Nov 30, 2020

eedalong Nov 30, 2020

eedalong Nov 30, 2020

fix checkpoint load error and stop updating paramters in evaluation stage #3124

fix checkpoint load error and stop updating paramters in evaluation stage #3124

Conversation

eedalong commented Nov 25, 2020 • edited Loading

eedalong commented Nov 26, 2020

linbinskn Nov 27, 2020

Choose a reason for hiding this comment

eedalong Nov 29, 2020

Choose a reason for hiding this comment

linbinskn Nov 27, 2020

Choose a reason for hiding this comment

eedalong Nov 29, 2020 • edited Loading

Choose a reason for hiding this comment

eedalong commented Nov 29, 2020

J-shang Nov 30, 2020

Choose a reason for hiding this comment

eedalong Nov 30, 2020

Choose a reason for hiding this comment

eedalong Nov 30, 2020

Choose a reason for hiding this comment

eedalong commented Nov 25, 2020 •

edited

Loading

eedalong Nov 29, 2020 •

edited

Loading