-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix checkpoint load error and stop updating paramters in evaluation stage #3124
Conversation
Here scale and zero point are defined with a 1-dimensioon tensor, but might got size changed to 0-dimension tensor during calculation by function call update_quantization_param(bits, rmin, rmax). rmin and rmax are calculated by torch.min() and torch.max(), they are 0-dimension tensor, and in update_quantization_param(bits, rmin, rmax), you calculated min value by rmin = min(rmin, 0), after this step, rmin is still a 0-dimension tensor, such as Tensor(1.23) but not Tensor([1.23]). Then when we save checkpoint, we might save 0-dimension tensor for scale and zero_point, And now BOMM! checkpoint load is failed due to paramter size dismatch, because scale and zero_point are defined as 1-dimension Tensor but we see 0-dimension Tensor in checkpoint. this is fixed in commit 1ed0163 |
module.ema_decay, self.bound_model.steps) | ||
module.tracked_max_biased, module.tracked_max = update_ema(module.tracked_max_biased, current_max, | ||
module.ema_decay, self.bound_model.steps) | ||
module.scale, module.zero_point = update_quantization_param(output_bits, module.tracked_min, module.tracked_max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At present, we have to re-calculate the scale and zero_point of output and weight. Because the parameter of activation and weight all use the same scale and zero-point of such module. If quantize both activation and weight during testing without updating, parameters would be only one of them, and the result could be wrong. But this design that scale and zero_point using the same parameter doesn't make sense, we plan to refactor it in the following release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
um, Indeed activation and weight use the same field, but once a user defined to quantize output of the layer, module.zero_point module.scale will always be activation's zero_point and scale , so it will not cause anything wrong
for layer, config in modules_to_compress: | ||
layer.module.register_buffer("zero_point", None) | ||
layer.module.register_buffer("scale", None) | ||
layer.module.register_buffer("zero_point", torch.Tensor([0.0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just use 0-dimension definition here, would it be more simple and we don't need to add conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define 0-dimension does save some work,defining it to 1-dimension tensor is just for keeping consitent with other buffered variables. Because other buffered variables are also a real value, there's no reason why scale and zero_point, who are also just two real numbers, are defined as a 0-dimension tensor. It might confuse users
add cuda transformation in quantize calculation, because when using multi GPU training , we might have tensor type mismatch problem, i.e. we might have an error of adding a cpu tensor and a cuda tensor. Although I have no idea why this is happening ==. |
rmax = torch.max(rmax, torch.Tensor([0]).cuda()) | ||
qmin = torch.Tensor([0]).cuda() | ||
qmax = torch.Tensor([(1 << bits) - 1]).cuda() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this place use 1-dimension, it seems torch.min
, torch.max
and update_ema
should return 0-dimension, although the call in line 277 will return 1-dimension due to use self.bound_model.steps
. Maybe it would be better to unify these places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we claim scale and zero-point as a 1-dimension tensor in QAT, but here changes scale and zero_point to 0-dimension, then we load checkpoint after saving, we would have type mismatch problem, cuz scale and zero_point are claimed as 1-dimension tensor, but in checkpoint, they are saved as a 0-dimension tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what i do here is to keep the scale and zero_point's size consistent during all calculation
#3119
When testing QAT in NNI, users would run into these problems
checkpoint load fails, which will cause incremental training fail
in evaluation stage, parameters of weight quantization and output quantization should not be updated
when we load checkpoint and then evaluate immediately, we would see bad model performance because of the reset of inner variable :[steps]
this pr fix these problems