Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

fix checkpoint load error and stop updating paramters in evaluation stage #3124

Merged
merged 4 commits into from
Nov 30, 2020
Merged

Conversation

eedalong
Copy link
Contributor

@eedalong eedalong commented Nov 25, 2020

#3119
When testing QAT in NNI, users would run into these problems

  1. checkpoint load fails, which will cause incremental training fail

  2. in evaluation stage, parameters of weight quantization and output quantization should not be updated

  3. when we load checkpoint and then evaluate immediately, we would see bad model performance because of the reset of inner variable :[steps]

this pr fix these problems

@eedalong eedalong changed the title fix checkpoint load error and stop updating paramters in training stage fix checkpoint load error and stop updating paramters in evaluation stage Nov 25, 2020
@eedalong
Copy link
Contributor Author

Here scale and zero point are defined with a 1-dimensioon tensor, but might got size changed to 0-dimension tensor during calculation by function call update_quantization_param(bits, rmin, rmax).

rmin and rmax are calculated by torch.min() and torch.max(), they are 0-dimension tensor, and in update_quantization_param(bits, rmin, rmax), you calculated min value by rmin = min(rmin, 0), after this step, rmin is still a 0-dimension tensor, such as Tensor(1.23) but not Tensor([1.23]).

Then when we save checkpoint, we might save 0-dimension tensor for scale and zero_point, And now BOMM!

checkpoint load is failed due to paramter size dismatch, because scale and zero_point are defined as 1-dimension Tensor but we see 0-dimension Tensor in checkpoint. this is fixed in commit 1ed0163

module.ema_decay, self.bound_model.steps)
module.tracked_max_biased, module.tracked_max = update_ema(module.tracked_max_biased, current_max,
module.ema_decay, self.bound_model.steps)
module.scale, module.zero_point = update_quantization_param(output_bits, module.tracked_min, module.tracked_max)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At present, we have to re-calculate the scale and zero_point of output and weight. Because the parameter of activation and weight all use the same scale and zero-point of such module. If quantize both activation and weight during testing without updating, parameters would be only one of them, and the result could be wrong. But this design that scale and zero_point using the same parameter doesn't make sense, we plan to refactor it in the following release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um, Indeed activation and weight use the same field, but once a user defined to quantize output of the layer, module.zero_point module.scale will always be activation's zero_point and scale , so it will not cause anything wrong

for layer, config in modules_to_compress:
layer.module.register_buffer("zero_point", None)
layer.module.register_buffer("scale", None)
layer.module.register_buffer("zero_point", torch.Tensor([0.0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just use 0-dimension definition here, would it be more simple and we don't need to add conversion?

Copy link
Contributor Author

@eedalong eedalong Nov 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define 0-dimension does save some work,defining it to 1-dimension tensor is just for keeping consitent with other buffered variables. Because other buffered variables are also a real value, there's no reason why scale and zero_point, who are also just two real numbers, are defined as a 0-dimension tensor. It might confuse users

@eedalong
Copy link
Contributor Author

add cuda transformation in quantize calculation, because when using multi GPU training , we might have tensor type mismatch problem, i.e. we might have an error of adding a cpu tensor and a cuda tensor. Although I have no idea why this is happening ==.

rmax = torch.max(rmax, torch.Tensor([0]).cuda())
qmin = torch.Tensor([0]).cuda()
qmax = torch.Tensor([(1 << bits) - 1]).cuda()
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this place use 1-dimension, it seems torch.min, torch.max and update_ema should return 0-dimension, although the call in line 277 will return 1-dimension due to use self.bound_model.steps. Maybe it would be better to unify these places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we claim scale and zero-point as a 1-dimension tensor in QAT, but here changes scale and zero_point to 0-dimension, then we load checkpoint after saving, we would have type mismatch problem, cuz scale and zero_point are claimed as 1-dimension tensor, but in checkpoint, they are saved as a 0-dimension tensor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what i do here is to keep the scale and zero_point's size consistent during all calculation

@liuzhe-lz liuzhe-lz merged commit fc0ff8c into microsoft:master Nov 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants