Skip to content

Commit

Permalink
【Hackathon No.66】为 PaddleScience 增加损失函数权重自适应功能 (#178)
Browse files Browse the repository at this point in the history
* add rfc of gradnorm

* update code

* fix format of code

* replace tab with space

* update
  • Loading branch information
Asthestarsfalll authored Jul 27, 2022
1 parent c1c6a87 commit d403256
Showing 1 changed file with 290 additions and 0 deletions.
290 changes: 290 additions & 0 deletions rfcs/Science/20220710_api_design_for_GradNorm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
# PaddleScience.network.GradNorm设计文档

| API名称 | paddlescience.network.GradNorm |
| ---------------------------------------- | --------------------------------------- |
| 提交作者<input type="checkbox" class="rowselector hidden"> | Asthestarsfalll |
| 提交时间<input type="checkbox" class="rowselector hidden"> | 2022-07-10 |
| 版本号 | V1.0 |
| 依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | develop |
| 文件名 | 20220710_api_design_for_GradNorm.md<br> |

# 一、概述

## 1、相关背景

PINNs方法中损失函数由 PDE Loss、初值 Loss、边值 Loss及 data Loss 组成,其中每项损失函数有不同的权重。

[GradNorm](https://arxiv.org/abs/1711.02257)方法可以根据损失对各个任务的梯度来动态分配损失的权重,从而达到均衡各个损失梯度、稳定不同任务收敛速度的效果。

顾名思义,GradNorm即为利用梯度来对梯度进行归一化,论文中选用了所有损失的最后一层共享层,记为$W$。

其整体计算流程如下:

首先初始化可学习的损失权重,记为$w$,选取超参$\alpha$ 用于控制学习速度, $\alpha$ 越大,对训练速度的平衡限制越强。

在训练过程的第一步时,储存所有loss,记为$L(0)$,用于后续步骤。

1. 求和所有loss$\sum_iL_i(t)$,反向传播,获得所有参数的梯度,这一步需要保留计算图;
2. 取出最后一个共享层的参数$w$,分别求解每个loss相对于$W$的梯度$ \nabla WL_i(t)$ ,与对应的权重相乘取l2范数:$||\nabla Ww_i(t)L_i(t)||_2$, 记为$G^{(i)}_w(t) $ ;
3. 计算当前loss相对于初始loss的比例:$\widetilde{L}_i=L_i(t)/L_i(0)$ ,计算相对反向训练速度(inverse training rate)$r_i(t)=\widetilde{L}_i/E(\widetilde{L}_i)$ ,这里需要将其当做一个常数,因此在实现过程中需要转换为numpy计算;
4. 计算$\overline{G}_W(t) =E(G_W^i(t))$,此处同样作为常数。
5. 计算Norm Loss:$\sum_i|G_W^i(t)-\overline{G}_W(t)\times (r_i(t))^{\alpha}|_1$;
6. 通过Norm Loss求解相对于相对于$w$的梯度,手动更新$w$的梯度;
7. 更新参数即可。

## 2、功能目标

为 PaddleScience 新增 GradNorm Loss 权重自适应功能,集成入 PaddleScience 作为 API 调用。

## 3、意义

使得PaddleScience支持损失函数权重自适应的功能。

# 二、飞桨现状

Paddle拥有实现GradNorm所需的基础API。

# 三、业内方案调研

## pytorch-grad-norm

核心代码如下:

[初始化权重及获取最后一层共享层](https://github.com/brianlan/pytorch-grad-norm/blob/master/model.py#L24-L43)

```python
class RegressionTrain(torch.nn.Module):
def __init__(self, model):
# initialize the module using super() constructor
super(RegressionTrain, self).__init__()
# assign the architectures
self.model = model
# assign the weights for each task
self.weights = torch.nn.Parameter(torch.ones(model.n_tasks).float())
# loss function
self.mse_loss = MSELoss()

def forward(self, x, ts):
B, n_tasks = ts.shape[:2]
ys = self.model(x)

# check if the number of tasks is equal to this size
assert(ys.size()[1] == n_tasks)
task_loss = []
for i in range(n_tasks):
task_loss.append( self.mse_loss(ys[:,i,:], ts[:,i,:]) )
task_loss = torch.stack(task_loss)

return task_loss

def get_last_shared_layer(self):
return self.model.get_last_shared_layer()
```

[Norm Loss计算过程](https://github.com/brianlan/pytorch-grad-norm/blob/master/train.py#L67-L138)

```python
if t == 0:
# set L(0)
if torch.cuda.is_available():
initial_task_loss = task_loss.data.cpu()
else:
initial_task_loss = task_loss.data
initial_task_loss = initial_task_loss.numpy()

# get the total loss
loss = torch.sum(weighted_task_loss)
# clear the gradients
optimizer.zero_grad()
# do the backward pass to compute the gradients for the whole set of weights
# This is equivalent to compute each \nabla_W L_i(t)
loss.backward(retain_graph=True)

# set the gradients of w_i(t) to zero because these gradients have to be updated using the GradNorm loss
#print('Before turning to 0: {}'.format(model.weights.grad))
model.weights.grad.data = model.weights.grad.data * 0.0
#print('Turning to 0: {}'.format(model.weights.grad))


# switch for each weighting algorithm:
# --> grad norm
if args.mode == 'grad_norm':

# get layer of shared weights
W = model.get_last_shared_layer()

# get the gradient norms for each of the tasks
# G^{(i)}_w(t)
norms = []
for i in range(len(task_loss)):
# get the gradient of this task loss with respect to the shared parameters
gygw = torch.autograd.grad(task_loss[i], W.parameters(), retain_graph=True)
# compute the norm
norms.append(torch.norm(torch.mul(model.weights[i], gygw[0])))
norms = torch.stack(norms)
#print('G_w(t): {}'.format(norms))


# compute the inverse training rate r_i(t)
# \curl{L}_i
if torch.cuda.is_available():
loss_ratio = task_loss.data.cpu().numpy() / initial_task_loss
else:
loss_ratio = task_loss.data.numpy() / initial_task_loss
# r_i(t)
inverse_train_rate = loss_ratio / np.mean(loss_ratio)
#print('r_i(t): {}'.format(inverse_train_rate))


# compute the mean norm \tilde{G}_w(t)
if torch.cuda.is_available():
mean_norm = np.mean(norms.data.cpu().numpy())
else:
mean_norm = np.mean(norms.data.numpy())
#print('tilde G_w(t): {}'.format(mean_norm))


# compute the GradNorm loss
# this term has to remain constant
constant_term = torch.tensor(mean_norm * (inverse_train_rate ** args.alpha), requires_grad=False)
if torch.cuda.is_available():
constant_term = constant_term.cuda()
#print('Constant term: {}'.format(constant_term))
# this is the GradNorm loss itself
grad_norm_loss = torch.tensor(torch.sum(torch.abs(norms - constant_term)))
#print('GradNorm loss {}'.format(grad_norm_loss))

# compute the gradient for the weights
model.weights.grad = torch.autograd.grad(grad_norm_loss, model.weights)[0]
```

# 四、对比分析

实现清晰,与论文一致,按照上述代码实现即可。

# 五、设计思路与实现方案

## 命名与参数设计

API设计为`paddlescience.network.GradNorm(net, n_loss, alpha, weight_attr=None)`

## 底层OP设计

无需设计底层OP。

## API实现方案

paddlescience.network.GradNorm实现于paddlescience\network\grad_norm.py文件中,初步实现如下:

```python
import numpy as np
import paddle
import paddle.nn
from paddle.nn.initializer import Assign
from .network_base import NetworkBase

class GradNorm(NetworkBase):
r"""
Gradient normalization for adaptive loss balancing.
Parameters:
net(NetworkBase): The network which must have "get_shared_layer" method.
n_loss(int): The number of loss, must be greater than 1.
alpha(float): The hyperparameter which controls learning rate.
weight_attr(list, tuple): The inital weights for "loss_weights". If not specified, "loss_weights" will be initialized with 1.
"""
def __init__(self, net, n_loss, alpha, weight_attr=None):
super().__init__()
if n_loss <= 1:
raise ValueError("'n_loss' must be greater than 1, but got {}".format(n_loss))
if alpha < 0:
raise ValueError("'alpha' is a positive number, but got {}".format(alpha))
if weight_attr is not None:
if len(weight_attr) != n_loss:
raise ValueError("weight_attr must have same length with loss weights.")

self.net = net
self.loss_weights = self.create_parameter(
shape=[n_loss], attr=Assign(weight_attr if weight_attr else [1] * n_loss),
dtype=self._dtype, is_bias=False)
self.set_grad()
self.alpha = float(alpha)
self.initial_losses = None

def nn_func(self, ins):
return self.net.nn_func(ins)

def __getattr__(self, __name):
try:
return super().__getattr__(__name)
except:
return getattr(self.net, __name)

def get_grad_norm_loss(self, losses):
if isinstance(losses, list):
losses = paddle.concat(losses)
if self.initial_losses is None:
self.initial_losses = losses.numpy()

W = self.net.get_last_shared_layer()
if self.loss_weights.grad is not None
self.loss_weights.grad.set_value(paddle.zeros_like(self.loss_weights))

norms = []
for i in range(losses.shape[0]):
grad = paddle.autograd.grad(losses[i], W, retain_graph=True)
norms.append(paddle.norm(self.loss_weights[i]*grad[0], p=2))
norms = paddle.concat(norms)

loss_ratio = losses.numpy() / self.initial_losses
inverse_train_rate = loss_ratio / np.mean(loss_ratio)
mean_norm = np.mean(norms.numpy())

constant_term = paddle.to_tensor(mean_norm * np.power(inverse_train_rate, self.alpha), dtype=self._dtype)

grad_norm_loss = paddle.norm(norms - constant_term, p=1)
self.loss_weights.grad.set_value(paddle.autograd.grad(grad_norm_loss, self.loss_weights)[0])

return grad_norm_loss

def reset_initial_losses(self):
self.initial_losses = None

def set_grad(self):
x = paddle.ones_like(self.loss_weights)
x *= self.loss_weights
x.backward()
self.loss_weights.grad.set_value(paddle.zeros_like(self.loss_weights))

def get_weights(self):
return self.loss_weights.numpy()
```

这里对实现的要点进行说明:

1. GradNorm接受一个网络实例,需要为`NetworkBase`的子类,且必须要有`get_shared_layer`方法,该方法可以返回任意共享层,论文中为最后一层。为了拓展功能,后续可以考虑如何开放接口让用户手动选择。
2. 需要输入loss的数量,当`weight_attr`未指定时,权重会全部初始化未1,否则按照给定权值初始化。
3. 重写了`__getattr__`方法,可以通过`.`直接调用传入网络的方法,使得用户在使用GradNorm前后都可以使用相同的方式调用网络的方法,而无需修改。
4. `set_grad()` 用于初始化权重的梯度,由于在paddle中无法直接对梯度使用赋值语句,需要使用`Tensor.grad.set_value()`,而刚初始化时grad为None,无法调用set_value方法,因此需要使用`backward()` 来获得初始权重。

# 六、测试和验收的考量

1. 保证前向反向与论文中所描述的一致;
2. 保证训练、测试等流程不出现错误。

# 七、可行性分析和排期规划

已完成部分工作,规定时间内可全部完成。

# 八、影响面

需要为网络添加一个`get_shared_layer()`方法。

此外当`GradNorm` 拥有与传入网络实例相同名称的属性或方法时,以GradNorm为准,可能会对已有代码造成一些影响,修改后即可使用。

# 名词解释


# 附件及参考资料

[GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks](https://arxiv.org/pdf/1711.02257.pdf)

0 comments on commit d403256

Please sign in to comment.