Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon No.66】为 PaddleScience 增加损失函数权重自适应功能 #178

Merged
merged 5 commits into from
Jul 27, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
290 changes: 290 additions & 0 deletions rfcs/Science/20220710_api_design_for_GradNorm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
# PaddleScience.network.GradNorm设计文档

| API名称 | paddlescience.network.GradNorm |
| ---------------------------------------- | --------------------------------------- |
| 提交作者<input type="checkbox" class="rowselector hidden"> | Asthestarsfalll |
| 提交时间<input type="checkbox" class="rowselector hidden"> | 2022-07-10 |
| 版本号 | V1.0 |
| 依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | develop |
| 文件名 | 20220710_api_design_for_GradNorm.md<br> |

# 一、概述

## 1、相关背景

PINNs方法中损失函数由 PDE Loss、初值 Loss、边值 Loss及 data Loss 组成,其中每项损失函数有不同的权重。

[GradNorm](https://arxiv.org/abs/1711.02257)方法可以根据损失对各个任务的梯度来动态分配损失的权重,从而达到均衡各个损失梯度、稳定不同任务收敛速度的效果。

顾名思义,GradNorm即为利用梯度来对梯度进行归一化,论文中选用了所有损失的最后一层共享层,记为$W$。

其整体计算流程如下:

首先初始化可学习的损失权重,记为$w$,选取超参$\alpha$ 用于控制学习速度, $\alpha$ 越大,对训练速度的平衡限制越强。

在训练过程的第一步时,储存所有loss,记为$L(0)$,用于后续步骤。

1. 求和所有loss$\sum_iL_i(t)$,反向传播,获得所有参数的梯度,这一步需要保留计算图;
2. 取出最后一个共享层的参数$w$,分别求解每个loss相对于$W$的梯度$ \nabla WL_i(t)$ ,与对应的权重相乘取l2范数:$||\nabla Ww_i(t)L_i(t)||_2$, 记为$G^{(i)}_w(t) $ ;
3. 计算当前loss相对于初始loss的比例:$\widetilde{L}_i=L_i(t)/L_i(0)$ ,计算相对反向训练速度(inverse training rate)$r_i(t)=\widetilde{L}_i/E(\widetilde{L}_i)$ ,这里需要将其当做一个常数,因此在实现过程中需要转换为numpy计算;
4. 计算$\overline{G}_W(t) =E(G_W^i(t))$,此处同样作为常数。
5. 计算Norm Loss:$\sum_i|G_W^i(t)-\overline{G}_W(t)\times (r_i(t))^{\alpha}|_1$;
6. 通过Norm Loss求解相对于相对于$w$的梯度,手动更新$w$的梯度;
7. 更新参数即可。

## 2、功能目标

为 PaddleScience 新增 GradNorm Loss 权重自适应功能,集成入 PaddleScience 作为 API 调用。

## 3、意义

使得PaddleScience支持损失函数权重自适应的功能。

# 二、飞桨现状

Paddle拥有实现GradNorm所需的基础API。

# 三、业内方案调研

## pytorch-grad-norm

核心代码如下:

[初始化权重及获取最后一层共享层](https://github.com/brianlan/pytorch-grad-norm/blob/master/model.py#L24-L43)

```python
class RegressionTrain(torch.nn.Module):
def __init__(self, model):
# initialize the module using super() constructor
super(RegressionTrain, self).__init__()
# assign the architectures
self.model = model
# assign the weights for each task
self.weights = torch.nn.Parameter(torch.ones(model.n_tasks).float())
# loss function
self.mse_loss = MSELoss()

def forward(self, x, ts):
B, n_tasks = ts.shape[:2]
ys = self.model(x)

# check if the number of tasks is equal to this size
assert(ys.size()[1] == n_tasks)
task_loss = []
for i in range(n_tasks):
task_loss.append( self.mse_loss(ys[:,i,:], ts[:,i,:]) )
task_loss = torch.stack(task_loss)

return task_loss

def get_last_shared_layer(self):
return self.model.get_last_shared_layer()
```

[Norm Loss计算过程](https://github.com/brianlan/pytorch-grad-norm/blob/master/train.py#L67-L138)

```python
if t == 0:
# set L(0)
if torch.cuda.is_available():
initial_task_loss = task_loss.data.cpu()
else:
initial_task_loss = task_loss.data
initial_task_loss = initial_task_loss.numpy()

# get the total loss
loss = torch.sum(weighted_task_loss)
# clear the gradients
optimizer.zero_grad()
# do the backward pass to compute the gradients for the whole set of weights
# This is equivalent to compute each \nabla_W L_i(t)
loss.backward(retain_graph=True)

# set the gradients of w_i(t) to zero because these gradients have to be updated using the GradNorm loss
#print('Before turning to 0: {}'.format(model.weights.grad))
model.weights.grad.data = model.weights.grad.data * 0.0
#print('Turning to 0: {}'.format(model.weights.grad))


# switch for each weighting algorithm:
# --> grad norm
if args.mode == 'grad_norm':

# get layer of shared weights
W = model.get_last_shared_layer()

# get the gradient norms for each of the tasks
# G^{(i)}_w(t)
norms = []
for i in range(len(task_loss)):
# get the gradient of this task loss with respect to the shared parameters
gygw = torch.autograd.grad(task_loss[i], W.parameters(), retain_graph=True)
# compute the norm
norms.append(torch.norm(torch.mul(model.weights[i], gygw[0])))
norms = torch.stack(norms)
#print('G_w(t): {}'.format(norms))


# compute the inverse training rate r_i(t)
# \curl{L}_i
if torch.cuda.is_available():
loss_ratio = task_loss.data.cpu().numpy() / initial_task_loss
else:
loss_ratio = task_loss.data.numpy() / initial_task_loss
# r_i(t)
inverse_train_rate = loss_ratio / np.mean(loss_ratio)
#print('r_i(t): {}'.format(inverse_train_rate))


# compute the mean norm \tilde{G}_w(t)
if torch.cuda.is_available():
mean_norm = np.mean(norms.data.cpu().numpy())
else:
mean_norm = np.mean(norms.data.numpy())
#print('tilde G_w(t): {}'.format(mean_norm))


# compute the GradNorm loss
# this term has to remain constant
constant_term = torch.tensor(mean_norm * (inverse_train_rate ** args.alpha), requires_grad=False)
if torch.cuda.is_available():
constant_term = constant_term.cuda()
#print('Constant term: {}'.format(constant_term))
# this is the GradNorm loss itself
grad_norm_loss = torch.tensor(torch.sum(torch.abs(norms - constant_term)))
#print('GradNorm loss {}'.format(grad_norm_loss))

# compute the gradient for the weights
model.weights.grad = torch.autograd.grad(grad_norm_loss, model.weights)[0]
```

# 四、对比分析

实现清晰,与论文一致,按照上述代码实现即可。

# 五、设计思路与实现方案

## 命名与参数设计

API设计为`paddlescience.network.GradNorm(net, n_loss, alpha, weight_attr=None)`

## 底层OP设计

无需设计底层OP。

## API实现方案

paddlescience.network.GradNorm实现于paddlescience\network\grad_norm.py文件中,初步实现如下:

```python
import numpy as np
import paddle
import paddle.nn
from paddle.nn.initializer import Assign
from .network_base import NetworkBase

class GradNorm(NetworkBase):
r"""
Gradient normalization for adaptive loss balancing.
Parameters:
net(NetworkBase): The network which must have "get_shared_layer" method.
n_loss(int): The number of loss, must be greater than 1.
alpha(float): The hyperparameter which controls learning rate.
weight_attr(list, tuple): The inital weights for "loss_weights". If not specified, "loss_weights" will be initialized with 1.
"""
def __init__(self, net, n_loss, alpha, weight_attr=None):
super().__init__()
if n_loss <= 1:
raise ValueError("'n_loss' must be greater than 1, but got {}".format(n_loss))
if alpha < 0:
raise ValueError("'alpha' is a positive number, but got {}".format(alpha))
if weight_attr is not None:
if len(weight_attr) != n_loss:
raise ValueError("weight_attr must have same length with loss weights.")

self.net = net
self.loss_weights = self.create_parameter(
shape=[n_loss], attr=Assign(weight_attr if weight_attr else [1] * n_loss),
dtype=self._dtype, is_bias=False)
self.set_grad()
self.alpha = float(alpha)
self.initial_losses = None

def nn_func(self, ins):
return self.net.nn_func(ins)

def __getattr__(self, __name):
try:
return super().__getattr__(__name)
except:
return getattr(self.net, __name)

def get_grad_norm_loss(self, losses):
if isinstance(losses, list):
losses = paddle.concat(losses)
if self.initial_losses is None:
self.initial_losses = losses.numpy()

W = self.net.get_last_shared_layer()
if self.loss_weights.grad is not None
self.loss_weights.grad.set_value(paddle.zeros_like(self.loss_weights))

norms = []
for i in range(losses.shape[0]):
grad = paddle.autograd.grad(losses[i], W, retain_graph=True)
norms.append(paddle.norm(self.loss_weights[i]*grad[0], p=2))
norms = paddle.concat(norms)

loss_ratio = losses.numpy() / self.initial_losses
inverse_train_rate = loss_ratio / np.mean(loss_ratio)
mean_norm = np.mean(norms.numpy())

constant_term = paddle.to_tensor(mean_norm * np.power(inverse_train_rate, self.alpha), dtype=self._dtype)

grad_norm_loss = paddle.norm(norms - constant_term, p=1)
self.loss_weights.grad.set_value(paddle.autograd.grad(grad_norm_loss, self.loss_weights)[0])

return grad_norm_loss

def reset_initial_losses(self):
self.initial_losses = None

def set_grad(self):
x = paddle.ones_like(self.loss_weights)
x *= self.loss_weights
x.backward()
self.loss_weights.grad.set_value(paddle.zeros_like(self.loss_weights))

def get_weights(self):
return self.loss_weights.numpy()
```

这里对实现的要点进行说明:

1. GradNorm接受一个网络实例,需要为`NetworkBase`的子类,且必须要有`get_shared_layer`方法,该方法可以返回任意共享层,论文中为最后一层。为了拓展功能,后续可以考虑如何开放接口让用户手动选择。
2. 需要输入loss的数量,当`weight_attr`未指定时,权重会全部初始化未1,否则按照给定权值初始化。
3. 重写了`__getattr__`方法,可以通过`.`直接调用传入网络的方法,使得用户在使用GradNorm前后都可以使用相同的方式调用网络的方法,而无需修改。
4. `set_grad()` 用于初始化权重的梯度,由于在paddle中无法直接对梯度使用赋值语句,需要使用`Tensor.grad.set_value()`,而刚初始化时grad为None,无法调用set_value方法,因此需要使用`backward()` 来获得初始权重。

# 六、测试和验收的考量

1. 保证前向反向与论文中所描述的一致;
2. 保证训练、测试等流程不出现错误。

# 七、可行性分析和排期规划

已完成部分工作,规定时间内可全部完成。

# 八、影响面

需要为网络添加一个`get_shared_layer()`方法。

此外当`GradNorm` 拥有与传入网络实例相同名称的属性或方法时,以GradNorm为准,可能会对已有代码造成一些影响,修改后即可使用。

# 名词解释


# 附件及参考资料

[GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks](https://arxiv.org/pdf/1711.02257.pdf)