Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why is detach necessary #116

Closed
rogertrullo opened this issue Mar 20, 2017 · 17 comments
Closed

why is detach necessary #116

rogertrullo opened this issue Mar 20, 2017 · 17 comments

Comments

@rogertrullo
Copy link

Hi, I am wondering why is detach necessary in this line:

output = netD(fake.detach())

I understand that we want to update the gradients of netD without changin the ones of netG. But if the optimizer is only using the parameters of netD, then only its weight will be updated. Am I missing something here?
Thanks in advance!

@soumith
Copy link
Member

soumith commented Mar 20, 2017

you are correct. it is done for speed, not correctness. The computation of gradients wrt the weights of netG can be fully avoided in the backward pass if the graph is detached where it is.

@soumith soumith closed this as completed Mar 20, 2017
@rogertrullo
Copy link
Author

Thanks for the quick reply @soumith!

@sunshineatnoon
Copy link

Missed detach when implementing dcgan in pytorch, and it gives me this error:

RuntimeError: Trying to backward through the graph second time, but the buffers have already been freed. Please specify retain_variables=True when calling backward for the first time.

@dantp-ai
Copy link

dantp-ai commented Jul 1, 2018

@soumith This is not true. Detaching fake from the graph is necessary to avoid forward-passing the noise through G when we actually update the generator. If we do not detach, then, although fake is not needed for gradient update of D, it will still be added to the computational graph and as a consequence of backward pass which clears all the variables in the graph (retain_graph=False by default), fake won't be available when G is updated.

@tusharkr
Copy link

what @plopd has said is absolutely right. Detaching fake from the graph is necessary and will lead to an error if not done so.

@soumith
Copy link
Member

soumith commented Dec 13, 2018

ah yes, what I said above is only true if we also retain_graph=True. My bad, I stand corrected.

@MeirGavish
Copy link

Hello,
I am a little confused by this and will greatly appreciate help in understanding.
According to my understanding detach() prevents further computations from being tracked. (I suppose it also prevent previous computations from being taken into account in the backward pass?)

Either way, wouldn't you want to track the next computation, the operation of D over fake, for the backward pass of D?
If you wanted to prevent tracking of the Generator, wouldn't it make sense to detach before applying G and then restore tracking for D right at the point where detach is now called? (With requires_grad_(True)?)
Thank you

@colinshenc
Copy link

@plopd what you are saying doesn't make any sense to me

@Einstellung
Copy link

let me tell you. The role of detach is to freeze the gradient drop. Whether it is for discriminating the network or generating the network, we update all about logD(G(z)). For the discriminant network, freezing G does not affect the overall gradient update (that is The inner function is considered to be a constant, which does not affect the outer function to find the gradient), but conversely, if D is frozen, there is no way to complete the gradient update. Therefore, we did not use the gradient of freezing D when training the generator. So, for the generator, we did calculate the gradient of D, but we didn't update the weight of D (only optimizer_g.step was written), so the discriminator will not be changed when the generator is trained. You may ask, that's why, when you train the discriminator, you need to add detach. Isn't this an extra move?
Because we freeze the gradient, we can speed up the training, so we can use it where it can be used. It is not an extra task. Then when we train the generator, because of logD(G(z)), there is no way to freeze the gradient of D, so we will not write detach here.

@shiyuanyin
Copy link

@Einstellung
hi
I have a question, the G model gradient update
首先有三个独立的网络,鉴别器网络D,生成器网络G和源网络S
1、鉴别器网络首先 输入:合并的值特征值,输出:LogSoftmax(),损失是用1 和0 的标签, 二分类损失, 梯度更新backward()
2 输入:首先用鉴别器, 输入:生成器G的输出特征,输出:LogSoftmax(),
然后我不明白,怎么和生成器G,关联起来昵,它和鉴别器是两个独立的网络,鉴别器的梯度更新怎么和 G 联系起来昵???,
G网络的输入是target 数据 标签 ,输出是fc
D网路的输入是 G的特征以及 G与S的拼接特征。输出是LogSoftmax()

@shiyuanyin
Copy link

S feature->
G feature->
图片

@Einstellung
Copy link

@shiyuanyin
I don't really know what you said. You should use G result to update D

@shiyuanyin
Copy link

@Einstellung

I don't really know what you said. You should use G result to update D

    ############################
    # (2) Update G network: maximize log(D(G(z)))
    ###########################
    netG.zero_grad()
    label.data.fill_(real_label)  # fake labels are real for generator cost
    output = netD(fake)
    errG = criterion(output, label)
    errG.backward()
    D_G_z2 = output.data.mean()
    optimizerG.step()

########
output = netD(fake), output is related to D model , not G model, when backward(D get the input G gradient) ,the gadient how to give the G, because is not connect
##I think the process, netG.backward(the model D grad,to input G feature)
this is right???

@disanda
Copy link

disanda commented Apr 19, 2020

detach just reduce the work that G() gradient upgrade in training step of D(), because G() will train in next step

@tjysdsg
Copy link

tjysdsg commented Sep 26, 2021

@soumith This is not true. Detaching fake from the graph is necessary to avoid forward-passing the noise through G when we actually update the generator. If we do not detach, then, although fake is not needed for gradient update of D, it will still be added to the computational graph and as a consequence of backward pass which clears all the variables in the graph (retain_graph=False by default), fake won't be available when G is updated.

If I understand correctly, then if i created a new noise input for G, there's no need for the detach() call?

@Shanmukh0028
Copy link

@soumith This is not true. Detaching fake from the graph is necessary to avoid forward-passing the noise through G when we actually update the generator. If we do not detach, then, although fake is not needed for gradient update of D, it will still be added to the computational graph and as a consequence of backward pass which clears all the variables in the graph (retain_graph=False by default), fake won't be available when G is updated.

If I understand correctly, then if i created a new noise input for G, there's no need for the detach() call?

Do you mean like creating fake1=netG(noise) which is same as fake that was before disconnection. Even i have the same doubt can someone please clarify this?

@batman47steam
Copy link

I think that's because if you don't use fake.detach() in output = netD(fake.detach()).view(-1) then fake is just some middle variable in the whole computational Graph, it will track from netG() to netD(). and when you can optimizerD.step() all grad information except leaf nodes are released. which means no more gradient information about netG() in the computational Graph. then you use errG.backward() it will cause an error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests