Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows 2016 windows-restart provisioner not waiting for reboot after patching #6733

Closed
nwesoccer opened this issue Sep 18, 2018 · 37 comments · Fixed by #6792
Closed

Windows 2016 windows-restart provisioner not waiting for reboot after patching #6733

nwesoccer opened this issue Sep 18, 2018 · 37 comments · Fixed by #6792

Comments

@nwesoccer
Copy link

We start with a base image for windows 2016 and windows 2016 core directly from iso. This packer template clones that vm and applies windows patches, reboots, applies windows patches again (because windows tends to require this multiple times), then reboots. The reboot following each of the windows 2016 patches cause the vm to go into the "Getting Windows ready. Don't turn off your computer" state. The windows-restart provisioner completes prior to the machine even shutting down. We have confirmed this behavior by adding -debug and stepping slowly through and watching the vm.

In the following screenshot you can see the machine is still on even though the windows-restart provisioner claims the machine successfully restarted. If we wait there for 5 or so minutes, the vm starts a progress counter, once it gets to 100% it reboots.

screen shot 2018-09-18 at 4 57 55 pm

@gusztavvargadr
Copy link

@nwesoccer does this happen also on the first reboot? You might want to try setting the WinRM service to delayed start.

@nwesoccer
Copy link
Author

@gusztavvargadr It happens after windows patches have been applied no matter which reboot it is. It's not about WinRM coming up to fast, the problem is that the server takes a while to go down because it's in the "Getting Windows ready. Don't turn off your computer". For some reason this state allows packer to pass the TryCheckReboot at https://github.com/hashicorp/packer/blob/master/provisioner/windows-restart/provisioner.go#L126 even though the machine has not even gone down yet. So the subsequent provisioners start executing thinking that the reboot is done, then the reboot actually happens mid-sebsequent provisioning steps. Best case scenario the build fails, worst case is that it succeeds and never rebooted because then the patching will not be fully applied.

@SwampDragons
Copy link
Contributor

As soon as I think I've finally nailed this provisioner to the wall, someone walks in with another bug 😭

I'll see if I can add whatever error code is generated by "Getting Windows ready. Don't turn off your computer" and make sure it doesn't pass the check.

@SwampDragons SwampDragons added this to the v1.3.2 milestone Sep 19, 2018
@kpshek
Copy link

kpshek commented Sep 20, 2018

I'm seeing this issue too. As a temporary/dirty hack, I've put a shell-local provisioner immediately after my windows-restart that merely sleeps for a couple of minutes.

{
  "type": "shell-local",
  "command": "echo \"Sleeping for 2 minutes...\"; sleep 120"
},

edit: The built-in pause_before provisioner configuration is a much better implementation of this hack. Inserting a pause_before on the next provisioner following the windows-restart has worked around this bug (at least in my case).

{
  "type": "windows-restart"
},
{
  "type": "powershell",
  "pause_before": "120s",
  "inline": ["dir c:\\"]
}

@kpshek
Copy link

kpshek commented Sep 20, 2018

Not sure if it helps, but this blog has documented the four places you can check for to see if there is a pending reboot in progress: http://ilovepowershell.com/2015/09/10/how-to-check-if-a-server-needs-a-reboot/

@kpshek
Copy link

kpshek commented Sep 20, 2018

@SwampDragons - I was able to pair down my template that has this same issue to a fairly simple repro: https://gist.github.cerner.com/ks8252/b75f146ab5a62e0337b9c4418c6aa015

In that gist I've included both the Packer template and my Packer log (obtained via PACKER_LOG=1) which shows the issue.

The relevant bit in the log is:

...
2018/09/20 12:40:14 [INFO] (telemetry) Starting provisioner windows-restart
==> googlecompute: Restarting Machine
2018/09/20 12:40:14 packer: 2018/09/20 12:40:14 [INFO] starting remote command: shutdown /r /f /t 0 /c "packer restart"
2018/09/20 12:40:15 packer: 2018/09/20 12:40:15 [INFO] command 'shutdown /r /f /t 0 /c "packer restart"' exited with code: 0
2018/09/20 12:40:15 packer: 2018/09/20 12:40:15 [INFO] RPC endpoint: Communicator ended with: 0
2018/09/20 12:40:15 [INFO] 0 bytes written for 'stdout'
2018/09/20 12:40:15 [INFO] 0 bytes written for 'stderr'
2018/09/20 12:40:15 [INFO] RPC client: Communicator ended with: 0
2018/09/20 12:40:15 [INFO] RPC endpoint: Communicator ended with: 0
2018/09/20 12:40:15 packer: 2018/09/20 12:40:15 [INFO] 0 bytes written for 'stdout'
2018/09/20 12:40:15 packer: 2018/09/20 12:40:15 [INFO] 0 bytes written for 'stderr'
2018/09/20 12:40:15 packer: 2018/09/20 12:40:15 [INFO] RPC client: Communicator ended with: 0
==> googlecompute: Waiting for machine to restart...
2018/09/20 12:40:16 packer: 2018/09/20 12:40:16 Check if machine is rebooting...
2018/09/20 12:40:16 packer: 2018/09/20 12:40:16 [INFO] starting remote command: shutdown /r /f /t 60 /c "packer restart test"
2018/09/20 12:40:16 packer: 2018/09/20 12:40:16 [INFO] command 'shutdown /r /f /t 60 /c "packer restart test"' exited with code: 1115
2018/09/20 12:40:16 packer: 2018/09/20 12:40:16 [INFO] RPC endpoint: Communicator ended with: 1115
2018/09/20 12:40:16 [INFO] 0 bytes written for 'stdout'
2018/09/20 12:40:16 [INFO] 40 bytes written for 'stderr'
2018/09/20 12:40:16 [INFO] RPC client: Communicator ended with: 1115
2018/09/20 12:40:16 [INFO] RPC endpoint: Communicator ended with: 1115
    googlecompute: A system shutdown is in progress.(1115)
2018/09/20 12:40:16 packer: 2018/09/20 12:40:16 [INFO] 0 bytes written for 'stdout'
2018/09/20 12:40:16 packer: 2018/09/20 12:40:16 [INFO] 40 bytes written for 'stderr'
2018/09/20 12:40:16 packer: 2018/09/20 12:40:16 [INFO] RPC client: Communicator ended with: 1115
2018/09/20 12:40:16 packer: 2018/09/20 12:40:16 Reboot already in progress, waiting...
2018/09/20 12:40:26 packer: 2018/09/20 12:40:26 Waiting for machine to reboot with timeout: 5m0s
2018/09/20 12:40:26 packer: 2018/09/20 12:40:26 Waiting for machine to become available...
2018/09/20 12:40:26 packer: 2018/09/20 12:40:26 Checking that communicator is connected with: 'powershell.exe -EncodedCommand ZQBjAGgAbwAgACIAJAB7AGUAbgB2ADoAQwBPAE0AUABVAFQARQBSAE4AQQBNAEUAfQAgAHIAZQBzAHQAYQByAHQAZQBkAC4AIgA='
2018/09/20 12:40:31 packer: 2018/09/20 12:40:31 [INFO] starting remote command: powershell.exe -EncodedCommand ZQBjAGgAbwAgACIAJAB7AGUAbgB2ADoAQwBPAE0AUABVAFQARQBSAE4AQQBNAEUAfQAgAHIAZQBzAHQAYQByAHQAZQBkAC4AIgA=
2018/09/20 12:40:32 packer: 2018/09/20 12:40:32 [INFO] command 'powershell.exe -EncodedCommand ZQBjAGgAbwAgACIAJAB7AGUAbgB2ADoAQwBPAE0AUABVAFQARQBSAE4AQQBNAEUAfQAgAHIAZQBzAHQAYQByAHQAZQBkAC4AIgA=' exited with code: 0
2018/09/20 12:40:32 packer: 2018/09/20 12:40:32 [INFO] RPC endpoint: Communicator ended with: 0
2018/09/20 12:40:32 [INFO] 28 bytes written for 'stdout'
2018/09/20 12:40:32 [INFO] 392 bytes written for 'stderr'
2018/09/20 12:40:32 [INFO] RPC client: Communicator ended with: 0
2018/09/20 12:40:32 [INFO] RPC endpoint: Communicator ended with: 0
    googlecompute: PACKER-5BA3DA90 restarted.
2018/09/20 12:40:32 packer: 2018/09/20 12:40:32 [INFO] 28 bytes written for 'stdout'
2018/09/20 12:40:32 packer: 2018/09/20 12:40:32 [INFO] 392 bytes written for 'stderr'
2018/09/20 12:40:32 packer: 2018/09/20 12:40:32 [INFO] RPC client: Communicator ended with: 0
    googlecompute: #< CLIXML
    googlecompute: <Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04"><Obj S="progress" RefId="0"><TN RefId="0"><T>System.Management.Automation.PSCustomObject</T><T>System.Object</T></TN><MS><I64 N="SourceId">1</I64><PR N="Record"><AV>Preparing modules for first use.</AV><AI>0</AI><Nil /><PI>-1</PI><PC>-1</PC><T>Completed</T><SR>-1</SR><SD> </SD></PR></MS></Obj></Objs>
==> googlecompute: Machine successfully restarted, moving on
2018/09/20 12:40:32 [INFO] (telemetry) ending windows-restart

...<snip uploading PS1 on the next provisioner\>...

</SD></PR></MS></Obj></Objs>2018/09/20 12:40:40 [INFO] starting remote command: powershell -executionpolicy bypass -file "C:/Windows/Temp/packer-elevated-shell-5ba3db92-32c8-a706-d39b-a1220bbaacf7.ps1"
2018/09/20 12:40:51 packer: 2018/09/20 12:40:51 [INFO] command 'powershell -executionpolicy bypass -file "C:/Windows/Temp/packer-elevated-shell-5ba3db92-32c8-a706-d39b-a1220bbaacf7.ps1"' exited with code: 2147943515
2018/09/20 12:40:51 packer: 2018/09/20 12:40:51 [INFO] RPC endpoint: Communicator ended with: 2147943515
2018/09/20 12:40:51 [INFO] 0 bytes written for 'stdout'
2018/09/20 12:40:51 [INFO] 0 bytes written for 'stderr'
2018/09/20 12:40:51 [INFO] RPC client: Communicator ended with: 2147943515
2018/09/20 12:40:51 [INFO] RPC endpoint: Communicator ended with: 2147943515
2018/09/20 12:40:51 packer: 2018/09/20 12:40:51 [INFO] 0 bytes written for 'stdout'
2018/09/20 12:40:51 packer: 2018/09/20 12:40:51 [INFO] 0 bytes written for 'stderr'
2018/09/20 12:40:51 packer: 2018/09/20 12:40:51 [INFO] RPC client: Communicator ended with: 2147943515
2018/09/20 12:40:51 [INFO] (telemetry) ending powershell
...

If you put a "pause_before": "2m", on the second powershell provisioner, the image builds fine.

@SwampDragons
Copy link
Contributor

Thanks, Kevin!

@nwesoccer
Copy link
Author

I would love to use the pause_before, however sometimes it takes 5-10 minutes to finally shut down after the patches are applied so it's a pretty big waste of time. It's also prone to issues, if we pause for 10 minutes and it takes 11 mins we've found that the packer run could be "successful" even though the full reboot to finish applying patches did not yet occur. For now stepping through with -debug is time consuming but ensure we get everything applied.

@brettjacobson
Copy link

I'm also experiencing this problem when trying to use PSWindowsUpdate 2.0.0.4 to perform windows updates on Windows Server 2016 (or even just showing the available updates). This problem ONLY happens on Hyper-V; the exact same stuff works just fine under Virtual Box. This occurs with both packer 1.3.1 and 1.2.5.

The symptoms are that soon after my script issues an Install-WindowsUpdate or Show-WindowsUpdate, the next provisioner step executes (which of course, is a windows-restart!).
Windows Updates in our environment take anywhere from 3 minutes to 30 minutes, so adding a sleep statement is terribly unreliable.

Whatever PSWindowsUpdate is doing under Hyper-V, seems to trick Packer into thinking that the step is completed (successfully, too!), when in fact it still has a long way to go.

@gusztavvargadr
Copy link

gusztavvargadr commented Sep 27, 2018

@nwesoccer @brettjacobson what is your detailed setup if I might ask? I'm building using a quite similar setup and it works fine for me.

I have Windows 10 Pro latest for the Hyper-V and VirtualBox host, with Packer 1.2.5. Updates and windows-restart work fine with Windows 10 Enterprise, Windows 2016 Standard and Standard Core guests. For the guests I'm using the publicly available evaluation ISOs.

The only Windows Update customization I make is I turn off automatic Windows Updates as the first thing after the first boot to make sure it kicks off only when I request it.

@brettjacobson
Copy link

@gusztavvargadr For my case, I'm stuck using Windows 10, 1703 (thanks, Corp IT) as my host. so I fear this may be a situation where packer is no longer supporting Hyper-V as it was in 1703. It seems vagrant requires Hyper-V from 1709, now, so would not surprise me. Tomorrow I'm spinning up a Server 1709 box to try it out on (which will probably answer my questions).

slightly off-topic; how are you running both Hyper-V and VirtualBox on your host? I thought they were mutually exclusive?

@shurick81
Copy link

It works just fine on packer 1.2.4:

==> virtualbox-iso: Restarting Machine
==> virtualbox-iso: Waiting for machine to restart...
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: #< CLIXML
    virtualbox-iso: <Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04"><Obj S="progress" Re
fId="0"><TN RefId="0"><T>System.Management.Automation.PSCustomObject</T><T>System.Object</T></TN><MS><I64 N="SourceId">1
</I64><PR N="Record"><AV>Preparing modules for first use.</AV><AI>0</AI><Nil /><PI>-1</PI><PC>-1</PC><T>Completed</T><SR
>-1</SR><SD> </SD></PR></MS></Obj></Objs>
    virtualbox-iso: WIN-KJJFDKIOGEB restarted.

The same when I upgrade to packer 1.2.5.

But then after I upgraded to 1.3.0, it does not detect ongoing installation properly:

==> virtualbox-iso: Restarting Machine
==> virtualbox-iso: Waiting for machine to restart...
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: WIN-SVJIF4QN5H7 restarted.
    virtualbox-iso: #< CLIXML
    virtualbox-iso: <Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04"><Obj S="progress" Re
fId="0"><TN RefId="0"><T>System.Management.Automation.PSCustomObject</T><T>System.Object</T></TN><MS><I64 N="SourceId">1
</I64><PR N="Record"><AV>Preparing modules for first use.</AV><AI>0</AI><Nil /><PI>-1</PI><PC>-1</PC><T>Completed</T><SR
>-1</SR><SD> </SD></PR></MS></Obj></Objs>
==> virtualbox-iso: Machine successfully restarted, moving on

And the same on 1.3.1

@gusztavvargadr
Copy link

@brettjacobson yes, you can't run both Hyper-V and VirtualBox at the same time, but you can turn off Hyper-V and then VirtualBox works after a reboot: https://stackoverflow.com/questions/30496116/how-to-disable-hyper-v-in-command-line.

@brettjacobson
Copy link

brettjacobson commented Sep 29, 2018

@azr I'm not certain this is a problem with the windows-restart provisioner per se. It appears to me that its a problem where the regular powershell provisioner is thinking a step is completed (and thus, moving on to a windows-restart step which is always next with windows updates) prematurely. its like whatever Windows update processes are doing during the step are tricking packer+hyperv+powershell into thinking the step is complete when it is still active.

I personally am only seeing this behavior with HyperV; with virtual box it behaves as expected (ie, waits for the long windows update process to finish before moving on to the next provisioner step; which in these scenarios are always windows-restart).

@brettjacobson
Copy link

I think this issue is getting a little confusing. OP, @nwesoccer (and others) can you confirm if you believe this to be true:
The problem is the windows update step is being declared finished prematurely, NOT that the windows-restart step is not waiting for things properly.
we also need clarification if this is HyperV only situation, or other hypervisors as well (since it works fine for me on VirtualBox, but NOT hyperv)

@shurick81
Copy link

The tests that I brought above is from VirtualBox, on Windows 10 host machine.

@Occams
Copy link

Occams commented Oct 2, 2018

As soon as I think I've finally nailed this provisioner to the wall, someone walks in with another bug 😭

I'll see if I can add whatever error code is generated by "Getting Windows ready. Don't turn off your computer" and make sure it doesn't pass the check.

@SwampDragons
Hey there. We're also experiencing this issue after upgrading to 1.3.1.
I took the liberty to have a look at the code. Seems to me that an unnecessary break statement found it's way into the branch that invokes sleeping after checking the reboot condition. Thus, it never loops again as intended.

if cmd.ExitStatus == 1115 || cmd.ExitStatus == 1190 || cmd.ExitStatus == 1717 {
// Reboot already in progress but not completed
log.Printf("Reboot already in progress, waiting...")
time.Sleep(10 * time.Second)
break
}

@Poddo
Copy link
Contributor

Poddo commented Oct 2, 2018

Also having this problem. Using "pause_before" seems to be the best workaround, but it is impossible to know the best value to use. Sometimes you only need a few minutes, others it could be 30 depending on system specs and which updates are installing.

@briggsy87
Copy link

Also having this problem.
Using "pause_before" seems to be the best workaround, but as mentioned above, is impractical, as the time needed is dependent on the tasks done prior to the restart being needed.

@Occams Great find, I compared the code to the commit prior to the latest one and it seems to make sense as to why this is not functioning correctly. It will move onto the next step after a hard 10s, no matter what state the reboot is in, due to the break escaping the for loop.

@briggsy87
Copy link

For what it's worth, I commented out the break mentioned above and recompiled the binary, tested against my previously failing code where the windows-restart provisioner now properly loops when receiving the 1115 return code, until the VM has actually actually restarted.

azure-arm: Waiting for machine to restart...
azure-arm: A system shutdown is in progress.(1115)
azure-arm: A system shutdown is in progress.(1115)
azure-arm: A system shutdown is in progress.(1115)
azure-arm: A system shutdown is in progress.(1115)
azure-arm: A system shutdown is in progress.(1115)
azure-arm: restarted.

@Poddo
Copy link
Contributor

Poddo commented Oct 3, 2018

Tested and working in my environment also.

#6792

@azr
Copy link
Contributor

azr commented Oct 3, 2018

Hey @Occams, @Poddo & everyone thanks a lot for finding/bringing a fix, can anyone that had the issue please confirm that it fixed it for them. I will merge it then.

Here are some binaries for you to try out:

packer_linux_amd64.zip

packer_windows_amd64.zip

mac: packer_darwin_amd64.zip

@nwesoccer
Copy link
Author

Sorry for not responding promptly.

@brettjacobson I'm not exactly sure how to verify which it is.
@azr I'm not familiar with how to compile packer for testing the fix. I'm running on a mac with macOS High Sierra. I'd be happy to test it out.

@azr
Copy link
Contributor

azr commented Oct 3, 2018

I linked binaries in my comment up there so you don't have to do this.


But if you want to build packer manually, follow these steps

  • personally I brew install go to install go as it's quicker/simpler
  • then you will have to checkout Update provisioner.go #6792, we use a git alias for this:
[alias]
	pr = "!f() { git fetch -fu ${2:-origin} refs/pull/$1/head:pr/$1 && git checkout pr/$1; }; f"

to run git pr 6792

  • et voilà, now you just have to build a binary following steps in the first link

@nwesoccer
Copy link
Author

@azr Whoops, my brain didn't put two and two together to map darwin to mac. Will give it a try.

@azr
Copy link
Contributor

azr commented Oct 3, 2018

No worries, I should have been clearer 🙂 ( I edited it in case )

@marcinbojko
Copy link
Contributor

marcinbojko commented Oct 3, 2018

1.3.2-dev works. no more forced pauses.
I've tested with W2k16 STD image, also 1803 is in progress

@azr
Copy link
Contributor

azr commented Oct 3, 2018

Great news, thanks everyone; merge incoming. 🎉

@brettjacobson
Copy link

I think the problems I am having were due to my Host Windows 10 being build 1703 (with an old Hyper-V). Currently everything is working on another machine with 1803, and I'll be trying 1709 next week. This is with packer 1.3.1. (the problem is IT will only let me have 1703 at the moment)

@shurick81
Copy link

shurick81 commented Oct 6, 2018

https://github.com/hashicorp/packer/files/2441352/packer_windows_amd64.zip worked for me. It failed before on my 1703 host but with this version verything works just fine. I also had the same issues on another host, 1803, but I did not try the fix there. @brettjacobson, it can be some timing difference between slower and faster hosts maybe? How exactly is it going on your 1803, is packer really waiting for the updates installation and the output is something like this?

==> virtualbox-iso: Restarting Machine
==> virtualbox-iso: Waiting for machine to restart...
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: A system shutdown is in progress.(1115)
    virtualbox-iso: #< CLIXML

@brettjacobson
Copy link

@shurick81 with the dev version of packer linked above, I'm seeing the output exactly as you describe on both 1803 and 1709 hosts. But at least it is detecting the restart is finished, and goes on to the next step. This fix seems to work for my scenarios.

@SwampDragons
Copy link
Contributor

I could add some code to clean up the output, but that's a cosmetic fix and can probably wait; thanks everyone for all of your input on this issue.

@ghost
Copy link

ghost commented Oct 11, 2018

@shurick81 i've tried your release but i get this error

packer version
Packer v1.3.2-dev

packer validate C:\sre\packer\master.json
Template validation failed. Errors are shown below.
Errors validating build 'amazon-ebs'. 1 error(s) occurred:

  • Unknown region: eu-west-1

@SwampDragons
Copy link
Contributor

@fdaforno Interesting; the master branch currently contains some changes to the code that validates aws regions: b0774d1

If you let me know your host architecture I can revert that and see if that makes things work for you.

@chryzsh
Copy link

chryzsh commented Oct 12, 2018

I think I've nailed down my issue to something about my WinRM config, but also the fact that I noticed the VM does not get an IP from DHCP after the automatic reboot performed by windows update. Still have no idea how to fix that though.

@SwampDragons
Copy link
Contributor

@fdaforno I can't reproduce your validation error. :-/

@ghost
Copy link

ghost commented Mar 30, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Mar 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.