Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows service #1070

Merged
merged 11 commits into from
Nov 15, 2017
Merged

Windows service #1070

merged 11 commits into from
Nov 15, 2017

Conversation

samuelkarp
Copy link
Contributor

@samuelkarp samuelkarp commented Nov 12, 2017

Summary

This set of commits enables the agent to be run as a Windows service and write logs into the Windows event log.

amazon-ecs-agent-windows-scm
amazon-ecs-agent-windows-event-viewer

Note: This is marked WIP as it will fail to build because I haven't vendored the new dependencies (two more packages inside golang.org/x/sys/windows). I'm waiting for #1009 to be merged so I can manage dependencies with dep.

Implementation details

  • Added a new command-line argument to run as a Windows service
  • Refactored the termination handler to use a function provided as part of the agent struct
  • Implemented a new log receiver for seelog

TODO

  • Vendor dependencies
  • Set restart settings
  • Set dependency on Docker
  • Determine correct recovery settings
  • Determine if these need to be set in Go code or in PowerShell startup?

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent) (as long as you have golang.org/x/sys/windows on your GOPATH)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass (as long as you have golang.org/x/sys/windows on your GOPATH)
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass
  • Starts the agent
  • Can be Started, Stopped, and Restarted from PowerShell and services.msc
  • Writes to Windows event log
  • Service stops gracefully on its own when Windows is shutting down or rebooting
  • Service starts correctly on boot when service is set to automatic startup or delayed start

New tests cover the changes: yes

Description for the changelog

Feature - The Amazon ECS agent can now be installed as a Windows service

Licensing

This contribution is under the terms of the Apache 2.0 License: yes

Copy link
Contributor

@tehshwann tehshwann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. Thanks!

Few comments... mostly text nits.


const (
eventLogName = "AmazonECSAgent"
eventLogID = 999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here explaining that this is a default value so others know its overridden later on a case-by-case basis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it's a constant without being overridden. I'm going to leave it as-is for now and we can come back to making it play more nicely with Windows later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help to list the eventid in the Readme or in a comment so that it is easy to search. Even though windows event log id's are specific to the application and not always well documented, administrators like to be able to find the id quickly in a search.

@@ -86,7 +87,7 @@ func SetLevel(logLevel string) {
// GetLevel gets the log level
func GetLevel() string {
levelLock.RLock()
defer levelLock.RLock()
defer levelLock.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: RUnlock -> Runlock or RunLock??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look like it's a https://godoc.org/sync#RWMutex, so RUnlock is a reader taking a lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUnlock is unlocking the read-lock. This is unrelated to Windows; I just saw gometalinter complain about it while I was working in the package.

README.md Outdated
```powershell
PS C:\> # Set up directories the agent uses
PS C:\> New-Item -Type directory -Path $ProgramFiles\Amazon\ECS
PS C:\> New-Item -Type directory -Path $ProgramData\Amazon\ECS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New-Item -Type directory -Path **$ENV:**ProgramData\Amazon\ECS -Force

README.md Outdated

```powershell
PS C:\> # Set up directories the agent uses
PS C:\> New-Item -Type directory -Path $ProgramFiles\Amazon\ECS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New-Item -Type directory -Path **$ENV:**ProgramFiles\Amazon\ECS -Force

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is New-Item -Type dir -Path ... -Force like mkdir -p?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you meant -d, (dir) yes.

README.md Outdated
PS C:\> Invoke-RestMethod -OutFile $zipFile -Uri $agentZipUri
PS C:\> # Put the executables in the executable directory.
PS C:\> Expand-Archive -Path $zipFile -DestinationPath $ecsExeDir -Force
PS C:\> # Uncomment the following line to enable task IAM roles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isnt super clear cause the following line is a comment. Maybe add a bool here for users to change so its more clear? Also... I think you want to cd into the exec dir in either case??

Example:

Set-Location $ecsExeDir
/# Set '$EnableTaskIAMRoles' to '$true' if you wish to use Task IAM Roles.
[bool]$EnableTaskIAMRoles = $false
if ($EnableTaskIAMRoles) {
.\hostsetup.ps1
}

acceptInsecureCertUsage = "Disable SSL certificate verification. We do not recommend setting this option"
licenseUsage = "Print the LICENSE and NOTICE files and exit"
blacholeEC2MetadataUsage = "Blackhole the EC2 Metadata requests. Setting this option can cause the ECS Agent to fail to work properly. We do not recommend setting this option"
ecsAttributesUsage = "Print the Agent's ECS Attributes based on its environment"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got reordered.

// ReceiveMessage receives a log line from seelog and emits it to the Windows event log
func (r *eventLogReceiver) ReceiveMessage(message string, level seelog.LogLevel, context seelog.LogContextInterface) error {
switch level {
case seelog.DebugLvl, seelog.InfoLvl:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be indented?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is Go's spec :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gofmt does this. I recommend running gofmt on save every time you modify a Go file.

@@ -61,11 +66,11 @@ func FinalSave(saver statemanager.Saver, taskEngine engine.TaskEngine) error {
engineDisabled := make(chan error)

disableTimer := time.AfterFunc(engineDisableTimeout, func() {
engineDisabled <- errors.New("Timed out waiting for TaskEngine to settle")
engineDisabled <- errors.New("final save: timed out waiting for TaskEngine to settle")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Final save:...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is conventional idiomatic go casing for the error message to be lower cased. Here's an example:

agent error: this package errored fetching data: I couldn't reach the server.

This can come from 3 different sources and would look weird if it was capitalized after each :.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention is described here.

@@ -75,10 +80,10 @@ func FinalSave(saver statemanager.Saver, taskEngine engine.TaskEngine) error {

stateSaved := make(chan error)
saveTimer := time.AfterFunc(finalSaveTimeout, func() {
stateSaved <- errors.New("Timed out trying to save to disk")
stateSaved <- errors.New("final save: timed out trying to save to disk")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Final...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above: idiomatic error messages.

})
go func() {
log.Debug("Saving state before shutting down")
seelog.Debug("Saving state before shutting down")
stateSaved <- saver.ForceSave()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse my not-Go-savviness... Should this have a trap/catch with a log so in the event that the state was not saved correctly... its at least logged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The saver.ForceSave sends its results on down the stateSaved channel and its up to the other side to determine what to do about that. However.. we could be better on how we log failures and its context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log line is above, on line 83. That function will get triggered on a timeout so we have a record that the save failed.

@adnxn
Copy link
Contributor

adnxn commented Nov 13, 2017

@samuelkarp #1009 merged =]

case <-agentDone:
// This means that the agent stopped on its own. This is where it's appropriate to light the event log on fire
// and set off all the alarms.
seelog.Error("Exiting")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more context in this exit log message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a ton of context here except that the agent stopped on its own... What would you like me to add here?

}

// sleepCtx provides a cancelable sleep
func sleepCtx(ctx context.Context, duration time.Duration) {
Copy link
Member

@jahkeup jahkeup Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can instead use the stdlib (or golang.org/x/net) context.WithDeadline function and wait on ctx.Done() of that context.

@samuelkarp samuelkarp changed the title [WIP] Windows service Windows service Nov 14, 2017
Copy link
Contributor

@tehshwann tehshwann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the things!

Copy link
Contributor

@cabbruzzese cabbruzzese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Found some small suggestions and nits.


```powershell
PS C:\> # Set up directories the agent uses
PS C:\> New-Item -Type directory -Path $ProgramFiles\Amazon\ECS
PS C:\> New-Item -Type directory -Path $ProgramData\Amazon\ECS
PS C:\> New-Item -Type directory -Path ${env:ProgramFiles}\Amazon\ECS -Force
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this as-is for now; we can come back and fix it later.

wg.Wait()
}

func TestHandler_Execute_WindowsStops(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A service has 30 seconds to report back a stopped status, and a 20 seconds to respond to a shutdown event. Can you add an assert to make sure it does not exceed these timeouts?
https://msdn.microsoft.com/en-us/library/windows/desktop/ms685149(v=vs.85).aspx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests are run with a 25s timeout right now. This completes in about a second. Our unit tests aren't really able to simulate the worst-case right now, so I'd rather not add an assert that won't tell us anything or try to make the unit test different. Instead, I think this can be addressed in a higher-level acceptance test (probably outside the agent itself).


const (
eventLogName = "AmazonECSAgent"
eventLogID = 999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help to list the eventid in the Readme or in a comment so that it is easy to search. Even though windows event log id's are specific to the application and not always well documented, administrators like to be able to find the id quickly in a search.

@samuelkarp
Copy link
Contributor Author

@cabbruzzese Not sure why I can't seem to reply inline to your comment about the event log ID. Is it important to include it in the README even though it's a constant right now?

@cabbruzzese
Copy link
Contributor

Was just looking for the event id to have better visibility on search engines for future debugging. Not a blocker.

Copy link
Contributor

@cabbruzzese cabbruzzese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

>> .\hostsetup.ps1
>> }
PS C:\> # Install the agent service
PS C:\> New-Service -Name "AmazonECS" `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it confusing the call the ECS Agent service "AmazonECS"? This is the agent. Would it be better to call it "AmazonECSAgent"?

Sorry if this topic came up before, feel free to ignore & point me to any existing discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it "AmazonECS" so that we'd have the flexibility to swap out a bit in terms of implementation without requiring customers to change the name that they're using or making the name confusing. Specifically, this enables us to potentially transition to running the agent inside a container and using a service on the host to manage the container (similar to what we do with ecs-init on Linux). If we called it "AmazonECSAgent", we could still do that but it could create confusion because the service is separate from the agent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AmazonECSService?

@samuelkarp
Copy link
Contributor Author

Service starts correctly on boot when service is set to automatic startup or delayed start

I had success when setting the service Startup Type to "Automatic (Delayed Start)". Setting it to "Automatic" caused it to try to start before networking was available and then fail because it couldn't retrieve the instance identity document. For now, I would say that we want to recommend "Automatic (Delayed Start)" and as a follow up later we should implement retries for getting the instance identity document.

Copy link
Contributor

@petderek petderek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Had some questions but overall it looks good

// start starts the Agent execution
start() int
// setTerminationHandler sets the termination handler
setTerminationHandler(sighandlers.TerminationHandler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat!

h.ecsAgent.start() // should block forever, unless there is an error
// TODO: distinguish between recoverable and unrecoverable errors
indicator.agentStopped()
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a defered cancel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. I will be changing some of this in the next PR.

<-derivedCtx.Done()
}

type termHandlerIndicator struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way this indicator struct is being used. Out of curiosity -- is this implementing anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm aware of, but I didn't do a very deep search yet.


func (t *termHandlerIndicator) wait() {
t.mu.Lock()
invoked := t.handlerInvoked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand how the 'invoked' state is consumed. Is this the only place 'invoked' is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. This is just to determine whether we need to wait on the channel or not.

>> .\hostsetup.ps1
>> }
PS C:\> # Install the agent service
PS C:\> New-Service -Name "AmazonECS" `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AmazonECSService?


// ReceiveMessage receives a log line from seelog and emits it to the Windows event log
func (r *eventLogReceiver) ReceiveMessage(message string, level seelog.LogLevel, context seelog.LogContextInterface) error {
switch level {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How tightly is this coupled to seelog? Are we adding significant work for if we want to swap out logging implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Seelog's interface, so it's somewhat tightly coupled. However, the Windows event log interface is pretty simple so it should be easy to change this if we switch to a different logger as long as that logger has a pluggable backend.

t.agentRunning = false
}

func (t *termHandlerIndicator) done() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would recommend this be called 'finish' or 'end'. This isn’t a strong preference, just that 'done()' is used by the context class for listening instead of terminating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will change in the next PR.

@samuelkarp samuelkarp changed the base branch from dev to windows November 15, 2017 19:17
@samuelkarp
Copy link
Contributor Author

Merging this into the windows branch for now. I will have a follow-up PR to adjust what we're doing with the failure/recovery policy and address any additional comments.

@samuelkarp samuelkarp merged commit d83b49a into aws:windows Nov 15, 2017
@samuelkarp samuelkarp added this to the 1.16.0 milestone Nov 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants