Skip to content
This repository has been archived by the owner on Nov 8, 2022. It is now read-only.

Snapd does not appropriately handle a plugin hanging on {collect, process, publish} #1142

Closed
IRCody opened this issue Aug 15, 2016 · 3 comments

Comments

@IRCody
Copy link
Contributor

IRCody commented Aug 15, 2016

If a plugin hangs on any step in the workflow, snapd does not appear to be stopping the plugin/failing the task appropriately. To test you can add this line to the CollectMetrics() function inside the mock2 collector:

time.Sleep(time.Hour)

Then load the mock plugins (from snap/build/bin run: ./snapd -t 0 -l 1 -a ../plugin) and load the example/tasks/mock-file.json task file. The task will hang after 1 "hit". Snapd does not appear to be failing the task correctly.

My expectation would be that a plugin hanging would result in a failed run after some specified amount of time (I think you could base it off the interval for a reasonable automatic guess).

@geauxvirtual
Copy link
Contributor

This is the intention of the deadline setting for a task, so that a task (or jobs within the task) can be killed or not run if the task run time exceeds the configured deadline. Testing with sleep.Time(5 * time.Second) will show that the collect job will take 5s to return, and the subsequent process and publish jobs will not run because the job time exceeded the default configured deadline of 5s.

The interval in a task is just how frequent a user wants the task to run, but this does not equate to how quickly a plugin or plugins could collect, process, or publish the metrics listed in the task manifest.

@IRCody
Copy link
Contributor Author

IRCody commented Aug 18, 2016

This is the intention of the deadline setting for a task, so that a task (or jobs within the task) can be killed or not run if the task run time exceeds the configured deadline. Testing with sleep.Time(5 * time.Second) will show that the collect job will take 5s to return, and the subsequent process and publish jobs will not run because the job time exceeded the default configured deadline of 5s.

I hadn't dug into it enough to try that case. The behavior that prompted this was a plugin that hangs indefinitely was "freezing" the workflow in that 1.) the task was never being failed, 2.) No other runs of this task were happening. The end result was task list showing some number of hits, 0 misses, 0 failures.

I'm not sure what the value is of the deadline if it works how your saying. My expectation was that a deadline was meant to stop runaway workflows, but if snap explicitly waits for it to return before enforcing the deadline then any runaway step in the workflow pauses that task indefinitely.

My expectation would be that deadline would be enforced aggressively, as in this is the absolute maximum time a step in the workflow has to execute, and if it doesn't finish in that time we stop it and mark it as a failure.

The interval in a task is just how frequent a user wants the task to run, but this does not equate to how quickly a plugin or plugins could collect, process, or publish the metrics listed in the task manifest.

I was just suggesting if you had to pick a number to timeout a step in the workflow, some multiple of the interval seems like a good starting place. I also wasn't aware of some of the details you added above re: deadline.

@geauxvirtual
Copy link
Contributor

I agree, deadline should be honored throughout the workflow and not just during the transitions during the workflow from collector to processor or publisher and processor to publisher.

IRCody added a commit to IRCody/snap that referenced this issue Sep 8, 2016
Adds handling of the timeout to the native go client which enforces that
each request for {collect,process,publish} finishes in this amount of
time.

Ups the default client timeout to 10 seconds.
IRCody added a commit to IRCody/snap that referenced this issue Sep 9, 2016
Adds handling of the timeout to the native go client which enforces that each
request for {collect, process, publish} finishes in this amount of time.

Ups the default client timeout to 10 seconds.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants