Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resilient watches #67

Closed
wants to merge 5 commits into from
Closed

Resilient watches #67

wants to merge 5 commits into from

Conversation

schmichael
Copy link
Contributor

Got some help in #coreos to find out what to do about 401 errors. Turns out the answer is to restart the watch with a fresh Get. That ensures metafora will never miss events.

The diff looks like an absolute mess. Basically all I did was put another for loop around the initial Get when we start watching, and have 401s continue that outer loop.

bleh


func (e *FatalError) Error() string { return e.Err.Error() }
func (*FatalError) Fatal() bool { return true }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole error thing I think would look cleaner with embedding:

type FatalError struct {
    error
}

func (*FatalError) Fatal() bool { return true }

func main() {
    e := &FatalError{errors.New("some error")}
    fmt.Printf("error: %v\n", e)
    fmt.Printf("error: is fatal: %v\n", e.Fatal())
}

Try: http://play.golang.org/p/UHCiHEnDcJ, It would eliminate the needless pass-through, and a few lines of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Fixed.

@mdmarek
Copy link
Contributor

mdmarek commented Nov 12, 2014

👍

}
}
w.cordCtx.Log(metafora.LogLevelError, "%s Unexpected error unmarshalling etcd response: %+v", w.path, err)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be safer to just do a continue startWatch here too, just so the node doesn't go down? Maybe throw in a sleep and retry count too to prevent it from thrashing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metafora itself does that here: https://github.com/lytics/metafora/blob/master/metafora.go#L202-L212

My current strategy has been to only handle specific errors that I know how to handle here (timeouts and 401/EcodeExpiredIndex), and leave catchall error handling up to the consumer which can have configurable error handling logic added.

That being said maybe it's currently backwards: maybe coordinator implementations should handle all non-fatal errors internally and metafora's core consumer should just shutdown when coordinators actually return errors?

Feel free to open an issue if you have an opinion. I can't think of a clear winning strategy at the moment.

@epsniff
Copy link
Contributor

epsniff commented Nov 12, 2014

👍

@schmichael
Copy link
Contributor Author

Merged a rebased version of this branch/PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants