Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

m_etcd: Make ClaimTTL and Lost interval documented and easily configurable #139

Open
schmichael opened this issue Aug 7, 2015 · 0 comments
Labels

Comments

@schmichael
Copy link
Contributor

Description

m_etcd.ClaimTTL is critical to ensuring a task is being executed exactly once in a Metafora cluster.

Currently the claim TTL is 120s and the claim is actually refreshed every 90s. If a refresh fails and the task is lost, it has at most 30s to exit before the exactly-once guarantee is lost and the task is eligible for simultaneous execution within the cluster.

Claim TTL and refresh interval should be configurable because:

  • They are critical to Metafora's correct operation.
  • Acceptable values vary by handler and task.

Solution: Configurable TTL, documented refresh calculation

  • Make TTL configurable on m_etcd.EtcdCoordinator instead of via a global.
  • Document refresh calculation from taskmgr.go and/or make it configurable

Future Improvements

  • The coordinator could inform the task handler when the claim will expire via Stop() or metadata on a statemachine.Message. This would allow a handler to detect that simultaneous execution may have occurred and choose to rollback a transaction, not flush data, avoid checkpointing, etc if possible.
  • Claim TTL / Refresh Interval may be more appropriate to define per task since a safe interval depends on how long a task handler takes to exit. Ideally a handler would checkpoint in intervals less than the Claim TTL, so if the claim expires it can simply skip its final checkpoint since another node may be executing that task.
@schmichael schmichael added the etcd label Aug 7, 2015
schmichael added a commit that referenced this issue Aug 7, 2015
I also made #139 because we should really offer users clearer
configuration options when it comes to maintaining exactly once
semantics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant