Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup Meta Ticket #84

Closed
11 of 14 tasks
armon opened this issue Feb 9, 2016 · 23 comments
Closed
11 of 14 tasks

Cleanup Meta Ticket #84

armon opened this issue Feb 9, 2016 · 23 comments
Milestone

Comments

@armon
Copy link
Member

armon commented Feb 9, 2016

Here are the list of issues, grouped together if they might make sense as a single PR.

State Races

  • Raft state should use locks for any fields that are accessed together (e.g. index and term)

Multi-Row Fetching

  • Replace single row lookups with multi row lookups (LogStore / LogCache) (look at cases around log truncation)
  • Verify the current term has not changed when preparing/processing the AppendEntries message Rework goroutines and synchronization #136

Follower Replication:

  • replicateTo should verify leadership is current during looping
  • Check for any hot loops that do not break on stopCh

Change Inflight Tracking

  • Remove majorityQuorum
  • Inflight tracker should map Node -> Last Commit Index (match index)
  • Votes should be ignored from peers that are not part of peer set
  • precommit may not be necessary with new inflight (likely will be cleaned up via Discussion: reworking membership changes #117)

Improve Membership Tracking

Crashes / Restart Issues

New Tests

/cc: @superfell @ongardie @sean- @ryanuber @slackpad

@ongardie
Copy link
Contributor

Thanks for writing that up, @armon. I'll take a shot at "Change inflight tracking".

@ongardie
Copy link
Contributor

ongardie commented Apr 1, 2016

I'll start on 'precommit may not be necessary with new inflight' (though I only have about 20min right now, ugh).

@slackpad
Copy link
Contributor

slackpad commented Jun 28, 2016

Ok I'm going to go ahead and merge #117. Here's the list of remaining tasks I identified from that. Most of these are already marked inline with TODO comments:

@slackpad
Copy link
Contributor

@ongardie got one trivial PR up #127 and will be sweeping through the unit tests tomorrow fixing the integration tests as well as the commented out ones.

@ongardie
Copy link
Contributor

Cool, thanks @slackpad. I've got a busy day tomorrow but can help out later in the week.

@slackpad
Copy link
Contributor

Cool I will keep adding the unit test work to the same PR in that case.

On Jun 27, 2016, at 11:50 PM, Diego Ongaro notifications@github.com wrote:

Cool, thanks @slackpad. I've got a busy day tomorrow but can help out later in the week.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@slackpad
Copy link
Contributor

Ok I think #127 is in good shape for merge. Tomorrow I'm going to spend some time on these two items:

Work through live upgrade scenario now that we have a new config format and ignore old config messages (will we need to support the legacy LogAddPeer and LogRemovePeer to make the transition work?)
We need a mechanism for manual peer updates after an outage - there's currently no equivalent of "editing peers.json" so it doesn't seem possible to recover from all outage scenarios; there's one broken test related to this as well: TestRaft_SnapshotRestore_PeerChange

Will experiment a little and write up a spec so we can figure out the right way forward with these. We will need a way for Consul to take the existing peers.json and call some new Raft APIs in order to start up the cluster as well, since the peers.json had the only state related to the peers if a snapshot had been taken. I don't think I want to reintroduce the peer store into Raft, but Consul will need to parse this file outside of Raft and call some bootstrap-y thing to get going. I think this problem is closely related to the items above.

@ongardie-sfdc
Copy link

@slackpad, small request: could you number the post-117 checklist above? It'd help me track things.

@slackpad
Copy link
Contributor

slackpad commented Jul 1, 2016

@ongardie numbered the todo list

@ongardie-sfdc
Copy link

BTW I'm trying to refactor a ton of stuff that'll help or fix the following:

  • "5 - Add staging -> voter promotion logic"
  • "16 - Resolve "what is this" in replicate()"
  • "17 - Resolve in updateLastAppended()"
  • "19 - Think through the possibility of having two replication threads going to the same peer - see these comments for details"

so save those for me for now :)

@slackpad
Copy link
Contributor

slackpad commented Jul 9, 2016

@ongardie thanks for the heads up. I'm working on 1, 2, 3 which should not conflict with your stuff too much. Should have a PR up soon so you can see what I'm thinking wrt. to these. Have a good weekend!

@ongardie-sfdc
Copy link

@slackpad, now that I know you don't have too much outstanding, can we move runFSM out to fsm.go and runSnapshots, shouldSnapshot, takeSnapshot out to snapshot.go? See where I'm going with this? Every goroutine that has its own independent state goes in a different file. Later on, they get kicked off of the Raft struct so that they're not reaching into things they shouldn't.

@slackpad
Copy link
Contributor

slackpad commented Jul 9, 2016

Yep - I was starting to think about that, too. I'll go ahead and make that change and commit it now so we have a good basis for further work.

@slackpad
Copy link
Contributor

slackpad commented Jul 9, 2016

@ongardie the split is done under #135 - went ahead and merged it.

@ongardie-sfdc
Copy link

Thanks @slackpad!

@ongardie-sfdc
Copy link

Some bookkeeping:

  • In the top checklist, "Verify the current term has not changed when preparing/processing the AppendEntries message" will be resolved by Rework goroutines and synchronization #136 (please annotate).
  • In the post-117 checklist, (9 quorumSize), (16 replicate), (17 updateLastAppended) will be resolved by Rework goroutines and synchronization #136 (please annotate).
  • New: Update or delete membership.md now that the work happened.
  • New: Add a note to type configurations that latest and committed may be the same, and often are.
  • In the post-117 checklist, expand (11 getLastLog) to panic any time the stores fail and we're not sure what ought to happen.

I think that's all I'm sitting on.

@slackpad
Copy link
Contributor

slackpad commented Aug 1, 2016

@ongardie-sfdc thanks for the update and the detailed reviews! I reconciled all your changes with the checklists (added new numbered items for the two new ones).

@slackpad
Copy link
Contributor

slackpad commented Aug 1, 2016

After merging the version and recovery stuff into issue-84-integration, I cut https://github.com/hashicorp/raft/tree/library-v2-stage-one. I'll integrate and ship the next version of Consul against this branch and keep it open for small fixes I can pick from here if needed. This'll free up the issue-84-integration branch for the goroutine changes and a possible period of instability as that settles in.

After giving the community a heads up wrt. the interface changes we can take issue-84-integration into master, and possibly make library-v2-stage-two as an interim bake and integrate target for future Consul versions.

@superfell
Copy link
Contributor

Am seeing some issues with AddPeer after last nights merge, Is there a test somewhere that adds a new peer to a cluster, I did a scan though but didn't spot one for that. @ongardie @slackpad

@slackpad
Copy link
Contributor

slackpad commented Aug 2, 2016

Hi @superfell - there's a test here that hits it - https://github.com/hashicorp/raft/blob/issue-84-integration/raft_test.go#L2217-L2258. It's likely that you are calling AddPeer() but haven't set the ProtcolVersion back via configuration to a version that supports it - if you set it to 1 things should work as before.

@slackpad
Copy link
Contributor

slackpad commented Aug 2, 2016

Also, if you don't need to support any of the old stuff, I'd keep the ProtocolVersion at 3 and swap out your AddPeer() call with a call to AddVoter().

@superfell
Copy link
Contributor

Ahh, the setting of protocolVer in DefaultConfig() was messing with me.

@slackpad
Copy link
Contributor

slackpad commented Oct 7, 2016

Everything is done or tracked elsewhere - closing!

@slackpad slackpad closed this as completed Oct 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants