Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TSDB creation conflicting with transfer #1818

Merged
merged 10 commits into from
Dec 11, 2019

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Nov 13, 2019

What this PR does:
In this PR:

  • Do not allow to query PENDING ingesters (consider them unhealthy)
  • Do not create TSDB if not exist when querying an ingester
  • Do not allow to create TSDB if the ingester is not in ACTIVE state

Suggest to enable "Hide whitespace changes" when reviewing for an easier diff.

Why?
I found an issue leading to data loss and/or TSDB wal corruption, which should be fixed by this PR. What happens:

  • Right now, a TSDB is created also when querying (not need, fixed)
  • The querier consider healthy an ingester in the PENDING state, so a query request can hit an ingester even if the ACTIVE state has not been reached yet (not desired, fixed)

If a query hit a new ingester in the PENDING state before the transfer starts (it switches to JOINING then), the query creates a TSDB opening the wal segment 00000000. The subsequent transfer will overwrite the wal segments, while the 00000000 file reference is still kept by the opened TSDB.

Moreover, if it get transferred a wal where a checkpoint already occurred, the 00000000 segment is not part of the transfer but the segment 0 is created by the opened TSDB which leads to an inconsistent segments numbering on disk (ie. 0, checkpoint.6, 7, 8, 9) which cause TSDB to fail a subsequent re-opening.

Which issue(s) this PR fixes:
No reported issue.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/ingester/transfer.go Show resolved Hide resolved
pkg/ingester/transfer.go Show resolved Hide resolved
Copy link
Contributor

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't have a chance to look at the tests, but other changes LGTM

// a transfer in occurs.
if !force {
ingesterState := i.lifecycler.GetState()
if ingesterState != ring.ACTIVE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you can combine these 2, also combine it with the above if !force check to be single if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

@@ -184,7 +184,7 @@ func (d *Desc) TokensFor(id string) (tokens, other []uint32) {
func (i *IngesterDesc) IsHealthy(op Operation, heartbeatTimeout time.Duration) bool {
if op == Write && i.State != ACTIVE {
return false
} else if op == Read && i.State == JOINING {
} else if op == Read && i.State != ACTIVE && i.State != LEAVING {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC pending ingesters shouldn't have any tokens; how did you trigger this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should occur in the paths where we use forAllIngesters(). I think one path is:

  1. Querier receive a LabelNames request
  2. Querier forwards the request to the distributor
  3. The distributor runs client.LabelNames() in distributor.forAllIngesters()
  4. distributor.forAllIngesters() calls ring.GetAll()
  5. ring.GetAll() returns all ingesters which satisfy the IsHealthy()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bboreham What's your take on this? If this change is a problem, I could rollback it in order to move one with this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for enumerating these.

pkg/ingester/ingester_v2.go Show resolved Hide resolved
Comment on lines 120 to 121
testData := testData

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, removed.

@pracucci pracucci force-pushed the fix-tsdb-transfer branch 2 times, most recently from 748dcbb to 7a81a93 Compare November 22, 2019 12:43
@owen-d
Copy link
Contributor

owen-d commented Nov 29, 2019

Nice work! I appreciate the refactoring on the read path (getTSDB instead of getOrCreateTSDB).

@codesome
Copy link
Contributor

codesome commented Dec 4, 2019

You got some merge conflicts now

@pracucci
Copy link
Contributor Author

pracucci commented Dec 4, 2019

@gouthamve I've fixed the conflicts after rebasing. May you help me moving forward with this PR, please?

@codesome
Copy link
Contributor

codesome commented Dec 6, 2019

Active development going on here :) you got more conflicts

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…n JOINING state

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…ll require a dedicated discussion

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jtlisi jtlisi merged commit 26ee2e2 into cortexproject:master Dec 11, 2019
@pracucci pracucci deleted the fix-tsdb-transfer branch December 12, 2019 08:39
@pracucci pracucci mentioned this pull request Jan 22, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants