Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete series api and purger to purge data requested for deletion #2103

Merged

Conversation

sandeepsukhani
Copy link
Contributor

@sandeepsukhani sandeepsukhani commented Feb 10, 2020

What this PR does:
Adds support for deleting series. This PR just includes the following pieces from PR #1906

  1. Delete Requests Handler: It accepts requests for creating new delete requests and listing all the delete requests from the user.
  2. Data Purger: It does the actual deletion of data when they become more than 24 hours old.

Implementation details

  • New entities

    • DeleteStore - Handles storing, updating and fetching delete requests.
    • DataPurger - Builds delete plans for requests older than 24h and executes them using a configurable worker pool.
    • StorageClient - Used by DataPurger for storing delete plans in protobuf format.
  • Delete Request Lifecycle
    Delete request could have one following states at a time:

  1. Received - No actions are done on request yet.
  2. BuildingPlan - Request picked up for processing and building plans for it.
  3. Deleting - Plans built already, running delete operations.
  4. Processed - All requested data deleted.

Delete requests keep moving forward from one state to another. They are moved from Received to BuildingPlan state only after they are 24+ hours old, thereafter it keeps moving forward as it gets processed. We do not want to keep this period configurable since users might set it to extremely low and we might not want to get into the rabbit hole of deletion of live data from ingesters.

  • Delete Plans
    Workers performing delete operation could die in the middle, which can cause issues in queries if chunks are deleted first or end up in stale chunks if an index is deleted first. To avoid such issues a delete plan is built which include labels and chunks which are supposed to be deleted. To perform deletion in parallel, delete plan is sharded by day i.e they include all the labels and chunks that are supposed to be deleted for a day.

New APIs

  • /delete_series - Accepts Prometheus style delete request for deleting series
  • /get_all_delete_requests - Get all delete requests

Tradeoffs

  • Parallel execution of delete requests from same users is not supported - Parallel execution can invalidate another delete request's delete plan if they are touching same chunks.

Notes

  1. This feature is kept behind a feature flag until we stabilize it.
  2. For now, this PR implements new methods or clients for only boltdb and filesystem to avoid increasing the size of the PR. Support for other storages will be done in separate PRs.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@sandeepsukhani
Copy link
Contributor Author

Forgot to make it a draft PR while I work on tests 🤦‍♂️

@sandeepsukhani sandeepsukhani force-pushed the delete-series-store-purger branch 4 times, most recently from b736036 to 1d23841 Compare February 16, 2020 17:12
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. I have a few comments.

pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/gcp/bigtable_object_client.go Outdated Show resolved Hide resolved
pkg/chunk/purger/request_handler.go Outdated Show resolved Hide resolved
pkg/cortex/modules.go Outdated Show resolved Hide resolved
pkg/cortex/modules.go Outdated Show resolved Hide resolved
@sandeepsukhani sandeepsukhani force-pushed the delete-series-store-purger branch 2 times, most recently from d3446bc to be92f93 Compare February 18, 2020 08:28
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor nits.

pkg/chunk/purger/purgeplan/delete_plan.proto Outdated Show resolved Hide resolved
pkg/chunk/purger/purgeplan/delete_plan.proto Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
@sandeepsukhani sandeepsukhani mentioned this pull request Feb 19, 2020
1 task
@sandeepsukhani sandeepsukhani force-pushed the delete-series-store-purger branch 2 times, most recently from e4fab35 to 5511880 Compare February 20, 2020 10:31
@sandeepsukhani sandeepsukhani mentioned this pull request Feb 20, 2020
1 task
@sandeepsukhani sandeepsukhani force-pushed the delete-series-store-purger branch 3 times, most recently from 0428f10 to 022c0e7 Compare February 23, 2020 06:10
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor nits but overall this LGTM

pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/cortex/modules.go Outdated Show resolved Hide resolved
@@ -104,6 +104,20 @@ Where default_value is the value to use if the environment variable is undefined
# The table_manager_config configures the Cortex table-manager.
[table_manager: <table_manager_config>]

data_purger:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding a new root block to the doc generator, so that the data_purger_config will have its own dedicated section in this reference file? It's 3 lines config at the top of tools/doc-generator/main.go (look at other root blocks as a reference)

@@ -104,6 +104,20 @@ Where default_value is the value to use if the environment variable is undefined
# The table_manager_config configures the Cortex table-manager.
[table_manager: <table_manager_config>]

data_purger:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to keep consistency between YAML config and CLI flags. The YAML config block is called data_purger while CLI flags have the prefix -purger.. Have you considered renaming the YAML config block from data_purger to purger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I missed doing that somehow. Pushed the changes. Thanks!

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job Sandeep! I've left some questions and comments you may want to look at, some more important than others.

option (gogoproto.marshaler_all) = true;
option (gogoproto.unmarshaler_all) = true;

message Plan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messages in this proto file are not immediately clear to me, and would appreciate some comments to understand how they are supposed to be used and work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Update: it made bit more sense after reading purger.go, but still – some comments would be nice)

Comment on lines 389 to 394
if err != nil {
return err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed, because named return variable err is used here. It's a pattern we use in this file at many places.

return
}

adminRouter := t.server.HTTP.PathPrefix(cfg.HTTPPrefix + "/api/v1/admin/tsdb").Subrouter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this some well-known API endpoint? Isn't "tsdb" going to be confusing as this is only used by chunks storage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we want to keep APIs the same as what Prometheus provides, I agree with your point. I also think we would have to eventually support delete API for blocks storage as well so I think for consistency with Prometheus APIs we can keep it like this. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

All: {
deps: []moduleName{Querier, Ingester, Distributor, TableManager},
deps: []moduleName{Querier, Ingester, Distributor, TableManager, DataPurger},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can multiple data-purgers run at the same time? (If someone runs multiple single-binary instances of Cortex)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I think chances of that would be less. I think we can address that issue with future PRs if needed. We can for now make it clear in docs that there should be only 1 Purger running at a time.

store: boltdb

purger:
object_store_type: filesystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant? Cannot we use object_store from schema, or does it make sense to use different value here?

Copy link
Contributor Author

@sandeepsukhani sandeepsukhani Feb 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can have different schemas for different time range, we would have to infer which schema user would want to use. If we go with the active/lastest one then moving to a different store by adding a new schema would abandon old schemas. Which is why I wanted it to be explicit. Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, didn't realize that. Thanks!

Comment on lines 345 to 347
if deleteRequest.CreatedAt.Add(24 * time.Hour).After(model.Now()) {
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be first, to avoid debug-logging in the previous check? [nit]

pkg/chunk/purger/purger.go Show resolved Hide resolved
pkg/chunk/purger/purger.go Show resolved Hide resolved
@@ -0,0 +1,32 @@
syntax = "proto3";

package purgeplan;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why extra package just for protobuf file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep it isolated but I think this won't be needed outside of purger package so I think I can move it there.


executePlansChan chan deleteRequestWithLogger
workerJobChan chan workerJob
workerJobExecutionStatusChan chan workerJobExecutionStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why workerJobExecutionStatusChan is needed at all. Same cleanup can be done by worker itself, after it is finished with the plan? Why does it need to be done on separate goroutine?

Furthermore, as it is now, when Stop is called, goroutine reading from workerJobExecutionStatusChan will exit quickly (because it's not doing anything that takes a long time), while workers are still executing the plan, but when they are finished, there is no channel reader anymore, so the result of plan execution will not be logged or handled in any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there can be multiple plans, worker just takes care of just 1 plan so there has to be another entity which does the cleanup after all the plans are executed successfully. I will take care of the exit issue that you pointed out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there can be multiple plans, worker just takes care of just 1 plan so there has to be another entity which does the cleanup after all the plans are executed successfully.

I understand this, but think that worker can simply check if it did the last piece on its own, without going through another goroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are building per day plan, the last one could be smallest and could finish quickly which means we might mark a delete request completed before it executes all the plans successfully.
We also do not want to mark a delete request completed when any of its plans but the last one fails to execute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine... I think your checks are fine, I'm just objecting on using another goroutine and extra channel. Worker can do the same checks, and remove entry from the map only if there no other plan that runs. If you extract that code to new method as suggested elsewhere, I would just let worker call the method, instead of spawning new goroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I got your point. I have done the changes and pushed them. Thanks!

@sandeepsukhani
Copy link
Contributor Author

@pstibrany Thank you so much for the review!
I have done the requested changes.

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design of the DataPurger feels very Java-like to me – with static number of workers and goroutines and communication through shared maps like inProcessRequestIDs and pendingPlansCount.

What about changing it such that purger selects next job to work on, and then spawns NumWorkers goroutines to handle all plans, waits until they are finished, and then updates the request in database with the result. It seems to me that entire handling of single request could be much simplified that way, instead of tracking number of in-flight plans. What do you think?

Update: Notes from Slack communication:

  • by static I meant that they are created once, and then run forever, regardless of whether they have work to do or not
  • Alternative would be to have a method that handles entire request. This method would spawn NumWorkers goroutines to handle plans for request, and once all plans are finished, goroutines would die and method would finish the request in database.
  • hat way, there is no need to have shared map with number of in-flight plans, and perhaps also no need to have shared map for users

In any case, thanks for addressing my feedback!

logger log.Logger
}

type workerJobExecutionStatus struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get rid of this struct now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be resolved @sandeepsukhani?

pkg/chunk/purger/purger.go Show resolved Hide resolved
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as the current design will work as well. According to Sandeep, more design changes are coming in subsequent PRs.

Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Sandeep, good work. Have a bunch of nits mostly, after which I'll give the LGTM

pkg/chunk/storage/factory.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
logger log.Logger
}

type workerJobExecutionStatus struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be resolved @sandeepsukhani?

pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved
pkg/chunk/purger/purger.go Show resolved Hide resolved
delete_store manages delete requests and purge plan records in stores
purger builds delete plans(delete requests sharded by day) and executes them paralelly
only one requests per user would be in execution phase at a time
delete requests gets picked up for deletion after they get older by more than a day

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
…e only component that needs it

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
@gouthamve gouthamve merged commit 1a9d546 into cortexproject:master Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants