Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill segments by versions #15994

Merged
merged 21 commits into from
Mar 13, 2024
Merged

Conversation

abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented Feb 28, 2024

Summary:

Prior to this patch, kill tasks will delete all versions of unused segments within the specified interval. With this patch, users can now delete specific versions of unused data, while retaining the rest by specifying an optional list of versions in the kill task payload. If left unspecified, the default behavior remains unchanged, i.e., delete all versions of unused segments in the interval.

Motivation:

  • A user may want to use this functionality for data compliance reasons - an ingestion job that created some bad data that needs to be deleted right away, while keeping other versions of the data around for sometime.
  • Manage storage costs: Keep only the last 'n' versions of unused segments in the deep storage.

Note that adding an optional list of versions support to /markAsUsed and /markAsUnused segment management APIs would be a complementary addition. I didn't make those changes in this PR to keep it simple to review - I will follow up on that later.

Release note

Kill task accepts an optional list of unused segment versions to delete.


Key changed/added classes in this PR
  • KillUnusedSegmentsTask.java
  • IndexerSQLMetadataStorageCoordinator.java
  • RetrieveUnusedSegmentsAction.java
  • KillUnusedSegmentsTaskTest.java
  • IndexerSQLMetadataStorageCoordinatorTest.java
  • RetrieveSegmentsActionsTest.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Kill tasks by default kill all versions of unused segments in the specified
interval. Users wanting to delete specific versions (for example, data compliance
reasons) and keep rest of the versions can specify the optional version in the
kill task payload.
@abhishekrb19 abhishekrb19 changed the title Kill task version support Kill segments by version Feb 28, 2024
@abhishekrb19 abhishekrb19 requested review from zachjsh, kfaraz and AmatyaAvadhanula and removed request for zachjsh and kfaraz February 28, 2024 18:08
@abhishekrb19 abhishekrb19 marked this pull request as draft March 8, 2024 07:29
@abhishekrb19 abhishekrb19 changed the title Kill segments by version Kill segments by versions Mar 11, 2024
@abhishekrb19 abhishekrb19 marked this pull request as ready for review March 11, 2024 05:17
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. On the whole, I would prefer that we override the methods where possible to add a new flavour that accepts versions. Otherwise, we have to pass a lot of nulls around which is a little confusing.

Retain the old interface method and make it default and route it to
the method with nullable versions variant. Update usages to use the
default method where versions doesn't matter.
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of final comments, otherwise all looks good.
Haven't gone through the tests in detail. But should we also add a test to verify that we don't kill a segment whose load spec is currently being used by some other version segment?

final String versionsStr = versions.stream()
.map(version -> "'" + version + "'")
.collect(Collectors.joining(","));
sb.append(StringUtils.format(" AND version IN (%s)", versionsStr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we enforce some limit on the number of versions here? I think in practice most users wouldnt be specifying too many versions at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. However, we don't enforce an upper bound on some related parameters like limit and batchSize, so I'm unsure if we want to do it for versions size. Also, the maximum size for IN would also depend on the underlying metadata store itself, so my suggestion would be to roll this out without any size restriction right out of the bat, and revisit this later if needed

final DateTime maxUsedStatusLastUpdatedTime1 = DateTimes.nowUtc();

// Delay for 1s, mark the segments as unused and then capture the last updated time cutoff again
Thread.sleep(1000);
Copy link
Contributor Author

@abhishekrb19 abhishekrb19 Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can actually remove the sleep here and in the other existing tests by directly marking the segments as unused using the test connector. That way, the test would have control over the last updated time and we can set it to whatever time. I can clean up this pattern along with other miscellaneous testing stuff in a follow-up patch

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, @abhishekrb19 !
The test improvements can be done in a follow up.

Out of curiosity, do you plan to make similar changes for markUsed / markUnused APIs too?

Intervals.of("2019/2020"),
null,
null,
null
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation seems off when looking at the args in the previous lines.

Comment on lines +346 to +353
final DateTime now = DateTimes.nowUtc();
final String v1 = now.toString();
final String v2 = now.minusHours(2).toString();
final String v3 = now.minusHours(3).toString();

final DataSegment segment1 = newSegment(Intervals.of("2019-01-01/2019-02-01"), v1, ImmutableMap.of("foo", "1"));
final DataSegment segment2 = newSegment(Intervals.of("2019-02-01/2019-03-01"), v2, ImmutableMap.of("foo", "1"));
final DataSegment segment3 = newSegment(Intervals.of("2019-03-01/2019-04-01"), v3, ImmutableMap.of("foo", "1"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the tests seem to be using similar segments, versions, etc. Do you think some of this can go into the setup method?

DATA_SOURCE,
umbrellaInterval,
null,
null
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation seems off. Closing brace should have smaller indentation than the preceding args.

@abhishekrb19
Copy link
Contributor Author

Thanks for the thorough reviews, @kfaraz!

Thanks for the changes, @abhishekrb19 !
The test improvements can be done in a follow up.
Out of curiosity, do you plan to make similar changes for markUsed / markUnused APIs too?

Yes, I'm going to do similar changes to the markUsed / markUnused APIs next. I will also follow up on the test improvements in a separate patch.

@abhishekrb19 abhishekrb19 merged commit fb7bb09 into apache:master Mar 13, 2024
83 checks passed
@abhishekrb19 abhishekrb19 deleted the kill_by_version branch March 13, 2024 04:07
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants