Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kinesis adaptive memory management #15360

Merged
merged 31 commits into from
Jan 19, 2024

Conversation

zachjsh
Copy link
Contributor

@zachjsh zachjsh commented Nov 10, 2023

Description

Our Kinesis consumer works by using the GetRecords API in some number of fetchThreads, each fetching some number of records (recordsPerFetch) and each inserting into a shared buffer that can hold a recordBufferSize number of records. The logic is described in our documentation at: https://druid.apache.org/docs/27.0.0/development/extensions-core/kinesis-ingestion/#determine-fetch-settings

There is a problem with the logic that this pr fixes: the memory limits rely on a hard-coded “estimated record size” that is 10 KB if deaggregate: false and 1 MB if deaggregate: true. There have been cases where a supervisor had deaggregate: true set even though it wasn’t needed, leading to under-utilization of memory and poor ingestion performance.

Users don’t always know if their records are aggregated or not. Also, even if they could figure it out, it’s better to not have to. So we’d like to eliminate the deaggregate parameter, which means we need to do memory management more adaptively based on the actual record sizes.

We take advantage of the fact that GetRecords doesn’t return more than 10MB (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html ):

This pr:

eliminates recordsPerFetch, always use the max limit of 10000 records (the default limit if not set)

eliminate deaggregate, always have it true

cap fetchThreads to ensure that if each fetch returns the max (10MB) then we don't exceed our budget (100MB or 5% of heap). In practice this means fetchThreads will never be more than 10. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments

add recordBufferSizeBytes as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be 100MB or 10% of heap, whichever is smaller.

add maxBytesPerPoll as a bytes-based limit for how much data we poll from shared buffer at a time. Default is 1000000 bytes.

deprecate recordBufferSize, use recordBufferSizeBytes instead. Warning is logged if recordBufferSize is specified

deprecate maxRecordsPerPoll, use maxBytesPerPoll instead. Warning is logged if maxRecordsPerPoll` is specified

Fixed issue that when the record buffer is full, the fetchRecords logic throws away the rest of the GetRecords result after recordBufferOfferTimeout and starts a new shard iterator. This seems excessively churny. Instead, wait an unbounded amount of time for queue to stop being full. If the queue remains full, we’ll end up right back waiting for it after the restarted fetch.

There was also a call to newQ::offer without check in filterBufferAndResetBackgroundFetch, which seemed like it could cause data loss. Now checking return value here, and failing if false.

Release Note

Kinesis ingestion memory tuning config has been greatly simplified, and a more adaptive approach is now taken for the configuration. Here is a summary of the changes made:

eliminates recordsPerFetch, always use the max limit of 10000 records (the default limit if not set)

eliminate deaggregate, always have it true

cap fetchThreads to ensure that if each fetch returns the max (10MB) then we don't exceed our budget (100MB or 5% of heap). In practice this means fetchThreads will never be more than 10. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments

add recordBufferSizeBytes as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be 100MB or 10% of heap, whichever is smaller.

add maxBytesPerPoll as a bytes-based limit for how much data we poll from shared buffer at a time. Default is 1000000 bytes.

deprecate recordBufferSize, use recordBufferSizeBytes instead. Warning is logged if recordBufferSize is specified

deprecate maxRecordsPerPoll, use maxBytesPerPoll instead. Warning is logged if maxRecordsPerPoll` is specified

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@zachjsh zachjsh marked this pull request as ready for review November 14, 2023 08:34
@zachjsh zachjsh changed the title Kinesis adaptive memory management WIP Kinesis adaptive memory management Nov 14, 2023
@zachjsh zachjsh requested a review from gianm November 14, 2023 08:46
…rOfferTimeout

* check return value of newQ::offer and fail if false
For estimation purposes, Druid uses a figure of 10 KB for regular records and 1 MB for [aggregated records](#deaggregation).
- `maxRecordsPerPoll`: 100 for regular records, 1 for [aggregated records](#deaggregation).
- `recordBufferSizeBytes`: 100 MB or an estimated 10% of available heap, whichever is smaller.
- `maxRecordsPerPoll`: 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be higher? I wonder if this is too low in the case of non-aggregated records

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered the same actually. tbh, im not sure. I think validation for this requires extensive performance testing.

Copy link
Contributor Author

@zachjsh zachjsh Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it so that it polls for at least one record and at most 1_000_000 bytes if more than 1 record, which is what we were targeting for before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does that mean we should update the maxRecordsPerPoll: 1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

);
int maxFetchThreads = Math.max(
1,
(int) (memoryToUse / 10_000_000L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe use a constant for the 10MB limit with a comment that explains the limit comes from the Kinesis library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -81,7 +80,7 @@ public KinesisIndexTaskTuningConfig(
Long handoffConditionTimeout,
Boolean resetOffsetAutomatically,
Boolean skipSequenceNumberAvailabilityCheck,
Integer recordBufferSize,
@Nullable Integer recordBufferSizeBytes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'd make sense to log a warning if the eliminated property is provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good thought, will add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


scheduleBackgroundFetch(recordBufferFullWait);
return;
recordBufferOfferWaitMillis = recordBufferFullWait;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the shardIterator doesn't need to be reset here as before?

Copy link
Contributor Author

@zachjsh zachjsh Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously when the record buffer is full here, the fetchRecords logic threw away the rest of the GetRecords result after recordBufferOfferTimeout and starts a new shard iterator. This seemed excessively churny. Instead we wait an unbounded amount of time for queue to stop being full. If the queue remains full, we’ll end up right back waiting for it after the restarted fetch.

Class<?> kclUserRecordclass = Class.forName("com.amazonaws.services.kinesis.clientlibrary.types.UserRecord");
MethodHandles.Lookup lookup = MethodHandles.publicLookup();
try {
Class<?> kclUserRecordclass = Class.forName("com.amazonaws.services.kinesis.clientlibrary.types.UserRecord");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the points about the licensing above still correct? Looks like amazon-kinesis-client is Apache licensed now: https://github.com/awslabs/amazon-kinesis-client/blob/master/LICENSE.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the licensing comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it even looks like since #12370, amazon-kinesis-client with an Apache license is a regular dependency. So this reflective stuff is no longer needed. Please either rewrite it to use regular Java calls, or if you don't rewrite it, include a comment describing the situation. Something like:

The deaggregate function is implemented by the amazon-kinesis-client, whose license was formerly not compatible with Apache. The code here avoids the license issue by using reflection, but is no longer necessary since amazon-kinesis-client is now Apache-licensed and is now a dependency of Druid. This code could safely be modified to use regular calls rather than reflection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

.forEachOrdered(newQ::offer);
.filter(x -> !partitions.contains(x.getData().getStreamPartition()))
.forEachOrdered(x -> {
if (!newQ.offer(x)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new failure mode? What would've happened in the old code if the queue size was exceeded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a new failure more. I believe if the data was not added here, it could have resulted in data loss. Any other suggestion here? I was a little concerned about this too, but I think potential data loss is worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment saying that this shouldnt really happen, but is added for safety.

@zachjsh zachjsh requested a review from jon-wei November 27, 2023 23:55
@gianm
Copy link
Contributor

gianm commented Nov 29, 2023

Tagged "release notes" since various memory-related configs are changed.

For estimation purposes, Druid uses a figure of 10 KB for regular records and 1 MB for [aggregated records](#deaggregation).
- `maxRecordsPerPoll`: 100 for regular records, 1 for [aggregated records](#deaggregation).
- `recordBufferSizeBytes`: 100 MB or an estimated 10% of available heap, whichever is smaller.
- `maxRecordsPerPoll`: 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does that mean we should update the maxRecordsPerPoll: 1 here?

polledRecords,
expectedSize,
MAX_BYTES_PER_POLL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like maxRecordsPerPoll isn't doing anything anymore. Is that right? If so let's get rid of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, and added maxBytesPerPoll which is being used instead now.

.filter(x -> !partitions.contains(x.getData().getStreamPartition()))
.forEachOrdered(x -> {
if (!newQ.offer(x)) {
// this should never really happen in practice but adding check here for safety.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks that should never happen, but are for safety, should be DruidException.defensive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! updated

@@ -294,22 +296,18 @@ private Runnable fetchRecords()

// If the buffer was full and we weren't able to add the message, grab a new stream iterator starting
// from this message and back off for a bit to let the buffer drain before retrying.
if (!records.offer(currRecord, recordBufferOfferTimeout, TimeUnit.MILLISECONDS)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above is no longer accurate -- we aren't grabbing new stream iterators anymore when the buffer is full.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@zachjsh zachjsh requested a review from gianm November 29, 2023 20:14
@@ -78,6 +82,14 @@ public KinesisIndexTask(
);
this.useListShards = useListShards;
this.awsCredentialsConfig = awsCredentialsConfig;
if (tuningConfig.getRecordBufferSizeConfigured() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these two checks to run rather than the constructor, because we don't need to log this stuff every time a task object is constructed. (That happens at various points on the Overlord due to various API calls and internal machinations, and will create a log of log spam.)

Copy link
Contributor Author

@zachjsh zachjsh Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Moved.

(int) (memoryToUse / GET_RECORDS_MAX_BYTES_PER_CALL)
);
if (fetchThreads > maxFetchThreads) {
log.warn("fetchThreads [%d] being lowered to [%d]", fetchThreads, maxFetchThreads);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning should only get logged if configuredFetchThreads != null. There's no reason to log it if runtimeInfo.getAvailableProcessors() * 2 is lower than maxFetchThreads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated.

@JsonProperty("fetchDelayMillis") Integer fetchDelayMillis,
@JsonProperty("awsAssumedRoleArn") String awsAssumedRoleArn,
@JsonProperty("awsExternalId") String awsExternalId,
@Nullable @JsonProperty("autoScalerConfig") AutoScalerConfig autoScalerConfig,
@JsonProperty("deaggregate") boolean deaggregate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recordsPerFetch and deaggregate properties should stay here for better compatibility during rolling updates and rollbacks. (We don't want to lose track of them prior to a potential rollback.)

So let's instead mark them deprecated, but keep them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added back and marked as deprecated.

polledRecords,
expectedSize,
maxBytesPerPoll,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a single record is larger than maxBytePerPoll? Would this get stuck and make no progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, it always drains at least one record, clarified that in the docs. I added a test for this, see org.apache.druid.java.util.common.MemoryBoundLinkedBlockingQueueTest#test_drain_queueWithFirstItemSizeGreaterThanLimit_succeeds

@zachjsh zachjsh requested review from gianm and jon-wei December 1, 2023 17:46
@zachjsh zachjsh merged commit 9d4e805 into apache:master Jan 19, 2024
83 checks passed
@zachjsh zachjsh deleted the kinesis-adaptive-memory-management branch January 19, 2024 19:30
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants