Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache segment metadata on the Overlord to speed up segment allocation and other task actions #17653

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Jan 22, 2025

Description

The Overlord performs several metadata operations on the tables druid_segments and druid_pendingSegments,
such as segment commit, allocation, upgrade and mark used / unused.

Segment allocation, in particular, involves several reads/writes to the metadata store and can often become a bottleneck,
causing ingestion to slow down. This effect is particularly pronounced when streaming ingestion is enabled for
multiple datasources or if there is a lot of late arriving data.

This patch adds an in-memory segment cache to the Overlord to speed up all segment metadata operations.

Assumptions

  • A segment metadata transaction involves only a single datasource.
  • A segment metadata transaction does not read what it writes.

Design

Summary

  • Add Overlord runtime property druid.manager.segments.useCache to enable cache
  • Keep cache disabled by default
  • When cache is enabled, read metadata from cache and write metadata to both metadata store and cache
  • Poll metadata store periodically in case any metadata update did not make it to the cache (should not happen under stable operational conditions)

Segment metadata transaction with cache enabled

  • If not leader, do not proceed
  • If cache is not READY, fall back to old flow
  • Acquire a lock on cache to ensure that another thread does not update it while we are reading from it.
  • Start transaction
  • Get leader term
  • Perform computations
  • For every read, just read from the cache
  • For every write
    • check if leadership has been lost or term has changed
    • this helps safeguard against cases where we lose leadership during the transaction
    • if yes, rollback transaction
    • if not, remember write action to commit to cache later
  • If transaction has succeeded, commit pending writes to cache
  • Close transaction
  • Release lock

Lifecycle of cache

  • Enable cache if druid.manager.segments.useCache=true on the Overlord
  • If Overlord gains leadership, start() cache
  • Cache then polls the metadata store at a period of druid.manager.segments.pollDuration to do the following:
    • Retrieve all segment IDs and their last updated timestamps
    • If the cache has stale or no information for any unused segment, update it
    • If the cache has stale or no information for any used segment, fetch entire segment payload
    • Retrieve all pending segments and update cache if it has stale information
  • When first poll is complete, mark cache as READY for transactions
  • If Overlord loses leadership, stop() cache

Contents of cache

The cache maintains the following fields for every datasource.

Field Needed In
Map<String, DataSegmentPlus> idToUsedSegment Reads/writes for used segments
Set<String> unusedSegmentIds Checking set of existing segment IDs to avoid duplicate insert
Map<Interval, Map<String, Integer>> intervalVersionToHighestUnusedPartitionNumber Segment allocation to avoid duplicate IDs
Map<Interval, Map<String, PendingSegmentRecord>> intervalToPendingSegments Segment allocation

Code changes

Class / Description
SegmentsMetadataManagerConfig.useCache
  • Enable/disable cache on the Overlord
DatasourceSegmentMetadataReader
  • Interface to perform segment metadata read operations
DatasourceSegmentMetadataWriter
  • Interface to perform segment metadata write operations
HeapMemorySegmentMetadataCache
  • Poll committed and pending segments from the metadata store
DatasourceSegmentCache
  • Cache committed and pending segments of a single datasource
SegmentMetadataTransaction
  • Encapsulate all read/write operations performed within a transaction.
  • This abstraction allows the code to redirect all read/write operations within a transaction to either the cache or to the metadata store itself
SqlSegmentMetadataTransaction
  • Perform read/writes directly on metadata store if cache is disabled or not ready
CachedSegmentMetadataTransaction
  • Perform read only from cache and writes to both metadata store and cache
SqlSegmentMetadataTransactionFactory
  • Create transaction based on state of cache
IndexerSQLMetadataStorageCoordinator
  • Perform all metadata transactions using transaction factory
  • Move metadata reads methods to SqlSegmentsMetadataQuery
  • Move metadata write methods to SqlSegmentMetadataTransaction

Testing

  • Update IndexerSQLMetadataStorageCoordinatorTest to run both with and without cache
  • Add DatasourceSegmentCacheTest

Pending items

  • Add more UTs for the cache classes
  • Wire up existing UTs to use cache
  • Update existing ITs to work with cache
  • Benchmarking

Release note

Add Overlord runtime property druid.manager.segments.useCache (default value false).
Set this to true to turn on segment metadata caching on the Overlord. This allows segment metadata operations
such as reads and segment allocation to be sped up significantly.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz marked this pull request as ready for review January 28, 2025 12:34
// Assume that the metadata write operation succeeded
// Do not update the cache just yet, add to the list of pending writes
pendingCacheWrites.add(writer -> {
T ignored = action.apply(writer);

Check notice

Code scanning / CodeQL

Unread local variable Note

Variable 'T ignored' is never read.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant