-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support persisting TableMetadata in the metastore #433
base: main
Are you sure you want to change the base?
Conversation
polaris-service/src/main/java/org/apache/polaris/service/catalog/BasePolarisCatalog.java
Outdated
Show resolved
Hide resolved
polaris-service/src/main/java/org/apache/polaris/service/catalog/BasePolarisCatalog.java
Outdated
Show resolved
Hide resolved
polaris-service/src/main/java/org/apache/polaris/service/catalog/BasePolarisCatalog.java
Outdated
Show resolved
Hide resolved
polaris-service/src/main/java/org/apache/polaris/service/persistence/MetadataCacheManager.java
Outdated
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/entity/TableLikeEntity.java
Show resolved
Hide resolved
polaris-service/src/main/java/org/apache/polaris/service/persistence/MetadataCacheManager.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, suggestion to add comment for readers of the code to more easily understand why we might expect a stale cached metadata location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this effort!
However, I think the efficiency of this approach wrt resource usage (heap/network/database pressure) could be drastically improved.
There seems to be a potential inconsistency w/ the cached metadata, which shouldn't happen at all IMO.
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class MetadataCacheManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather a collection of static utility methods than a "cache manager"? Maybe give it a better name.
BTW: If it only has utility methods, add a private no-arg constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, do you have an alternative name you prefer?
Since we are treating the metastore as a cache for the metadata in object storage, and this class manages that cache, I chose the current name.
String.format("Caching metadata for %s", tableLikeEntity.getTableIdentifier())); | ||
TableLikeEntity newTableLikeEntity = | ||
new TableLikeEntity.Builder(tableLikeEntity) | ||
.setMetadataContent(tableLikeEntity.getMetadataLocation(), json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing JSON uncompressed and as a string is a waste of resources on the network, on heap and in the database.
I'd suggest to use Smile serialization format and likely compress that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably not a blocker for this initial PR; we can always add another format later on. Personally, I would rather us use a more structured schema (e.g. a more relational one) that supports partial metadata loading rather than just compress the current JSON-based approach.
Storing JSON uncompressed and as a string
Notably, the metastore does exactly this today for properties and internal properties
PolarisCallContext callContext, | ||
PolarisMetaStoreManager metaStoreManager, | ||
PolarisResolutionManifestCatalogView resolvedEntityView) { | ||
String json = TableMetadataParser.toJson(metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, this requires a lot of duplicate expensive operations:
- it loads the metadata JSON from the object store
- parses JSON into a TableMetadata
- serializes the TableMetadata to the same string representation as in step 1
- converts the
String
into abyte[]
representation - just to check the length
The latter steps are likely performed for every access to a table-metadata that exceeds the METADATA_CACHE_MAX_BYTES_NO_CACHING setting, adding a lot of new heap pressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about this too, but it's only there for users who have a non-infinite and non-zero METADATA_CACHE_MAX_BYTES
. Even still, the overhead is not small. Do you think just checking the string length is sufficient?
* Load the cached {@link Table} or fall back to `fallback` if one doesn't exist. If the metadata | ||
* is not currently cached, it may be added to the cache. | ||
*/ | ||
public static TableMetadata loadTableMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason why view metadata isn't handled as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hope is to limit the scope of the initial PR and expand from there. Do you think view support should be included in the initial change though?
Description
This adds a new flag
METADATA_CACHE_MAX_BYTES
which allows the catalog to store table metadata in the metastore and vend it from there when loadTable is called.Entries are cached based on the metadata location. Currently, the entire metadata.json content is cached.
Features not included in this PR:
There is partial support for (1) here and I want to extend it, but the goal is to structure things in a way that will allow us to implement (2) and (3) in the future as well.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Existing tests vend table metadata correctly when caching is enabled.
Added a small test in
BasePolarisCatalogTest
to cover the basic semantics of cachingManual testing with eclipselink -- I observed the entities getting created in Postgres and saw large metadata being cached:
With MySQL, small metadata is persisted:
However large metadata may cause
internalproperties
to exceed the size limit and nothing will be cached. Calls still return safely.