-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg] Add manifest file caching for HMS-based deployments #24481
base: master
Are you sure you want to change the base?
[Iceberg] Add manifest file caching for HMS-based deployments #24481
Conversation
7db7896
to
666d248
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I may not have the correct context on this but is it possible to add some tests too?
@@ -67,7 +67,7 @@ public class IcebergConfig | |||
|
|||
private EnumSet<ColumnStatisticType> hiveStatisticsMergeFlags = EnumSet.noneOf(ColumnStatisticType.class); | |||
private String fileIOImpl = HadoopFileIO.class.getName(); | |||
private boolean manifestCachingEnabled; | |||
private boolean manifestCachingEnabled = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is intentional. Performance is significantly worse with it disabled, and I don't think there are any known downsides to making this enabled by default other than an increased memory footprint
public ManifestFileCache createManifestFileCache(IcebergConfig config, MBeanExporter exporter) | ||
{ | ||
Cache<ManifestFileCacheKey, ManifestFileCachedContent> delegate = CacheBuilder.newBuilder() | ||
.maximumWeight(config.getManifestCachingEnabled() ? config.getMaxManifestCacheSize() : 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the caching is disabled i think we should not have any caching instead of adding this via 0 weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the PR to remove use of the 0 weight and bypass the cache entirely if it is not enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post some metrics about cache hit ratios/eviction for a canonical read-heavy workload ? Maybe like partitioned/unpartitioned TPCDS ?
presto-iceberg/src/main/java/com/facebook/presto/iceberg/ManifestFileCache.java
Show resolved
Hide resolved
long fileLength = delegate.getLength(); | ||
if (fileLength <= cache.getMaxFileLength() && cache.isEnabled()) { | ||
try { | ||
ManifestFileCachedContent content = readFully(delegate, fileLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N00b question, but are the (avro) manifest files always or mostly read fully and then deserialized? Or are range-reads supported ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. For the most part when dealing with manifests, the files are read fully. However, there are cases where it is not fully used. ex. when reading partition specs in avro format, you only need to read the file metadata
However, in order to plan an entire query you need to read all of the (valid) manifest files fully. You won't really ever only need just the partition specs. The partition specs are going to be contained within one of those files anyways.
Additionally, when caching is enabled on catalogs other than HMS, this is the same approach as in the Iceberg library
implements InputFile | ||
{ | ||
private static final Logger LOG = Logger.get(HdfsCachedInputFile.class); | ||
private static final long BUFFER_CHUNK_SIZE = 2 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the default here 2Mb for a specific reason ? Can we make this Integer.MAX_VALUE ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Integer.MAX_VALUE
is too large. The iceberg code does around 4MB I believe. There is a balance to strike in the chunk sizes. Since the byte buffer internally is going to store a contiguous array of bytes, you need that much memory available. When you break it into chunks, it's easier for the allocator to find smaller chunks and puts less pressure on the GC to evict when under load. I think going as high as 16MB would probably be OK. My gut says 2MB is probably fine but I am open to tweaking it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its better for this to come from a config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to set a slightly smaller value here.
I have not tested on a partitioned dataset yet, but on our internal unpartitioned TPC-DS SF1k dataset executed in the "ds_power" configuration (1 query at a time, q1 through q99), the cache hit rate was 96.8%. The total numbers If I recall were cache hits somewhere between 10-12k, while misses were just a few hundred. When testing locally on an sf10 dataset generated from the |
@Provides | ||
public ManifestFileCache createManifestFileCache(IcebergConfig config, MBeanExporter exporter) | ||
{ | ||
Cache<ManifestFileCacheKey, ManifestFileCachedContent> delegate = CacheBuilder.newBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use Caffeine as the caching library since iceberg-core already brings it in ? It appears to have better performance and is recommended by the Guava team too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought too. Caching performance would likely improve too because eviction decisions in caffeine use global weight for eviction versus rather than per-segment weight in guava. However, most of the Presto codebase uses guava caches. Since caffeine and guava are different types, it would not be compatible with the current infrastructure such as the CacheStatsMBean
object. Additionally, we use use guava's SimpleForwardingCache
which is not available in caffeine, so I would have to roll my own. Not a terrible amount of effort, but I think there's enough work there to push that effort into a separate PR
Some more concrete data on how much manifest caching improves planning times. Click the image to zoom in/view high resolution. Absolute analysis time comparison Analysis time ratio comparing caching to no caching -- 1.0 means the time was equivalent without caching. Lower is better: Additionally, here's some raw data which includes all the cache statistics on the manifest cache. Unfortunately, we don't have data about the evictions counts. Here are the most pertinent IMO "cachestats.hitcount": 25801,
"cachestats.hitrate": 0.9825209444021326,
"cachestats.misscount": 459,
"cachestats.size": 22,
"filesizedistribution.alltime.avg": 11953.193899782134,
"filesizedistribution.alltime.count": 459.0,
"filesizedistribution.alltime.max": 18990,
"filesizedistribution.alltime.maxerror": 0.0,
"filesizedistribution.alltime.min": 4528,
"filesizedistribution.alltime.p01": 4602,
"filesizedistribution.alltime.p05": 6793,
"filesizedistribution.alltime.p10": 7322,
"filesizedistribution.alltime.p25": 8417,
"filesizedistribution.alltime.p50": 12048,
"filesizedistribution.alltime.p75": 14475,
"filesizedistribution.alltime.p90": 18084,
"filesizedistribution.alltime.p95": 18949,
"filesizedistribution.alltime.p99": 18990, One thing to note, is that the cache is completely fresh for q1, 2, 3 etc. So we have higher query planning times in the beginning of the DS-power run while the cache is getting populated. You can see once we've read most tables' metadata the analysis time consistently starts dropping around q6/7/8 |
2563391
to
2c9c425
Compare
Nit, suggested rephrase of release note to follow the Order of changes phrasing in the Release Notes Guidelines:
|
A little unsure about this. Please correct me if I'm wrong, should we just implement the method The reference code in Iceberg lib could be found here. So it seems that the following code in
|
This is a good question. I initially was going to use this method but decided it would not work well. The reason we can't use the Iceberg library caching code is that (1) there is no metrics available, so we can't track the hit/miss counts or report them in the query's runtime metrics. This is currently a limitation with non-hive catalogs. (2) is that we wouldn't be able to cache across queries because the cache key in the Iceberg library is a single IO instance. In Presto's current implementation, we create a new IO instance for every new |
LGTM % tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Lgtm, only a couple of nit and small question.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/HdfsFileIO.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/HdfsCachedInputFile.java
Outdated
Show resolved
Hide resolved
implements InputFile | ||
{ | ||
private static final Logger LOG = Logger.get(HdfsCachedInputFile.class); | ||
private static final long BUFFER_CHUNK_SIZE = 2 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to set a slightly smaller value here.
2c9c425
to
28cf822
Compare
28cf822
to
9a73101
Compare
Description
Adds manifest file caching to the Iceberg connector for HMS-based deployments.
Motivation and Context
In order to optimize and plan iceberg queries we call the
planFiles()
API multiple times throughout the query optimization lifecycle. Each time it requires reading and parsing metadata files which usually exist on an external filesystem such as S3. For large tables there could be hundreds of files. They usually range in a few kilobytes in size up to a few megabytes. When not cached in memory within Presto it can lead to significant E2E query latency degradation.Impact
TBD
Test Plan
TBD
Contributor checklist
Release Notes