-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IndexInput#prefetch. #13337
Add IndexInput#prefetch. #13337
Conversation
This adds `IndexInput#prefetch`, which is an optional operation that instructs the `IndexInput` to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to the `IndexInput#readXXX` methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread per `IndexInput`. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes including `TermsEnum`. I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated. Currently, only `NIOFSDirectory` implements this new API. I played with `MMapDirectory` as well and found an approach that worked better in the benchmark I've been playing with, but I'm not sure it makes sense to implement this API on this directory as it either requires adding an explicit buffer on `MMapDirectory`, or forcing data to be loaded into the page cache even though the OS may have decided that it's not a good idea due to too few cache hits. This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc. Relates apache#13179
I created the following benchmark to simulate lookups in a terms dictionary that cannot fit in the page cache.import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IOContext;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;
import org.apache.lucene.store.NIOFSDirectory;
public class PrefetchBench {
private static final int NUM_TERMS = 3;
private static final long FILE_SIZE = 100L * 1024 * 1024 * 1024; // 100GB
private static final int NUM_BYTES = 16;
public static int DUMMY;
public static void main(String[] args) throws IOException {
Path filePath = Paths.get(args[0]);
Path dirPath = filePath.getParent();
String fileName = filePath.getFileName().toString();
Random r = ThreadLocalRandom.current();
try (Directory dir = new NIOFSDirectory(dirPath)) {
if (Arrays.asList(dir.listAll()).contains(fileName) == false) {
try (IndexOutput out = dir.createOutput(fileName, IOContext.DEFAULT)) {
byte[] buf = new byte[8196];
for (long i = 0; i < FILE_SIZE; i += buf.length) {
r.nextBytes(buf);
out.writeBytes(buf, buf.length);
}
}
}
for (boolean dataFitsInCache : new boolean[] { false, true}) {
try (IndexInput i0 = dir.openInput("file", IOContext.DEFAULT)) {
byte[][] b = new byte[NUM_TERMS][];
for (int i = 0; i < NUM_TERMS; ++i) {
b[i] = new byte[NUM_BYTES];
}
IndexInput[] inputs = new IndexInput[NUM_TERMS];
if (dataFitsInCache) {
// 16MB slice that should easily fit in the page cache
inputs[0] = i0.slice("slice", 0, 16 * 1024 * 1024);
} else {
inputs[0] = i0;
}
for (int i = 1; i < NUM_TERMS; ++i) {
inputs[i] = inputs[0].clone();
}
final long length = inputs[0].length();
List<Long>[] latencies = new List[2];
latencies[0] = new ArrayList<>();
latencies[1] = new ArrayList<>();
for (int iter = 0; iter < 10_000; ++iter) {
final boolean prefetch = (iter & 1) == 0;
final long start = System.nanoTime();
for (IndexInput ii : inputs) {
final long offset = r.nextLong(length - NUM_BYTES);
ii.seek(offset);
if (prefetch) {
ii.prefetch();
}
}
for (int i = 0; i < NUM_TERMS; ++i) {
inputs[i].readBytes(b[i], 0, b[i].length);
}
final long end = System.nanoTime();
// Prevent the JVM from optimizing away the reads
DUMMY = Arrays.stream(b).mapToInt(Arrays::hashCode).sum();
latencies[iter & 1].add((end - start) / 1024);
}
latencies[0].sort(null);
latencies[1].sort(null);
System.out.println("Data " + (dataFitsInCache ? "fits" : "does not fit") + " in the page cache");
long prefetchP50 = latencies[0].get(latencies[0].size() / 2);
long prefetchP90 = latencies[0].get(latencies[0].size() * 9 / 10);
long noPrefetchP50 = latencies[1].get(latencies[0].size() / 2);
long noPrefetchP90 = latencies[1].get(latencies[0].size() * 9 / 10);
System.out.println(" With prefetching: P50=" + prefetchP50 + "us P90=" + prefetchP90 + "us");
System.out.println(" Without prefetching: P50=" + noPrefetchP50 + "us P90=" + noPrefetchP90 + "us");
}
}
}
}
} It assumes 3 terms that need to read 16 bytes each from the terms dictionary. We compare the time it takes to read these 16 bytes using today's sequential approach vs. taking advantage of the new
Using |
@rmuir asked if we could add support for this on
|
Hi, give me some time to review. I got the concept! I also have some questions about the NIOFS one because I don't like to use twice as much file handles just for the prefetching. MMap: the pagesize problem is well known. You are not the first one hitting this. I worked aorund it with some hack, too: lucene/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java Lines 117 to 119 in 40cae08
The problem with page size is: It is known everywhere in the JDK and available via Unsafe and various other places, but all are private. I temporarily tried to add a workaround to use another glibc call to retrieve the page size, but this failes due to differences in enum constants on different platforms (the SC_PAGESIZE constant has a different value on various linux/macos versions) because its a C enum and has no fixed value. So we cannot easily get the page size without hardcoding an integer constant which is not even defined in any header file. The alternative is to use the deprecated/outdated getpagesize() function on libc.... But I don't want to use it as its not posix standardized.... What do you think about this workaround: use 4 K as pagesize, but guard with try/catch and do nothing on IOExc (if unaliged)? I was about to open an issue on JDK to somehow allow to get page size from JVM in the MemorySegment API because you need it quite often when using them. |
We can also use the Hotspot bean to get page size, but this fails on OpenJ9 or any 3rd party JVM. So we could try to get page size from HotSpt bean in Constants.java and save it in OptionInt. If it is undefined, preloading gets disabled: see those examples in Constants: lucene/lucene/core/src/java/org/apache/lucene/util/Constants.java Lines 88 to 98 in 40cae08
|
Thanks for taking a look Uwe, and suggesting approaches for the page size issue! By the way, feel free to push directly to the branch.
I don't like it either. And actually tests don't like it either, as I've seen more issues with |
Can't we use the same filechannel and do a positional read in another thread (not async)? I like the trick to use a virtual thread, because by that we have no additional thread and instead it hooks into the next I/O call and triggers our read. As this is the case, what's the problem of doing a blocking call (positional) in the virtual thread? If the virtual thread gets blocked it will hand over the call to another virtual thread. |
Does not work, we only can get huge page size, not native page size.
It looks we need to have 2 options:
|
Hi,
at least libc's So the only viable solution would be to expose and implement this method in NativeAcess via PosixNativeAccess. Should I give it a try? |
Please go ahead. |
I gave it a try in the last commit, is this what you had in mind? The benchmark suggests that this still works well, slightly better than the previous approach, now on-par with
|
Yes, that was my idea. I also quickyl implemented the page size problem. I haven't tested it (on windows at moment). If you like you could quickly check the return value on linux and mac (it's too late now). I will also remove the nativeaccess passed around in MMapDir and just make a static constant in the MemorySegmentIndexInput. |
It works on my Linux box and returns 4096. 🎉 |
We could now also fix my hack regarding smaller chunk sizes and just ensure the chunk size is greater page size to enable madvise (so alignment fits pages for mmap call). |
If madvise does the trick for mmapdir, why not try POSIX_FADV_WILLNEED for the niofs case? |
We can't get the file handle in Java (still open issue). |
…s (TestMultiMMap)
hmm, ok. I felt like we were able to get it somewhere thru the guts of nio/2 filesystem apis, maybe I am wrong? |
No it's a known issue. Maurizio and Alan Bateman agreed to fix it. |
ability to madvise/fadvise without resorting to native code would be awesome too. I don't know how it may translate to windows. but it seems like it does exactly what this PR wants to do:
|
Thanks Uwe, maybe the correct solution is to simply add the api and implement with I feel the bytebuffer/thread dancing in bufferedindexinput is too much. I can't reason about |
…ndling of allowed ReadAdvice enum constants
I reverted changes to |
Some questions about the API, curious to get your thoughts:
|
I had the same problem. I don't like the changes in IOContext (i reverted some of them and used a method to disallow specific ReadAdvice). Indeed a new method that is for the WillNeed case only is a good idea. Actually the willneed is different to the readahead settings. So possibly we can separate them in nativeaccess.
I have no preference.
This was my first idea when looking at the first method. With current code it is completely undefined how much the prefetch should read. Making it absolute is not bad, but should be limited to the RandomAccessIndexInput interface. So basically we can have both variants. The sequential one should be part of IndexInput and is a must to be implemented. The other should not be in IndexInput only in the random slice. Basically the current code doe snot preload accrodd boundaries, but implemebting that is easy and can be one in MemorySegmentIndexInput in the same way like reading arrays, also cross boundary. It would just be 2 calls. |
To me it makes sense to be very specific with the region that is needed. otherwise all this madvising doesn't make a lot of sense... turning off the OS default readahead and supplying our own values that might be gigabytes? better to not call madvise at all :) I see this stuff as an operating system hint to get better performance, not actually reading data. So requiring a user to slice()/clone() just to give a hint seems like the wrong way. That has too much overhead for some directories (e.g. buffers). also, i'm a little concerned about low-level parallelization of e.g. individual stored documents. seems like a lot of overhead! if you need 10,000 documents ranges, at least make a single |
@@ -50,6 +51,7 @@ abstract class MemorySegmentIndexInput extends IndexInput implements RandomAcces | |||
final int chunkSizePower; | |||
final Arena arena; | |||
final MemorySegment[] segments; | |||
final Optional<NativeAccess> nativeAccess; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a singleton, we can make it static and initialize it here. There's no need to pass the optional through constructors and have it in every clone.
We may also make it static final on the provider, but that's unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me looks fine. I only have a minor comment.
In general we may still look at Robert's suggestion. If we plan to send a preload for many slices, we should think of adding another random API to public void preload(Tuple<long offset, long length> pair...) This one should first call |
This sounds like a good idea. If a user wants to return 10k stored documents, I wonder if we should also split this into smaller batches to avoid running into a case when some pages from the cache gets claimed by something else before we have a chance to retrieve all these stored documents.
Thanks Uwe, this sounds like a good suggestion. I'll start looking into using this API for terms dictionary lookups of boolean queries of term queries, which I don't think would need it since we'd do a single seek per clone anyway. But when we later move to vectors, stored fields and term vectors this could be useful. |
As Robert pointed out and benchmarks confirmed, there is some (small) overhead to calling `madvise` via the foreign function API, benchmarks suggest it is in the order of 1-2us. This is not much for a single call, but may become non-negligible across many calls. Until now, we only looked into using prefetch() for terms, skip data and postings start pointers which are a single prefetch() operation per segment per term. But we may want to start using it in cases that could result into more calls to `madvise`, e.g. if we start using it for stored fields and a user requests 10k documents. In apache#13337, Robert wondered if we could take advantage of `mincore()` to reduce the overhead of `IndexInput#prefetch()`, which is what this PR is doing. For now, this is trying to not add new APIs. Instead, `IndexInput#prefetch` tracks consecutive hits on the page cache and calls `madvise` less and less frequently under the hood as the number of cache hits increases.
…AM. (#13381) As Robert pointed out and benchmarks confirmed, there is some (small) overhead to calling `madvise` via the foreign function API, benchmarks suggest it is in the order of 1-2us. This is not much for a single call, but may become non-negligible across many calls. Until now, we only looked into using prefetch() for terms, skip data and postings start pointers which are a single prefetch() operation per segment per term. But we may want to start using it in cases that could result into more calls to `madvise`, e.g. if we start using it for stored fields and a user requests 10k documents. In #13337, Robert wondered if we could take advantage of `mincore()` to reduce the overhead of `IndexInput#prefetch()`, which is what this PR is doing via `MemorySegment#isLoaded()`. `IndexInput#prefetch` tracks consecutive hits on the page cache and calls `madvise` less and less frequently under the hood as the number of consecutive cache hits increases.
This adds `IndexInput#prefetch`, which is an optional operation that instructs the `IndexInput` to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to the `IndexInput#readXXX` methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread per `IndexInput`. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes including `TermsEnum`. I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated. This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc. Relates apache#13179 Co-authored-by: Uwe Schindler <uschindler@apache.org>
This adds `IndexInput#prefetch`, which is an optional operation that instructs the `IndexInput` to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to the `IndexInput#readXXX` methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread per `IndexInput`. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes including `TermsEnum`. I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated. This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc. Relates apache#13179 Co-authored-by: Uwe Schindler <uschindler@apache.org>
…AM. (apache#13381) As Robert pointed out and benchmarks confirmed, there is some (small) overhead to calling `madvise` via the foreign function API, benchmarks suggest it is in the order of 1-2us. This is not much for a single call, but may become non-negligible across many calls. Until now, we only looked into using prefetch() for terms, skip data and postings start pointers which are a single prefetch() operation per segment per term. But we may want to start using it in cases that could result into more calls to `madvise`, e.g. if we start using it for stored fields and a user requests 10k documents. In apache#13337, Robert wondered if we could take advantage of `mincore()` to reduce the overhead of `IndexInput#prefetch()`, which is what this PR is doing via `MemorySegment#isLoaded()`. `IndexInput#prefetch` tracks consecutive hits on the page cache and calls `madvise` less and less frequently under the hood as the number of consecutive cache hits increases.
This adds
IndexInput#prefetch
, which is an optional operation that instructs theIndexInput
to start fetching bytes from storage in the background. These bytes will be picked up by follow-up calls to theIndexInput#readXXX
methods. In the future, this will help Lucene move from a maximum of one I/O operation per search thread to one I/O operation per search thread perIndexInput
. Typically, when running a query on two terms, the I/O into the terms dictionary is sequential today. In the future, we would ideally do these I/Os in parallel using this new API. Note that this will require API changes to some classes includingTermsEnum
.I settled on this API because it's simple and wouldn't require making all Lucene APIs asynchronous to take advantage of extra I/O concurrency, which I worry would make the query evaluation logic too complicated.
Currently, onlyNIOFSDirectory
implements this new API. I played withMMapDirectory
as well and found an approach that worked better in the benchmark I've been playing with, but I'm not sure it makes sense to implement this API on this directory as it either requires adding an explicit buffer onMMapDirectory
, or forcing data to be loaded into the page cache even though the OS may have decided that it's not a good idea due to too few cache hits.This change will require follow-ups to start using this new API when working with terms dictionaries, postings, etc.
Relates #13179