Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmap() - backed istream implementation #150

Closed
wants to merge 3 commits into from

Conversation

apaz-cli
Copy link
Contributor

This is for issue #91.

Treat this as a first draft. There are definitely some thing that need to be changed and will be changed shortly.

I have not benchmarked. I'm submitting the PR early to get feedback.

Questions/Observations for discussion:

  1. Should llama_istream be a part of utils.cpp?
  2. MADV_SEQUENTIAL is probably wrong, as brought up in Should use mmap for model loading #91. I'll benchmark to make sure, then fix that before it's merged.
  3. When I wrote llama_istream, the intent was that you wouldn't have to check USE_MMAP at the site where it's constructed. Reduce model loading time #43 throws a wrench in that. I could create a quick wrapper class, rather than using using. It might be a little confusing when people merge though.
  4. What I'm doing is object-oriented, which doesn't fit the coding style very well. However, std::istream's copy constructor is explicitly deleted. So I can't get it out of a function except as an rvalue as the return value (or a global variable which I think would be worse), and the convention elsewhere is that the return value is for the error.
  5. Subclassing istream was WAY more painful than I expected. For some reason, if I don't reimplement seekoff() and seekpos() this way, tellg() seems to return garbage. Couldn't tell you why. Nor can I find any hint of why on the internet. The signature is also very finicky. If you have any insight into why it needs to be exactly this way, let me know.
  6. I'm not completely sure which mmap() arguments to use. If you have a better suggestion, let me know.

@apage43
Copy link
Contributor

apage43 commented Mar 15, 2023

My comment about MADV_SEQUENTIAL was assuming you were trying to implement zero-copy approach and have the inference-time code directly use the model from the mapping rather than still copying everything into private memory first (to be fair this is easier said than done if the file format is not intended to be used in this way, you would likely run into alignment issues)

Its probably not as important and may actually help here since you do read most of the data in the mmap region, at least the weights, only once (it does seem like it scans through the file a few times skipping weights to do the name-based lookups, but that's not a huge deal)

I'm also not sure much is gained by creating an mmap-backed istream - if you are going to copy the data anyway mmap doesn't have many advantages over normal read I/O and adds significant complexity - for large reads mmap can mean many more context switches than regular I/O, just less visible ones as they're triggered by the MMU instead of explicitly by user code.

Generally the advantages of mmaping files are that you can avoid copying, be that from the file into private process memory (you can just read the exact memory the kernel originally read the file into), among processes (multiple processes mapping the same file also hit the same physical memory, except in the case of MAP_PRIVATE), or across launches (your process exiting doesn't immediately evict everything from the kernel page cache, the next launch will get the same underlying memory if it hasn't been evicted - this one especially is lost if you have to copy anyway since having a second in-memory copy of the model in private process memory will cause the kernel to have to evict more things from the page cache)

@jart jart self-requested a review March 15, 2023 23:22
@jart jart added enhancement New feature or request performance Speed related topics labels Mar 15, 2023
@jart
Copy link
Contributor

jart commented Mar 16, 2023

Thank you for sending us this change! This change would result in an 18% speedup of loading in the average case, however what @apage43 says is true. It doesn't make sense to use mmap() as a read() replacement, because the true power of mmap() is it gives us the ability to not read the file at all. I've just written a blog post for a command line utility I wrote here: https://justine.lol/rusage/ which uses what we've done here as an example use case, and at the bottom adds further clarity to these points.

We need to go back to the drawing board on this. An mmap() change should reduce startup latency not by 18%, but by 100%. In order to make that change, we need to find a way to have the tensor data structures that exist on disk be the same as what llama.cpp needs at runtime in memory. I propose that we discuss our ideas for how we might do this in #91.

Thank you again for your contribution. I hope you'll continue to keep working with us on the issue, while we figure this out!

@jart jart closed this Mar 16, 2023
@sw sw mentioned this pull request Mar 18, 2023
@apaz-cli apaz-cli deleted the master branch April 8, 2023 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants