-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llava 1.6 support #5267
Llava 1.6 support #5267
Conversation
one note is that for this impacts the current code as it only looks for mmproj in the last of the |
will now search for projector
I've updated it with a quick solution to search for those two checkpoint paths.
|
It looks like Ollama that uses Llama.cpp as their backend already supports llava:34b-v1.6. |
I have the same question... |
With these tools you can convert llava-1.6 into a llama.cpp GGUF file and it will work for inferencing. Right now llama.cpp will create the usual 14 patches of a rectangular padded 336 pixel image. |
Thanks for the reply. So right now, |
Yeah, I get that, but I was wondering if Ollama forked Llama.cpp they're using and already completed their own implementation to match Llava 1.6 architecture and image preprocessing. |
If it is just the pre-processing missing, we can merge this and make a separate issue with a specific goal to implement that pre-processing. Might be a good idea to first confirm that using a correctly pre-processed image (from the reference implementation) yields good results using this code |
Basically does the resize logic from clip_image_preprocess function in clip.cpp need to be modified to match the resize logic from process_image function in conversation.py? # LLaVA/llava/conversation.py
def process_image(self, image, image_process_mode, return_pil=False, image_format='PNG', max_len=1344, min_len=672):
if image_process_mode == "Pad":
def expand2square(pil_img, background_color=(122, 116, 104)):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image)
elif image_process_mode in ["Default", "Crop"]:
pass
elif image_process_mode == "Resize":
image = image.resize((336, 336))
else:
raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
if max(image.size) > max_len:
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
if return_pil:
return image
else:
buffered = BytesIO()
image.save(buffered, format=image_format)
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
return img_b64_str // llama.cpp/examples/llava/clip.cpp
bool clip_image_preprocess(struct clip_ctx * ctx, const clip_image_u8 * img, clip_image_f32 * res, const bool pad2square) {
if (!ctx->has_vision_encoder) {
printf("This gguf file seems to have no vision encoder\n");
return false;
}
// the logic below is to pad the shorter side to the longer side with a background color: rgb(122, 116, 104)
// see https://github.com/haotian-liu/LLaVA/blob/e854a2bf85118c504f6f16bf5c3c7c92f8fa8c6b/llava/conversation.py#L113-L156
clip_image_u8 * temp = clip_image_u8_init(); // we will keep the input image data here temporarily
if (pad2square && img->nx != img->ny) {
int longer_side = std::max(img->nx, img->ny);
temp->nx = longer_side;
temp->ny = longer_side;
temp->buf.resize(3 * longer_side * longer_side);
const uint8_t bc[3] = {122, 116, 104}; // background color in RGB from LLaVA
// fill with background color
for (size_t i = 0; i < temp->buf.size(); i++) {
temp->buf[i] = bc[i % 3];
}
// copy from the input image
for (int y = 0; y < img->ny; y++) {
for (int x = 0; x < img->nx; x++) {
const int i = 3 * (y * img->nx + x);
const int j = 3 * (y * temp->nx + x);
temp->buf[j] = img->buf[i];
temp->buf[j+1] = img->buf[i+1];
temp->buf[j+2] = img->buf[i+2];
}
}
} else {
temp->nx = img->nx;
temp->ny = img->ny;
temp->buf.resize(img->buf.size());
memcpy(temp->buf.data(), img->buf.data(), temp->buf.size());
}
const int nx = temp->nx;
const int ny = temp->ny;
const int nx2 = ctx->vision_model.hparams.image_size;
const int ny2 = ctx->vision_model.hparams.image_size;
res->nx = nx2;
res->ny = ny2;
res->buf.resize(3 * nx2 * ny2);
const float scale = std::max(nx, ny) / (float)ctx->vision_model.hparams.image_size;
const int nx3 = int(nx / scale + 0.5f);
const int ny3 = int(ny / scale + 0.5f);
const auto & m3 = ctx->image_mean; // {0.48145466f, 0.4578275f, 0.40821073f};
const auto & s3 = ctx->image_std; // {0.26862954f, 0.26130258f, 0.27577711f};
for (int y = 0; y < ny3; y++) {
for (int x = 0; x < nx3; x++) {
for (int c = 0; c < 3; c++) {
// linear interpolation
const float sx = (x + 0.5f) * scale - 0.5f;
const float sy = (y + 0.5f) * scale - 0.5f;
const int x0 = std::max(0, (int)std::floor(sx));
const int y0 = std::max(0, (int)std::floor(sy));
const int x1 = std::min(x0 + 1, nx - 1);
const int y1 = std::min(y0 + 1, ny - 1);
const float dx = sx - x0;
const float dy = sy - y0;
const int j00 = 3 * (y0 * nx + x0) + c;
const int j01 = 3 * (y0 * nx + x1) + c;
const int j10 = 3 * (y1 * nx + x0) + c;
const int j11 = 3 * (y1 * nx + x1) + c;
const float v00 = temp->buf[j00];
const float v01 = temp->buf[j01];
const float v10 = temp->buf[j10];
const float v11 = temp->buf[j11];
const float v0 = v00 * (1.0f - dx) + v01 * dx;
const float v1 = v10 * (1.0f - dx) + v11 * dx;
const float v = v0 * (1.0f - dy) + v1 * dy;
const uint8_t v2 = std::min(std::max(std::round(v), 0.0f), 255.0f);
const int i = 3 * (y * nx3 + x) + c;
res->buf[i] = ((float(v2) / 255.0f) - m3[c]) / s3[c];
}
}
}
clip_image_u8_free(temp);
return true;
} |
PR: Testing: |
I'm almost done but got stuck for today on the 5 dimensional permutations that arrange the final embeddings. I tried to create an own slim tensor manipulation class, it's probably buggy.
|
Since dims 3 and 4 or not permuted, you can reshape to 4d tensor, apply permute + cont and then reshape back to 5d tensor |
I'm cleaning up the code and hope a first PR update by tomorrow. Even without the grid-embeddings permutation I am getting results better than any other llava variant I tested. Below are 34B 3 bit (llm) with 6bit (ViT) quantized results, I'm not satisfied until the permutation works but it's anyway quite good. It's about GPT4V/Cog-VLM level but with the permutation bug solved it will exceed it.
Below is a comparison of the same llava-1.6 using the previous inference, same settings but fp16 ViT:
|
…lues) Clip: bicubic resize function Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6) Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final convert-image-encoder: fixed image-grid flattening
I just pushed, I was not able to finalize it completely and will be mostly busy for the weekend.
I hope everything is compiling, it's 9AM so I could not test the PR. I have simplified the original python implementation due to the lack of 5 dimensional tensors in GGML, I first tested it on the python side and it did not result in a noticeable output quality drop. I am uploading the quantized projectors again on HF, they need an update due to the grid array bugfix. the gguf of the LLMs are still compatible.
Running: General problems:
Implementation problems:
Here is the current Hotfix in llava.cpp to reverse the permutation again:
Details on the modification of the original implementation:
|
WOW thank you so much for implementing LLaVA-1.6 in llama.cpp!!! One quick note about the prompt for our 34b model: Ideally the correct format should be
And please let me know if there is anything I could do to help (but not about tensor in cpp😭). |
Yup, the state of the clip/llava implementation is not great - hopefully nobody uses this in production I've added |
Sounds great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ok to merge - let's give it some time until tomorrow and if no issues are reported we can merge it
Thanks, looking forward to see a closure on that one. It was a big pain to get working :-) |
It should now work with llava-1.5 as well
It would be very useful to add detailed instructions for LLaVA v1.6 in the README. I tried writing them, but I realized I don't know which is the correct CLIP model to use - I think I'm using the small one with 576 tokens embedding. Not sure Anyway, if anyone figures out all the steps, please add open a PR |
I've updated the readme with detailed instructions and hints on llava-1.6 In general: I have used llava-surgery-v2 only a little bit outside llava-1.6, it is meant as a full replacement to llava-surgery.py. Maybe after a while the original llava-surgery can be removed, once we know the new one works for everything |
Hi all, I tried running this on latest master: ./bin/server -m ../models/mistral-7b-q_5_k.gguf --mmproj ../models/mmproj-mistral7b-f16-q6_k.gguf -ngl 50 -c 6000 --host 0.0.0.0 --port 8007 --no-mmap and got: llama_new_context_with_model: graph splits (measure): 3 and traced it down to a memory management error. Fixed it here: Great work on this btw, llava 1.6 is fantastic! |
Similarly, possible fix for bad access crash on wide images: #5493 And got the same impression in testing: definitely a huge step up :) |
@ggerganov @cmp-nct This is amazing! Thanks! |
Make sure to update to the latest commits, several bugs have been corrected For server I just found this bug report: For some reason server is processing the image instead of using processing functions already available in llava.cpp, maybe that's historic weight. |
Just curious... |
I've not actually tested but last time I looked main did not support that. |
Thanks so much for the explanation . You're totally right. I just tried main, and it doesn't work. I got llava-cli with llava 1.6 13B to work on free Colab t4 , and hopefully someone can fix server to work properly with llava 1.6. That would be amazing! https://colab.research.google.com/gist/chigkim/c44dcc37af26f1cb3af03a2209d7c50a/llava16.ipynb More importantly, I believe Ollama also uses llama.cpp server. |
I had two issues with the
diff --git a/examples/llava/convert-image-encoder-to-gguf.py b/examples/llava/convert-image-encoder-to-gguf.py
index c69f89ac..94754a47 100644
--- a/examples/llava/convert-image-encoder-to-gguf.py
+++ b/examples/llava/convert-image-encoder-to-gguf.py
@@ -117,7 +117,7 @@ else:
with open(dir_model + "/config.json", "r", encoding="utf-8") as f:
config = json.load(f)
if args.clip_model_is_vision:
- v_hparams = config
+ v_hparams = config["vision_config"]
t_hparams = None
else:
v_hparams = config["vision_config"] |
The readme was updated but I think just merged today. The entire llava-1.6 change including the last minute refactors were a bit much for me to push at once, I've had dozens of variants local, that's how that issue sneaked in. At this point llava-cli appears to work flawless, server does not. Server needs an update to use llava.cpp preprocessing functions, it also needs an update to allow flexible system prompt and finetune syntax. It's minor high level work anyone should be able to do. |
Preprocessing step should be implemented in #5553, thanks for the insight into the problem on the other threads.
Could you explain more on this, happy to try to include it as well |
This comment was marked as duplicate.
This comment was marked as duplicate.
@svenstaro, updated |
* Create llava-survery-v2.py * Update convert-image-encoder-to-gguf.py * Update convert-image-encoder-to-gguf.py * Rename llava-survery-v2.py to llava-surgery-v2.py * Update convert-image-encoder-to-gguf.py will now search for projector * Update convert-image-encoder-to-gguf.py whoops * Update llava-surgery-v2.py * Clip: Bugfix for normalization (it did not loat the 3 std and mean values) Clip: bicubic resize function Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6) Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final convert-image-encoder: fixed image-grid flattening * whitespace corrections * ws * Tensors are now properly permuted. Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference. * ws * added verbose_prompt support into cli added stopwords for llava-1.6 into cli * moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed * ws * convert : skip unknown tensors (need for LLaVA) * llava : update readme * llava : fix compile warnings * llava : style * convert : add --skip-unknown CLI arg * server : remove clip structs * bugfix for non llava-1.6 It should now work with llava-1.5 as well * clip : minor code rearrange * llava : update readme a bit --------- Co-authored-by: John <cmt-nct@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Create llava-survery-v2.py * Update convert-image-encoder-to-gguf.py * Update convert-image-encoder-to-gguf.py * Rename llava-survery-v2.py to llava-surgery-v2.py * Update convert-image-encoder-to-gguf.py will now search for projector * Update convert-image-encoder-to-gguf.py whoops * Update llava-surgery-v2.py * Clip: Bugfix for normalization (it did not loat the 3 std and mean values) Clip: bicubic resize function Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6) Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final convert-image-encoder: fixed image-grid flattening * whitespace corrections * ws * Tensors are now properly permuted. Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference. * ws * added verbose_prompt support into cli added stopwords for llava-1.6 into cli * moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed * ws * convert : skip unknown tensors (need for LLaVA) * llava : update readme * llava : fix compile warnings * llava : style * convert : add --skip-unknown CLI arg * server : remove clip structs * bugfix for non llava-1.6 It should now work with llava-1.5 as well * clip : minor code rearrange * llava : update readme a bit --------- Co-authored-by: John <cmt-nct@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
First steps - I got impressive results with llava-1.6-13B on the license_demo example already, despite many open issues.
Todo:
The biggest and most important difference missing is the "spatial_unpad" logic.
The conversion script I added can convert the nested array into a flat 2D array with valid image shapes, but it ignores them at this point.
The new tensor for the image separation is part of the projector - for compatibility with pytorch I removed it from the llava.clip extract.
llava-surgery-v2.py should be compatible with cogvlm, llava-1.6 and llava-1.5
For Mistral and using llava-cli binary:
Add this:
-p "<image>\nUSER:\nProvide a full description.\nASSISTANT:\n"
The mistral template for llava-1.6 seems to be no system print and a USER/ASSISTANT role
For Vicunas the default settings work.
For the 34B this should work:
Add this:
-e -p <|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nWhat can be said about this image?<|im_end|><|im_start|>assistant\n
Do not expect great results before the proper image preprocessing was added
Downloads:
I've extracted the embedded vit and quantized it for all 4 variants (though, not all quantizations)
They are being uploaded here: https://huggingface.co/cmp-nct/llava-1.6-gguf/upload/main
Please note: Until preprocessing is done, expect poor results