-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLIP (ViT) #315
CLIP (ViT) #315
Conversation
a440c73
to
d7264fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gboduljak this looks amazing! Thanks for the contribution.
I started the review, haven't finished it yet but I am submitting the 1st pass and will get back to it again later today to review the conversion, preprocessing and hub download parts.
The current comments are mostly stylistic and image attribution. So far the PR looks great so feel free to address the review comments when the 2nd pass of review is submitted.
As you mentioned one could use |
Thank you for the code review and the constructive feedback. I will address your requested changes after your second pass. |
This looks interesting to me. Please take a look at |
@angeloskath please see ml-explore/mlx-data#36 :) |
@gboduljak Great work! I can't wait for this to be merged, it will enable us to port Llava-like models to mlx 🚀 |
@angeloskath Could you take a second look at this PR? I would like to complete this soon :) |
@gboduljak can you rebase on main and resolve the conflict? I can spend some time on this also as I'd like to get it in in the nearish future. |
@awni done :) |
@gboduljak my high level comment is there is a lot of code in this example. Examples are intended to be pretty minimal so they can be 1. instructive 2. easily hackable/customizable, and 3. easy for us to maintain. If we want to include this in
Another path is to make this a separate package that is external to mlx-examples which is also great! And maybe makes more sense for your goals with this, I am not sure. If we push to include it in mlx-examples though let's aim to simplify as much as possible. Let me know which route your prefer and if you want to aim to get it in mlx-examples (which I support) are you up for taking a stab at simplifying it a bit? PS I know some of our examples are not as simple as they should / could be, but those are more the exceptions 😄 and they may be moved out of the examples repo in the future. |
The largest amount of the code is preprocessing, which is not really related to CLIP implementation. My goal was to reduce dependency on I suggest we remove preprocessing completely and rely on To save your time, please take a look at the model implementation ( |
I'm in favor of removing it's all in NumPy anyway there is no need to reimplement it here. I don't love having a dependency on Torch though but maybe we take it out once we have the image preprocessing in mlx-data.
|
I think it is mostly Let me take a look at this later today and come up with some simplified implementation (e.g. removing preprocessing and trying to implement something basic in |
Awesome thank you!!! The normalization is pretty easy to do in |
@gboduljak let me know when this is ready for re-review. |
@awni here is the status update. I removed preprocessing for images and I just wrapped image preprocessing from The implementation of CLIP model, the weight conversion script and the tests are final and I do not intend to change them. These will not change during the image preprocessing re-implementation. Thus, please review |
65a445c
to
9651660
Compare
Hi @awni, @angeloskath :) @angeloskath I addressed your first pass comments. This PR is now ready for another review. I removed existing preprocessing code and I implemented a very simple port of CLIPImageProcessor. The 'simple port' is implemented using I tried using def normalize(x: mx.array, mean: mx.array, std: mx.array) -> mx.array:
x = x.astype("float32") / 255.0
return (x - mean) / std
mean = mx.array(conf["image_mean"]).reshape((1, 1, 3))
std = mx.array(conf["image_std"]).reshape((1, 1, 3))
norm = partial(normalize, mean=mean, std=std)
dset = (
# Make a buffer (finite length container of samples) from the python list
dx.buffer_from_vector([{'image': b'assets/cat.jpeg'},{'image': b'assets/dog.jpeg'}])
# Shuffle and transform to a stream
.to_stream()
# CLIP image pipeline
.load_image("image")
.image_resize_smallest_side("image", 224)
.image_center_crop("image", 224, 224)
# Accumulate into batches
.batch(2)
# Normalize
.key_transform("image", norm)
# Finally, fetch batches in background threads
.prefetch(prefetch_size=2, num_threads=2)
)
cat, dog = [sample["image"] for sample in dset] However, I was getting outputs very different from the reference CLIPImageProcessor. I suspect this is because CLIPImageProcessor uses Bicubic sampler for resizing. I am not sure whether we have Bicubic sampler in |
Sorry for the digression, but thank you so much for working on this! |
@nkasmanoff: * clip image processor * added example usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
Hopefully we 'll be able to remove a bunch of those dependencies (image preprocessing and BPE tokenizer) using mlx-data in the near future :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
* probably approximatelly correct CLIPTextEncoder * implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer * replaced embedding layer with simple matrix * implemented ViT * added ViT tests * fixed tests * added pooler_output for text * implemented complete CLIPModel * implemented init * implemented convert.py and from_pretrained * fixed some minor bugs and added the README.md * removed tokenizer unused comments * removed unused deps * updated ACKNOWLEDGEMENTS.md * Feat: Image Processor for CLIP (#1) @nkasmanoff: * clip image processor * added example usage * refactored image preprocessing * deleted unused image_config.py * removed preprocessing port * added dependency to mlx-data * fixed attribution and moved photos to assets * implemented a simple port of CLIPImageProcessor * review changes * PR review changes * renamed too verbose arg * updated README.md * nits in readme / conversion * simplify some stuff, remove unneeded inits * remove more init stuff * more simplify * make test a unit test * update main readme * readme nits --------- Co-authored-by: Noah Kasmanoff <nkasmanoff@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com>
This is an implementation of CLIP (ViT). Closes #143.
Implemented:
CLIPTokenizer
CLIPImageProcessor
convert.py
This PR is ready for review. My only concern is the correctness of CLIP initialization. I am not sure how to test that.
I would like to generate
HuggingFace
weights and configuration after the review. It is already possible to generate and uploadHuggingFace
weights and configuration. However, I decided to wait because those may change if the code changes.