Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vision/Multimodal #791

Open
bhack opened this issue Apr 18, 2024 · 16 comments
Open

Vision/Multimodal #791

bhack opened this issue Apr 18, 2024 · 16 comments

Comments

@bhack
Copy link

bhack commented Apr 18, 2024

With all the growing activity and focus on multimodal models is this library restricted to tune text only LLM?
Do we plan to have Vision or more in general multimodal models tuning support?

@ebsmothers
Copy link
Contributor

Hi @bhack thanks for the question! We haven't added any multimodal models yet as we are working to get good coverage of text-only methods first, but it's definitely something we are considering for the future. Out of curiosity, are there any multimodal models or techniques you'd be interested in seeing specifically?

@bhack
Copy link
Author

bhack commented Apr 18, 2024

There was a recent and interesting survey at:
https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey

@bhack
Copy link
Author

bhack commented Apr 18, 2024

About my personal preference I would like to effectively fine-tuning with a support library like this one models like (or similar):
https://github.com/FoundationVision/GLEE

@matbee-eth
Copy link

matbee-eth commented Apr 23, 2024

Moondream2 would be a good one to begin with! It uses Phi-2 and forked siglip as a projector

@ebsmothers
Copy link
Contributor

Thanks @bhack and @matbee-eth for the suggestions! At this exact moment we do not have the bandwidth to take these on but we will keep them both in mind for the near future (re Moondream2 we are currently working on adding Phi-3). In the meantime, please let us know if either of you would be willing to contribute on this front.

@matbee-eth
Copy link

Thanks @bhack and @matbee-eth for the suggestions! At this exact moment we do not have the bandwidth to take these on but we will keep them both in mind for the near future (re Moondream2 we are currently working on adding Phi-3). In the meantime, please let us know if either of you would be willing to contribute on this front.

Do you know of any PR's that cover end-to-end implementation details for doing such a thing? Just to assess whether it takes novel-work or if its just conforming to some sort of protocol/design.

@bhack
Copy link
Author

bhack commented Jul 30, 2024

Also fine-tuning other Meta foundational models would be nice like recent https://github.com/facebookresearch/segment-anything-2

@joecummings
Copy link
Contributor

Also fine-tuning other Meta foundational models would be nice like recent facebookresearch/segment-anything-2

Thanks for the input! We're still a small team so working hard to provide great memory savings and performance for LLMs first, but this is 100% on our radar.

Just out of curiosity - what kind of finetuning would you want to do with SAM2? Do you have any hard data or HW constraints?

@bhack
Copy link
Author

bhack commented Jul 30, 2024

Yes generally it could be hard mining cases, highres, HW constrains etc.. So I think that also in Vision we really have the same type of fine-tuning needs.
I really hope that we could share some common infra/components between LLM, Vision and Multimodal without building 3 different frameworks.
But this will really depend on how well torctune could abstract some concepts.
Also if I understand you are prioritizing LLM I think that a Multimodal/Vision model could be useful as an early Canary test to lowering the risk of a more heavy refractory on a later stage.
I think that you can ask the collaboration of some Vision/multimodal teams internally to create more critical mass around the project.

@bhack
Copy link
Author

bhack commented Jul 30, 2024

E.g. see how many comments we had on the original SAM just related to fine-tuning:
facebookresearch/segment-anything#5

@bhack
Copy link
Author

bhack commented Jul 30, 2024

Also just to make another example. Your WIP RLHF with PPO #1005 or other approaches like that could be still useful in Vision/Multimodal https://encord.com/blog/guide-to-rlhf/

So I think this is why it is important to have some canary tests on other domains to better validate the design.

@bhack
Copy link
Author

bhack commented Nov 6, 2024

@RdoubleA Where we could track the Vision part?

@RdoubleA
Copy link
Contributor

RdoubleA commented Nov 6, 2024

We currently support multimodal models (vision + text), you can take a look at Llama3.2 Vision: https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama3_2_vision/_model_builders.py which uses CLIP image encoders.

The components used here can be used to add future multimodal models, especially for DeepFusion (https://github.com/pytorch/torchtune/blob/main/torchtune/modules/model_fusion/_fusion.py). I am currently working on enabling EarlyFusion here: #1904. This would support models like Llava.

Is there any model or feature you are particularly interested in?

@RdoubleA RdoubleA reopened this Nov 6, 2024
@bhack
Copy link
Author

bhack commented Nov 10, 2024

SAM2 now has official training code from META. Do you think we could have a pure vision model like this?

@felipemello1
Copy link
Contributor

felipemello1 commented Nov 11, 2024

Hey @bhack , thanks for asking! Currently we don't have it in your roadmap :/. Our main focus is LLM, including multimodality. It is possible that, as we build for multimodality some vision models will become natively supported. But SAM2 is not in our plans just yet.

@bhack
Copy link
Author

bhack commented Nov 13, 2024

Ok any visual only model could be fine if you could include it in the roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants