Adds support for PaliGemma 2.

google-research · Dec 5, 2024 · 8e9b05b · 8e9b05b
1 parent 46b2456
commit 8e9b05b
Show file tree

Hide file tree

Showing 33 changed files with 5,281 additions and 563 deletions.
diff --git a/README.md b/README.md
@@ -93,7 +93,8 @@ codebase:
 - (partial) [PaLI-3 Vision Language Models: Smaller, Faster, Stronger](https://arxiv.org/abs/2310.09199), by Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut.
 - [LocCa](https://arxiv.org/abs/2403.19596), by
   Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai.
-- [PaliGemma](https://arxiv.org/abs/2407.07726), by *wow many authors*.\
+- [PaliGemma](https://arxiv.org/abs/2407.07726),
+  [PaliGemma 2](https://arxiv.org/abs/2412.03555), by *wow many authors*.\
 - Resources: [readme](big_vision/configs/proj/paligemma/README.md),
     [model](big_vision/models/proj/paligemma/paligemma.py),
     [transfer configs](big_vision/configs/proj/paligemma/transfers),

diff --git a/big_vision/configs/common_fewshot.py b/big_vision/configs/common_fewshot.py
@@ -18,7 +18,7 @@
 
 
 def get_fewshot_lsr(target_resolution=224, resize_resolution=256,
-                    runlocal=False, **kw):
+                    runlocal=False, pp=None, **kw):
   """Returns a standard-ish fewshot eval configuration."""
   kw.setdefault('representation_layer', 'pre_logits')
   kw.setdefault('shots', (1, 5, 10, 25))
@@ -45,12 +45,16 @@ def get_fewshot_lsr(target_resolution=224, resize_resolution=256,
   } if not runlocal else {
       'pets': ('oxford_iiit_pet', 'train', 'test'),
   }
-  config.pp_train = (f'decode|resize({resize_resolution})|'
-                     f'central_crop({target_resolution})|'
-                     f'value_range(-1,1)|keep("image", "label")')
-  config.pp_eval = (f'decode|resize({resize_resolution})|'
-                    f'central_crop({target_resolution})|'
-                    f'value_range(-1,1)|keep("image", "label")')
+
+  pp = pp or '|'.join([
+      'decode',
+      f'resize({resize_resolution})',
+      f'central_crop({target_resolution})',
+      'value_range(-1,1)'
+  ])
+  pp += '|keep("image", "label")'
+  config.pp_train = pp
+  config.pp_eval = pp
   config.display_first = [('imagenet', 10)] if not runlocal else [('pets', 10)]
 
   return config
diff --git a/big_vision/configs/proj/paligemma/README.md b/big_vision/configs/proj/paligemma/README.md
@@ -8,27 +8,28 @@ the [Gemma language model](https://ai.google.dev/gemma).
 PaliGemma is designed as a versatile model for transfer to a wide range of
 vision-language tasks such as image and short video caption, visual question
 answering, text reading, object detection and object segmentation. Together with
-the pretrained and transfer checkpoints at multiple resolutions, we provide a
-checkpoint transferred to a mixture of tasks that can be used for off-the-shelf
-exploration.
+the pretrained checkpoints (PaliGemma and PaliGemma 2) we also provide transfer
+checkpoints at multiple resolutions and a checkpoint transferred to a mixture of
+tasks that can be used for off-the-shelf exploration (PaliGemma only).
 
 ## Quick Reference
 
 This is the reference repository of the model, you may also want to check out the resources on
 
- - [ArXiv](https://arxiv.org/abs/2407.07726): Technical report.
- - [Kaggle](https://www.kaggle.com/models/google/paligemma):
- All pre-trained / mix checkpoints and model card.
- - [Kaggle-FT](https://www.kaggle.com/models/google/paligemma-ft):
- All fine-tuned checkpoints and model card.
- - [VertexAI Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/363):
- Paligemma models on GCP.
- - [Hugging Face](https://huggingface.co/google/paligemma-3b-pt-224):
- PyTorch port of paligemma models.
- - [Light finetuning colab](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/finetune_paligemma.ipynb):
-  Lightweight colab for fine-tuning PaliGemma. It can be run on a single T4 GPU (16GB)
-  available on free Colab.
- - [HuggingFace demo](https://hf.co/spaces/google/paligemma): live demo.
+ - Technical reports on ArXiv: [PaliGemma](https://arxiv.org/abs/2407.07726),
+   [PaliGemma 2](https://arxiv.org/abs/2412.03555)
+ - Pre-trained / mix checkpoints and model card on Kaggle:
+   [PaliGemma](https://www.kaggle.com/models/google/paligemma),
+   [PaliGemma transfers](https://www.kaggle.com/models/google/paligemma-ft),
+   [PaliGemma 2](https://www.kaggle.com/models/google/paligemma-2)
+ - Google Cloud VertexAI Model Garden:
+   [PaliGemma](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/363)
+ - PyTorch and JAX models on Hugging Face:
+   [PaliGemma](https://huggingface.co/collections/google/paligemma-release-6643a9ffbf57de2ae0448dda),
+   [PaliGemma 2](https://huggingface.co/collections/google/paligemma-2-release-67500e1e1dbfdd4dee27ba48)
+ - Light fine-tuning using `big_vision` on a single (free) T4 GPU:
+   [Colab](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/finetune_paligemma.ipynb)
+ - Demo: [HuggingFace PaliGemma space](https://hf.co/spaces/google/paligemma)
 
 ### Citation BibTeX
 
@@ -39,22 +40,31 @@ This is the reference repository of the model, you may also want to check out th
       year={2024},
       journal={arXiv preprint arXiv:2407.07726}
 }
+@article{steiner2024paligemma2,
+      title={{PaliGemma 2: A Family of Versatile VLMs for Transfer}},
+      author={Andreas Steiner and André Susano Pinto and Michael Tschannen and Daniel Keysers and Xiao Wang and Yonatan Bitton and Alexey Gritsenko and Matthias Minderer and Anthony Sherbondy and Shangbang Long and Siyang Qin and Reeve Ingle and Emanuele Bugliarello and Sahar Kazemzadeh and Thomas Mesnard and Ibrahim Alabdulmohsin and Lucas Beyer and Xiaohua Zhai},
+      year={2024},
+      journal={arXiv preprint arXiv:2412.03555}
+}
 ```
 
 ## Model description
 
 ### Overview
 
-PaliGemma-3B is Vision-Language model that was inspired by the PaLI-3 recipe.
-It is built on SigLIP visual encoder (specifically, SigLIP-So400m/14) and the
-Gemma 2B language model. PaliGemma takes as input one or more images,
-which are turned into "soft tokens" by the SigLIP encoder, and input text
-(codenamed the "prefix") that is tokenized by Gemma's tokenizer. The image
-tokens and prefix tokens are concatenated (in this order) and passed to the
-Gemma decoder with full block-attention, which then generates an output text
-(the "suffix") auto-regressively with masked attention.
+PaliGemma is Vision-Language model that was inspired by the PaLI-3 recipe. It is
+built on SigLIP visual encoder (specifically, SigLIP-So400m/14) and the
+Gemma language model. PaliGemma takes as input one or more images, which are
+turned into "soft tokens" by the SigLIP encoder, and input text (codenamed the
+"prefix") that is tokenized by Gemma's tokenizer. The image tokens and prefix
+tokens are concatenated (in this order) and passed to the Gemma decoder with
+full block-attention, which then generates an output text (the "suffix")
+auto-regressively with masked attention.
+
+![PaliGemma model](paligemma2.png)
 
-![PaliGemma model](paligemma.png)
+Note that PaliGemma uses Gemma 2B model, PaliGemma 2 uses Gemma 2 {2B,9B,27B}
+models.
 
 ### Training stages
 
@@ -98,12 +108,8 @@ other codebases.
 ## Checkpoints
 
 The PaliGemma models are released under the same open license as the Gemma
-models, and hence require manual acknowledgement of the license terms on kaggle:
-https://www.kaggle.com/models/google/paligemma. The reference checkpoints are
-available on
-[Kaggle](https://www.kaggle.com/models/google/paligemma),
-[VertexAI Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/363) and
-[Hugging Face](https://huggingface.co/google/paligemma-3b-pt-224).
+models, and hence require manual acknowledgement of the license terms. See
+above [Quick Reference](#quick-reference) for download links.
 
 ### Pretrained checkpoints
 
@@ -130,6 +136,8 @@ should happen in float32 or mixed precision.
 
 ### Mixture checkpoint
 
+(Currently only available for PaliGemma)
+
 This checkpoint is trained on a mixture of all our transfer tasks,
 with a balancing intended to make it "nice to use" out of the box for
 predictions. This model is multilingual and should
@@ -152,6 +160,8 @@ structured `detect {things}` and `segment {things}` prompts as in the base model
 
 ### Transfers results and checkpoints
 
+(DOCCI only available for PaliGemma 2, others only available for PaliGemma)
+
 We provide checkpoints transferred to most of the tasks we evaluated
 transfer on, see the [kaggle page](https://www.kaggle.com/models/google/paligemma).
 These are intended for use when a specialised model corresponding
@@ -244,16 +254,17 @@ Checkpoints can be downloaded from Kaggle. You need to create an account and ack
 export KAGGLE_USERNAME=
 export KAGGLE_KEY=
 
-# See https://www.kaggle.com/models/google/paligemma for a full list of models.
-export MODEL_NAME=paligemma-3b-pt-224
-export CKPT_FILE=paligemma-3b-pt-224.npz
+# See https://www.kaggle.com/models/google/paligemma-2 for a full list of models.
+export MODEL_NAME=paligemma-2
+export CKPT_FILE=paligemma2-3b-pt-224.npz.b16
 
 mkdir ckpts/
 cd ckpts/
 
+# Store as a "vanity name" from models/proj/paligemma/paligemma.py
 curl -L -u $KAGGLE_USERNAME:$KAGGLE_KEY\
-  -o pt_224.npz \
-  https://www.kaggle.com/api/v1/models/google/paligemma/jax/$MODEL_NAME/1/download/$CKPT_FILE
+  -o pt_3b_224.bf16.npz \
+  https://www.kaggle.com/api/v1/models/google/paligemma-2/jax/$MODEL_NAME/1/download/$CKPT_FILE
 ```
 
 As an example, we provide the `forkme.py` config that is based on the easily-adjustable jsonl data source:
@@ -267,4 +278,6 @@ If you want to use TFDS-based data, check out other transfer configs. Remember t
 
 ## Model Development Contributions
 
-See the [technical report](https://arxiv.org/abs/2407.07726)'s Appendix.
+See the Appendices of technical reports:
+[PaliGemma](https://arxiv.org/abs/2407.07726),
+[PaliGemma 2](https://arxiv.org/abs/2412.03555).