Pali: A jointly-scaled multilingual language-image model X Chen, X Wang, S Changpinyo, AJ Piergiovanni, P Padlewski, D Salz, ... ICLR 2023 (Oral), 2022 | 535 | 2022 |
LiT: Zero-Shot Transfer with Locked-image Text Tuning X Zhai, X Wang, B Mustafa, A Steiner, D Keysers, A Kolesnikov, L Beyer CVPR 2022, 2021 | 498 | 2021 |
Scaling vision transformers to 22 billion parameters M Dehghani, J Djolonga, B Mustafa, P Padlewski, J Heek, J Gilmer, ... ICML 2023 (Oral), 2023 | 389 | 2023 |
Simple Open-Vocabulary Object Detection with Vision Transformers M Minderer, A Gritsenko, A Stone, M Neumann, D Weissenborn, ... ECCV 2022, 2022 | 382* | 2022 |
Measuring compositional generalization: A comprehensive method on realistic data D Keysers, N Schärli, N Scales, H Buisman, D Furrer, S Kashubin, ... ICLR 2020, 2019 | 364 | 2019 |
Pali-x: On scaling up a multilingual vision and language model X Chen, J Djolonga, P Padlewski, B Mustafa, S Changpinyo, J Wu, ... CVPR 2024, 2023 | 124 | 2023 |
Pali-3 vision language models: Smaller, faster, stronger X Chen, X Wang, L Beyer, A Kolesnikov, J Wu, P Voigtlaender, B Mustafa, ... arXiv preprint arXiv:2310.09199, 2023 | 50 | 2023 |
PaliGemma: A versatile 3B VLM for transfer L Beyer, A Steiner, AS Pinto, A Kolesnikov, X Wang, D Salz, M Neumann, ... arXiv preprint arXiv:2407.07726, 2024 | 18 | 2024 |
Three Towers: Flexible Contrastive Learning with Pretrained Image Models J Kossen, M Collier, B Mustafa, X Wang, X Zhai, L Beyer, A Steiner, ... NeuIPS 2023, 2023 | 7 | 2023 |
No filter: Cultural and socioeconomic diversityin contrastive vision-language models A Pouget, L Beyer, E Bugliarello, X Wang, AP Steiner, X Zhai, ... NeurIPS 2024, 2024 | 5 | 2024 |
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? I Alabdulmohsin, X Wang, A Steiner, P Goyal, A D'Amour, X Zhai ICLR 2024, 2024 | 5 | 2024 |
A study of autoregressive decoders for multi-tasking in computer vision L Beyer, B Wan, G Madan, F Pavetic, A Steiner, A Kolesnikov, AS Pinto, ... arXiv preprint arXiv:2303.17376, 2023 | 5 | 2023 |
LocCa: Visual Pretraining with Location-aware Captioners B Wan, M Tschannen, Y Xian, F Pavetic, I Alabdulmohsin, X Wang, ... NeurIPS 2024, 2024 | 2 | 2024 |
Locked-Model Multimodal Contrastive Tuning D Keysers, X Zhai, X Wang, L Beyer, B Mustafa, A Steiner, A Kolesnikov US Patent App. 18/051,106, 2024 | | 2024 |