| DINO | Emerging properties in self-supervised vision transformers | 2021 | link | link |
| DINOv2 | Dinov2: Learning robust visual features without supervision | 2023 | link | link |
| BEiT | Beit: Bert pre-training of image transformers | 2021 | link | link |
| CLIP | Learning transferable visual models from natural language supervision | 2021 | link | link |
| ALIGN | Scaling up visual and vision-language representation learning with noisy text supervision | 2021 | link | NA |
| FLAVA | Flava: A foundational language and vision alignment model | 2022 | link | link |
| Florence | Florence: A new foundation model for computer vision | 2021 | link | NA |
| OFA | Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework | 2022 | link | link |
| Unified-IO | Unified-io: A unified model for vision, language, and multi-modal tasks | 2022 | link | link |
| AIM | Scalable Pre-training of Large Autoregressive Image Models | 2024 | link | link |
| AIMv2 | Multimodal Autoregressive Pre-training of Large Vision Encoders | 2024 | link | link |
| BLIP | Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation | 2022 | link | link |
| BLIP 2 | Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models | 2023 | link | link |
| SigLIP | Sigmoid loss for language image pre-training | 2023 | link | link |
| SigLIP 2 | Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features | 2025 | link | link |
| OpenCLIP | Reproducible Scaling Laws for Contrastive Language-Image Learning | 2023 | link | link |
| SAM | Segment anything | 2023 | link | link |
| SAM~2 | Sam 2: Segment anything in images and videos | 2024 | link | link |
| DALL-E | Zero-shot text-to-image generation | 2021 | link | NA |
| DALL-E 2 | Hierarchical text-conditional image generation with clip latents | 2022 | link | NA |
| DALL-E~3 | Improving image generation with better captions | 2023 | link | NA |
| Stable Diffusion | High-resolution image synthesis with latent diffusion models | 2022 | link | link |
| Imagen 3 | Imagen 3 | 2024 | link | NA |
| Edify | Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models | 2024 | link | NA |
| LlamaGen | Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation | 2024 | link | link |
| GPT-4V | GPT-4V | 2024 | link | NA |
| MiniGPT-4 | Minigpt-4: Enhancing vision-language understanding with advanced large language models | 2023 | link | link |
| Flamingo | Flamingo: a Visual Language Model for Few-Shot Learning | 2022 | link | NA |
| LLaVa | Improved baselines with visual instruction tuning | 2024 | link | link |
| Video-LLaVa | Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | 2023 | link | link |
| Pixtral | Pixtral 12B | 2024 | link | link |
| Phi-3.5-Vision | Phi-3 technical report: A highly capable language model locally on your phone | 2024 | link | link |
| VILA | Vila: On pre-training for visual language models | 2024 | link | link |
| NVILA | NVILA: Efficient frontier visual language models | 2024 | link | link |
| VILA-U | Vila-u: a unified foundation model integrating visual understanding and generation | 2024 | link | link |
| TokenFlow | Tokenflow: Unified image tokenizer for multimodal understanding and generation | 2024 | link | link |
| VAR | Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction | 2024 | link | link |
| InstructBLIP | Instructblip: Towards general-purpose vision-language models with instruction tuning | 2023 | link | link |
| Yi-VL | Yi: Open foundation models by 01. ai | 2024 | link | link |
| Qwen-VL | Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond | 2023 | link | link |
| Qwen2-VL | Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution | 2024 | link | link |
| Qwen2.5-VL | Qwen2.5-VL Technical Report | 2025 | link | link |
| InternVL | Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks | 2024 | link | link |
| InternVL 1.5 | How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites | 2024 | link | link |
| InternVL3 | Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models | 2025 | link | link |
| InternVideo2 | InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | 2024 | link | link |
| LLaVa-OneVision | LLaVA-OneVision: Easy Visual Task Transfer | 2024 | link | link |
| LLaVa-NeXT | Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models | 2024 | link | link |
| CogVLM2 | Cogvlm2: Visual language models for image and video understanding | 2024 | link | link |
| Bunny | Efficient multimodal learning from data-centric perspective | 2024 | link | link |
| Chameleon | Chameleon: Mixed-Modal Early-Fusion Foundation Models | 2024 | link | link |
| Apollo | Apollo: An Exploration of Video Understanding in Large Multimodal Models | 2024 | link | link |
| DeepSeek-VL | DeepSeek-VL: Towards Real-World Vision-Language Understanding | 2024 | link | link |
| DeepSeek-VL2 | DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | 2024 | link | link |
| Emu 3 | Emu3: Next-Token Prediction is All You Need | 2024 | link | link |
| Janus | Janus: Decoupling visual encoding for unified multimodal understanding and generation | 2024 | link | link |
| JanusFlow | JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation | 2024 | link | link |
| Janus-Pro | Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | 2025 | link | link |
| Movie Gen | Movie Gen: A Cast of Media Foundation Models | 2024 | link | NA |
| Mochi | [blog] Mochi 1: A new SOTA in open text-to-video | 2024 | link | link |
| Imagen Video | Imagen video: High definition video generation with diffusion models | 2022 | link | NA |
| Make-A-Video | Make-A-Video: Text-to-Video Generation without Text-Video Data | 2023 | link | link |
| Tune-A-Video | Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation | 2023 | link | link |
| PixelDance | Make pixels dance: High-dynamic video generation | 2024 | link | link |
| CogVideoX | CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer | 2024 | link | link |
| FlashVideo | FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation | 2025 | link | link |
| Goku | Goku: Flow Based Video Generative Foundation Models | 2025 | link | link |
| T2V | Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model | 2025 | link | link |
| Sora | [blog] Sora: Creating Video from Text | 2024 | link | NA |