✨ Foundation Models

In this section, we review recent advancements in foundation models and mention state-of-the-art models. We catgeorise foundation models in four different catgories:

Large Language Models (LLMs)

Model Paper Title Year Paper Code
GPT Improving language understanding by generative pre-training 2018 link link
GPT-2 Language models are unsupervised multitask learners 2019 link link
GPT-3 Language models are few-shot learners 2020 link link
GPT-4 GPT-4 technical report 2023 link NA
o1 OpenAI o1 System Card 2024 link NA
o3-mini OpenAI o3-mini System Card 2025 link NA
BERT BERT: Pretraining of deep bidirectional transformers for language understanding 2018 link link
T5 Exploring the limits of transfer learning with a unified text-to-text transformer 2020 link link
FLAN-T5 Scaling instruction-finetuned language models 2024 link link
OPT OPT: Open pre-trained transformer language models 2020 link NA
Falcon The Falcon Series of Open Language Models 2023 link link
Mistral Mistral 7B 2023 link link
Mixtral Mixtral of experts 2023 link link
LLaMA LLaMA: Open and efficient foundation language models 2023 link link
LLaMA 2 Llama 2: Open foundation and fine-tuned chat models 2023 link link
Vicuna Judging LLM-as-a-judge with MT-bench and chatbot arena 2023 link link
Gemma Gemma: Open models based on Gemini research and technology 2024 link link
Gemma 2 Gemma 2: Improving open language models at a practical size 2024 link link
Nemotron 4 Nemotron-4 340B technical report 2024 link link
Qwen Qwen technical report 2023 link link
Qwen 2.5 Qwen2.5 technical report 2024 link link
Qwen 3 Qwen3 technical report 2025 link link
Phi-4 Phi-4 technical report 2024 link link
DeepSeek DeepSeek LLM: Scaling open-source language models with longtermism 2024 link link
DeepSeek-V2 DeepSeek-v2: A strong, economical, and efficient mixture-of-experts language model 2024 link link
DeepSeek-V3 DeepSeek-V3 technical report 2024 link link
DeepSeek-R1 DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learnin 2025 link link
ReasonFlux ReasonFlux: Hierarchical LLM reasoning via scaling thought templates 2025 link link

Vision Language Models (VLMs)

Model Paper Title Year Paper Code
DINO Emerging properties in self-supervised vision transformers 2021 link link
DINOv2 Dinov2: Learning robust visual features without supervision 2023 link link
BEiT Beit: Bert pre-training of image transformers 2021 link link
CLIP Learning transferable visual models from natural language supervision 2021 link link
ALIGN Scaling up visual and vision-language representation learning with noisy text supervision 2021 link NA
FLAVA Flava: A foundational language and vision alignment model 2022 link link
Florence Florence: A new foundation model for computer vision 2021 link NA
OFA Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework 2022 link link
Unified-IO Unified-io: A unified model for vision, language, and multi-modal tasks 2022 link link
AIM Scalable Pre-training of Large Autoregressive Image Models 2024 link link
AIMv2 Multimodal Autoregressive Pre-training of Large Vision Encoders 2024 link link
BLIP Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation 2022 link link
BLIP 2 Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models 2023 link link
SigLIP Sigmoid loss for language image pre-training 2023 link link
SigLIP 2 Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features 2025 link link
OpenCLIP Reproducible Scaling Laws for Contrastive Language-Image Learning 2023 link link
SAM Segment anything 2023 link link
SAM~2 Sam 2: Segment anything in images and videos 2024 link link
DALL-E Zero-shot text-to-image generation 2021 link NA
DALL-E 2 Hierarchical text-conditional image generation with clip latents 2022 link NA
DALL-E~3 Improving image generation with better captions 2023 link NA
Stable Diffusion High-resolution image synthesis with latent diffusion models 2022 link link
Imagen 3 Imagen 3 2024 link NA
Edify Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models 2024 link NA
LlamaGen Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation 2024 link link
GPT-4V GPT-4V 2024 link NA
MiniGPT-4 Minigpt-4: Enhancing vision-language understanding with advanced large language models 2023 link link
Flamingo Flamingo: a Visual Language Model for Few-Shot Learning 2022 link NA
LLaVa Improved baselines with visual instruction tuning 2024 link link
Video-LLaVa Video-LLaVA: Learning United Visual Representation by Alignment Before Projection 2023 link link
Pixtral Pixtral 12B 2024 link link
Phi-3.5-Vision Phi-3 technical report: A highly capable language model locally on your phone 2024 link link
VILA Vila: On pre-training for visual language models 2024 link link
NVILA NVILA: Efficient frontier visual language models 2024 link link
VILA-U Vila-u: a unified foundation model integrating visual understanding and generation 2024 link link
TokenFlow Tokenflow: Unified image tokenizer for multimodal understanding and generation 2024 link link
VAR Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction 2024 link link
InstructBLIP Instructblip: Towards general-purpose vision-language models with instruction tuning 2023 link link
Yi-VL Yi: Open foundation models by 01. ai 2024 link link
Qwen-VL Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond 2023 link link
Qwen2-VL Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution 2024 link link
Qwen2.5-VL Qwen2.5-VL Technical Report 2025 link link
InternVL Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks 2024 link link
InternVL 1.5 How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites 2024 link link
InternVL3 Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models 2025 link link
InternVideo2 InternVideo2: Scaling Foundation Models for Multimodal Video Understanding 2024 link link
LLaVa-OneVision LLaVA-OneVision: Easy Visual Task Transfer 2024 link link
LLaVa-NeXT Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models 2024 link link
CogVLM2 Cogvlm2: Visual language models for image and video understanding 2024 link link
Bunny Efficient multimodal learning from data-centric perspective 2024 link link
Chameleon Chameleon: Mixed-Modal Early-Fusion Foundation Models 2024 link link
Apollo Apollo: An Exploration of Video Understanding in Large Multimodal Models 2024 link link
DeepSeek-VL DeepSeek-VL: Towards Real-World Vision-Language Understanding 2024 link link
DeepSeek-VL2 DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding 2024 link link
Emu 3 Emu3: Next-Token Prediction is All You Need 2024 link link
Janus Janus: Decoupling visual encoding for unified multimodal understanding and generation 2024 link link
JanusFlow JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation 2024 link link
Janus-Pro Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling 2025 link link
Movie Gen Movie Gen: A Cast of Media Foundation Models 2024 link NA
Mochi [blog] Mochi 1: A new SOTA in open text-to-video 2024 link link
Imagen Video Imagen video: High definition video generation with diffusion models 2022 link NA
Make-A-Video Make-A-Video: Text-to-Video Generation without Text-Video Data 2023 link link
Tune-A-Video Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation 2023 link link
PixelDance Make pixels dance: High-dynamic video generation 2024 link link
CogVideoX CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer 2024 link link
FlashVideo FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation 2025 link link
Goku Goku: Flow Based Video Generative Foundation Models 2025 link link
T2V Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model 2025 link link
Sora [blog] Sora: Creating Video from Text 2024 link NA

Audio-Language Models (ALMs)

Model Paper Title Year Paper Code
Wav2Vec 2.0 wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 2020 link link
HuBERT Hubert: Self-supervised speech representation learning by masked prediction of hidden units 2021 link link
WavLM Wavlm: Large-scale self-supervised pre-training for full stack speech processing 2022 link link
Whisper Robust speech recognition via large-scale weak supervision 2023 link link
USM Google usm: Scaling automatic speech recognition beyond 100 languages 2023 link NA
UniAudio Uniaudio: An audio foundation model toward universal audio generation 2023 link link
MERT Mert: Acoustic music understanding model with large-scale self-supervised training 2023 link link
CLAP Clap learning audio concepts from natural language supervision 2023 link link
SenseVoice Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms 2024 link link
CosyVoice Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens 2024 link link
Vall-E Neural codec language models are zero-shot text to speech synthesizers 2023 link NA
SpeechT5 Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing 2021 link link
SLM Slm: Bridge the thin gap between speech and text foundation models 2023 link NA
AudioGPT Audiogpt: Understanding and generating speech, music, sound, and talking head 2024 link link
SpeechGPT SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities 2023 link link
AudioPaLM Audiopalm: A large language model that can speak and listen 2023 link NA
SALMONN SALMONN: Towards Generic Hearing Abilities for Large Language Models 2024 link link
WavLLM Wavllm: Towards robust and adaptive speech large language model 2024 link link
Pengi Pengi: An audio language model for audio tasks 2023 link link
LTU Listen, Think, and Understand 2024 link link
GAMA GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities 2024 link link
Qwen2-Audio Qwen2-audio technical report 2024 link link
SeamlessM4T SeamlessM4T-Massively Multilingual \& Multimodal Machine Translation 2023 link link
Step-Audio Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction 2025 link link
MusicLM Musiclm: Generating music from text 2023 link NA
AudioLDM Audioldm: Text-to-audio generation with latent diffusion models 2023 link link

Large Multi-modal Models (LMMs)

Model Paper Title Year Paper Code
AudioCLIP Audioclip: Extending clip to image, text and audio 2022 link link
4M 4m: Massively multimodal masked modeling 2023 link link
ImageBind Imagebind: One embedding space to bind them all 2023 link link
PandaGPT Pandagpt: One model to instruction-follow them all 2023 link link
NeXT-GPT NExT-GPT: Any-to-Any Multimodal LLM 2023 link link
Video-LLaMA Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding 2023 link link
Video-LLaMA2 VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs 2024 link link
Video-SALMONN video-salmonn: Speech-enhanced audio-visual large language models 2024 link link
Gemini Pro Gemini: a family of highly capable multimodal models 2023 link NA
LLaMA 3 The llama 3 herd of models 2024 link link
Qwen2.5-Omni Qwen2. 5-omni technical report 2025 link link
GPT-4o GPT-4o System Card 2024 link NA

This site uses Just the Docs, a documentation theme for Jekyll.