✨ Foundation Models

In this section, we review recent advancements in foundation models and mention state-of-the-art models. We catgeorise foundation models in four different catgories:

Large Language Models (LLMs)
Vision Language Models (VLMs)
Audio-Language Models (ALMs)
Large Multi-modal Models (LMMs)

Large Language Models (LLMs)

Model	Paper Title	Year	Paper	Code
GPT	Improving language understanding by generative pre-training	2018	link	link
GPT-2	Language models are unsupervised multitask learners	2019	link	link
GPT-3	Language models are few-shot learners	2020	link	link
GPT-4	GPT-4 technical report	2023	link	NA
o1	OpenAI o1 System Card	2024	link	NA
o3-mini	OpenAI o3-mini System Card	2025	link	NA
BERT	BERT: Pretraining of deep bidirectional transformers for language understanding	2018	link	link
T5	Exploring the limits of transfer learning with a unified text-to-text transformer	2020	link	link
FLAN-T5	Scaling instruction-finetuned language models	2024	link	link
OPT	OPT: Open pre-trained transformer language models	2020	link	NA
Falcon	The Falcon Series of Open Language Models	2023	link	link
Mistral	Mistral 7B	2023	link	link
Mixtral	Mixtral of experts	2023	link	link
LLaMA	LLaMA: Open and efficient foundation language models	2023	link	link
LLaMA 2	Llama 2: Open foundation and fine-tuned chat models	2023	link	link
Vicuna	Judging LLM-as-a-judge with MT-bench and chatbot arena	2023	link	link
Gemma	Gemma: Open models based on Gemini research and technology	2024	link	link
Gemma 2	Gemma 2: Improving open language models at a practical size	2024	link	link
Nemotron 4	Nemotron-4 340B technical report	2024	link	link
Qwen	Qwen technical report	2023	link	link
Qwen 2.5	Qwen2.5 technical report	2024	link	link
Qwen 3	Qwen3 technical report	2025	link	link
Phi-4	Phi-4 technical report	2024	link	link
DeepSeek	DeepSeek LLM: Scaling open-source language models with longtermism	2024	link	link
DeepSeek-V2	DeepSeek-v2: A strong, economical, and efficient mixture-of-experts language model	2024	link	link
DeepSeek-V3	DeepSeek-V3 technical report	2024	link	link
DeepSeek-R1	DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learnin	2025	link	link
ReasonFlux	ReasonFlux: Hierarchical LLM reasoning via scaling thought templates	2025	link	link

Vision Language Models (VLMs)

Model	Paper Title	Year	Paper	Code
DINO	Emerging properties in self-supervised vision transformers	2021	link	link
DINOv2	Dinov2: Learning robust visual features without supervision	2023	link	link
BEiT	Beit: Bert pre-training of image transformers	2021	link	link
CLIP	Learning transferable visual models from natural language supervision	2021	link	link
ALIGN	Scaling up visual and vision-language representation learning with noisy text supervision	2021	link	NA
FLAVA	Flava: A foundational language and vision alignment model	2022	link	link
Florence	Florence: A new foundation model for computer vision	2021	link	NA
OFA	Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework	2022	link	link
Unified-IO	Unified-io: A unified model for vision, language, and multi-modal tasks	2022	link	link
AIM	Scalable Pre-training of Large Autoregressive Image Models	2024	link	link
AIMv2	Multimodal Autoregressive Pre-training of Large Vision Encoders	2024	link	link
BLIP	Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation	2022	link	link
BLIP 2	Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models	2023	link	link
SigLIP	Sigmoid loss for language image pre-training	2023	link	link
SigLIP 2	Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features	2025	link	link
OpenCLIP	Reproducible Scaling Laws for Contrastive Language-Image Learning	2023	link	link
SAM	Segment anything	2023	link	link
SAM~2	Sam 2: Segment anything in images and videos	2024	link	link
DALL-E	Zero-shot text-to-image generation	2021	link	NA
DALL-E 2	Hierarchical text-conditional image generation with clip latents	2022	link	NA
DALL-E~3	Improving image generation with better captions	2023	link	NA
Stable Diffusion	High-resolution image synthesis with latent diffusion models	2022	link	link
Imagen 3	Imagen 3	2024	link	NA
Edify	Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models	2024	link	NA
LlamaGen	Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation	2024	link	link
GPT-4V	GPT-4V	2024	link	NA
MiniGPT-4	Minigpt-4: Enhancing vision-language understanding with advanced large language models	2023	link	link
Flamingo	Flamingo: a Visual Language Model for Few-Shot Learning	2022	link	NA
LLaVa	Improved baselines with visual instruction tuning	2024	link	link
Video-LLaVa	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	2023	link	link
Pixtral	Pixtral 12B	2024	link	link
Phi-3.5-Vision	Phi-3 technical report: A highly capable language model locally on your phone	2024	link	link
VILA	Vila: On pre-training for visual language models	2024	link	link
NVILA	NVILA: Efficient frontier visual language models	2024	link	link
VILA-U	Vila-u: a unified foundation model integrating visual understanding and generation	2024	link	link
TokenFlow	Tokenflow: Unified image tokenizer for multimodal understanding and generation	2024	link	link
VAR	Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction	2024	link	link
InstructBLIP	Instructblip: Towards general-purpose vision-language models with instruction tuning	2023	link	link
Yi-VL	Yi: Open foundation models by 01. ai	2024	link	link
Qwen-VL	Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond	2023	link	link
Qwen2-VL	Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution	2024	link	link
Qwen2.5-VL	Qwen2.5-VL Technical Report	2025	link	link
InternVL	Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks	2024	link	link
InternVL 1.5	How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites	2024	link	link
InternVL3	Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models	2025	link	link
InternVideo2	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	2024	link	link
LLaVa-OneVision	LLaVA-OneVision: Easy Visual Task Transfer	2024	link	link
LLaVa-NeXT	Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models	2024	link	link
CogVLM2	Cogvlm2: Visual language models for image and video understanding	2024	link	link
Bunny	Efficient multimodal learning from data-centric perspective	2024	link	link
Chameleon	Chameleon: Mixed-Modal Early-Fusion Foundation Models	2024	link	link
Apollo	Apollo: An Exploration of Video Understanding in Large Multimodal Models	2024	link	link
DeepSeek-VL	DeepSeek-VL: Towards Real-World Vision-Language Understanding	2024	link	link
DeepSeek-VL2	DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	2024	link	link
Emu 3	Emu3: Next-Token Prediction is All You Need	2024	link	link
Janus	Janus: Decoupling visual encoding for unified multimodal understanding and generation	2024	link	link
JanusFlow	JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	2024	link	link
Janus-Pro	Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling	2025	link	link
Movie Gen	Movie Gen: A Cast of Media Foundation Models	2024	link	NA
Mochi	[blog] Mochi 1: A new SOTA in open text-to-video	2024	link	link
Imagen Video	Imagen video: High definition video generation with diffusion models	2022	link	NA
Make-A-Video	Make-A-Video: Text-to-Video Generation without Text-Video Data	2023	link	link
Tune-A-Video	Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation	2023	link	link
PixelDance	Make pixels dance: High-dynamic video generation	2024	link	link
CogVideoX	CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer	2024	link	link
FlashVideo	FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation	2025	link	link
Goku	Goku: Flow Based Video Generative Foundation Models	2025	link	link
T2V	Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model	2025	link	link
Sora	[blog] Sora: Creating Video from Text	2024	link	NA

Audio-Language Models (ALMs)

Model	Paper Title	Year	Paper	Code
Wav2Vec 2.0	wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations	2020	link	link
HuBERT	Hubert: Self-supervised speech representation learning by masked prediction of hidden units	2021	link	link
WavLM	Wavlm: Large-scale self-supervised pre-training for full stack speech processing	2022	link	link
Whisper	Robust speech recognition via large-scale weak supervision	2023	link	link
USM	Google usm: Scaling automatic speech recognition beyond 100 languages	2023	link	NA
UniAudio	Uniaudio: An audio foundation model toward universal audio generation	2023	link	link
MERT	Mert: Acoustic music understanding model with large-scale self-supervised training	2023	link	link
CLAP	Clap learning audio concepts from natural language supervision	2023	link	link
SenseVoice	Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms	2024	link	link
CosyVoice	Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens	2024	link	link
Vall-E	Neural codec language models are zero-shot text to speech synthesizers	2023	link	NA
SpeechT5	Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing	2021	link	link
SLM	Slm: Bridge the thin gap between speech and text foundation models	2023	link	NA
AudioGPT	Audiogpt: Understanding and generating speech, music, sound, and talking head	2024	link	link
SpeechGPT	SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	2023	link	link
AudioPaLM	Audiopalm: A large language model that can speak and listen	2023	link	NA
SALMONN	SALMONN: Towards Generic Hearing Abilities for Large Language Models	2024	link	link
WavLLM	Wavllm: Towards robust and adaptive speech large language model	2024	link	link
Pengi	Pengi: An audio language model for audio tasks	2023	link	link
LTU	Listen, Think, and Understand	2024	link	link
GAMA	GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities	2024	link	link
Qwen2-Audio	Qwen2-audio technical report	2024	link	link
SeamlessM4T	SeamlessM4T-Massively Multilingual \& Multimodal Machine Translation	2023	link	link
Step-Audio	Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction	2025	link	link
MusicLM	Musiclm: Generating music from text	2023	link	NA
AudioLDM	Audioldm: Text-to-audio generation with latent diffusion models	2023	link	link

Model	Paper Title	Year	Paper	Code
AudioCLIP	Audioclip: Extending clip to image, text and audio	2022	link	link
4M	4m: Massively multimodal masked modeling	2023	link	link
ImageBind	Imagebind: One embedding space to bind them all	2023	link	link
PandaGPT	Pandagpt: One model to instruction-follow them all	2023	link	link
NeXT-GPT	NExT-GPT: Any-to-Any Multimodal LLM	2023	link	link
Video-LLaMA	Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	2023	link	link
Video-LLaMA2	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	2024	link	link
Video-SALMONN	video-salmonn: Speech-enhanced audio-visual large language models	2024	link	link
Gemini Pro	Gemini: a family of highly capable multimodal models	2023	link	NA
LLaMA 3	The llama 3 herd of models	2024	link	link
Qwen2.5-Omni	Qwen2. 5-omni technical report	2025	link	link
GPT-4o	GPT-4o System Card	2024	link	NA

✨ Foundation Models

Large Language Models (LLMs)

Vision Language Models (VLMs)

Audio-Language Models (ALMs)

Large Multi-modal Models (LMMs)