Vision
updated
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
• 2406.16860
• Published • 63
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
• 2407.02477
• Published • 24
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
• 2408.10188
• Published • 52
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published • 133
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
• 2408.13257
• Published • 26
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published • 57
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time
Series Forecasters
Paper
• 2408.17253
• Published • 39
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper
• 2409.01704
• Published • 83
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free
Real Image Editing
Paper
• 2409.01322
• Published • 96
NVLM: Open Frontier-Class Multimodal LLMs
Paper
• 2409.11402
• Published • 74
Phidias: A Generative Model for Creating 3D Content from Text, Image,
and 3D Conditions with Reference-Augmented Diffusion
Paper
• 2409.11406
• Published • 27
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
Vision-Language Models
Paper
• 2410.09733
• Published • 8
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
• 2409.02889
• Published • 54
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published • 54
Progressive Multimodal Reasoning via Active Retrieval
Paper
• 2412.14835
• Published • 73
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
Paper
• 2309.16414
• Published • 19
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Paper
• 2503.16418
• Published • 36
ReSearch: Learning to Reason with Search for LLMs via Reinforcement
Learning
Paper
• 2503.19470
• Published • 19
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?
Paper
• 2409.15277
• Published • 38
Unified Vision-Language-Action Model
Paper
• 2506.19850
• Published • 28