论文
- Layer by Layer: Uncovering Hidden Representations in Language Models
- s1: Simple test-time scaling
- SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Titans: Learning to Memorize at Test Time
- ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Real-Time Video Generation with Pyramid Attention Broadcast - Diffusion Models without Classifier-free Guidance
CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models Deliberation in Latent Space via Differentiable Cache Augmentation Training Large Language Models to Reason in a Continuous Latent Space - Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
- Distributional Reasoning in LLMs: Parallel Reasoning Processes in Multi-Hop Reasoning
In-context Autoencoder for Context Compression in a Large Language Model - DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
VoCo-LLaMA: Towards Vision Compression with Large Language Models - Progressive Compositionality in Text-to-Image Generative Models
- GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
Fixed Point Diffusion Models Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Visual Lexicon: Rich Image Features in Language Space - DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Scaling Language-Free Visual Representation Learning - Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT - Boosting Generative Image Modeling via Joint Image-Feature Synthesis
- Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
- Mean Flows for One-step Generative Modeling
- Fractal Generative Models
- FlowTok: Flowing Seamlessly Across Text and Image Tokens
- DDT: Decoupled Diffusion Transformer
- Introducing Multiverse: The First AI Multiplayer World Model
- Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
- Circuit Tracing: Revealing Computational Graphs in Language Models
- On the Biology of a Large Language Model
- PixelFlow: Pixel-Space Generative Models with Flow
- No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
- Visual Planning: Let’s Think Only with Images
BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset - Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
- Emerging Properties in Unified Multimodal Pretraining
- Latent Flow Transformer
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Efficient Pretraining Length Scaling
- Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
- MMaDA: Multimodal Large Diffusion Language Models
- Harnessing the Universal Geometry of Embeddings
Diffusion Meets Flow Matching: Two Sides of the Same Coin - Elucidating the Design Space of Diffusion-Based Generative Models
- Noise Schedules Considered Harmful
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens - Exploring the Latent Capacity of LLMs for One-Step Text Generation
- DataRater: Meta-Learned Dataset Curation
- A Fourier Space Perspective on Diffusion Models
- An Alchemist’s Notes on Deep Learning
- Spurious Rewards: Rethinking Training Signals in RLVR
- Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
- DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
- Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
- Mathematical Theory of Deep Learning
- Atlas: Learning to Optimally Memorize the Context at Test Time
- Navigating the Latent Space Dynamics of Neural Models
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models - un2CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP
- Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
- VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
- Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
- Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Dual-Process Image Generation - Continuous Thought Machines
- UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
- How much do language models memorize?
- Object Concepts Emerge from Motion
- Why Gradients Rapidly Increase Near the End of Training
- WorldExplorer: Towards Generating Fully Navigable 3D Scenes
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
- Physics of Language Models
- Flow-GRPO: Training Flow Matching Models via Online RL
- Contrastive Flow Matching
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
- ★STARFLOW: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
- Self Forcing Bridging the Train-Test Gap in Autoregressive Video Diffusion
- Understanding Transformer from the Perspective of Associative Memory
Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution - Hidden in plain sight: VLMs overlook their visual representations
- Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance
- Cartridges: Lightweight and general-purpose long context representations via self-study
- A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
- Inductive Moment Matching
- V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
- Exploring Diffusion Transformer Designs via Grafting
- Diffuse and Disperse: Image Generation with Representation Regularization
- Highly Compressed Tokenizer Can Generate Without Training
- Edit Flows: Flow Matching with Edit Operations
- Language-Image Alignment with Fixed Text Encoders
- Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
- The Illusion of the Illusion of Thinking
- Ambient Diffusion Omni: Training Good Models with Bad Data
Text-to-LoRA: Instant Transformer Adaption Visual Pre-Training on Unlabeled Images using Reinforcement Learning - On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity
- Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation - Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer
- Human-like object concept representations emerge naturally in multimodal large language models
- From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
- Rethinking Score Distillation as a Bridge Between Image Distributions
- Generative Multimodal Models are In-Context Learners
- Randomized Autoregressive Visual Generation
How Visual Representations Map to Language Feature Space in Multimodal LLMs CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens Imagine while Reasoning in Space: Multimodal Visualization-of-Thought - OmniGen2: Exploration to Advanced Multimodal Generation
Describing Differences in Image Sets with Natural Language - Vision-Language Models Create Cross-Modal Task Representations
- ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
- BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
Perception Encoder: The best visual embeddings are not at the output of the network - Streamline Without Sacrifice- Squeeze out Computation Redundancy in LMM
- From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
- Improving Progressive Generation with Decomposable Flow Matching
- Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models
- Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
- Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights
- VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
- Inference-time Scaling of Diffusion Models through Classical Search
- Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
- Generative Blocks World: Moving Things Around in Pictures
- Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps - MAGIC: Near-Optimal Data Attribution for Deep Learning
- Perception-R1: Pioneering Perception Policy with Reinforcement Learning
- REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
- Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction - Ovis-U1: Unified Understanding, Generation, and Editing
- DeepVerse: 4D Autoregressive Video Generation as a World Model
- GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
- Learning to Instruct for Visual Instruction Tuning
- Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
- Matryoshka Representation Learning
- Adaptive Length Image Tokenization via Recurrent Allocation
Test-Time Scaling of Diffusion Models via Noise Trajectory Search - Matryoshka Multimodal Models
Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning - Reasoning to Learn from Latent Thoughts
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer - ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models - Unified Multimodal Understanding via Byte-Pair Visual Encoding
- In-Context Learning State Vector with Inner and Momentum Optimization
- In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
- Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
- DICE: Distilling Classifier-Free Guidance into Text Embeddings
- Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
- Uncovering the Text Embedding in Text-to-Image Diffusion Models
- VLM-R3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
- Thinking with Generated Images
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning - Visual Planning: Let’s Think Only with Images
- Apollo: An Exploration of Video Understanding in Large Multimodal Models
- TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
- Latent Concept Disentanglement in Transformer-based Language Models
- JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation
- Describe Anything: Detailed Localized Image and Video Captioning
- Enough Coin Flips Can Make LLMs Act Bayesian
- Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
- When Does Perceptual Alignment Benefit Vision Representations?
- Activation Reward Models for Few-Shot Model Alignment
- Fast and Simplex: 2-Simplicial Attention in Triton
- Steering Llama 2 via Contrastive Activation Addition
- Extracting Latent Steering Vectors from Pretrained Language Models
- Steering Language Models With Activation Engineering
- Activation Reward Models for Few-Shot Model Alignment
- Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
- Do Large Language Models Latently Perform Multi-Hop Reasoning?
- Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
- Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
- Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
Hierarchical Text-Conditional Image Generation with CLIP Latents - High Fidelity Visualization of What Your Self-Supervised Representation Knows About
- On the Importance of Embedding Norms in Self-Supervised Learning
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces - Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
- Contrastive Learning Inverts the Data Generating Process
- Mitigating the Discrepancy Between Video and Text Temporal Sequences: A Time-Perception Enhanced Video Grounding method for LLM
- Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames
- Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing - Think before you speak: Training Language Models With Pause Tokens
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling - Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
- Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models Energy-Based Transformers are Scalable Learners and Thinkers Null-text Inversion for Editing Real Images using Guided Diffusion Models - A General Framework for Inference-time Scaling and Steering of Diffusion Models
- Video-T1: Test-Time Scaling for Video Generation
- The Parallelism Tradeoff: Limitations of Log-Precision Transformers
- The Generative AI Paradox: “What It Can Create, It May Not Understand”
- Birth of a Transformer: A Memory Viewpoint
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching - What happens to diffusion model likelihood when your model is conditional?
- On the rankability of visual embeddings
- Mitigating Overthinking in Large Reasoning Models via Manifold Steering
- Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
- Modern Methods in Associative Memory
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models A Survey on Latent Reasoning - Compressed Chain of Thought: Efficient Reasoning through Dense Representations
- Parallel Continuous Chain-of-Thought with Jacobi Iteration
- Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration
- Do LLMs Really Think Step-by-step In Implicit Reasoning?
- How Do LLMs Perform Two-Hop Reasoning in Context?
- Iteration Head: A Mechanistic Study of Chain-of-Thought
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
- How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
- Steering Your Diffusion Policy with Latent Space Reinforcement Learning
- DanceGRPO: Unleashing GRPO on Visual Generation
- EVA-CLIP: Improved Training Techniques for CLIP at Scale
- Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
- Does Data Scaling Lead to Visual Compositional Generalization?
- Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
- Prompting as Scientific Inquiry
- Single-pass Adaptive Image Tokenization for Minimum Program Search
- Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
- Context Tuning for In-Context Optimization
- MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
- Not All Explanations for Deep Learning Phenomena Are Equally Valuable
- VAGEN: Training VLM agents with multi-turn reinforcement learning
网站
笔记