论文
- Layer by Layer: Uncovering Hidden Representations in Language Models
- s1: Simple test-time scaling
- SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Titans: Learning to Memorize at Test Time
- ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Real-Time Video Generation with Pyramid Attention Broadcast - Diffusion Models without Classifier-free Guidance
CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models Deliberation in Latent Space via Differentiable Cache Augmentation Training Large Language Models to Reason in a Continuous Latent Space - Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
- Distributional Reasoning in LLMs: Parallel Reasoning Processes in Multi-Hop Reasoning
In-context Autoencoder for Context Compression in a Large Language Model - DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
VoCo-LLaMA: Towards Vision Compression with Large Language Models - Progressive Compositionality in Text-to-Image Generative Models
- GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
Fixed Point Diffusion Models Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Visual Lexicon: Rich Image Features in Language Space - DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Scaling Language-Free Visual Representation Learning - Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT - Boosting Generative Image Modeling via Joint Image-Feature Synthesis
- Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
- Mean Flows for One-step Generative Modeling
- Fractal Generative Models
- FlowTok: Flowing Seamlessly Across Text and Image Tokens
DDT: Decoupled Diffusion Transformer - Introducing Multiverse: The First AI Multiplayer World Model
- Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
- Circuit Tracing: Revealing Computational Graphs in Language Models
- On the Biology of a Large Language Model
- PixelFlow: Pixel-Space Generative Models with Flow
- No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
- Visual Planning: Let’s Think Only with Images
BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset - Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
- Emerging Properties in Unified Multimodal Pretraining
- Latent Flow Transformer
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Efficient Pretraining Length Scaling
- MMaDA: Multimodal Large Diffusion Language Models
- Harnessing the Universal Geometry of Embeddings
Diffusion Meets Flow Matching: Two Sides of the Same Coin - Elucidating the Design Space of Diffusion-Based Generative Models
- Noise Schedules Considered Harmful
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens - Exploring the Latent Capacity of LLMs for One-Step Text Generation
- DataRater: Meta-Learned Dataset Curation
- A Fourier Space Perspective on Diffusion Models
- An Alchemist’s Notes on Deep Learning
- Spurious Rewards: Rethinking Training Signals in RLVR
- Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
- DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
- Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
- Mathematical Theory of Deep Learning
- Atlas: Learning to Optimally Memorize the Context at Test Time
- Navigating the Latent Space Dynamics of Neural Models
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models - un2CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP
- Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
- VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
- Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
- Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Dual-Process Image Generation - Continuous Thought Machines
- UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
- How much do language models memorize?
- Object Concepts Emerge from Motion
- Why Gradients Rapidly Increase Near the End of Training
- WorldExplorer: Towards Generating Fully Navigable 3D Scenes
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
- Physics of Language Models
- Flow-GRPO: Training Flow Matching Models via Online RL
- Contrastive Flow Matching
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
- ★STARFLOW: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
- Self Forcing Bridging the Train-Test Gap in Autoregressive Video Diffusion
- Understanding Transformer from the Perspective of Associative Memory
Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution - Hidden in plain sight: VLMs overlook their visual representations
- Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance
- Cartridges: Lightweight and general-purpose long context representations via self-study
- A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
- Inductive Moment Matching
- V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
- Exploring Diffusion Transformer Designs via Grafting
- Diffuse and Disperse: Image Generation with Representation Regularization
- Highly Compressed Tokenizer Can Generate Without Training
- Edit Flows: Flow Matching with Edit Operations
- Language-Image Alignment with Fixed Text Encoders
- The Illusion of the Illusion of Thinking
- Ambient Diffusion Omni: Training Good Models with Bad Data
Text-to-LoRA: Instant Transformer Adaption Visual Pre-Training on Unlabeled Images using Reinforcement Learning - On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity
- Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation - Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer
- Human-like object concept representations emerge naturally in multimodal large language models
- From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
- Rethinking Score Distillation as a Bridge Between Image Distributions
- Generative Multimodal Models are In-Context Learners
- Randomized Autoregressive Visual Generation
How Visual Representations Map to Language Feature Space in Multimodal LLMs CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens Imagine while Reasoning in Space: Multimodal Visualization-of-Thought - OmniGen2: Exploration to Advanced Multimodal Generation
Describing Differences in Image Sets with Natural Language - Vision-Language Models Create Cross-Modal Task Representations
- ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
- BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
Perception Encoder: The best visual embeddings are not at the output of the network - Streamline Without Sacrifice- Squeeze out Computation Redundancy in LMM
- From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
- Improving Progressive Generation with Decomposable Flow Matching
- Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models
- Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
- Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights
- VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
- Inference-time Scaling of Diffusion Models through Classical Search
- Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
- Generative Blocks World: Moving Things Around in Pictures
- Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps - MAGIC: Near-Optimal Data Attribution for Deep Learning
- Perception-R1: Pioneering Perception Policy with Reinforcement Learning
- REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
- Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction Ovis-U1: Unified Understanding, Generation, and Editing - DeepVerse: 4D Autoregressive Video Generation as a World Model
- GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
- Learning to Instruct for Visual Instruction Tuning
- Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
- Matryoshka Representation Learning
- Adaptive Length Image Tokenization via Recurrent Allocation
Test-Time Scaling of Diffusion Models via Noise Trajectory Search - Matryoshka Multimodal Models
Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning - Reasoning to Learn from Latent Thoughts
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer - ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models - Unified Multimodal Understanding via Byte-Pair Visual Encoding
- In-Context Learning State Vector with Inner and Momentum Optimization
- In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
- Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
- DICE: Distilling Classifier-Free Guidance into Text Embeddings
- Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
- Uncovering the Text Embedding in Text-to-Image Diffusion Models
- VLM-R3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
- Thinking with Generated Images
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning - Visual Planning: Let’s Think Only with Images
- Apollo: An Exploration of Video Understanding in Large Multimodal Models
- TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
- Latent Concept Disentanglement in Transformer-based Language Models
- JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation
- Describe Anything: Detailed Localized Image and Video Captioning
- Enough Coin Flips Can Make LLMs Act Bayesian
- Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
- When Does Perceptual Alignment Benefit Vision Representations?
- Activation Reward Models for Few-Shot Model Alignment
- Fast and Simplex: 2-Simplicial Attention in Triton
- Steering Llama 2 via Contrastive Activation Addition
- Extracting Latent Steering Vectors from Pretrained Language Models
- Steering Language Models With Activation Engineering
- Activation Reward Models for Few-Shot Model Alignment
- Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
- Do Large Language Models Latently Perform Multi-Hop Reasoning?
- Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
- Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
- Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
Hierarchical Text-Conditional Image Generation with CLIP Latents - High Fidelity Visualization of What Your Self-Supervised Representation Knows About
- On the Importance of Embedding Norms in Self-Supervised Learning
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces - Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
- Contrastive Learning Inverts the Data Generating Process
- Mitigating the Discrepancy Between Video and Text Temporal Sequences: A Time-Perception Enhanced Video Grounding method for LLM
- Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames
- Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing - Think before you speak: Training Language Models With Pause Tokens
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling - Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
- Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models Energy-Based Transformers are Scalable Learners and Thinkers Null-text Inversion for Editing Real Images using Guided Diffusion Models - A General Framework for Inference-time Scaling and Steering of Diffusion Models
- Video-T1: Test-Time Scaling for Video Generation
- The Parallelism Tradeoff: Limitations of Log-Precision Transformers
- The Generative AI Paradox: “What It Can Create, It May Not Understand”
- Birth of a Transformer: A Memory Viewpoint
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching - What happens to diffusion model likelihood when your model is conditional?
- On the rankability of visual embeddings
- Mitigating Overthinking in Large Reasoning Models via Manifold Steering
- Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
- Modern Methods in Associative Memory
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models A Survey on Latent Reasoning - Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration
- Do LLMs Really Think Step-by-step In Implicit Reasoning?
- How Do LLMs Perform Two-Hop Reasoning in Context?
- Iteration Head: A Mechanistic Study of Chain-of-Thought
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
- How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
- Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DanceGRPO: Unleashing GRPO on Visual Generation - EVA-CLIP: Improved Training Techniques for CLIP at Scale
- Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
- Does Data Scaling Lead to Visual Compositional Generalization?
- Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
- Prompting as Scientific Inquiry
- Single-pass Adaptive Image Tokenization for Minimum Program Search
- Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
- Context Tuning for In-Context Optimization
- MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
- Not All Explanations for Deep Learning Phenomena Are Equally Valuable
- VAGEN: Training VLM agents with multi-turn reinforcement learning
- Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation
- Streaming 4D Visual Geometry Transformer
- ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
- The Expressive Power of Transformers with Chain of Thought
- Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
- Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
- Compressed Chain of Thought: Efficient Reasoning through Dense Representations
Parallel Continuous Chain-of-Thought with Jacobi Iteration - Efficient Reasoning with Hidden Thinking
- Enhancing Latent Computation in Transformers with Latent Tokens
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing - Test-Time Training Done Right
- Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
- Multi-Actor Generative Artificial Intelligence as a Game Engine
- Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry
- MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
- Test-Time Scaling with Reflective Generative Model
- Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
- Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling Planting a SEED of Vision in Large Language Model - STAR: Scale-wise Text-conditioned AutoRegressive image generation
- Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
- How Far Are We from Intelligent Visual Deductive Reasoning?
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning - Latent Denoising Makes Good Visual Tokenizers
- Kimi K2: Open Agentic Intelligence
- Transition Matching: Scalable and Flexible Generative Modeling
- CoT-lized Diffusion: Let’s Reinforce T2I Generation Step-by-step
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning - HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
- Back to the Features: DINO as a Foundation for Video World Models
- TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning Flow Matching Policy Gradients Qwen3 Technical Report DAPO: An Open-Source LLM Reinforcement Learning System at Scale Group Sequence Policy Optimization Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning - Hierarchical Reasoning Model
Flow-GRPO: Training Flow Matching Models via Online RL - DINOv3
The Promise of RL for Autoregressive Image Editing - Qwen-Image Technical Report
- On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
- Next Visual Granularity Generation
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception - SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
- Demystifying Long Chain-of-Thought Reasoning in LLMs
Unified Reward Model for Multimodal Understanding and Generation Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation - Kimi-VL Technical Report
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE - Contrastive Representations for Temporal Reasoning
Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination - T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation
- Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
- RL’s Razor: Why Online Reinforcement Learning Forgets Less
Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching - Can Understanding and Generation Truly Benefit Together – or Just Coexist?
- Reusing Samples in Variance Reduction
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models - Importance Weighted Autoencoders
- Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps DiffusionNFT: Online Diffusion Reinforcement with Forward Process - DSPO: Direct Score Preference Optimization for Diffusion Model Alignment
Selective Underfitting in Diffusion Models - VUGEN: Visual Understanding priors for GENeration
- Video models are zero-shot learners and reasoners
- Improving the Diffusability of Autoencoders
- Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs - Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
LIMA: Less Is More for Alignment - Learning an Image Editing Model without Image Editing Pairs
- Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
- Scaling Latent Reasoning via Looped Language Models
- ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
网站
笔记