论文

Layer by Layer: Uncovering Hidden Representations in Language Models
s1: Simple test-time scaling
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Titans: Learning to Memorize at Test Time
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
~~DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning~~
~~Real-Time Video Generation with Pyramid Attention Broadcast~~
Diffusion Models without Classifier-free Guidance
~~CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally~~
~~I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models~~
~~Deliberation in Latent Space via Differentiable Cache Augmentation~~
~~Training Large Language Models to Reason in a Continuous Latent Space~~
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
Distributional Reasoning in LLMs: Parallel Reasoning Processes in Multi-Hop Reasoning
~~In-context Autoencoder for Context Compression in a Large Language Model~~
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
~~VoCo-LLaMA: Towards Vision Compression with Large Language Models~~
Progressive Compositionality in Text-to-Image Generative Models
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
~~Fixed Point Diffusion Models~~
~~Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach~~
~~Visual Lexicon: Rich Image Features in Language Space~~
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
~~Scaling Language-Free Visual Representation Learning~~
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
~~T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT~~
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
Mean Flows for One-step Generative Modeling
Fractal Generative Models
FlowTok: Flowing Seamlessly Across Text and Image Tokens
~~DDT: Decoupled Diffusion Transformer~~
Introducing Multiverse: The First AI Multiplayer World Model
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
Circuit Tracing: Revealing Computational Graphs in Language Models
On the Biology of a Large Language Model
PixelFlow: Pixel-Space Generative Models with Flow
No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
Visual Planning: Let’s Think Only with Images
~~BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset~~
Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
Emerging Properties in Unified Multimodal Pretraining
Latent Flow Transformer
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Efficient Pretraining Length Scaling
MMaDA: Multimodal Large Diffusion Language Models
Harnessing the Universal Geometry of Embeddings
~~Diffusion Meets Flow Matching: Two Sides of the Same Coin~~
Elucidating the Design Space of Diffusion-Based Generative Models
Noise Schedules Considered Harmful
~~Unified Autoregressive Visual Generation and Understanding with Continuous Tokens~~
Exploring the Latent Capacity of LLMs for One-Step Text Generation
DataRater: Meta-Learned Dataset Curation
A Fourier Space Perspective on Diffusion Models
An Alchemist’s Notes on Deep Learning
Spurious Rewards: Rethinking Training Signals in RLVR
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
Mathematical Theory of Deep Learning
Atlas: Learning to Optimally Memorize the Context at Test Time
Navigating the Latent Space Dynamics of Neural Models
~~DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models~~
un2CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP
Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
~~Dual-Process Image Generation~~
Continuous Thought Machines
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
How much do language models memorize?
Object Concepts Emerge from Motion
Why Gradients Rapidly Increase Near the End of Training
WorldExplorer: Towards Generating Fully Navigable 3D Scenes
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
Physics of Language Models
Flow-GRPO: Training Flow Matching Models via Online RL
Contrastive Flow Matching
Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
★STARFLOW: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
Self Forcing Bridging the Train-Test Gap in Autoregressive Video Diffusion
Understanding Transformer from the Perspective of Associative Memory
~~Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution~~
Hidden in plain sight: VLMs overlook their visual representations
Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance
Cartridges: Lightweight and general-purpose long context representations via self-study
A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
Inductive Moment Matching
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Exploring Diffusion Transformer Designs via Grafting
Diffuse and Disperse: Image Generation with Representation Regularization
Highly Compressed Tokenizer Can Generate Without Training
Edit Flows: Flow Matching with Edit Operations
Language-Image Alignment with Fixed Text Encoders
The Illusion of the Illusion of Thinking
Ambient Diffusion Omni: Training Good Models with Bad Data
~~Text-to-LoRA: Instant Transformer Adaption~~
~~Visual Pre-Training on Unlabeled Images using Reinforcement Learning~~
On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
~~Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation~~
Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer
Human-like object concept representations emerge naturally in multimodal large language models
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Rethinking Score Distillation as a Bridge Between Image Distributions
Generative Multimodal Models are In-Context Learners
Randomized Autoregressive Visual Generation
~~How Visual Representations Map to Language Feature Space in Multimodal LLMs~~
~~CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation~~
~~UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation~~
~~Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens~~
~~Imagine while Reasoning in Space: Multimodal Visualization-of-Thought~~
OmniGen2: Exploration to Advanced Multimodal Generation
~~Describing Differences in Image Sets with Natural Language~~
Vision-Language Models Create Cross-Modal Task Representations
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
~~Perception Encoder: The best visual embeddings are not at the output of the network~~
Streamline Without Sacrifice- Squeeze out Computation Redundancy in LMM
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
Improving Progressive Generation with Decomposable Flow Matching
Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Inference-time Scaling of Diffusion Models through Classical Search
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
Generative Blocks World: Moving Things Around in Pictures
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
~~Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps~~
MAGIC: Near-Optimal Data Attribution for Deep Learning
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step
~~Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction~~
~~Ovis-U1: Unified Understanding, Generation, and Editing~~
DeepVerse: 4D Autoregressive Video Generation as a World Model
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Learning to Instruct for Visual Instruction Tuning
Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
Matryoshka Representation Learning
Adaptive Length Image Tokenization via Recurrent Allocation
~~Test-Time Scaling of Diffusion Models via Noise Trajectory Search~~
Matryoshka Multimodal Models
~~Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think~~
~~SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning~~
Reasoning to Learn from Latent Thoughts
~~HART: Efficient Visual Generation with Hybrid Autoregressive Transformer~~
ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
~~Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models~~
~~X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models~~
Unified Multimodal Understanding via Byte-Pair Visual Encoding
In-Context Learning State Vector with Inner and Momentum Optimization
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
DICE: Distilling Classifier-Free Guidance into Text Embeddings
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
Uncovering the Text Embedding in Text-to-Image Diffusion Models
VLM-R3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Thinking with Generated Images
~~GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning~~
Visual Planning: Let’s Think Only with Images
Apollo: An Exploration of Video Understanding in Large Multimodal Models
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Latent Concept Disentanglement in Transformer-based Language Models
JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation
Describe Anything: Detailed Localized Image and Video Captioning
Enough Coin Flips Can Make LLMs Act Bayesian
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
When Does Perceptual Alignment Benefit Vision Representations?
Activation Reward Models for Few-Shot Model Alignment
Fast and Simplex: 2-Simplicial Attention in Triton
Steering Llama 2 via Contrastive Activation Addition
Extracting Latent Steering Vectors from Pretrained Language Models
Steering Language Models With Activation Engineering
Activation Reward Models for Few-Shot Model Alignment
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
~~Hierarchical Text-Conditional Image Generation with CLIP Latents~~
High Fidelity Visualization of What Your Self-Supervised Representation Knows About
On the Importance of Embedding Norms in Self-Supervised Learning
~~Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces~~
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
Contrastive Learning Inverts the Data Generating Process
Mitigating the Discrepancy Between Video and Text Temporal Sequences: A Time-Perception Enhanced Video Grounding method for LLM
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames
Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator
~~GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing~~
Think before you speak: Training Language Models With Pause Tokens
~~Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling~~
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
~~Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models~~
~~Energy-Based Transformers are Scalable Learners and Thinkers~~
~~Null-text Inversion for Editing Real Images using Guided Diffusion Models~~
A General Framework for Inference-time Scaling and Steering of Diffusion Models
Video-T1: Test-Time Scaling for Video Generation
The Parallelism Tradeoff: Limitations of Log-Precision Transformers
The Generative AI Paradox: “What It Can Create, It May Not Understand”
Birth of a Transformer: A Memory Viewpoint
~~FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching~~
What happens to diffusion model likelihood when your model is conditional?
On the rankability of visual embeddings
Mitigating Overthinking in Large Reasoning Models via Manifold Steering
Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
Modern Methods in Associative Memory
~~Perception Tokens Enhance Visual Reasoning in Multimodal Language Models~~
~~A Survey on Latent Reasoning~~
Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration
Do LLMs Really Think Step-by-step In Implicit Reasoning?
How Do LLMs Perform Two-Hop Reasoning in Context?
Iteration Head: A Mechanistic Study of Chain-of-Thought
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
~~DanceGRPO: Unleashing GRPO on Visual Generation~~
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Does Data Scaling Lead to Visual Compositional Generalization?
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
Prompting as Scientific Inquiry
Single-pass Adaptive Image Tokenization for Minimum Program Search
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Context Tuning for In-Context Optimization
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Not All Explanations for Deep Learning Phenomena Are Equally Valuable
VAGEN: Training VLM agents with multi-turn reinforcement learning
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation
Streaming 4D Visual Geometry Transformer
ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
The Expressive Power of Transformers with Chain of Thought
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Compressed Chain of Thought: Efficient Reasoning through Dense Representations
~~Parallel Continuous Chain-of-Thought with Jacobi Iteration~~
Efficient Reasoning with Hidden Thinking
Enhancing Latent Computation in Transformers with Latent Tokens
~~Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing~~
Test-Time Training Done Right
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Multi-Actor Generative Artificial Intelligence as a Game Engine
Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
Test-Time Scaling with Reflective Generative Model
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
~~Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling~~
~~Planting a SEED of Vision in Large Language Model~~
STAR: Scale-wise Text-conditioned AutoRegressive image generation
Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
How Far Are We from Intelligent Visual Deductive Reasoning?
~~Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning~~
Latent Denoising Makes Good Visual Tokenizers
Kimi K2: Open Agentic Intelligence
Transition Matching: Scalable and Flexible Generative Modeling
CoT-lized Diffusion: Let’s Reinforce T2I Generation Step-by-step
~~UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning~~
~~Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning~~
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
Back to the Features: DINO as a Foundation for Video World Models
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
~~X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again~~
~~ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning~~
~~Flow Matching Policy Gradients~~
~~Qwen3 Technical Report~~
~~DAPO: An Open-Source LLM Reinforcement Learning System at Scale~~
~~Group Sequence Policy Optimization~~
~~Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning~~
Hierarchical Reasoning Model
~~Flow-GRPO: Training Flow Matching Models via Online RL~~
DINOv3
~~The Promise of RL for Autoregressive Image Editing~~
Qwen-Image Technical Report
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
Next Visual Granularity Generation
~~LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception~~
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Demystifying Long Chain-of-Thought Reasoning in LLMs
~~Unified Reward Model for Multimodal Understanding and Generation~~
~~Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation~~
Kimi-VL Technical Report
~~MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE~~
Contrastive Representations for Temporal Reasoning
~~Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination~~
T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
RL’s Razor: Why Online Reinforcement Learning Forgets Less
~~Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching~~
Can Understanding and Generation Truly Benefit Together – or Just Coexist?
Reusing Samples in Variance Reduction
~~LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models~~
Importance Weighted Autoencoders
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
~~DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps~~
~~DiffusionNFT: Online Diffusion Reinforcement with Forward Process~~
DSPO: Direct Score Preference Optimization for Diffusion Model Alignment
~~Selective Underfitting in Diffusion Models~~
VUGEN: Visual Understanding priors for GENeration
Video models are zero-shot learners and reasoners
Improving the Diffusability of Autoencoders
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
~~STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation~~
~~The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models~~
~~Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs~~
Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
~~LIMA: Less Is More for Alignment~~
Learning an Image Editing Model without Image Editing Pairs
Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
Scaling Latent Reasoning via Looped Language Models
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

网站

Deep-ML

笔记 · 科研阅读 · 2025

论文

网站

笔记