笔记 · 科研阅读 · 2025

论文

  • Layer by Layer: Uncovering Hidden Representations in Language Models
  • s1: Simple test-time scaling
  • SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
  • SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
  • Titans: Learning to Memorize at Test Time
  • ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
  • Real-Time Video Generation with Pyramid Attention Broadcast
  • Diffusion Models without Classifier-free Guidance
  • CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
  • I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
  • Deliberation in Latent Space via Differentiable Cache Augmentation
  • Training Large Language Models to Reason in a Continuous Latent Space
  • Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
  • Distributional Reasoning in LLMs: Parallel Reasoning Processes in Multi-Hop Reasoning
  • In-context Autoencoder for Context Compression in a Large Language Model
  • DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
  • VoCo-LLaMA: Towards Vision Compression with Large Language Models
  • Progressive Compositionality in Text-to-Image Generative Models
  • GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
  • Fixed Point Diffusion Models
  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
  • Visual Lexicon: Rich Image Features in Language Space
  • DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
  • Scaling Language-Free Visual Representation Learning
  • Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
  • Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
  • T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
  • Boosting Generative Image Modeling via Joint Image-Feature Synthesis
  • Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
  • Mean Flows for One-step Generative Modeling
  • Fractal Generative Models
  • FlowTok: Flowing Seamlessly Across Text and Image Tokens
  • DDT: Decoupled Diffusion Transformer
  • Introducing Multiverse: The First AI Multiplayer World Model
  • Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
  • Circuit Tracing: Revealing Computational Graphs in Language Models
  • On the Biology of a Large Language Model
  • PixelFlow: Pixel-Space Generative Models with Flow
  • No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
  • Visual Planning: Let’s Think Only with Images
  • BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset
  • Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
  • Emerging Properties in Unified Multimodal Pretraining
  • Latent Flow Transformer
  • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
  • Efficient Pretraining Length Scaling
  • MMaDA: Multimodal Large Diffusion Language Models
  • Harnessing the Universal Geometry of Embeddings
  • Diffusion Meets Flow Matching: Two Sides of the Same Coin
  • Elucidating the Design Space of Diffusion-Based Generative Models
  • Noise Schedules Considered Harmful
  • Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
  • Exploring the Latent Capacity of LLMs for One-Step Text Generation
  • DataRater: Meta-Learned Dataset Curation
  • A Fourier Space Perspective on Diffusion Models
  • An Alchemist’s Notes on Deep Learning
  • Spurious Rewards: Rethinking Training Signals in RLVR
  • Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
  • DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
  • Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
  • Mathematical Theory of Deep Learning
  • Atlas: Learning to Optimally Memorize the Context at Test Time
  • Navigating the Latent Space Dynamics of Neural Models
  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
  • un2CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP
  • Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
  • VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
  • Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
  • Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
  • Dual-Process Image Generation
  • Continuous Thought Machines
  • UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
  • How much do language models memorize?
  • Object Concepts Emerge from Motion
  • Why Gradients Rapidly Increase Near the End of Training
  • WorldExplorer: Towards Generating Fully Navigable 3D Scenes
  • The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
  • Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
  • Physics of Language Models
  • Flow-GRPO: Training Flow Matching Models via Online RL
  • Contrastive Flow Matching
  • Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
  • ★STARFLOW: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
  • Self Forcing Bridging the Train-Test Gap in Autoregressive Video Diffusion
  • Understanding Transformer from the Perspective of Associative Memory
  • Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
  • Hidden in plain sight: VLMs overlook their visual representations
  • Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance
  • Cartridges: Lightweight and general-purpose long context representations via self-study
  • A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
  • Inductive Moment Matching
  • V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
  • Exploring Diffusion Transformer Designs via Grafting
  • Diffuse and Disperse: Image Generation with Representation Regularization
  • Highly Compressed Tokenizer Can Generate Without Training
  • Edit Flows: Flow Matching with Edit Operations
  • Language-Image Alignment with Fixed Text Encoders
  • The Illusion of the Illusion of Thinking
  • Ambient Diffusion Omni: Training Good Models with Bad Data
  • Text-to-LoRA: Instant Transformer Adaption
  • Visual Pre-Training on Unlabeled Images using Reinforcement Learning
  • On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity
  • Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
  • Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
  • Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer
  • Human-like object concept representations emerge naturally in multimodal large language models
  • From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
  • Rethinking Score Distillation as a Bridge Between Image Distributions
  • Generative Multimodal Models are In-Context Learners
  • Randomized Autoregressive Visual Generation
  • How Visual Representations Map to Language Feature Space in Multimodal LLMs
  • CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
  • UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
  • Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
  • Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
  • OmniGen2: Exploration to Advanced Multimodal Generation
  • Describing Differences in Image Sets with Natural Language
  • Vision-Language Models Create Cross-Modal Task Representations
  • ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
  • BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
  • Perception Encoder: The best visual embeddings are not at the output of the network
  • Streamline Without Sacrifice- Squeeze out Computation Redundancy in LMM
  • From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
  • Improving Progressive Generation with Decomposable Flow Matching
  • Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models
  • Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
  • Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights
  • VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
  • Inference-time Scaling of Diffusion Models through Classical Search
  • Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
  • Generative Blocks World: Moving Things Around in Pictures
  • Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
  • Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
  • MAGIC: Near-Optimal Data Attribution for Deep Learning
  • Perception-R1: Pioneering Perception Policy with Reinforcement Learning
  • REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
  • Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step
  • Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
  • Ovis-U1: Unified Understanding, Generation, and Editing
  • DeepVerse: 4D Autoregressive Video Generation as a World Model
  • GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
  • Learning to Instruct for Visual Instruction Tuning
  • Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
  • Matryoshka Representation Learning
  • Adaptive Length Image Tokenization via Recurrent Allocation
  • Test-Time Scaling of Diffusion Models via Noise Trajectory Search
  • Matryoshka Multimodal Models
  • Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
  • SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
  • Reasoning to Learn from Latent Thoughts
  • HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
  • ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
  • Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
  • X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
  • Unified Multimodal Understanding via Byte-Pair Visual Encoding
  • In-Context Learning State Vector with Inner and Momentum Optimization
  • In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
  • Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
  • DICE: Distilling Classifier-Free Guidance into Text Embeddings
  • Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
  • Uncovering the Text Embedding in Text-to-Image Diffusion Models
  • VLM-R3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
  • Thinking with Generated Images
  • GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
  • Visual Planning: Let’s Think Only with Images
  • Apollo: An Exploration of Video Understanding in Large Multimodal Models
  • TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
  • Latent Concept Disentanglement in Transformer-based Language Models
  • JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation
  • Describe Anything: Detailed Localized Image and Video Captioning
  • Enough Coin Flips Can Make LLMs Act Bayesian
  • Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
  • When Does Perceptual Alignment Benefit Vision Representations?
  • Activation Reward Models for Few-Shot Model Alignment
  • Fast and Simplex: 2-Simplicial Attention in Triton
  • Steering Llama 2 via Contrastive Activation Addition
  • Extracting Latent Steering Vectors from Pretrained Language Models
  • Steering Language Models With Activation Engineering
  • Activation Reward Models for Few-Shot Model Alignment
  • Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
  • Do Large Language Models Latently Perform Multi-Hop Reasoning?
  • Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
  • Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
  • Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
  • Hierarchical Text-Conditional Image Generation with CLIP Latents
  • High Fidelity Visualization of What Your Self-Supervised Representation Knows About
  • On the Importance of Embedding Norms in Self-Supervised Learning
  • Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
  • Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
  • Contrastive Learning Inverts the Data Generating Process
  • Mitigating the Discrepancy Between Video and Text Temporal Sequences: A Time-Perception Enhanced Video Grounding method for LLM
  • Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames
  • Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator
  • GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
  • Think before you speak: Training Language Models With Pause Tokens
  • Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
  • Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
  • OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
  • Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
  • Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
  • Energy-Based Transformers are Scalable Learners and Thinkers
  • Null-text Inversion for Editing Real Images using Guided Diffusion Models
  • A General Framework for Inference-time Scaling and Steering of Diffusion Models
  • Video-T1: Test-Time Scaling for Video Generation
  • The Parallelism Tradeoff: Limitations of Log-Precision Transformers
  • The Generative AI Paradox: “What It Can Create, It May Not Understand”
  • Birth of a Transformer: A Memory Viewpoint
  • FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
  • What happens to diffusion model likelihood when your model is conditional?
  • On the rankability of visual embeddings
  • Mitigating Overthinking in Large Reasoning Models via Manifold Steering
  • Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
  • Modern Methods in Associative Memory
  • Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
  • A Survey on Latent Reasoning
  • Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration
  • Do LLMs Really Think Step-by-step In Implicit Reasoning?
  • How Do LLMs Perform Two-Hop Reasoning in Context?
  • Iteration Head: A Mechanistic Study of Chain-of-Thought
  • Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
  • How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
  • Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
  • Steering Your Diffusion Policy with Latent Space Reinforcement Learning
  • DanceGRPO: Unleashing GRPO on Visual Generation
  • EVA-CLIP: Improved Training Techniques for CLIP at Scale
  • Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
  • Does Data Scaling Lead to Visual Compositional Generalization?
  • Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
  • Prompting as Scientific Inquiry
  • Single-pass Adaptive Image Tokenization for Minimum Program Search
  • Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
  • Context Tuning for In-Context Optimization
  • MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
  • Not All Explanations for Deep Learning Phenomena Are Equally Valuable
  • VAGEN: Training VLM agents with multi-turn reinforcement learning
  • Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation
  • Streaming 4D Visual Geometry Transformer
  • ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
  • The Expressive Power of Transformers with Chain of Thought
  • Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
  • Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
  • Compressed Chain of Thought: Efficient Reasoning through Dense Representations
  • Parallel Continuous Chain-of-Thought with Jacobi Iteration
  • Efficient Reasoning with Hidden Thinking
  • Enhancing Latent Computation in Transformers with Latent Tokens
  • Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
  • Test-Time Training Done Right
  • Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
  • Multi-Actor Generative Artificial Intelligence as a Game Engine
  • Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry
  • MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
  • Test-Time Scaling with Reflective Generative Model
  • Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
  • Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
  • Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
  • Planting a SEED of Vision in Large Language Model
  • STAR: Scale-wise Text-conditioned AutoRegressive image generation
  • Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
  • How Far Are We from Intelligent Visual Deductive Reasoning?
  • Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
  • Latent Denoising Makes Good Visual Tokenizers
  • Kimi K2: Open Agentic Intelligence
  • Transition Matching: Scalable and Flexible Generative Modeling
  • CoT-lized Diffusion: Let’s Reinforce T2I Generation Step-by-step
  • UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
  • Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
  • HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
  • Back to the Features: DINO as a Foundation for Video World Models
  • TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
  • X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
  • ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning
  • Flow Matching Policy Gradients
  • Qwen3 Technical Report
  • DAPO: An Open-Source LLM Reinforcement Learning System at Scale
  • Group Sequence Policy Optimization
  • Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
  • Hierarchical Reasoning Model
  • Flow-GRPO: Training Flow Matching Models via Online RL
  • DINOv3
  • The Promise of RL for Autoregressive Image Editing
  • Qwen-Image Technical Report
  • On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
  • Next Visual Granularity Generation
  • LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
  • SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
  • Demystifying Long Chain-of-Thought Reasoning in LLMs
  • Unified Reward Model for Multimodal Understanding and Generation
  • Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation
  • Kimi-VL Technical Report
  • MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
  • Contrastive Representations for Temporal Reasoning
  • Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination
  • T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation
  • Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
  • RL’s Razor: Why Online Reinforcement Learning Forgets Less
  • Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
  • Can Understanding and Generation Truly Benefit Together – or Just Coexist?
  • Reusing Samples in Variance Reduction
  • LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
  • Importance Weighted Autoencoders
  • Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
  • DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps
  • DiffusionNFT: Online Diffusion Reinforcement with Forward Process
  • DSPO: Direct Score Preference Optimization for Diffusion Model Alignment
  • Selective Underfitting in Diffusion Models
  • VUGEN: Visual Understanding priors for GENeration
  • Video models are zero-shot learners and reasoners
  • Improving the Diffusability of Autoencoders
  • Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
  • STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation
  • The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
  • Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs
  • Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
  • LIMA: Less Is More for Alignment
  • Learning an Image Editing Model without Image Editing Pairs
  • Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
  • Scaling Latent Reasoning via Looped Language Models
  • ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

网站

笔记