笔记 · 科研阅读 · 2025

论文

  • Layer by Layer: Uncovering Hidden Representations in Language Models
  • s1: Simple test-time scaling
  • SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
  • SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
  • Titans: Learning to Memorize at Test Time
  • ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
  • Real-Time Video Generation with Pyramid Attention Broadcast
  • Diffusion Models without Classifier-free Guidance
  • CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
  • I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
  • Deliberation in Latent Space via Differentiable Cache Augmentation
  • Training Large Language Models to Reason in a Continuous Latent Space
  • Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
  • Distributional Reasoning in LLMs: Parallel Reasoning Processes in Multi-Hop Reasoning
  • In-context Autoencoder for Context Compression in a Large Language Model
  • DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
  • VoCo-LLaMA: Towards Vision Compression with Large Language Models
  • Progressive Compositionality in Text-to-Image Generative Models
  • GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
  • Fixed Point Diffusion Models
  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
  • Visual Lexicon: Rich Image Features in Language Space
  • DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
  • Scaling Language-Free Visual Representation Learning
  • Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
  • Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
  • T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
  • Boosting Generative Image Modeling via Joint Image-Feature Synthesis
  • Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
  • Mean Flows for One-step Generative Modeling
  • Fractal Generative Models
  • FlowTok: Flowing Seamlessly Across Text and Image Tokens
  • DDT: Decoupled Diffusion Transformer
  • Introducing Multiverse: The First AI Multiplayer World Model
  • Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
  • Circuit Tracing: Revealing Computational Graphs in Language Models
  • On the Biology of a Large Language Model
  • PixelFlow: Pixel-Space Generative Models with Flow
  • No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
  • Visual Planning: Let’s Think Only with Images
  • BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset
  • Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
  • Emerging Properties in Unified Multimodal Pretraining
  • Latent Flow Transformer
  • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
  • Efficient Pretraining Length Scaling
  • Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
  • MMaDA: Multimodal Large Diffusion Language Models
  • Harnessing the Universal Geometry of Embeddings
  • Diffusion Meets Flow Matching: Two Sides of the Same Coin
  • Elucidating the Design Space of Diffusion-Based Generative Models
  • Noise Schedules Considered Harmful
  • Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
  • Exploring the Latent Capacity of LLMs for One-Step Text Generation
  • DataRater: Meta-Learned Dataset Curation
  • A Fourier Space Perspective on Diffusion Models
  • An Alchemist’s Notes on Deep Learning
  • Spurious Rewards: Rethinking Training Signals in RLVR
  • Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
  • DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
  • Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
  • Mathematical Theory of Deep Learning
  • Atlas: Learning to Optimally Memorize the Context at Test Time
  • Navigating the Latent Space Dynamics of Neural Models
  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
  • un2CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP
  • Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
  • VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
  • Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
  • Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
  • Dual-Process Image Generation
  • Continuous Thought Machines
  • UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
  • How much do language models memorize?
  • Object Concepts Emerge from Motion
  • Why Gradients Rapidly Increase Near the End of Training
  • WorldExplorer: Towards Generating Fully Navigable 3D Scenes
  • The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
  • Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
  • Physics of Language Models
  • Flow-GRPO: Training Flow Matching Models via Online RL
  • Contrastive Flow Matching
  • Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
  • ★STARFLOW: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
  • Self Forcing Bridging the Train-Test Gap in Autoregressive Video Diffusion
  • Understanding Transformer from the Perspective of Associative Memory
  • Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
  • Hidden in plain sight: VLMs overlook their visual representations
  • Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance
  • Cartridges: Lightweight and general-purpose long context representations via self-study
  • A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
  • Inductive Moment Matching
  • V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
  • Exploring Diffusion Transformer Designs via Grafting
  • Diffuse and Disperse: Image Generation with Representation Regularization
  • Highly Compressed Tokenizer Can Generate Without Training
  • Edit Flows: Flow Matching with Edit Operations
  • Language-Image Alignment with Fixed Text Encoders
  • Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
  • The Illusion of the Illusion of Thinking
  • Ambient Diffusion Omni: Training Good Models with Bad Data
  • Text-to-LoRA: Instant Transformer Adaption
  • Visual Pre-Training on Unlabeled Images using Reinforcement Learning
  • On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity
  • Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
  • Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
  • Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer
  • Human-like object concept representations emerge naturally in multimodal large language models
  • From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
  • Rethinking Score Distillation as a Bridge Between Image Distributions
  • Generative Multimodal Models are In-Context Learners
  • Randomized Autoregressive Visual Generation
  • How Visual Representations Map to Language Feature Space in Multimodal LLMs
  • CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
  • UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
  • Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
  • Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
  • OmniGen2: Exploration to Advanced Multimodal Generation
  • Describing Differences in Image Sets with Natural Language
  • Vision-Language Models Create Cross-Modal Task Representations
  • ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
  • BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
  • Perception Encoder: The best visual embeddings are not at the output of the network
  • Streamline Without Sacrifice- Squeeze out Computation Redundancy in LMM
  • From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
  • Improving Progressive Generation with Decomposable Flow Matching
  • Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models
  • Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
  • Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights
  • VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
  • Inference-time Scaling of Diffusion Models through Classical Search
  • Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
  • Generative Blocks World: Moving Things Around in Pictures
  • Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
  • Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
  • MAGIC: Near-Optimal Data Attribution for Deep Learning
  • Perception-R1: Pioneering Perception Policy with Reinforcement Learning
  • REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
  • Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step
  • Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
  • Ovis-U1: Unified Understanding, Generation, and Editing
  • DeepVerse: 4D Autoregressive Video Generation as a World Model
  • GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
  • Learning to Instruct for Visual Instruction Tuning
  • Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
  • Matryoshka Representation Learning
  • Adaptive Length Image Tokenization via Recurrent Allocation
  • Test-Time Scaling of Diffusion Models via Noise Trajectory Search
  • Matryoshka Multimodal Models
  • Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
  • SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
  • Reasoning to Learn from Latent Thoughts
  • HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
  • ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
  • Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
  • X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
  • Unified Multimodal Understanding via Byte-Pair Visual Encoding
  • In-Context Learning State Vector with Inner and Momentum Optimization
  • In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
  • Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
  • DICE: Distilling Classifier-Free Guidance into Text Embeddings
  • Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
  • Uncovering the Text Embedding in Text-to-Image Diffusion Models
  • VLM-R3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
  • Thinking with Generated Images
  • GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
  • Visual Planning: Let’s Think Only with Images
  • Apollo: An Exploration of Video Understanding in Large Multimodal Models
  • TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
  • Latent Concept Disentanglement in Transformer-based Language Models
  • JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation
  • Describe Anything: Detailed Localized Image and Video Captioning
  • Enough Coin Flips Can Make LLMs Act Bayesian
  • Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
  • When Does Perceptual Alignment Benefit Vision Representations?
  • Activation Reward Models for Few-Shot Model Alignment
  • Fast and Simplex: 2-Simplicial Attention in Triton
  • Steering Llama 2 via Contrastive Activation Addition
  • Extracting Latent Steering Vectors from Pretrained Language Models
  • Steering Language Models With Activation Engineering
  • Activation Reward Models for Few-Shot Model Alignment
  • Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
  • Do Large Language Models Latently Perform Multi-Hop Reasoning?
  • Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
  • Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
  • Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
  • Hierarchical Text-Conditional Image Generation with CLIP Latents
  • High Fidelity Visualization of What Your Self-Supervised Representation Knows About
  • On the Importance of Embedding Norms in Self-Supervised Learning
  • Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
  • Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
  • Contrastive Learning Inverts the Data Generating Process
  • Mitigating the Discrepancy Between Video and Text Temporal Sequences: A Time-Perception Enhanced Video Grounding method for LLM
  • Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames
  • Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator
  • GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
  • Think before you speak: Training Language Models With Pause Tokens
  • Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
  • Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
  • OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
  • Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
  • Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
  • Energy-Based Transformers are Scalable Learners and Thinkers
  • Null-text Inversion for Editing Real Images using Guided Diffusion Models
  • A General Framework for Inference-time Scaling and Steering of Diffusion Models
  • Video-T1: Test-Time Scaling for Video Generation
  • The Parallelism Tradeoff: Limitations of Log-Precision Transformers
  • The Generative AI Paradox: “What It Can Create, It May Not Understand”
  • Birth of a Transformer: A Memory Viewpoint
  • FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
  • What happens to diffusion model likelihood when your model is conditional?
  • On the rankability of visual embeddings
  • Mitigating Overthinking in Large Reasoning Models via Manifold Steering
  • Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
  • Modern Methods in Associative Memory
  • Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
  • A Survey on Latent Reasoning
  • Compressed Chain of Thought: Efficient Reasoning through Dense Representations
  • Parallel Continuous Chain-of-Thought with Jacobi Iteration
  • Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration
  • Do LLMs Really Think Step-by-step In Implicit Reasoning?
  • How Do LLMs Perform Two-Hop Reasoning in Context?
  • Iteration Head: A Mechanistic Study of Chain-of-Thought
  • Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
  • How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
  • Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
  • Steering Your Diffusion Policy with Latent Space Reinforcement Learning
  • DanceGRPO: Unleashing GRPO on Visual Generation
  • EVA-CLIP: Improved Training Techniques for CLIP at Scale
  • Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
  • Does Data Scaling Lead to Visual Compositional Generalization?
  • Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful
  • Prompting as Scientific Inquiry
  • Single-pass Adaptive Image Tokenization for Minimum Program Search
  • Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
  • Context Tuning for In-Context Optimization
  • MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
  • Not All Explanations for Deep Learning Phenomena Are Equally Valuable
  • VAGEN: Training VLM agents with multi-turn reinforcement learning

网站

笔记