笔记 · 科研阅读 · 2025

论文

  • Layer by Layer: Uncovering Hidden Representations in Language Models
  • s1: Simple test-time scaling
  • SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
  • SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
  • Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step
  • Titans: Learning to Memorize at Test Time
  • ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
  • Real-Time Video Generation with Pyramid Attention Broadcast
  • Diffusion Models without Classifier-free Guidance
  • CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
  • I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
  • Deliberation in Latent Space via Differentiable Cache Augmentation
  • Training Large Language Models to Reason in a Continuous Latent Space
  • Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
  • Distributional Reasoning in LLMs: Parallel Reasoning Processes in Multi-Hop Reasoning
  • In-context Autoencoder for Context Compression in a Large Language Model
  • DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
  • VoCo-LLaMA: Towards Vision Compression with Large Language Models
  • Progressive Compositionality in Text-to-Image Generative Models
  • GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
  • Fixed Point Diffusion Models
  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
  • Visual Lexicon: Rich Image Features in Language Space
  • DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
  • Scaling Language-Free Visual Representation Learning
  • Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
  • Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
  • T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
  • Boosting Generative Image Modeling via Joint Image-Feature Synthesis
  • Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
  • Mean Flows for One-step Generative Modeling
  • Fractal Generative Models
  • FlowTok: Flowing Seamlessly Across Text and Image Tokens
  • DDT: Decoupled Diffusion Transformer
  • Introducing Multiverse: The First AI Multiplayer World Model
  • Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
  • Circuit Tracing: Revealing Computational Graphs in Language Models
  • On the Biology of a Large Language Model
  • PixelFlow: Pixel-Space Generative Models with Flow
  • No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
  • Visual Planning: Let’s Think Only with Images
  • BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset
  • Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
  • Emerging Properties in Unified Multimodal Pretraining
  • Latent Flow Transformer
  • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
  • Efficient Pretraining Length Scaling
  • Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
  • MMaDA: Multimodal Large Diffusion Language Models
  • Harnessing the Universal Geometry of Embeddings
  • Diffusion Meets Flow Matching: Two Sides of the Same Coin
  • Elucidating the Design Space of Diffusion-Based Generative Models
  • Noise Schedules Considered Harmful
  • Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
  • Exploring the Latent Capacity of LLMs for One-Step Text Generation
  • DataRater: Meta-Learned Dataset Curation
  • A Fourier Space Perspective on Diffusion Models
  • An Alchemist’s Notes on Deep Learning
  • Spurious Rewards: Rethinking Training Signals in RLVR
  • Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
  • DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
  • Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
  • Mathematical Theory of Deep Learning
  • Atlas: Learning to Optimally Memorize the Context at Test Time
  • Navigating the Latent Space Dynamics of Neural Models
  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
  • un2CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP
  • Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
  • VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
  • Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
  • Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
  • Dual-Process Image Generation
  • Continuous Thought Machines
  • UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
  • How much do language models memorize?
  • Object Concepts Emerge from Motion

网站