论文
- Layer by Layer: Uncovering Hidden Representations in Language Models
- s1: Simple test-time scaling
- SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step
- Titans: Learning to Memorize at Test Time
- ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Real-Time Video Generation with Pyramid Attention Broadcast - Diffusion Models without Classifier-free Guidance
CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models Deliberation in Latent Space via Differentiable Cache Augmentation Training Large Language Models to Reason in a Continuous Latent Space - Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
- Distributional Reasoning in LLMs: Parallel Reasoning Processes in Multi-Hop Reasoning
In-context Autoencoder for Context Compression in a Large Language Model - DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
VoCo-LLaMA: Towards Vision Compression with Large Language Models - Progressive Compositionality in Text-to-Image Generative Models
- GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
Fixed Point Diffusion Models Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach - Visual Lexicon: Rich Image Features in Language Space
- DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
- Scaling Language-Free Visual Representation Learning
- Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
- T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
- Boosting Generative Image Modeling via Joint Image-Feature Synthesis
- Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
- Mean Flows for One-step Generative Modeling
- Fractal Generative Models
- FlowTok: Flowing Seamlessly Across Text and Image Tokens
- DDT: Decoupled Diffusion Transformer
- Introducing Multiverse: The First AI Multiplayer World Model
- Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
- Circuit Tracing: Revealing Computational Graphs in Language Models
- On the Biology of a Large Language Model
- PixelFlow: Pixel-Space Generative Models with Flow
- No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
- Visual Planning: Let’s Think Only with Images
- BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset
- Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
- Emerging Properties in Unified Multimodal Pretraining
- Latent Flow Transformer
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Efficient Pretraining Length Scaling
- Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
- MMaDA: Multimodal Large Diffusion Language Models
- Harnessing the Universal Geometry of Embeddings
Diffusion Meets Flow Matching: Two Sides of the Same Coin - Elucidating the Design Space of Diffusion-Based Generative Models
- Noise Schedules Considered Harmful
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
- Exploring the Latent Capacity of LLMs for One-Step Text Generation
- DataRater: Meta-Learned Dataset Curation
- A Fourier Space Perspective on Diffusion Models
- An Alchemist’s Notes on Deep Learning
- Spurious Rewards: Rethinking Training Signals in RLVR
- Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
- DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
- Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
- Mathematical Theory of Deep Learning
- Atlas: Learning to Optimally Memorize the Context at Test Time
- Navigating the Latent Space Dynamics of Neural Models
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- un2CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP
- Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
- VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
- Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
- Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
- Dual-Process Image Generation
- Continuous Thought Machines
- UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
- How much do language models memorize?
- Object Concepts Emerge from Motion
网站