笔记 · 科研阅读 · 2024
论文
Deep Unsupervised Learning using Nonequilibrium ThermodynamicsDenoising Diffusion Probablistic ModelsDenoising Diffusion Implicit ModelsDiffusion Models Beat GANs on Image SynthesisAnnealed Importance SamplingGenerative Modeling by Estimating Gradients of the Data DistributionHigh-Resolution Image Synthesis with Latent Diffusion ModelsScore-based Generative Modeling through Stochastic Differential EquationsVariational Diffusion ModelsFourier Features Let Networks Learn High Frequency Functions in Low Dimensional DomainsClassifier-Free Diffusion GuidanceAdding Conditional Control to Text-to-Image Diffusion ModelsScalable Diffusion Models with TransformersDreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven GenerationDiffusion Autoencoders: Toward a Meaningful and Decodable RepresentationReturn of Unconditional Generation: A Self-supervised Representation Generation MethodLLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language ModelsEmerging Properties in Self-Supervised Vision TransformersDINOv2: Learning Robust Visual Features without SupervisionCambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMsLLM-grounded Video Diffusion Models3D Gaussian Splatting for Real-Time Radiance Field RenderingPhysDreamer: Physics-Based Interaction with 3D Objects via Video GenerationReferee Can Play: An Alternative Approach to Conditional Generation via Model InversionDenoising Autoregressive Representation LearningI-Design: Personalized LLM Interior Designer- RGB↔X: Image decomposition and synthesis using material- and lighting-aware diffusion models
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMsDeconstructing denoising diffusion models for self-supervised learning- The Platonic Representation Hypothesis
- Chameleon: Mixed-Modal Early-Fusion Foundation Models
- MotionCraft: Physics-based Zero-Shot Video Generation
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception- Guiding a Diffusion Model with a Bad Version of Itself
- σ-GPTs: A New Approach to Autoregressive Models
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation- Verbalized Machine Learning: Revisiting Machine Learning with Language Models
Video Diffusion ModelsTraining Diffusion Models with Reinforcement LearningMuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion- Bayesian Learning via Stochastic Gradient Langevin Dynamics
Hierarchical Text-Conditional Image Generation with CLIP Latents- Make-A-Video: Text-to-Video Generation without Text-Video Data
- Structure and Content-Guided Video Synthesis with Diffusion Models
- Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
FlexiViT: One Model for All Patch Sizes- Unifying Diffusion Models’ Latent Space, with Applications to CycleDiffusion and Guidance
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian SplattingKOSMOS-G: Generating Images in Context with Multimodal Large Language ModelsDreamFusion: Text-to-3D using 2D DiffusionEyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs- Structure-from-Motion Revisited
- Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Generative Image Dynamics- Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images
Learning to See Physics via Visual De-animation- MoDE: CLIP Data Experts via Clustering
- ImageInWords: Unlocking Hyper-Detailed Image Descriptions
- Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Graphic Design with Large Multimodal ModelDiffusion Forcing: Next-token Prediction Meets Full-Sequence DiffusionMastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMsSelf-correcting LLM-controlled Diffusion ModelsReinforced Self-Training (ReST) for Language ModelingAutomatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based ApproachExecution-based Code Generation using Deep Reinforcement LearningDragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image EditingAgentless: Demystifying LLM-based Software Engineering AgentsDivCon: Divide and Conquer for Progressive Text-to-Image Generatio- VersaT2I: Improving Text-to-Image Models with Versatile Reward
- On Mechanistic Knowledge Localization in Text-to-Image Generative Models
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning- Efficiently Modeling Long Sequences with Structured State Spaces
LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation- S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces
Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification- ViperGPT: Visual Inference via Python Execution for Reasoning
One Transformer Fits All Distributions in Multi-Modal Diffusion at ScaleDiffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-TrainingBACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate HallucinationsFrozen Transformers in Language Models are Effective Visual Encoder LayersGenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing- Transparent Image Layer Diffusion using Latent Transparency
Self-Training Large Language Models for Improved Visual Program Synthesis With Visual ReinforcementAdaptive Patching for High-resolution Image Segmentation with Transformers- Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model
- BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
- Video Style Transfer by Consistent Adaptive Patch Sampling
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMsLM4LV: A Frozen Large Language Model for Low-level Vision TasksVideo In-context Learning- Transformer Alignment in Large Language Models
- Toward a Diffusion-Based Generalist for Dense Vision Tasks
UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback- Unified Auto-Encoding with Masked Diffusion
- Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
Learning UI-to-Code Reverse Generator Using Visual Critic Without RenderingRolling Diffusion ModelsPrompt Highlighter: Interactive Control for Multi-Modal LLMsPathformer: Multi-scale Transformers with Adaptive Pathways for Time Series ForecastingAdaptive Fourier Neural Operators: Efficient Token Mixers for Transformers- When Representations Align: Universality in Representation Learning Dynamics
- Diffusion Models as Plug-and-Play Priors
Deep Reward Supervisions for Tuning Text-to-Image Diffusion ModelsAligning Text-to-Image Diffusion Models with Reward Backpropagation- Directly Fine-Tuning Diffusion Models on Differentiable Rewards
- Score Distillation Sampling with Learned Manifold Corrective
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Discriminative Probing and Tuning for Text-to-Image GenerationTransformer Layers as PaintersEGC: Image Generation and Classification via a Diffusion Energy-Based ModelYour Diffusion Model is Secretly a Zero-Shot ClassifierSDXL: Improving Latent Diffusion Models for High-Resolution Image SynthesisLearning Generative Models via Discriminative Approaches- Understanding Hallucinations in Diffusion Models through Mode Interpolation
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
- TokenCompose: Text-to-Image Diffusion with Token-level Supervision
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis- Rethinking Conditional Diffusion Sampling with Progressive Guidance
- GradCheck: Analyzing Classifier Guidance Gradients for Conditional Diffusion Sampling
- PixelAsParam: A Gradient View on Diffusion Sampling with Guidance
- Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance
- Universal Guidance for Diffusion Models
- The Generative AI Paradox: “What It Can Create, It May Not Understand”
- Diffusion Models already have a Semantic Latent Space
- Exploring Compositional Visual Generation with Latent Classifier Guidance
- Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion
VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language GuidanceContrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models- Image Content Generation with Causal Reasoning
Direct Preference Optimization:- Your Language Model is Secretly a Reward ModelRLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human FeedbackDiffusion Model Alignment Using Direct Preference Optimization- A Dense Reward View on Aligning Text-to-Image Diffusion with Preference
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image ManipulationBlended Diffusion for Text-driven Editing of Natural Images- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
When and Why Vision-Language Models Behave like Bags-of-Words, and What to Do about It?- Testing Relational Understanding in Text-Guided Image Generation
- TIAM - A Metric for Evaluating Alignment in Text-to-Image Generation
- Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
DOCCI: Descriptions of Connected and Contrasting Images- Deep Visual-Semantic Alignments for Generating Image Descriptions
- TextGrad: Automatic “Differentiation” via Text
- NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models
- Your Diffusion Model is Secretly a Noise Classifier and Benefits from Contrastive Training
- Self-Guided Diffusion Models
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image GenerationFlow Straight and Fast: Learning to Generate and Transfer Data with Rectified FlowScaling Rectified Flow Transformers for High-Resolution Image Synthesis- Denoising Diffusion Autoencoders are Unified Self-supervised Learners
- Diffusion Feedback Hlps CLIP See Better
End-to-End Diffusion Latent Optimization Improves Classifier Guidance- Do DALL-E and Flamingo Understand Each Other?
Towards Accurate Guided Diffusion Sampling through Symplectic Adjoint MethodReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization- Self-Improving Robust Preference Optimization
- Self-Guided Generation of Minority Samples Using Diffusion Models
Class-Conditional self-reward mechanism for improved Text-to-Image modelsNot All Noises Are Created Equally: Diffusion Noise Selection and Optimization- DreamLLM: Synergistic Multimodal Comprehension and Creation
LLM2Vec: Large Language Models Are Secretly Powerful Text EncodersSheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning- Not All Layers of LLMs Are Necessary During Inference
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksUnifying Two-Stream Encoders with Transformers for Cross-Modal RetrievalSeveral questions of visual generation in 2024An Empirical Study of Training End-to-End Vision-and-Language TransformersLLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init AttentionLLaMA-Adapter V2: Parameter-Efficient Visual Instruction ModelFlow Matching for Generative Modeling- Improving Image Generation with Better Captions
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis- Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
- Elucidating the Design Space of Diffusion-Based Generative Models
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks- Contextualized Diffusion Models for Text-Guided Image and Video Generation
- Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows
- Multimodal Masked Autoencoders Learn Transferable Representations
- Fast Training of Diffusion Models with Masked Transformers
- Improving Compositional Text-to-image Generation with Large Vision-Language Models
- VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers
- Generative Representational Instruction Tuning
- Unified Auto-Encoding with Masked Diffusion
Understanding the Impact of Negative Prompts: When and How Do They Take Effect?- VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers- MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal ModelPaliGemma: A versatile 3B VLM for transfer- From Pixels to Prose: A Large Dataset of Dense Image Captions
Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language ModelsOmniGen: Unified Image GenerationDiffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models- DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks- CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
笔记
生成扩散模型漫谈- 生成扩散模型 by Hammour Yue
- A Path to the Variational Diffusion Loss
Stochastic Differential Equations and Diffusion Models by Vanilla Bug- 一文解释 Diffusion Model
What are Diffusion Models?- Why KL?
- KL is all you need
The Illustrated Stable Diffusion by Jay AlammarUnderstanding Diffusion Models: A Unified Perspective by Calvin Luo- Mathematical Foundation of Diffusion Generative Models
Diffusion Models for Video Generation by Lilian Weng- Explained Latent Consistenct Models
The Illustrated VQGAN- From Autoencoder to Beta-VAE
- Flow-based Deep Generative Models
- Structure from Motion
球谐函数介绍
书籍
Introduction to Stochastic Differential EquationsIntroduction to Stochastic Calculus with ApplicationsAn Informal Introduction to Stochastic Calculus with Applications