笔记 · 科研阅读 · 2024

论文

  • Deep Unsupervised Learning using Nonequilibrium Thermodynamics
  • Denoising Diffusion Probablistic Models
  • Denoising Diffusion Implicit Models
  • Diffusion Models Beat GANs on Image Synthesis
  • Annealed Importance Sampling
  • Generative Modeling by Estimating Gradients of the Data Distribution
  • High-Resolution Image Synthesis with Latent Diffusion Models
  • Score-based Generative Modeling through Stochastic Differential Equations
  • Variational Diffusion Models
  • Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains
  • Classifier-Free Diffusion Guidance
  • Adding Conditional Control to Text-to-Image Diffusion Models
  • Scalable Diffusion Models with Transformers
  • DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
  • Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
  • Return of Unconditional Generation: A Self-supervised Representation Generation Method
  • LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
  • Emerging Properties in Self-Supervised Vision Transformers
  • DINOv2: Learning Robust Visual Features without Supervision
  • Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
  • LLM-grounded Video Diffusion Models
  • 3D Gaussian Splatting for Real-Time Radiance Field Rendering
  • PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation
  • Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
  • Denoising Autoregressive Representation Learning
  • I-Design: Personalized LLM Interior Designer
  • RGB↔X: Image decomposition and synthesis using material- and lighting-aware diffusion models
  • V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
  • Deconstructing denoising diffusion models for self-supervised learning
  • The Platonic Representation Hypothesis
  • Chameleon: Mixed-Modal Early-Fusion Foundation Models
  • MotionCraft: Physics-based Zero-Shot Video Generation
  • DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
  • Guiding a Diffusion Model with a Bad Version of Itself
  • σ-GPTs: A New Approach to Autoregressive Models
  • Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
  • Verbalized Machine Learning: Revisiting Machine Learning with Language Models
  • Video Diffusion Models
  • Training Diffusion Models with Reinforcement Learning
  • MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion
  • Bayesian Learning via Stochastic Gradient Langevin Dynamics
  • Hierarchical Text-Conditional Image Generation with CLIP Latents
  • Make-A-Video: Text-to-Video Generation without Text-Video Data
  • Structure and Content-Guided Video Synthesis with Diffusion Models
  • Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
  • FlexiViT: One Model for All Patch Sizes
  • Unifying Diffusion Models’ Latent Space, with Applications to CycleDiffusion and Guidance
  • Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
  • KOSMOS-G: Generating Images in Context with Multimodal Large Language Models
  • DreamFusion: Text-to-3D using 2D Diffusion
  • Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
  • Structure-from-Motion Revisited
  • Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
  • Generative Image Dynamics
  • Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images
  • Learning to See Physics via Visual De-animation
  • MoDE: CLIP Data Experts via Clustering
  • ImageInWords: Unlocking Hyper-Detailed Image Descriptions
  • Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
  • Graphic Design with Large Multimodal Model
  • Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
  • Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs
  • Self-correcting LLM-controlled Diffusion Models
  • Reinforced Self-Training (ReST) for Language Modeling
  • Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach
  • Execution-based Code Generation using Deep Reinforcement Learning
  • DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
  • Agentless: Demystifying LLM-based Software Engineering Agents
  • DivCon: Divide and Conquer for Progressive Text-to-Image Generatio
  • VersaT2I: Improving Text-to-Image Models with Versatile Reward
  • On Mechanistic Knowledge Localization in Text-to-Image Generative Models
  • Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
  • Efficiently Modeling Long Sequences with Structured State Spaces
  • LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation
  • S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces
  • Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification
  • ViperGPT: Visual Inference via Python Execution for Reasoning
  • One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
  • Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training
  • BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations
  • Frozen Transformers in Language Models are Effective Visual Encoder Layers
  • GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing
  • Transparent Image Layer Diffusion using Latent Transparency
  • Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement
  • Adaptive Patching for High-resolution Image Segmentation with Transformers
  • Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model
  • BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
  • Video Style Transfer by Consistent Adaptive Patch Sampling
  • SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
  • LM4LV: A Frozen Large Language Model for Low-level Vision Tasks
  • Video In-context Learning
  • Transformer Alignment in Large Language Models
  • Toward a Diffusion-Based Generalist for Dense Vision Tasks
  • UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback
  • Unified Auto-Encoding with Masked Diffusion
  • Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
  • Learning UI-to-Code Reverse Generator Using Visual Critic Without Rendering
  • Rolling Diffusion Models
  • Prompt Highlighter: Interactive Control for Multi-Modal LLMs
  • Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting
  • Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers
  • When Representations Align: Universality in Representation Learning Dynamics
  • Diffusion Models as Plug-and-Play Priors
  • Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models
  • Aligning Text-to-Image Diffusion Models with Reward Backpropagation
  • Directly Fine-Tuning Diffusion Models on Differentiable Rewards
  • Score Distillation Sampling with Learned Manifold Corrective
  • SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
  • Discriminative Probing and Tuning for Text-to-Image Generation
  • Transformer Layers as Painters
  • EGC: Image Generation and Classification via a Diffusion Energy-Based Model
  • Your Diffusion Model is Secretly a Zero-Shot Classifier
  • SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
  • Learning Generative Models via Discriminative Approaches
  • Understanding Hallucinations in Diffusion Models through Mode Interpolation
  • SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
  • TokenCompose: Text-to-Image Diffusion with Token-level Supervision
  • GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
  • Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
  • Rethinking Conditional Diffusion Sampling with Progressive Guidance
  • GradCheck: Analyzing Classifier Guidance Gradients for Conditional Diffusion Sampling
  • PixelAsParam: A Gradient View on Diffusion Sampling with Guidance
  • Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance
  • Universal Guidance for Diffusion Models
  • The Generative AI Paradox: “What It Can Create, It May Not Understand”
  • Diffusion Models already have a Semantic Latent Space
  • Exploring Compositional Visual Generation with Latent Classifier Guidance
  • Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion
  • VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
  • Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models
  • Image Content Generation with Causal Reasoning
  • Direct Preference Optimization:- Your Language Model is Secretly a Reward Model
  • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
  • Diffusion Model Alignment Using Direct Preference Optimization
  • A Dense Reward View on Aligning Text-to-Image Diffusion with Preference
  • DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
  • Blended Diffusion for Text-driven Editing of Natural Images
  • BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  • When and Why Vision-Language Models Behave like Bags-of-Words, and What to Do about It?
  • Testing Relational Understanding in Text-Guided Image Generation
  • TIAM - A Metric for Evaluating Alignment in Text-to-Image Generation
  • Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
  • DOCCI: Descriptions of Connected and Contrasting Images
  • Deep Visual-Semantic Alignments for Generating Image Descriptions
  • TextGrad: Automatic “Differentiation” via Text
  • NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models
  • Your Diffusion Model is Secretly a Noise Classifier and Benefits from Contrastive Training
  • Self-Guided Diffusion Models
  • Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation
  • Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
  • Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
  • Denoising Diffusion Autoencoders are Unified Self-supervised Learners
  • Diffusion Feedback Hlps CLIP See Better
  • End-to-End Diffusion Latent Optimization Improves Classifier Guidance
  • Do DALL-E and Flamingo Understand Each Other?
  • Towards Accurate Guided Diffusion Sampling through Symplectic Adjoint Method
  • ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization
  • Self-Improving Robust Preference Optimization
  • Self-Guided Generation of Minority Samples Using Diffusion Models
  • Class-Conditional self-reward mechanism for improved Text-to-Image models
  • Not All Noises Are Created Equally: Diffusion Noise Selection and Optimization
  • DreamLLM: Synergistic Multimodal Comprehension and Creation
  • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
  • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
  • Not All Layers of LLMs Are Necessary During Inference
  • ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
  • Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
  • Several questions of visual generation in 2024
  • An Empirical Study of Training End-to-End Vision-and-Language Transformers
  • LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
  • LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
  • Flow Matching for Generative Modeling
  • Improving Image Generation with Better Captions
  • PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
  • Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
  • Elucidating the Design Space of Diffusion-Based Generative Models
  • InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  • Contextualized Diffusion Models for Text-Guided Image and Video Generation
  • Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows
  • Multimodal Masked Autoencoders Learn Transferable Representations
  • Fast Training of Diffusion Models with Masked Transformers
  • Improving Compositional Text-to-image Generation with Large Vision-Language Models
  • VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers
  • Generative Representational Instruction Tuning
  • Unified Auto-Encoding with Masked Diffusion
  • Understanding the Impact of Negative Prompts: When and How Do They Take Effect?
  • VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
  • Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
  • MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
  • Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
  • PaliGemma: A versatile 3B VLM for transfer
  • From Pixels to Prose: A Large Dataset of Dense Image Captions
  • Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models
  • OmniGen: Unified Image Generation
  • Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models
  • DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
  • VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
  • CREPE: Can Vision-Language Foundation Models Reason Compositionally?
  • Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

笔记

书籍

  • Introduction to Stochastic Differential Equations
  • Introduction to Stochastic Calculus with Applications
  • An Informal Introduction to Stochastic Calculus with Applications

代码