Minh-Quan Le

Computer Vision Lab

Stony Brook, NY, USA 11794

I am currently a third-year Ph.D. student in Computer Science at Stony Brook University, NY, USA, advised by Prof. Dimitris Samaras.

Before joining SBU, I obtained my Bachelor’s Degree in Computer Science - Honors Program at University of Science, Vietnam National University - HCMC, under the supervision of Prof. Minh-Triet Tran, Prof. Tam Nguyen, and Dr. Trung-Nghia Le.

My research interests lie in Computer Vision and Machine Learning with focus on post-training methods in visual generative models and vision-language models.

news

Apr 30, 2026	My 2nd paper with Microsoft, PISCES, done during my internship, has been accepted to ICML 2026.
Sep 08, 2025	I join Google as a Student Researcher.
Mar 25, 2025	I’m joining Computer Science Laboratory (LIX) of École Polytechnique, Paris as a visiting student.
Jan 22, 2025	Our paper Hummingbird done during my internship at Microsoft has been accepted to ICLR 2025!
Oct 28, 2024	1 paper CamoFA has been accepted to WACV 2025!
Jul 01, 2024	1 paper ∞-Brush has been accepted to ECCV 2024!
May 28, 2024	I start my research internship at Microsoft, ROAR.
Feb 26, 2024	1 paper has been accepted to CVPR 2024!
Dec 08, 2023	My first A* paper MaskDiff has been accepted to AAAI 2024 (Oral).
Aug 28, 2023	I start my Ph.D. at Department of Computer Science, Stony Brook University.

selected publications

ICML

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Minh-Quan Le^*, Gaurav Mittal^*, Cheng Zhao, David Gu, and 2 more authors

In Forty-Third International Conference on Machine Learning, 2026

Abs PDF Code Website

Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present 𝙿𝙸𝚂𝙲𝙴𝚂, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, 𝙿𝙸𝚂𝙲𝙴𝚂 uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, 𝙿𝙸𝚂𝙲𝙴𝚂 is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that 𝙿𝙸𝚂𝙲𝙴𝚂 outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.
ICLR

Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment

Minh-Quan Le^*, Gaurav Mittal^*, Tianjian Meng, A S M Iftekhar, and 4 more authors

In The Thirteenth International Conference on Learning Representations, 2025

Abs PDF Code Website

While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce Hummingbird, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird’s potential as a robust multimodal context-aligned image generator in complex visual tasks.
ECCV

∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions

Minh-Quan Le^*, Alexandros Graikos^*, Srikar Yellapragada, Rajarsi Gupta, and 2 more authors

In European Conference on Computer Vision, 2024

Abs PDF

Synthesizing high-resolution images from intricate, domain-specific information remains a significant challenge in generative modeling, particularly for applications in large-image domains such as digital histopathology and remote sensing. Existing methods face critical limitations: conditional diffusion models in pixel or latent space cannot exceed the resolution on which they were trained without losing fidelity, and computational demands increase significantly for larger image sizes. Patch-based methods offer computational efficiency but fail to capture long-range spatial relationships due to their overreliance on local information. In this paper, we introduce a novel conditional diffusion model in infinite dimensions, ∞-Brush for controllable large image synthesis. We propose a cross-attention neural operator to enable conditioning in function space. Our model overcomes the constraints of traditional finite-dimensional diffusion models and patch-based methods, offering scalability and superior capability in preserving global image structures while maintaining fine details. To our best knowledge, ∞-Brush is the first conditional diffusion model in function space, that can controllably synthesize images at arbitrary resolutions of up to 4096 x 4096 pixels. The code is available at https://github.com/cvlab-stonybrook/infinity-brush.
AAAI

MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation

Minh-Quan Le, Tam V Nguyen, Trung-Nghia Le, Thanh-Toan Do, and 2 more authors

In Proceedings of the AAAI Conference on Artificial Intelligence, 2024

Oral Abs PDF

Selected as Oral Presentation (top 2.3%).

Few-shot instance segmentation extends the few-shot learning paradigm to the instance segmentation task, which tries to segment instance objects from a query image with a few annotated examples of novel categories. Conventional approaches have attempted to address the task via prototype learning, known as point estimation. However, this mechanism depends on prototypes (e.g. mean of K-shot) for prediction, leading to performance instability. To overcome the disadvantage of the point estimation mechanism, we propose a novel approach, dubbed MaskDiff, which models the underlying conditional distribution of a binary mask, which is conditioned on an object region and K-shot information. Inspired by augmentation approaches that perturb data with Gaussian noise for populating low data density regions, we model the mask distribution with a diffusion probabilistic model. We also propose to utilize classifier-free guided mask sampling to integrate category information into the binary mask generation process. Without bells and whistles, our proposed method consistently outperforms state-of-the-art methods on both base and novel classes of the COCO dataset while simultaneously being more stable than existing methods. The source code is available at: https://github.com/minhquanlecs/MaskDiff.
CVPR

Learned representation-guided diffusion models for large-image generation

Alexandros Graikos^*, Srikar Yellapragada^*, Minh-Quan Le, Saarthak Kapse, and 3 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Abs PDF Code Website

To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.

selected preprints

arXiv

Concurrent Image Understanding and Generation: Self-Correcting Coupled Markov Jump Processes

Minh-Quan Le, Armand Comas, Alexandros Lattas, Stylianos Moschoglou, and 6 more authors

2026

Abs PDF Code Website

Human cognition does not separate understanding and generation. A teacher at a whiteboard speaks and draws together, each modality reshapes the other. In this paper, we bring this coupled loop to artificial systems. Masked Diffusion Models (MDMs) are ideally suited to this task, yet existing samplers either decode text and image interleavedly or independently update them in parallel branches that share only previous-step history, but not the other modality’s latest decisions within the same step; combined with MDMs’ inability to remask, cross-modal contradictions are neither detected nor repaired. We introduce Self-Correcting Coupled Markov Jump Processes (SC-CMJP), a framework in which one modality’s transition rates are functionals of the other modality’s confidence score, as weighted by cross-modal attention. Furthermore, a remasking jump retracts commitments the moment cross-modal evidence turns against them. In conjunction with SC-CMJP, we introduce 𝙲𝙾𝟸𝙹𝚞𝚖𝚙 (Self-COrrecting COupled Jump), a novel training-free single-pass sampler for joint multimodal geneneration. For training and evaluation purposes, we have created and will release three large-scale joint multimodal generation corpora: JEdit-1M, JMaze-200K, JNono-200K, with matching in- and out-of-distribution benchmarks. 𝙲𝙾𝟸𝙹𝚞𝚖𝚙 achieves best joint performance for image understanding and editing as well as visual reasoning (maze and nonogram solving). The performance of the sampler scales monotonically with the number of denoising steps, evidence that the benefits of cross-modal coupling compound across the trajectory.
arXiv

What about gravity in video generation? Post-Training Newton’s Laws with Verifiable Rewards

Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, and Dimitris Samaras

2025

Abs PDF Code Website

Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose 𝙽𝚎𝚠𝚝𝚘𝚗𝚁𝚎𝚠𝚊𝚛𝚍𝚜, the first physics-grounded post-training framework for video generation based on verifiable rewards. Instead of relying on human or VLM feedback, 𝙽𝚎𝚠𝚝𝚘𝚗𝚁𝚎𝚠𝚊𝚛𝚍𝚜 extracts measurable proxies from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate 𝙽𝚎𝚠𝚝𝚘𝚗𝚁𝚎𝚠𝚊𝚛𝚍𝚜 on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, 𝙽𝚎𝚠𝚝𝚘𝚗𝙱𝚎𝚗𝚌𝚑-𝟼𝟶𝙺. Across all primitives in visual and physics metrics, 𝙽𝚎𝚠𝚝𝚘𝚗𝚁𝚎𝚠𝚊𝚛𝚍𝚜 consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.