September 30, 2024
UNC Computer Science researchers had 11 publications accepted by the 2024 European Conference on Computer Vision (ECCV). ECCV is a biennial conference covering computer vision and machine learning and is considered one of the top three conferences worldwide in computer vision.
The accepted papers came from the research groups of seven different UNC CS faculty members: Distinguished Professor Mohit Bansal, Assistant Professor Gedas Bertasius, Assistant Professor Tianlong Chen, Professor Marc Niethammer, Distinguished Professor Stephen M. Pizer, Assistant Professor Roni Sengupta, and Assistant Professor Huaxiu Yao.
The 11 publications cover topics including vision-language tasks, medical image analysis, video retrieval, video relighting, and safety benchmarks for learning models. Many of the papers are collaborations with external institutions, including computer science and related departments at universities in North America, Europe, and Asia; Harvard Medical School; and tech companies Meta and Enable Medicine.
Below, we’ve compiled a list of the accepted papers with brief summaries and links to learn more.
Accepted Papers
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on vision-language (VL) tasks, but incorporating visual guidance requires costly training on curated data. This paper introduces Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer.
Facial Affective Behavior Analysis with Instruction Tuning
Yifan Li (Michigan State), Anh Dao (Michigan State), Wentao Bao (Michigan State), Zhen Tan (Arizona State), Tianlong Chen, Huan Liu (Arizona State), Yu Kong (Michigan State)
Facial affective behavior analysis (FABA) is the process of understanding human mental states from images. Multimodal large language models (MLLMs) have been useful for general vision understanding tasks, but the scarcity of datasets and benchmarks and low training efficiency have made them less effective in FABA. This project introduced an instruction-following dataset for selected FABA tasks, a new benchmark for FABA incorporating both recognition and generation ability, and a new MLLM called EmoLA.
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for LLMs
Haoqin Tu* (UC Santa Cruz), Chenhang Cui*, Zijun Wang (UC Santa Cruz), Yiyang Zhou, Bingchen Zhao (University of Edinburgh), Junlin Han (University of Oxford), Wangchunshu Zhou (AIWaves), Huaxiu Yao, Cihang Xie (UC Santa Cruz)
*equal contribution
This collaborative paper introduces a comprehensive safety evaluation suite, covering both out-of-distribution generalization and adversarial robustness for Vision LLMs in visual reasoning tasks. Evaluation of 21 current VLLMs found that they struggle with out-of-distribution text but not images, unless the visual information is limited, and also that the VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromises safety protocols.
Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos
Akshay Paruchuri, Samuel Ehrenstein, Shuxian Wang, Inbar Fried, Stephen M. Pizer, Marc Niethammer, Roni Sengupta
The use of a single camera, the lack of strong geometric features, and the effects of light reflection make it a challenge to accurately estimate the depth of pixels in endoscopy videos. This paper uses near-field lighting, emitted by the endoscope and reflected by the surface, and some additional new processes to generate more accurate depth maps, allowing more accurate diagnosis and surgical planning.
Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex Network
Sukwon Yun, Jie Peng (USTC, Hefei, China), Alexandro E. Trevino (Enable Medicine), Chanyoung Park (KAIST, Daejeon, South Korea), Tianlong Chen
Multiplexed immunofluorescence (mIF) imaging is an important technique for the simultaneous detection and visualization of multiple protein targets within a single tissue sample. Unfortunately, mIF struggles to deal with cellular heterogeneity and to scale to handle image data encompassing a large number of cells. To overcome these limitations, this paper introduces Mew, a framework designed to efficiently process mIF images through the lens of a multiplex network.
NePhi: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration
Lin Tian, Hastings Greer, Raúl San José Estépar (Harvard Medical School), Roni Sengupta, Marc Niethammer
Training dense prediction neural networks, including 3D medical image registration networks, is very memory-intensive. As the size of the input increases, the memory consumption scales cubically, which limits the resolution of training images and the quality of predicted results. This paper introduces NePhi, a generalizable neural deformation model that requires less memory, improves inference speed, improves accuracy, and shows excellent deformation regularity, which is highly desirable for medical image registration.
Personalized Video Relighting With an At-Home Light Stage
Jun Myeong Choi, Max Christman, Roni Sengupta
This paper presents a real-time, personalized video relighting algorithm that allows a user superimposed on a new background to be relit in a temporarily consistent manner, regardless of the user’s pose, expression, or actual lighting conditions. The primary new development is a neural relighting architecture that effectively separates the intrinsic appearance features – the geometry and reflectance of the face – from the source lighting and then combines them with the target lighting to generate a relit image.
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Md Mohaiminul Islam*, Tushar Nagarajan (Meta), Huiyu Wang (Meta), Fu-Jen Chu (Meta), Kris Kitani (Meta), Gedas Bertasius, Xitong Yang (Meta)
*work undertaken during an internship at Meta
Intelligent assistants must anticipate the series of actions needed to complete a task over time. That necessitates comprehensive task knowledge that is typically acquired by extensive training on a specific dataset, resulting in poor generalization to tasks outside the dataset. This paper introduces VidAssist, which leverages large language models (LLMs) as both the knowledge base and the assessment tool for generating and evaluating action plans, thus overcoming the challenges of acquiring procedural knowledge from small-scale, low-diversity datasets
.
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
Tanveer Hannan (LMU Munich, Germany, and Munich Center for Machine Learning), Md Mohaiminul Islam, Thomas Seidl (LMU Munich and Munich Center for Machine Learning), Gedas Bertasius
Locating specific moments within long videos (20 to 120 minutes) presents a significant challenge, and short video grounding methods do not adapt well to longer videos. This paper proposes RGNet, which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, such as clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization.
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin, Gedas Bertasius
Traditional audio-visual methods rely on independent audio and visual backbones, which is costly and not scalable. This paper investigates using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining. The new framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing for scaling to larger datasets and model sizes.
4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation
Feng Cheng* (Meta and UNC), Mi Luo* (UT Austin), Huiyu Wang (Meta), Alex Dimakis (UT Austin), Lorenzo Torresani (Meta), Gedas Bertasius, Kristen Grauman (Meta and UT Austin)
*equal contribution
A major component of human learning is the ability to watch a task performed by someone else and visualize performing it yourself. This collaborative paper presents 4Diff, a 3D-aware diffusion model to generate first-person (egocentric) view images from corresponding third-person (exocentric) images. 4Diff capitalizes on egocentric point cloud rasterization and 3D-aware rotary cross-attention to achieve state-of-the-art results on viewpoint translation tasks and generalizes well to environments not encountered during training.