Skip to main content
CoDi-2 highlight poster at CVPR 2024
June 28, 2024

UNC Computer Science students and faculty had 8 papers accepted by the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), one of the most prestigious and selective conferences in its field. Two of the papers were selected for highlight by the conference.

Contributions came from five different research groups, led by current faculty members Mohit Bansal, Gedas Bertasius, Marc Niethammer, and Huaxiu Yao, as well as Tianlong Chen, who will join the department in July for the Fall 2024 semester. The publications include multiple industry collaborations with Microsoft and Meta, a global project spanning more than 20 institutions and 740 participants, and intradepartmental team-ups within UNC CS.

Two papers were selected for highlight: “CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation” and “TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models.”

CoDi-2 improved on the functionality of CoDi, short for Composable Diffusion, a collaboration with Microsoft Research. CoDi-2 is a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal instructions, conduct in-context learning, reason, chat, edit, and more, in an any-to-any input-output modality paradigm. In other words, a user can write instructions and provide any combination of input modalities–such as text, images, video, and audio–and have CoDi-2 execute a number of generative tasks to provide multimodal output.Temporal Feature Maintenance Quantization for Diffusion Models, or TFMQ-DM, addresses a shortcoming of the Diffusion model, a prevalent framework for image generation, wherein the model’s extended inference times and substantial memory requirements limit its broad applicability. The new framework maximizes temporal information and ensures end-to-end generation quality.

Below is a list of all eight papers, with very brief descriptions and a link to more information on each.

All Papers from UNC CS Personnel

Authors from UNC CS are bolded

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation 

Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal

An expansion on the functionality of the original version, CoDi-2 is a collaboration with Microsoft Research to develop a generative model that can take any combination of input modalities and generate any combination of output modalities.

 

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zachary Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian David Forigua Diaz, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Fu Xinzhu, Ryosuke Furuta, Cristina González, Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, PabloArbelaez, Gedas Bertasius, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C.V. Jawahar, Richard Newcombe, Hyun Soo Park, James Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

The Ego-Exo4D project is a collaboration between several academic institutions and Meta AI to create a first-of-its-kind, large-scale, multimodal, multiview dataset that enhances AI’s perception, responsiveness, and understanding of human skill in real-world settings. Where current AI systems primarily learn from static, third-person images and videos, Ego-Exo4D’s approach incorporates a multi-modal combination of first-person experiences from both the ego and exo viewpoint with feedback and insight from skilled experts.

 

LoCoNet: Long-Short Context Network for Active Speaker Detection

Xizi Wang, Feng Cheng, Gedas Bertasius

LoCoNet tackles the challenge of automatically detecting the active speaker in video by leveraging both long-term, intra-speaker and short-term, inter-speaker context models. The model was developed in tandem with researchers from Indiana University.

 

Molecular Data Programming: Towards Molecule Pseudo-labeling with Systematic Weak Supervision

Xin Juan, Kaixiong Zhou, Ninghao Liu, Tianlong Chen, Xin Wang

This project, a collaboration between Chen and researchers from Jilin University in China, the University of Georgia, and the Massachusetts Institute of Technology, enhances molecular machine learning by improving pseudo-labeling of molecules in training data.

 

Multimodal Representation Learning by Alternating Unimodal Adaptation

Xiaohui Zhang, Jaehong Yoon, Mohit Bansal, Huaxiu Yao

Multimodal learning methods often struggle with challenges where some modalities appear more dominant than others during multimodal learning. Multimodal Learning with Alternating Unimodal Adaptation reframes the conventional learning process by transforming it into an alternating unimodal learning process, thereby minimizing interference between modalities.

 

Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts

Qin Liu, Jaemin Cho, Mohit Bansal, Marc Niethammer

In this project, the team delved into the architecture of both generalist and specialist image segmentation models to facilitate the development of generalist models with high segmentation quality.

 

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

Yushi Huang*, Ruihao Gong*, Jing Liu, Tianlong Chen, Xianglong Liu

Chen and his collaborators at Beihang University in China, AI company SenseTime Research, and Monash University in Australia proposed a Temporal Feature Maintenance Quantization (TFMQ) framework, which augments the prevalent Diffusion model framework for image generation to help maintain the most temporal information and ensure end-to-end generation quality.

 

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Vu Bao Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

Where most video captioning models are designed to process short video clips and generate text describing only low-level concepts, this collaboration with researchers from Meta AI developed Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels.