AV4D Workshop | ICCV 2023

Abstract

We see and hear things every second of our lives. Before sounds arrive at our ears, they are first produced by some objects situated in the space, and then undergo the transformation of its surrounding space as a function of the geometry of the environment, materials, etc. Our perceived binaural sound not only tells us about the semantic property of the sound, e.g., telephone ringing, baby crying, but also helps us infer the spatial location of the sounding object. Both of these acoustic and spatial properties are captured by the visual stream, and require models to go beyond 2D understanding of images (3D with audio) and study the spatial (3D) aspect of audio in visuals (4D with audio). This is of vital importance for applications such as egocentric video understanding, robotic perception, AR/VR, etc. In support of robotic perception, where embodied agents can move around with both visual and auditory sensing, audio-visual simulations are also recently developed to facilitate research in this direction. The goal of this workshop is to share recent progress of audio-visual studies on the spatial-temporal (4D) dimensions, and also to discuss which directions the field should investigate next.

The AV4D workshop will bring together researchers in different subareas of visual learning of sound in spaces including computer vision, robotics, machine learning, room acoustics, and graphics to examine the challenges and opportunities emerging from visual learning of sounds embodied in spaces. We will review the current state and identify the research infrastructure needed to enable a stronger collaboration between researchers working on different subareas of this workshop.

Invited Speakers

Jiajun Wu

(Stanford)

Natalia Neverova

(FAIR, Meta AI)

Ming C. Lin

(University of Maryland)

Chuang Gan

(UMass Amherst)

Jim Glass

(MIT)

Hang Zhao

(Tsinghua University)

Call for Papers

We invite submissions of 2-4 pages extended abstracts in topics related to (but not limited to):

Visual learning of spatial audio

Visual learning of room acoustics

Visual learning of impact sounds

Audio-visual self-supervised and semi-supervised learning

Audio-visual speaker localization and diarization

Audio-visual source separation

Audio-visual embodied learning

Audio-visual simulation

Robotic perception with vision and sound

Audio-visual for AR/VR

A submission should take the form of an extended abstract (2-4 pages long excluding references) in PDF format using the ICCV style. We will accept submissions of (1) papers that have not been previously published or accepted for publication in substantially similar form; (2) papers that have been published or accepted for publication in recent venues including journal, conference, workshop, and arXiv; and (3) research proposals for future work with a focus on well-defined concepts and ideas. All submissions will be reviewed with single blind policy. Accepted extended abstracts will not appear in ICCV proceedings, and hence will not affect future publication of the work. We will publish all accepted extended abstracts on the workshop webpage.

CMT submissions website: https://cmt3.research.microsoft.com/AV4D2023

Key Dates:

Extended abstract submission deadline: August 31st, 2023 (11:59 pm Pacific time)

Notification to authors: September 12th, 2023

Camera-ready version deadline: September 24th, 2023 (11:59 pm Pacific time)

Workshop date: October 3rd, 2023

Accepted Papers

Title	Authors, Presenter (Bolded) and Format (in-person/pre-recorded video/virtual)	ID
Bi-directional Image-Speech Retrieval Through Geometric Consistency	Xinyuan Qian, Wei Xue, Qiquan Zhang, Ruijie Tao, Yiming Wang (in-person), Kainan Chen, Haizhou Li	A1
Listen and Move: Improving GANs Coherency in Agnostic Sound-to-Video Generation	Rafael Redondo (in-person)	A2
Video-guided speech inpainting transformer	Juan Felipe Montesinos, Daniel Michelsanti (in-person), Gloria Haro, Zheng-Hua Tan, Jesper Jensen	A3
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos	Ji-Hoon Kim, Jaehun Kim, Joon Son Chung (in-person)	A4
Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields	Susan Liang (virtual), Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu	A5
Separating Invisible Sounds Toward Universal Audio-Visual Scene-Aware Sound Separation	Yiyang Su (virtual), Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu	A6
Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer	Yaoting Wang (virtual), Liu Weisong, Guangyao Li, Jian Ding, Di Hu, Xi Li	A7
Position-Aware Audio-Visual Separation for Spatial Audio	Yuxin Ye (virtual), Wenming Yang, Yapeng Tian	A8
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment	Kim Sung-Bin (in-person), Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh	B1
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision	Xubo Liu (in-person), Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jachym Kolar, Stavros Petridis, Maja Pantic, Christian Fuegen	B3
Sound Source Localization is All about Cross-Modal Alignment	Arda Senocak (in-person), Hyeonggon Ryu	B4
MISAR: A Multimodal Instructional System with Augmented Reality	Jing Bi(virtual), Nguyen Manh Nguyen, Ali Vosoughi, Chenliang Xu	B5
Towards Better Egocentric Action Understanding in a Multi-Input Multi-Output View	Wenxuan Hou (virtual), Ruoxuan Feng, Yixin Xu, Yapeng Tian, Di Hu	B6
Towards Robust Active Speaker Detection	Siva Sai Nagender Vasireddy (virtual), Chenxu Zhang, Xiaohu Guo, Yapeng Tian	B7
Leveraging Foundation Models for Unsupervised Audio Visual Segmentation	Swapnil Bhosale (virtual), Haosen Yang, Diptesh Kanojia, Xiatian Zhu	B8

Presentation Instructions

We will have two paper presentation sessions: 9:30 - 10:30 and 14:15 - 15:15. Throughout the paper sessions, there will be short Q&A sessions for every four paper presentations.

You can present a poster during the lunch and coffee breaks. The posters will be sized A0 (84.1 * 118.9 cm) landscape. Please set up your poster on the poster board during breaks.

For remote attendees, please upload your five-minute presentation video on CMT before the conference.

For in-person speakers, prepare a five-minute presentation and test your laptop with the technician during breaks.

Schedule

08:50 - 09:00	Opening Remarks	Changan Chen (UT Austin)
09:00 - 09:30	Invited Talk: Combining Language and Perception	Jim Glass (MIT)
Abstract: The last decade has seen remarkable advances in deep learning methods that have dramatically influenced research in many areas of computer science, including speech and natural language processing and computer vision. An ability to learn latent embedding spaces across modalities and to learn from unlabeled data via self-supervised techniques has facilitated connections between perception and language, two of the original pillars of artificial intelligence. In this talk I review some of the recent progress our group has made towards weakly supervised learning from multimodal inputs including audio, speech, text, images, and video. Just as humans learn to perceive, understand, and act within the world by being immersed in a sea of multimodal sensory signals, we argue that multimodal input will also enable the creation of much richer and more transferable semantic concepts than unimodal training alone. I will describe several multimodal learning models that we have developed as part of our ongoing research in this area and describe recent efforts to connect these perceptual models with large language models to enable language-capable perceptual models.
9:30 - 10:30	Paper Session A	Chair: Changan Chen
10:30 - 11:15	Coffee Break and Poster Session
11:15 - 11:45	Invited Talk: Audio-Visual Physics-Aware Learning	Chuang Gan (UMass Amherst)
11:45 - 12:15	Invited Talk: Synchronized Video-to-audio Synthesis	Hang Zhao (Tsinghua University)
Abstract: Synchronized video-based speech and audio synthesis can find applications in many industries, including video conferencing, film-making, etc. This talk covers two topics: (1) video-based text-to-speech synthesis for video dubbing; (2) video-based natural sound synthesis for Foley.
12:15 - 13:15	Lunch Break
13:15 - 13:45	Invited Talk: Audio-Visual Reconstruction	Ming Lin (UMD)
Abstract: Deep neural networks trained on single- or multi-view images have enabled 3D reconstruction of objects and scenes using RGB and RGBD approaches for robotics and other 3D vision-based applications. However, existing methods still encounter some challenging scenarios for 3D reconstruction, such as transparency and highly reflective surfaces, difficulty to capture material properties, etc. In this talk, I present recent advances in audio-visual reconstruction using a combination of audio, images and videos to recover material properties, 3D geometry, and 3D scenes for robot navigation and AR/VR applications. These approaches offer new insights for understanding complexity in reconstructing 3D dynamic worlds. I'll conclude by discussing some possible future directions and challenges.
13:45 - 14:15	Invited Talk: Replaying immersive audio-visual memories: data and challenges	Natalia Neverova (Meta AI)
Abstract: Re-creating immersive audio-visual memories on headsets has been a long-standing dream for many. In this talk, we will introduce a new Replay benchmark for this task and discuss the challenges. Replay is a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models.
14:15 - 15:15	Paper Session B	Chair: Mingfei Chen
15:15 - 16:00	Coffee Break and Poster Session
16:00 - 16:30	Invited Talk: Multi-Sensory Neural Objects: Modeling, Datasets, and Applications	Jiajun Wu (Standford)
16:30 - 17:00	Invited Paper Session AutoAD II: The Sequel – Who, When, and What in Movie Audio Description AdVerb: Visually Guided Audio Dereverberation Be Everywhere - Hear Everything (BEE): Audio Scene Reconstruction by Sparse Audio-Visual Samples Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation	Tengda Han Sanjoy Chowdhury Mingfei Chen Andrew Owens
17:00 - 17:10	Closing Remarks

Organizers

Changan Chen
(UT Austin)

Ruohan Gao
(Stanford)

Andrew Owens
(UMich)

David Harwath
(UT Austin)

Chuang Gan
(IBM)

Antonio Torralba
(MIT)

Andrea Vedaldi
(Oxford)

Kristen Grauman
(UT Austin & Meta AI)