We see and hear things every second of our lives. Before sounds arrive at our ears, they are first produced by some objects situated in the space, and then undergo the transformation of its surrounding space as a function of the geometry of the environment, materials, etc. Our perceived binaural sound not only tells us about the semantic property of the sound, e.g., telephone ringing, baby crying, but also helps us infer the spatial location of the sounding object. Both of these acoustic and spatial properties are captured by the visual stream, and require models to go beyond 2D understanding of images (3D with audio) and study the spatial (3D) aspect of audio in visuals (4D with audio). This is of vital importance for applications such as egocentric video understanding, robotic perception, AR/VR, etc. In support of robotic perception, where embodied agents can move around with both visual and auditory sensing, audio-visual simulations are also recently developed to facilitate research in this direction. The goal of this workshop is to share recent progress of audio-visual studies on the spatial-temporal (4D) dimensions, and also to discuss which directions the field should investigate next.

The AV4D workshop will bring together researchers in different subareas of visual learning of sound in spaces including computer vision, robotics, machine learning, room acoustics, and graphics to examine the challenges and opportunities emerging from visual learning of sounds embodied in spaces. We will review the current state and identify the research infrastructure needed to enable a stronger collaboration between researchers working on different subareas of this workshop.

Call for Papers

We invite submissions of 2-4 pages extended abstracts in topics related to (but not limited to):
  • Visual learning of spatial audio
  • Visual learning of room acoustics
  • Visual learning of impact sounds
  • Audio-visual self-supervised and semi-supervised learning
  • Audio-visual speaker localization and diarization
  • Audio-visual source separation
  • Audio-visual embodied learning
  • Audio-visual simulation
  • Robotic perception with vision and sound
  • Audio-visual for AR/VR
A submission should take the form of an extended abstract (2-4 pages long excluding references) in PDF format using the ICCV style. We will accept submissions of (1) papers that have not been previously published or accepted for publication in substantially similar form; (2) papers that have been published or accepted for publication in recent venues including journal, conference, workshop, and arXiv; and (3) research proposals for future work with a focus on well-defined concepts and ideas. All submissions will be reviewed with single blind policy. Accepted extended abstracts will not appear in ICCV proceedings, and hence will not affect future publication of the work. We will publish all accepted extended abstracts on the workshop webpage.

CMT submissions website: https://cmt3.research.microsoft.com/AV4D2023

Key Dates:

  • Extended abstract submission deadline: August 31st, 2023 (11:59 pm Pacific time)
  • Notification to authors: September 12th, 2023
  • Camera-ready version deadline: September 24th, 2023 (11:59 pm Pacific time)
  • Workshop date: October 3rd, 2023

Accepted Papers

Title Authors, Presenter (Bolded) and Format (in-person/pre-recorded video/virtual) ID
Bi-directional Image-Speech Retrieval Through Geometric Consistency Xinyuan Qian, Wei Xue, Qiquan Zhang, Ruijie Tao, Yiming Wang (in-person), Kainan Chen, Haizhou Li A1
Listen and Move: Improving GANs Coherency in Agnostic Sound-to-Video Generation Rafael Redondo (in-person) A2
Video-guided speech inpainting transformer Juan Felipe Montesinos, Daniel Michelsanti (in-person), Gloria Haro, Zheng-Hua Tan, Jesper Jensen A3
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos Ji-Hoon Kim, Jaehun Kim, Joon Son Chung (in-person) A4
Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields Susan Liang (virtual), Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu A5
Separating Invisible Sounds Toward Universal Audio-Visual Scene-Aware Sound Separation Yiyang Su (virtual), Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu A6
Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer Yaoting Wang (virtual), Liu Weisong, Guangyao Li, Jian Ding, Di Hu, Xi Li A7
Position-Aware Audio-Visual Separation for Spatial Audio Yuxin Ye (virtual), Wenming Yang, Yapeng Tian A8
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment Kim Sung-Bin (in-person), Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh B1
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision Xubo Liu (in-person), Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jachym Kolar, Stavros Petridis, Maja Pantic, Christian Fuegen B3
Sound Source Localization is All about Cross-Modal Alignment Arda Senocak (in-person), Hyeonggon Ryu B4
MISAR: A Multimodal Instructional System with Augmented Reality Jing Bi(virtual), Nguyen Manh Nguyen, Ali Vosoughi, Chenliang Xu B5
Towards Better Egocentric Action Understanding in a Multi-Input Multi-Output View Wenxuan Hou (virtual), Ruoxuan Feng, Yixin Xu, Yapeng Tian, Di Hu B6
Towards Robust Active Speaker Detection Siva Sai Nagender Vasireddy (virtual), Chenxu Zhang, Xiaohu Guo, Yapeng Tian B7
Leveraging Foundation Models for Unsupervised Audio Visual Segmentation Swapnil Bhosale (virtual), Haosen Yang, Diptesh Kanojia, Xiatian Zhu B8

Presentation Instructions

  • We will have two paper presentation sessions: 9:30 - 10:30 and 14:15 - 15:15. Throughout the paper sessions, there will be short Q&A sessions for every four paper presentations.
  • You can present a poster during the lunch and coffee breaks. The posters will be sized A0 (84.1 * 118.9 cm) landscape. Please set up your poster on the poster board during breaks.
  • For remote attendees, please upload your five-minute presentation video on CMT before the conference.
  • For in-person speakers, prepare a five-minute presentation and test your laptop with the technician during breaks.


08:50 - 09:00 Opening Remarks Changan Chen (UT Austin)
09:00 - 09:30 Invited Talk: Combining Language and Perception Jim Glass (MIT)
Abstract: The last decade has seen remarkable advances in deep learning methods that have dramatically influenced research in many areas of computer science, including speech and natural language processing and computer vision. An ability to learn latent embedding spaces across modalities and to learn from unlabeled data via self-supervised techniques has facilitated connections between perception and language, two of the original pillars of artificial intelligence. In this talk I review some of the recent progress our group has made towards weakly supervised learning from multimodal inputs including audio, speech, text, images, and video. Just as humans learn to perceive, understand, and act within the world by being immersed in a sea of multimodal sensory signals, we argue that multimodal input will also enable the creation of much richer and more transferable semantic concepts than unimodal training alone. I will describe several multimodal learning models that we have developed as part of our ongoing research in this area and describe recent efforts to connect these perceptual models with large language models to enable language-capable perceptual models.
9:30 - 10:30 Paper Session A Chair: Changan Chen
10:30 - 11:15 Coffee Break and Poster Session
11:15 - 11:45 Invited Talk: Audio-Visual Physics-Aware Learning Chuang Gan (UMass Amherst)
11:45 - 12:15 Invited Talk: Synchronized Video-to-audio Synthesis Hang Zhao (Tsinghua University)
Abstract: Synchronized video-based speech and audio synthesis can find applications in many industries, including video conferencing, film-making, etc. This talk covers two topics: (1) video-based text-to-speech synthesis for video dubbing; (2) video-based natural sound synthesis for Foley.
12:15 - 13:15 Lunch Break
13:15 - 13:45 Invited Talk: Audio-Visual Reconstruction
Ming Lin (UMD)
Abstract: Deep neural networks trained on single- or multi-view images have enabled 3D reconstruction of objects and scenes using RGB and RGBD approaches for robotics and other 3D vision-based applications. However, existing methods still encounter some challenging scenarios for 3D reconstruction, such as transparency and highly reflective surfaces, difficulty to capture material properties, etc. In this talk, I present recent advances in audio-visual reconstruction using a combination of audio, images and videos to recover material properties, 3D geometry, and 3D scenes for robot navigation and AR/VR applications. These approaches offer new insights for understanding complexity in reconstructing 3D dynamic worlds. I'll conclude by discussing some possible future directions and challenges.
13:45 - 14:15 Invited Talk: Replaying immersive audio-visual memories: data and challenges Natalia Neverova (Meta AI)
Abstract: Re-creating immersive audio-visual memories on headsets has been a long-standing dream for many. In this talk, we will introduce a new Replay benchmark for this task and discuss the challenges. Replay is a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models.
14:15 - 15:15 Paper Session B Chair: Mingfei Chen
15:15 - 16:00 Coffee Break and Poster Session
16:00 - 16:30 Invited Talk: Multi-Sensory Neural Objects: Modeling, Datasets, and Applications Jiajun Wu (Standford)
16:30 - 17:00 Invited Paper Session
AutoAD II: The Sequel – Who, When, and What in Movie Audio Description
AdVerb: Visually Guided Audio Dereverberation
Be Everywhere - Hear Everything (BEE): Audio Scene Reconstruction by Sparse Audio-Visual Samples
Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

Tengda Han
Sanjoy Chowdhury
Mingfei Chen

Andrew Owens
17:00 - 17:10 Closing Remarks