We see and hear things every second of our lives. Before sounds arrive at our ears, they are first produced by some objects situated in the space, and then undergo the transformation of its surrounding space as a function of the geometry of the environment, materials, etc. Our perceived binaural sound not only tells us about the semantic property of the sound, e.g., telephone ringing, baby crying, but also helps us infer the spatial location of the sounding object. Both of these acoustic and spatial properties are captured by the visual stream, and require models to go beyond 2D understanding of images (3D with audio) and study the spatial (3D) aspect of audio in visuals (4D with audio). This is of vital importance for applications such as egocentric video understanding, robotic perception, AR/VR, etc. In support of robotic perception, where embodied agents can move around with both visual and auditory sensing, audio-visual simulations are also recently developed to facilitate research in this direction. The goal of this workshop is to share recent progress of audio-visual studies on the spatial-temporal (4D) dimensions, and also to discuss which directions the field should investigate next.

The AV4D workshop will bring together researchers in different subareas of visual learning of sound in spaces including computer vision, robotics, machine learning, room acoustics, and graphics to examine the challenges and opportunities emerging from visual learning of sounds embodied in spaces. We will review the current state and identify the research infrastructure needed to enable a stronger collaboration between researchers working on different subareas of this workshop.

Call for Papers

We invite submissions of 2-4 pages extended abstracts in topics related to (but not limited to):
  • Visual learning of spatial audio
  • Visual learning of room acoustics
  • Visual learning of impact sounds
  • Audio-visual self-supervised and semi-supervised learning
  • Audio-visual speaker localization and diarization
  • Audio-visual source separation
  • Audio-visual embodied learning
  • Audio-visual simulation
  • Robotic perception with vision and sound
  • Audio-visual for AR/VR
A submission should take the form of an extended abstract (2-4 pages long excluding references) in PDF format using the ECCV style. We will accept submissions of (1) papers that have not been previously published or accepted for publication in substantially similar form; (2) papers that have been published or accepted for publication in recent venues including journal, conference, workshop, and arXiv; and (3) research proposals for future work with a focus on well-defined concepts and ideas. All submissions will be reviewed with single blind policy. Accepted extended abstracts will not appear in ECCV proceedings, and hence will not affect future publication of the work. We will publish all accepted extended abstracts on the workshop webpage.

CMT submissions website: https://cmt3.research.microsoft.com/AV4D2022

Key Dates:

  • Extended abstract submission deadline: August 15th, 2022 (11:59 pm Pacific time)
  • Notification to authors: September 6th, 2022
  • Camera-ready version deadline: September 15th, 2022 (11:59 pm Pacific time)
  • Workshop date: October 23rd, 2022

Accepted Papers

Title Authors, Presenter (Bolded) and Format (in-person/pre-recorded video/virtual) ID
Estimating Visual Information From Audio Through Manifold LearningFabrizio Pedersoli, Dryden Wiebe (video), Amin Banitalebi-Dehkordi, Yong Zhang, George Tzanetakis, and Kwang M. YiA1
MIMOSA: Human-in-the-Loop Generation of Spatial Audio from Videos with Monaural AudioZheng Ning* (virtual), Zheng Zhang*, Jerrick Ban, Kaiwen Jiang, Ruohong Gan, Yapeng Tian, and Toby Jia-Jun LiA2
Egocentric Deep Multi-Channel Audio-Visual Active Speaker LocalizationHao Jiang (video), Calvin Murdock, and Vamsi Krishna IthapuA3
AVSBench: A Pixel-level Audio−Visual Segmentation BenchmarkJinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong (video)A4
Sound Localization by Self-Supervised Time Delay EstimationZiyang Chen, David F. Fouhey, and Andrew Owens (in-person)A5
Active Audio-Visual Separation of Dynamic Sound SourcesSagnik Majumder (in-person) and Kristen GraumanA6
Sound Adversarial Audio-Visual NavigationYinfeng Yu (virtual), Changan Chen, and Fuchun SunB1
Benchmarking Weakly-Supervised Audio-Visual Sound LocalizationShentong Mo (video) and Pedro MorgadoB2
Semantic-Aware Multi-modal Grouping for Weakly-Supervised Audio-Visual Video ParsingShentong Mo (video) and Yapeng TianB3
Invisible-to-Visible: Privacy-Aware Human Segmentation using Airborne Ultrasound via Collaborative Learning Probabilistic U-NetRisako Tanigawa (in-person), Yasunori Ishii, Kazuki Kozuka, and Takayoshi YamashitaB4
Don’t Listen to What You Can’t See: The Importance of Negative Examples for Audio-Visual Sound SeparationEfthymios Tzinis, Scott Wisdom (in-person), and John HersheyB5
Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial AlignmentShanshan Wang (in-person), Archontis Politis, Annamaria Mesaros, and Tuomas VirtanenB6

Presentation Instructions

  • We'll have two paper presentation sessions: 10:15 - 11:15 and 15:00 - 16:00. Each session will be a mix of in-person and video presentations. Throughout the paper sessions, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing.
  • The posters will be sized A0 (84.1 * 118.9 cm) landscape.
  • For virtual speakers, please make sure you register at ECCV for the workshop and can log onto the web platform. You are requested to join the Zoom room 20 minutes prior to your session.
  • For in-person speakers, prepare your presentations on Microsoft PowerPoint 16:9 ratio and submit to the technician in the hall 2 hours (30 minutes for morning sessions) prior to your session on a USB stick.
  • For people who are presenting a poster, set up your poster on the poster board during breaks.
  • Read the complete official speaker instructions here.


09:00 - 09:15 (23:00 - 23:15 PDT) Opening Remarks Changan Chen (University of Texas at Austin)
09:15 - 09:45 (23:15 - 23:45 PDT) Invited Talk Pedro Morgado (University of Wisconsin Madison)
Multi-modal Representation Learning from and for Realistic Audio-Visual Data
09:45 - 10:15 (23:45 - 00:15 PDT) Invited Talk Richard Newcombe (Meta Reality Labs)
Introduction to Multimodal Perception with Project Aria
10:15 - 10:55 (00:15 - 00:55 PDT) Paper Session A A1 - A6 (Session Chair: Ruohan Gao)
10:55 - 11:15 (00:55 - 01:15 PDT) Paper Session A Q&A
11:15 - 11:45 (01:15 - 01:45 PDT) Coffee Break
11:45 - 12:15 (01:45 - 02:15 PDT) Invited Talk Josh McDermott (MIT)
Learning to Localize Sounds
12:15 - 12:45 (02:15 - 02:45 PDT) Invited Talk Tuomas Virtanen (Tampere University)
Spherical Audio-Visual Learning
12:45 - 14:00 (02:45 - 04:00 PDT) Lunch Break
14:00 - 14:30 (04:00 - 04:30 PDT) Invited Talk Alexander Richard (Meta Reality Labs)
3D Audio Rendering for Social Telepresence
14:30 - 15:00 (04:30 - 05:00 PDT) Invited Talk Yapeng Tian (University of Texas at Dallas)
Human-Multisensory AI Collaboration: Opportunities and Challenges
15:00 - 15:40 (05:00 - 05:40 PDT) Paper Session B B1 - B6 (Session Chair: Andrew Owens)
15:40 - 16:00 (05:40 - 06:00 PDT) Paper Session B Q&A
16:00 - 16:30 (06:00 - 06:30 PDT) Coffee Break
16:30 - 17:00 (06:30 - 07:00 PDT) Invited Talk Dinesh Manocha (University of Maryland)
Learning-based Audio Simulation
17:00 - 17:30 (07:00 - 07:30 PDT) Invited Talk John Hershey (Google Research)
Parsing Nature at Its Seams: Unsupervised and Audio-Visual Sound Separation
17:30 - 17:45 (07:30 - 07:45 PDT) Closing Remarks