AV4D Workshop | ECCV 2022

Abstract

We see and hear things every second of our lives. Before sounds arrive at our ears, they are first produced by some objects situated in the space, and then undergo the transformation of its surrounding space as a function of the geometry of the environment, materials, etc. Our perceived binaural sound not only tells us about the semantic property of the sound, e.g., telephone ringing, baby crying, but also helps us infer the spatial location of the sounding object. Both of these acoustic and spatial properties are captured by the visual stream, and require models to go beyond 2D understanding of images (3D with audio) and study the spatial (3D) aspect of audio in visuals (4D with audio). This is of vital importance for applications such as egocentric video understanding, robotic perception, AR/VR, etc. In support of robotic perception, where embodied agents can move around with both visual and auditory sensing, audio-visual simulations are also recently developed to facilitate research in this direction. The goal of this workshop is to share recent progress of audio-visual studies on the spatial-temporal (4D) dimensions, and also to discuss which directions the field should investigate next.

The AV4D workshop will bring together researchers in different subareas of visual learning of sound in spaces including computer vision, robotics, machine learning, room acoustics, and graphics to examine the challenges and opportunities emerging from visual learning of sounds embodied in spaces. We will review the current state and identify the research infrastructure needed to enable a stronger collaboration between researchers working on different subareas of this workshop.

Invited Speakers

Alexander Richard
(Meta Reality Labs)

Pedro Morgado
(University of Wisconsin Madison)

Dinesh Manocha
(University of Maryland)

Tuomas Virtanen
(Tampere University)

Josh McDermott
(MIT)

Yapeng Tian
(University of Texas at Dallas)

Richard Newcombe
(Meta Reality Labs)

John Hershey
(Google Research)

Call for Papers

We invite submissions of 2-4 pages extended abstracts in topics related to (but not limited to):

Visual learning of spatial audio

Visual learning of room acoustics

Visual learning of impact sounds

Audio-visual self-supervised and semi-supervised learning

Audio-visual speaker localization and diarization

Audio-visual source separation

Audio-visual embodied learning

Audio-visual simulation

Robotic perception with vision and sound

Audio-visual for AR/VR

A submission should take the form of an extended abstract (2-4 pages long excluding references) in PDF format using the ECCV style. We will accept submissions of (1) papers that have not been previously published or accepted for publication in substantially similar form; (2) papers that have been published or accepted for publication in recent venues including journal, conference, workshop, and arXiv; and (3) research proposals for future work with a focus on well-defined concepts and ideas. All submissions will be reviewed with single blind policy. Accepted extended abstracts will not appear in ECCV proceedings, and hence will not affect future publication of the work. We will publish all accepted extended abstracts on the workshop webpage.

CMT submissions website: https://cmt3.research.microsoft.com/AV4D2022

Key Dates:

Extended abstract submission deadline: August 15th, 2022 (11:59 pm Pacific time)

Notification to authors: September 6th, 2022

Camera-ready version deadline: September 15th, 2022 (11:59 pm Pacific time)

Workshop date: October 23rd, 2022

Accepted Papers

Title	Authors, Presenter (Bolded) and Format (in-person/pre-recorded video/virtual)	ID
Estimating Visual Information From Audio Through Manifold Learning	Fabrizio Pedersoli, Dryden Wiebe (video), Amin Banitalebi-Dehkordi, Yong Zhang, George Tzanetakis, and Kwang M. Yi	A1
MIMOSA: Human-in-the-Loop Generation of Spatial Audio from Videos with Monaural Audio	*Zheng Ning (virtual)*, Zheng Zhang, Jerrick Ban, Kaiwen Jiang, Ruohong Gan, Yapeng Tian, and Toby Jia-Jun Li	A2
Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization	Hao Jiang (video), Calvin Murdock, and Vamsi Krishna Ithapu	A3
AVSBench: A Pixel-level Audio−Visual Segmentation Benchmark	Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong (video)	A4
Sound Localization by Self-Supervised Time Delay Estimation	Ziyang Chen, David F. Fouhey, and Andrew Owens (in-person)	A5
Active Audio-Visual Separation of Dynamic Sound Sources	Sagnik Majumder (in-person) and Kristen Grauman	A6
Sound Adversarial Audio-Visual Navigation	Yinfeng Yu (virtual), Changan Chen, and Fuchun Sun	B1
Benchmarking Weakly-Supervised Audio-Visual Sound Localization	Shentong Mo (video) and Pedro Morgado	B2
Semantic-Aware Multi-modal Grouping for Weakly-Supervised Audio-Visual Video Parsing	Shentong Mo (video) and Yapeng Tian	B3
Invisible-to-Visible: Privacy-Aware Human Segmentation using Airborne Ultrasound via Collaborative Learning Probabilistic U-Net	Risako Tanigawa (in-person), Yasunori Ishii, Kazuki Kozuka, and Takayoshi Yamashita	B4
Don’t Listen to What You Can’t See: The Importance of Negative Examples for Audio-Visual Sound Separation	Efthymios Tzinis, Scott Wisdom (in-person), and John Hershey	B5
Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment	Shanshan Wang (in-person), Archontis Politis, Annamaria Mesaros, and Tuomas Virtanen	B6

Presentation Instructions

We'll have two paper presentation sessions: 10:15 - 11:15 and 15:00 - 16:00. Each session will be a mix of in-person and video presentations. Throughout the paper sessions, there will be short Q&A sessions for all of the papers that precede them. We'll also release recordings on our website for offline viewing.
The posters will be sized A0 (84.1 * 118.9 cm) landscape.
For virtual speakers, please make sure you register at ECCV for the workshop and can log onto the web platform. You are requested to join the Zoom room 20 minutes prior to your session.

For in-person speakers, prepare your presentations on Microsoft PowerPoint 16:9 ratio and submit to the technician in the hall 2 hours (30 minutes for morning sessions) prior to your session on a USB stick.

For people who are presenting a poster, set up your poster on the poster board during breaks.

Read the complete official speaker instructions here.

Schedule

09:00 - 09:15 (23:00 - 23:15 PDT)	Opening Remarks	Changan Chen (University of Texas at Austin)
09:15 - 09:45 (23:15 - 23:45 PDT)	Invited Talk	Pedro Morgado (University of Wisconsin Madison) Multi-modal Representation Learning from and for Realistic Audio-Visual Data
09:45 - 10:15 (23:45 - 00:15 PDT)	Invited Talk	Richard Newcombe (Meta Reality Labs) Introduction to Multimodal Perception with Project Aria
10:15 - 10:55 (00:15 - 00:55 PDT)	Paper Session A	A1 - A6 (Session Chair: Ruohan Gao)
10:55 - 11:15 (00:55 - 01:15 PDT)	Paper Session A Q&A
11:15 - 11:45 (01:15 - 01:45 PDT)	Coffee Break
11:45 - 12:15 (01:45 - 02:15 PDT)	Invited Talk	Josh McDermott (MIT) Learning to Localize Sounds
12:15 - 12:45 (02:15 - 02:45 PDT)	Invited Talk	Tuomas Virtanen (Tampere University) Spherical Audio-Visual Learning
12:45 - 14:00 (02:45 - 04:00 PDT)	Lunch Break
14:00 - 14:30 (04:00 - 04:30 PDT)	Invited Talk	Alexander Richard (Meta Reality Labs) 3D Audio Rendering for Social Telepresence
14:30 - 15:00 (04:30 - 05:00 PDT)	Invited Talk	Yapeng Tian (University of Texas at Dallas) Human-Multisensory AI Collaboration: Opportunities and Challenges
15:00 - 15:40 (05:00 - 05:40 PDT)	Paper Session B	B1 - B6 (Session Chair: Andrew Owens)
15:40 - 16:00 (05:40 - 06:00 PDT)	Paper Session B Q&A
16:00 - 16:30 (06:00 - 06:30 PDT)	Coffee Break
16:30 - 17:00 (06:30 - 07:00 PDT)	Invited Talk	Dinesh Manocha (University of Maryland) Learning-based Audio Simulation
17:00 - 17:30 (07:00 - 07:30 PDT)	Invited Talk	John Hershey (Google Research) Parsing Nature at Its Seams: Unsupervised and Audio-Visual Sound Separation
17:30 - 17:45 (07:30 - 07:45 PDT)	Closing Remarks

Organizers

Changan Chen
(UT Austin & Meta AI)

Ruohan Gao
(Stanford)

Andrew Owens
(UMich)

David Harwath
(UT Austin)

Chuang Gan
(IBM)

Antonio Torralba
(MIT)

Andrea Vedaldi
(Oxford)

Kristen Grauman
(UT Austin & Meta AI)