We see and hear things every second of our lives. Before sounds arrive at our ears, they are first produced by some objects situated in the space, and then undergo the transformation of its surrounding space as a function of the geometry of the environment, materials, etc. Our perceived binaural sound not only tells us about the semantic property of the sound, e.g., telephone ringing, dog barking, but also helps us infer the spatial location of the sounding object. Both of these acoustic and spatial properties are captured by the visual stream, and require models to go beyond 2D understanding of images (3D with audio) and study the spatial (3D) aspect of audio in visuals (4D with audio). This is of vital importance for applications such as egocentric video understanding, robotic perception, AR/VR, etc. In support of robotic perception, where embodied agents can move around with both visual and auditory sensing, audio-visual simulations are also recently developed to facilitate research in this direction. The goal of this workshop is to share recent progress of audio-visual studies on the spatial-temporal (4D) dimensions, and also to discuss which directions the field should investigate next.

The AV4D workshop will bring together researchers in different subareas of visual learning of sound in spaces including computer vision, robotics, machine learning, room acoustics, and graphics to examine the challenges and opportunities emerging from visual learning of sounds embodied in spaces. We will review the current state and identify the research infrastructure needed to enable a stronger collaboration between researchers working on different subareas of this workshop.

Call for Papers

We invite submissions of 2-4 pages extended abstracts in topics related to (but not limited to):
  • Visual learning of spatial audio
  • Visual learning of room acoustics
  • Visual learning of impact sounds
  • Audio-visual self-supervised and semi-supervised learning
  • Audio-visual speaker localization and diarization
  • Audio-visual source separation
  • Audio-visual embodied learning
  • Audio-visual simulation
  • Robotic perception with vision and sound
  • Audio-visual for AR/VR
A submission should take the form of an extended abstract (2-4 pages long excluding references) in PDF format using the ECCV style. We will accept submissions of (1) papers that have not been previously published or accepted for publication in substantially similar form; (2) papers that have been published or accepted for publication in recent venues including journal, conference, workshop, and arXiv; and (3) research proposals for future work with a focus on well-defined concepts and ideas. All submissions will be reviewed with single blind policy. Accepted extended abstracts will not appear in ECCV proceedings, and hence will not affect future publication of the work. We will publish all accepted extended abstracts on the workshop webpage.

Key Dates:

  • Extended abstract submission deadline: July 31st, 2022
  • Notification to authors: August 31st, 2022
  • Workshop date: TBD, 2022


07:55 am - 08:00 am Introduction and Opening Remarks
08:00 am - 08:30 am Invited Talk
08:30 am - 09:00 am Invited Talk
09:00 am - 09:30 am Paper Session A A1 - A5
09:30 am - 09:40 am Paper Session A Q&A
09:40 am - 10:00 am Break
10:00 am - 10:30 am Invited Talk
10:30 am - 11:00 am Invited Talk
11:00 am - 12:00 pm Panel Discussion Having a question for the panelists? Ask here!
12:00 pm - 12:30 pm Break
12:30 pm - 01:00 pm Invited Talk
01:00 pm - 01:30 pm Invited Talk
01:30 pm - 02:00 pm Paper Session B B1 - B4
02:00 pm - 02:10 pm Paper Session B Q&A
02:10 pm - 02:30 pm Break
02:30 pm - 03:00 pm Invited Talk
03:00 pm - 03:30 pm Invited Talk
03:30 pm - 03:35 pm Closing Remarks