We see and hear things every second of our lives. Before sounds arrive at our ears,
they are first produced by some objects situated in the space, and then undergo the
transformation of its surrounding space as a function of the geometry of the environment,
materials, etc. Our perceived binaural sound not only tells us about the semantic
property of the sound, e.g., telephone ringing, dog barking, but also helps us
infer the spatial location of the sounding object. Both of these acoustic and
spatial properties are captured by the visual stream, and require models to go
beyond 2D understanding of images (3D with audio) and study the spatial (3D) aspect
of audio in visuals (4D with audio). This is of vital importance for applications
such as egocentric video understanding, robotic perception, AR/VR, etc.
In support of robotic perception, where embodied agents can move around with
both visual and auditory sensing, audio-visual simulations are also recently
developed to facilitate research in this direction. The goal of this workshop
is to share recent progress of audio-visual studies on the spatial-temporal (4D)
dimensions, and also to discuss which directions the field should investigate next.
The AV4D workshop will bring together researchers in different subareas of visual
learning of sound in spaces including computer vision, robotics, machine learning,
room acoustics, and graphics to examine the challenges
and opportunities emerging from visual learning of sounds embodied in spaces.
We will review the current state and identify the research
infrastructure needed to enable a stronger collaboration between researchers
working on different subareas of this workshop.