3D visual grounding aims to localize the target object in a 3D point clo...
3D visual grounding involves finding a target object in a 3D scene that
...
Speech Recognition builds a bridge between the multimedia streaming
(aud...
Direct speech-to-speech translation (S2ST) aims to convert speech from o...
Multi-modal Contrastive Representation (MCR) learning aims to encode
dif...
Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries ...
Multi-media communications facilitate global interaction among people.
H...
Out-of-distribution (OOD) detection is an important task to ensure the
r...