Abstract: Audio-Visual Question Answering (AVQA) requires complex reasoning across auditory and visual modalities. While recent advancements leverage sophisticated spatio-temporal representations, ...