Abstract: The field of Large Visual-Language Models (LVLMs) has made significant strides in integrating visual recognition and language understanding. However, its application in multimodal ...
Abstract: Visual grounding tasks aim to localize image regions based on natural language references. In this work, we ex-plore whether generative VLMs predominantly trained on image-text data could be ...
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, ...