Abstract: The field of Large Visual-Language Models (LVLMs) has made significant strides in integrating visual recognition and language understanding. However, its application in multimodal ...
Abstract: Visual grounding tasks aim to localize image regions based on natural language references. In this work, we ex-plore whether generative VLMs predominantly trained on image-text data could be ...
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results