Papers
arxiv:2304.08485

Visual Instruction Tuning

Published on Apr 17, 2023
Authors:
,

Abstract

LLaVA, a multimodal model combining vision encoders and LLMs using language-only GPT-4 generated instruction-following data, delivers high accuracy and impressive multimodal chat capabilities.

AI-generated summary

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Community

Unveiling LLaVA: The Next-Gen Visual Language Assistant

Links šŸ”—:

šŸ‘‰ Subscribe: https://www.youtube.com/@Arxflix
šŸ‘‰ Twitter: https://x.com/arxflix
šŸ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

hi

Sign up or log in to comment

Models citing this paper 21

Browse 21 models citing this paper

Datasets citing this paper 11

Browse 11 datasets citing this paper

Spaces citing this paper 28

Collections including this paper 10