LLaVA is a large-scale language and vision assistant built for multimodal GPT-4 level features. represents a large multimodal model trained end-to-end, concatenating a visual encoder and LLM for general vision and language understanding. Demo Early experiments show that LLaVA demonstrates excellent multi-modal chat capabilities, sometimes exhibiting multi-model GPT-4 behavior on unseen images/instructions, compared to GPT-4 in synthetic multi-modal instructions following data The pooling yielded a relative score of 85.1%. When fine-tuned on Science QA, the synergy between LLaVA and GPT-4 achieves 92.53% of the new… |
#LLaVA #Large #Multimodal #Model #EndtoEnd #Training