VLE Homepage, Documentation and Downloads- Vision-Language Multimodal Pre-Training Model- News Fast Delivery
VLE (Vision-Llanguage E.ncoder) is an image-text multimodal understanding model based on pre-trained text and image encoders, which can be applied to multimodal discriminative tasks such as visual question answering and image-text retrieval. In particular, VLE achieves the best performance among public models in the Visual Commonsense Reasoning (VCR) task, which has stronger requirements on language […]