Vision-and-Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.
Source: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Visual Question Answering (VQA) | 9 | 9.47% |
Visual Question Answering | 8 | 8.42% |
Question Answering | 7 | 7.37% |
Retrieval | 7 | 7.37% |
Image Captioning | 4 | 4.21% |
Visual Commonsense Reasoning | 4 | 4.21% |
Referring Expression | 3 | 3.16% |
Language Modelling | 3 | 3.16% |
Visual Dialog | 3 | 3.16% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |