Multimodal Large Language Models
- TPAMI 2023 (National Science Review): A Survey on Multimodal Large Language Models, arXiv
- ACL 2024: The Revolution of Multimodal Large Language Models: A Survey, arXiv
Multimodal Reasoning Foundational Works
- ICCV 2015, VQA: VQA: Visual Question Answering, arXiv
- CVPR 2017, CLEVR: CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, arXiv
- CVPR 2019, GQA: GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, arXiv
- ICLR 2018, MAC: Compositional Attention Networks for Machine Reasoning, arXiv
Neural-Symbolic Visual Reasoning
- CVPR 2016: Neural Module Networks, arXiv
- ICCV 2017: Inferring and Executing Programs for Visual Reasoning, arXiv
- NeurIPS 2018: Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, arXiv, Note
- CVPR 2023 Best Paper, VisProg: Visual Programming: Compositional Visual Reasoning without Training, arXiv, GitHub, Note