Multimodal Large Language Models

  • TPAMI 2023 (National Science Review): A Survey on Multimodal Large Language Models, arXiv
  • ACL 2024: The Revolution of Multimodal Large Language Models: A Survey, arXiv

Multimodal Reasoning Foundational Works

  • ICCV 2015, VQA: VQA: Visual Question Answering, arXiv
  • CVPR 2017, CLEVR: CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, arXiv
  • CVPR 2019, GQA: GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, arXiv
  • ICLR 2018, MAC: Compositional Attention Networks for Machine Reasoning, arXiv

Neural-Symbolic Visual Reasoning

  • CVPR 2016: Neural Module Networks, arXiv
  • ICCV 2017: Inferring and Executing Programs for Visual Reasoning, arXiv
  • NeurIPS 2018: Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, arXiv, Note
  • CVPR 2023 Best Paper, VisProg: Visual Programming: Compositional Visual Reasoning without Training, arXiv, GitHub, Note