Scale Can’t Overcome Pragmatics: Instilling Spatial and Compositional Reasoning into Multimodal Language Models
Presented by Ranjay
Ranjay’s research spans computer vision, natural language processing, robotics, and human-computer interaction, with recognized work at CVPR, ACL, NeurIPS, and other leading conferences. His publications have been featured in Science, Forbes, and The Wall Street Journal. He holds a Ph.D. in CS from Stanford and degrees in ECE and CS from Cornell.
Abstract
Compositionality is key to human vision and language, allowing us to interpret new scenes and sentences by combining familiar elements. While past research incorporated compositional and spatial priors into machine learning, large-scale models trained on internet data have largely overlooked them. This talk formalizes compositionality through cognitive science, evaluating whether models like GPT-4 and Gemini exhibit it—revealing near-random performance. We explore architectural and training modifications inspired by neuroscience and cognitive science to enhance compositional reasoning and address gaps in training data, showing how high-quality human annotations can help build stronger vision-language models.