Weak-to-Strong Generalization
Presented by Pavel
Pavel focused on reasoning, reinforcement learning, and AI alignment. He previously worked on advanced problem-solving models at OpenAI and contributed to Claude 3.7 and GPT-4-level systems. He completed his PhD at NYU in 2023 and will return as an Assistant Professor in Fall 2025, joining the Tandon CSE and Courant CS departments.
Abstract
As AI systems grow more capable, aligning them becomes increasingly challenging—especially when their behavior outpaces human understanding. This talk explores weak-to-strong generalization: can weak models effectively supervise stronger ones? Through experiments in NLP, chess, and reward modeling, the talk shows how even limited supervision can unlock surprising performance gains—and why alignment techniques like RLHF may not scale without new approaches.