How does the visual system develop category-selective regions for faces, bodies, scenes, and words? And how can we tell whether our deep neural network (DNN) models actually capture the feature tuning that arises in these brain areas? In this talk, I will first show that contrastive learning over large-scale natural image sets naturally gives rise to category-selective units for faces, scenes, bodies, and words, even in the absence of any category-specific learning rules or inductive biases. These emergent selective units have dissociable functional roles in object recognition when lesioned, and can predict responses in corresponding selective areas of human ventral visual cortex. These findings support a unifying account of category representation that bridges longstanding debates between modular and distributed theories of high-level vision. Building on this framework, I will then introduce 'parametric neural control' as a novel, more stringent test of DNN-brain alignment. Many DNN encoding models may show near-equal performance in predicting visual responses, while relying on fundamentally different features and computations. We demonstrate this using an interpretability technique called feature accentuation (Hamblin, Fel, et al., 2024), which enables us to synthesize stimulus sets that systematically vary along model-specific encoding axes, and, to then test each model's ability to precisely modulate neural responses in macaque inferotemporal cortex. Strikingly, we find that DNNs with equivalent encoding scores on natural images can show marked differences in their capacity to control neural responses using these targeted image manipulations. This approach therefore provides a new means to arbitrate between models, requiring a stronger commitment to feature tuning properties in local parts of the natural image manifold. Overall, these studies provide an updated deep learning paradigm for understanding the emergence and function of high-level visual representations in greater detail.
How does the visual system develop category-selective regions for faces, bodies, scenes, and words? And how can we tell whether our deep neural network (DNN) models actually capture the feature tuning that arises in these brain areas? In this talk, I will first show that contrastive learning over large-scale natural image sets naturally gives rise to category-selective units for faces, scenes, bodies, and words, even in the absence of any category-specific learning rules or inductive biases. These emergent selective units have dissociable functional roles in object recognition when lesioned, and can predict responses in corresponding selective areas of human ventral visual cortex. These findings support a unifying account of category representation that bridges longstanding debates between modular and distributed theories of high-level vision. Building on this framework, I will then introduce 'parametric neural control' as a novel, more stringent test of DNN-brain alignment. Many DNN encoding models may show near-equal performance in predicting visual responses, while relying on fundamentally different features and computations. We demonstrate this using an interpretability technique called feature accentuation (Hamblin, Fel, et al., 2024), which enables us to synthesize stimulus sets that systematically vary along model-specific encoding axes, and, to then test each model's ability to precisely modulate neural responses in macaque inferotemporal cortex. Strikingly, we find that DNNs with equivalent encoding scores on natural images can show marked differences in their capacity to control neural responses using these targeted image manipulations. This approach therefore provides a new means to arbitrate between models, requiring a stronger commitment to feature tuning properties in local parts of the natural image manifold. Overall, these studies provide an updated deep learning paradigm for understanding the emergence and function of high-level visual representations in greater detail.