New Technologies in Mathematics Seminar
Speaker: Abhishek Panigrahi, Dept. of Computer Science, Princeton University
Title: On the Power of Forward pass through Transformer Architectures
Abstract: Highly trained transformers are capable of interesting computations as they infer for an input. The exact mechanism that these models use during forward passes is an interesting area of study. This talk studies two interesting phenomena.
In the first half, we explore how and why pre-trained language models, specifically BERT of moderate sizes, can effectively learn linguistic structures like parse trees during pre-training. Specifically, using synthetic data through PCFGs, we show how moderate-sized transformers can perform forward-backward parsing, also known as the inside-outside algorithm, during inference. We further understand the role of the pre-training loss for the model to learn to parse during pre-training.
In the second half, we consider in-context learning of large language models, where they learn to reason on the fly. An ongoing hypothesis is that transformers simulate gradient descent at inference to perform in-context learning. We propose the Transformer in Transformer (TinT) framework, which creates explicit transformer architectures that can simulate and fine-tune a small pre-trained transformer model during inference. E.g. a 1.3B parameter TINT model can simulate and fine-tune a 125 million parameter model in a single forward pass. This framework suggests that large transformers might execute intricate sub-routines during inference, and provides insights for enhancing their capabilities through intelligent design considerations.