New Technologies in Mathematics Seminar
Speaker: Ronen Eldan, Microsoft Research
Title: The TinyStories Dataset: How Small Can Language Models Be And Still Speak Coherent
Abstract: While generative language models exhibit powerful capabilities at large scale, when either the model or the number of training steps is too small, they struggle to produce coherent and fluent text: Existing models whose size is below a few billion parameters often do not generate coherent text beyond a few sentences. Hypothesizing that one of the main reasons for the strong reliance on size is the vast breadth and abundance of patterns in the datasets used to train those models, this motivates the following question: Can we design a dataset that preserves the essential elements of natural language, such as grammar, vocabulary, facts, and reasoning, but that is much smaller and more refined in terms of its breadth and diversity?
In this talk, we introduce TinyStories, a synthetic dataset of short stories that only contain words that 3 to 4-year-olds typically understand, generated by GPT-3.5/4. We show that TinyStories can be used to train and analyze language models that are much smaller than the state-of-the-art models (below 10 million parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate certain reasoning capabilities. We also show that the trained models are substantially more interpretable than larger ones, as we can visualize and analyze the attention and activation patterns of the models, and show how they relate to the generation process and the story content. We hope that TinyStories can facilitate the development, analysis and research of language models, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.