An introduction to mixture of experts in deep learning

Member Seminar

Speaker: Samy Jelassi

Title: An introduction to mixture of experts in deep learning

Abstract: Scale has opened new frontiers in natural language processing – but at a high cost. Mixture-of-Experts (MoE) have been proposed as a path to even larger and more capable language models. They select different parameters for each incoming example. By doing so, the parameter count is decoupled from the compute per example leading to very large, but efficient models. In this talk, I will review the concept of mixture of experts, provide a basic description of the Switch Transformers model, characterize some of their behaviors and conclude by highlighting some open problems in the field. This talk is mainly based on the following papers: https://arxiv.org/pdf/2101.03961.pdf, https://arxiv.org/pdf/2209.01667.pdf .