Evaluating Language Models for Mathematics through Interactions

12/06/2023 2:00 pm - 3:00 pm
CMSA Room G10
Address: CMSA, 20 Garden Street, Cambridge, MA 02138 USA

New Technologies in Mathematics Seminar

Speakers: Katherine Collins and Albert Jiang, Department of Computer Science and Technology, University of Cambridge

Title: Evaluating Language Models for Mathematics through Interactions

Abstract: There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs based on static pairs of inputs and outputs is insufficient to be able to make an informed decision about which LLMs, and under what assistive settings they can be sensibly utilised. Static assessment fails to take into account the essential interactive element in their deployment, and therefore limits how we understand language model capabilities. In this talk, we present our recent work introducing CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We discuss our study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we identify useful scenarios and existing issues of GPT-4 in mathematical reasoning through a series of case studies contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models which communicate uncertainty, respond well to user corrections, and are more interpretable and concise may constitute better assistants; interactive evaluation is a promising way to navigate the capability of these models; humans should be aware of language models’ algebraic fallibility and discern where they are appropriate to use.