Large Language Models and Math: A Review of Approaches and Progress

11 min readNov 12, 2024

Existing Challenges in Math for LLMs

LLMs like chat-GPT are widely used and have made great progress but still struggle on tasks involving reasoning skills. There is currently a lot of interest in the research community to find approaches for reasoning tasks, and big prize competitions have been launched (, ).

Recently, DeepMind made a breakthrough by . This post reviews recent progress that has been made in AI for solving mathematical problems.

There are two main kinds of approaches to solve math problems with AI:

Natural language-based approaches, where billions of tokens are available for training ( corpus consists of 120B math-related tokens sourced from Common Crawl). However, those models can hallucinate and answer plausible but incorrect solutions in natural language.

Benchmark of several natural-language based approaches on MATH. (source: )

Formal theorem-proving approaches have the advantage that answers can be formally verified for correctness. Hence, the answer of the model (when it finds one) is guaranteed to be correct. However, data is scarce and costly to collect (27k human-written theorems in library, language). Formal theorem-proving approaches rely on generating formal data (synthetic or translated from natural language) to augment the training corpus. Recently, the model from DeepMind (with for geometry problems) reached the silver medal at the International Mathematics Olympiad competition using this approach and formalizing 1 million natural language problems for pre-training.

Benchmark of several formal theorem-proving approaches on miniF2F and ProofNet. Pass rate is the percentage of problems correctly solved. (source: )

Pipeline for training AlphaProof (source: )

Natural language-based approaches

Training

1️⃣ Pre-training is the initial training phase where the model learns general language and broad world knowledge. In this case general knowledge about math. Pre-training is a non-supervised task where LLM is trained on a large text corpus, the text is cut arbitrarily and LLM needs to complete it. For maths, a general LLM is fine-tuned on a large math textual data corpus. starts from an LLM specialized in coding (DeepSeek-Coder-Base-v1.5) and trains it on 120 Billion math tokens from the internet (a big contribution of the paper is a pipeline to harness high-quality math data from the internet at scale). further augments collected math data with a synthetic dataset generated using another LLM model (Qwen2–72B-Instruct), reaching a pre-training corpus of 1 Trillion math tokens.

2️⃣ Supervised Fine Tuning (SFT) aims to refine the pre-trained model to perform specific tasks or align with desired behaviours. SFT is a supervised training where the data is in question/answer format. This data is much more costly to obtain than a math text corpus. In our case, the supervised task is: given a math problem provide a solution, also called problem/solution pair. Formating the solution to add more reasoning details or giving it access to an external tool such as Python improves the data quality and the model's performance. The main used formats are: , where the solution is detailed in steps; where the solution contains separate reasoning steps from computational steps that are delegated to an external tool (Python Interpreter) and that extends PoT in an interactive way.
uses 776K training samples. uses 2500K CoT samples and 400K TiR English/Chinese samples (half of which are synthetically generated).

Comparison of three reasoning formats: CoT, PAL(Program Aided Language that generates intermediate steps and Python code) and TIR (source:)

3️⃣ The Reinforcement learning (RL) phase is designed to further align the model with human preferences and values. RL has been proven to be effective in further improving the mathematical reasoning ability of LLMs after SFT. For the RL phase, a reward model needs to be trained. The goal of the reward model is to score solutions, allowing to rank LLM answers, there are two main types: the Outcome Reward Model (ORM), which assigns a real value to the solution 0 if incorrect, 1 if correct; and the Process Reward Model (PRM) which assigns a score to each reasoning step of the solution. ORM and PRM training data can be generated from SFT dataset without human annotations (see ). To train the reward model, uses 144K questions and use 361K English and 257K Chinese mathematical problems.
Once the reward model is trained it can be used in an RL procedure to further improve the SFT Math model from step 2️⃣. Broadly used RL procedures are , first introduced by Open AI in 2017, and recently is being used in the Math domain. PPO also needs to train a Value Model but GRPO skips this step by estimating a value from group scores making it computationally more efficient.

Inference

Once the model is trained, the inference is a call to the LLM that will generate tokens of the proof. The proof is not verified (it could be correct or incorrect). Additionally, if the model is trained using the TIR format, it can make calls to an external tool during inference (Python interpreter).

Models

Examples are , (both open source).

is a 7B parameter model and the main contributions of the paper are: 1) a pipeline to gather massive high quality Math Data from the internet; 2) a new and efficient approach to do RL called GRPO.

is a 72B parameter model that achieves impressive results on MATH benchmark beating all other models. The main contributions of the paper are: 1) the extensive usage of synthetic data generated with a previous Math model Qwen2-Math-Instruct) to augment pre-training data; 2) iteratively training and improving the reward model thanks to data generated with the SFT model.

Formal theorem-proving approaches

General pipeline for training/inference of a formal theorem proving model

Formal language

A formal language is a structured system of symbols and rules that encodes information rigorously, allowing computers and humans to process logical statements unambiguously. Known formal languages for maths are Coq, Isabelle, Metamath… Recently, has gained broader adoption among the mathematics community. Lean4 acts both as a programming language and an interactive theorem prover. This dual role lets users construct proofs as they would write a program, using high-level “tactics,” which allow users to break down complex proofs interactively and incrementally. Today a highly maintained Lean4 library called contains more than 27k formalized theorems.

Training

In formal theorem proving the system, answers in a verifiable format (for example Lean4 language). The answer can then be verified by the theorem prover.

1️⃣ Pre-training is done in the same way as for natural language-based approaches to get general knowledge about math.

2️⃣ Supervised Fine Tuning (SFT) in the formal case aims to obtain a model that can output answers in formal language. This model will then be used to guide proof search, which, in turn, can add new proofs used to further fine-tune the model. The challenge here is to get sufficient formal data to fine-tune the model. In (Open AI) and (Meta) the mathlib library is processed to extract goal/tactic pairs (24k theorems and 144k goal-tactic pairs for the latter). Recently, DeepMind went to a whole new level formalizing around 1 Million informal statements into 100 Million formal statements (to achieve the 10x factor they probably also formalised intermediary statements, not detailed in the report) and achieving a silver medal at the International Mathematics Olympiad with . The auto-formalizer, that translates informal problems into formal statements, is the central piece allowing to scale the amount of formal, but no details are provided.

Example of source/target pairs used for SFT. The source string is the pretty-printed tactic state. The target string is the tactic invocation as entered by a human author of the source code

3️⃣ The Proof Search phase is central to training automated theorem-proving systems, as it defines how efficiently and effectively the model explores possible proof paths. Different approaches are Monte Carlo Tree Search (MCTS) adaptations like AlphaZero and (HTPS), or simpler approaches like (OpenAI). The search operates over a tree structure, where each node represents a state or subgoal in the proof, and edges between nodes represent logical steps or tactics that advance the proof from one state to another. The root of the tree is the initial theorem we’re trying to prove, while the leaves represent potential end states where the theorem is either proven (successful leaf) or cannot be completed (unsuccessful leaf). This search is very challenging since the number of possible proof steps is potentially infinite. The idea is to guide the search with the model trained in step 2️⃣ (that will propose new tactics given a goal, hence expanding the tree) and possibly a value model (that will assign a value to a “tactic” related to how probable it is that the goal can be solved with this “tactic”). The value model can share weights with the model from step 2️⃣ while having a different objective.

In MCTS like , a model repeatedly simulates paths to refine which moves or tactics offer the highest potential for success. This process combines exploration (trying new moves) and exploitation (favouring promising paths), using a value function to guide the search by estimating outcomes. But MCTS is designed for finite action space which is not the case in theorem proving, so some adaptation of it must have been done to adapt it in AlphaProof context (see this interesting ).

Diagram of MCTS guided by LLM. Policy Model is the LLM generating “tactics” given a goal. Critic Model assigns a score to a path.

is an MCTS-inspired search algorithm specifically designed for theorem proving, and operates over hypergraphs to manage multiple, interdependent goals. It prioritizes expanding multiple subgoals created by a single tactic, improving efficiency by ensuring that all necessary steps of a proof are addressed simultaneously.

, by contrast, ranks proof goals in a priority queue based on the likelihood of success (computed as cumulative logprob of all nodes in path; logprob stands for log probability and has a negative value), generally expanding the most promising goals first. It is a breadth-first search as deeper goals have more parent tactics and therefore a cumulative logprob lower than shallow nodes.

4️⃣ Self-improvement consist in repeating steps 2️⃣ (SFT) and 3️⃣ (Proof Search). After Proof Search new found proofs are added to the training data and model is further improved. In authors use Expert Iteration and in authors perform online learning. The main difference between Expert Iteration and HTPS is the rate at which the LLM model is refreshed. In Expert Iteration, the model is updated after training on all available SFT data(i.e. full cycle) whereas in HTPS the model is updated online as soon as new data is available (i.e. much more frequently).

Inference

Inference involves a full proof search, that will make many calls to the LLM and the theorem proving environment since each node expansion in the tree (exploration) needs a call to LLM and theorem proving environment. For instance Inference is more costly than natural-language based approaches but the output proof is verified to be correct. For instance AlphaProof and AlphaGeometry took three days to solve four out of six olympiad problems.

Models

Examples are (DeepMind), (openAI), (Meta).

is a slightly different variant. The main idea is that in order to guide the search in the tree, the LLM will generate a whole proof (instead of a proof step like in other formal theorem-proving approaches) and truncate this proof at the first Lean4 error. This makes the search more efficient (if LLM is able to generate good whole-proofs) since several tactics can be added simultaneously in a path. The proof search algorithm is called RMaxTS , a variant of Monte-Carlo tree search.
It is open-source and achieves a pass rate of 63.5% on miniF2Fs, setting a new state-of-the-art in the field.

Other approaches using LLM

is a natural language-based combined with a proof search approach where the traditional role of a solver, which would normally operate in a formal theorem-proving environment, is replaced with a combination of MCTS, self-refinement, and preference-based evaluations.

In the selection phase, the node to expand is selected (each node consists of a full proof written in natural language) using Upper Confidence Bound algorithm applied to Trees (UCT).
In the expansion phase new answers (i.e. new nodes) are generated through Self-Refine process where a Critic model (an LLM) finds drawbacks in the answer (mathematical wrongs or logical faults), and the answer is rewritten by the model given this feedback.

In the evaluation phase a Pairwise Preference Reward Model (PPRM), inspired by reinforcement Learning from Human Feedback (RLHF), replaces the solver’s role of evaluating correctness by introducing a preference-based evaluation. Instead of absolute scores, PPRM assigns comparative scores to pairs of solutions, deciding which is more promising.
In the backpropagation phase the quality score of a solution is sent it back up through previous steps to update the scores of earlier moves, helping guide future decisions toward more promising paths. This scores are used by the UCT algorithm

Conclusion

The recent progress in leveraging Large Language Models (LLMs) for mathematical problem-solving is remarkable with the best results in solving olympiad-level problems achieved with formal theorem-proving techniques. Formal theorem-proving seems the way to go but still faces major challenges:

Scarcity of formal data (problems and solutions written in formal language). Today, there are research directions on ways to gather formal data. The first is to translate informal to formal doing what is called “auto-formalization” (for example ). The second is to generate synthetic “high-quality” data. Today this is done using Proof Search to discover new proofs of an existing problem but little is done to generate new problems. An interesting approach to this direction has been proposed by François Fleuret (see or) to address the .
More efficient algorithms for Proof Search. Proof search in formal approaches is computationally intensive, requiring efficient exploration of a vast (infinite) search space. Some techniques, like Llama-Berry combine natural language reasoning with structured proof search and might offer new ways to address this issue.

By addressing the challenges in mathematical problem-solving, AI researchers are developing generalizable methods that could influence reasoning research beyond mathematics.

References (in chronological order)

(2017)

(Sep 2020)

(Feb 2022)

(May 2022)

(Jan 2024)

(Jul 2024)

(Aug 2024)

(Sep 2024)

(Oct 2024)

Large Language Models and Math: A Review of Approaches and Progress

Existing Challenges in Math for LLMs

Natural language-based approaches

Training

Inference

Models

Formal theorem-proving approaches

Formal language

Training

Inference

Models

Other approaches using LLM

Conclusion

References (in chronological order)

Written by Marina Vinyes

No responses yet