Arithmetic Word Problem Compendium: Using a New Math Word Problem Dataset as a Benchmark for LLMs

Feb 13

Quite a few word problems could fit into there. But maybe not as many as we can make.

Introduction

We've used all the data.

Companies are looking for ways to make the LLMs better, more intelligent, more resilient, even as readily available data runs out. LLMs have also not fared well in mathematics, and step-by-step reasoning. Newer reasoning models aim to address this shortcoming, but options can be limited when it comes to training and testing.

There is also a desire to make small language models that can handle the reasoning internally, without needing inference scaling that powers the larger models.

Lately, synthetic data has been a proposed solution to the problem, but there is still the threat of model collapse, where the model starts to perform poorly when trained on data it generates.

To that end, we've created a new dataset of arithmetic word problems using a templating system across multiple domains, and which is extensible and customizable, with non-trivial reasoning requirements. Here we'll take a look how it performs as a benchmark for text-to-text models, and the possibilities it opens up for the future. (You can go check it out at Hugging Face or Kaggle!).

The templating system

In order to not over train on synthetic data, we created a series of templates that produce the word problems. The templates cover various domains, including:

- Agriculture (soil temperature changes, etc.)

- Athletics (training hours, distances, etc.)

- Construction (elevation changes, work hours, etc.)

- Culinary (cooking temperature changes, calories per serving, etc.)

- Education (GPA changes, etc.)

- Entertainment (show ratings, stage lighting, etc.)

- Finance (stock prices, account balances, etc.)

These templates are designed to produce arithmetic word problems on a running total. A quantity that is relevant to the domain is added to, subtracted from, multiplied by, or divided by another quantity.

For instance:

```

"The kitchen's temperature gauge reads -23.9 degrees Celsius. First to heat the temperature, Charlotte adds 23.3 degrees. Later, the required temperature becomes 2.6 times the present value. Following that, the next step increases the temperature by 17.5 degrees. What is the temperature now? Use only 1 decimal place for your answer and any calculations shown."

```

Each entry has a number of properties. For instance:

id: the unique identifier for the problem
question: the word problem
metadata: the metadata for the problem
The metadata includes:
discrete: whether the problem is discrete or continuous
domain: the domain of the problem
numbers: the numbers in the problem
object_type: the type of object in the problem
solution: the solution to the problem
decimals: the number of decimal places to round the answer to
operators: the operators in the problem

The entry above looks like this:

```

{"id": "problem_345", "question": "The kitchen's temperature gauge reads -23.9 degrees Celsius. First to heat the temperature, Charlotte adds 23.3 degrees. Later, the required temperature becomes 2.6 times the present value. Following that, the next step increases the temperature by 17.5 degrees. What is the temperature now? Use only 1 decimal place for your answer and any calculations shown.", "metadata": {"discrete": false, "domain": "culinary", "numbers": [-23.9, 23.3, 2.6, 17.5], "object_type": "cooking_temperature_change", "solution": 15.9, "decimals": 1, "operators": ["add", "multiply", "add"]}}

```

The entry also has a corresponding word-based solution to answer the problem.

```

"Here's how we can solve this problem:

Addition: -23.9 degrees + 23.3 = -0.6

When -0.6 degrees are multiplied by 2.6, the result is -1.6

-1.6 degrees plus 17.5 equals 15.9

This calculation leads to the final answer of 15.9."

```

The solutions use a variety of step-by-step reasoning styles to make sure the model doesn't overfit to one pattern.

The templates all cover a variety of unique challenges and edge cases, including:

Multiplication and division by fractions, and making sure the language appropriately describes whether total will rise or fall.
Negative numbers, and, again, making sure the language appropriately describes whether total will rise or fall.
Ensuring the numbers are rounded correctly.
Choosing ranges of numbers that are appropriate for domain. (You don't want a cook who is managing 10,000 dinner plates).
Subject-verb agreement, such that you was "1 was" instead of "1 were", but also "20 were" instead of "20 was".
Making sure that the subject of a problem, if discrete, does not result in a fraction at any step.

The templates also provide interesting questions in areas like, temperature changes, weights, and even meta arithemtic, where percentage points are used. Altogether, the LLM training will see a wide variety of problems and situations.

Extensibility

A fortunate outcome of the templating system is its easy extensibility. We can generate millions of unique problems.

It's trivial to extend the number of steps in the mathematical problems. And we can have all sorts of diffrent objects, operand ranges, and mathematical steps. We can add new domains, new quantities, and new vocabulary for problem sovling. Names can be interchanged, pronouns adjusted, passive voices, active voices, and so on.

Quality assurance

We assure the quality of the data in a number of ways. A simple one is to check the grammer. There we can find issues of subject verb agreement, simple typos, and more. We've found numberous issues through the course of creating the templates. Even small things like, "an 8-fold increase" instead of "a 8-fold increase" which is incorrect. We also have ways to disregard certain checks, like when we want to mix up the language with Britsh spellings, or when the grammar check itself is incorrect, as when one should say "a -8-fold decrease" (because you need to say the word “negative”) instead of "an -8-fold decrease".

The math is also verified in a number of places. One would think this would be an easy check, but we found errors on occasion, such as when intermediate steps were giving fractional results, even though the object required a discrete number. And others were more subtle, such as when a number would be interpreted as 0.1344444449 instead of 0.135, which lead, in the rounding instructions, to an incorrect answer of 0.13 instead of 0.14.

A final way of assuring quality was to check the reasoning of the LLMs themselves when they failed to answer a question correctly. Then we were able to go back and say, no, the LLM had a point, and the question was ambiguous, vs times when the LLM got an actual calculation incorrect, or when the problem was fairly straightforward. The fairly straightforward judgement was often vindicated by a different model reasoning through the problem correctly.

As an aside, at one point we took all of the questions the LLM got wrong, and then for each one, gave the correct answer and steps to the LLM and asked it if the mistake was because the problem was unclear, or if it was a simple error that the model should have gotten right. For every single mistake, the model assured us that the question was completely clear and that the fault lay entirely with the model. The sycophancy in the machine was alive and well.

Performance

We've recorded the incorrect results of the model in our public repositories, but the summary results on our 1,000 problem dataset are:

GPT 3.5 Turbo: 81.70% accuracy
GPT 4 Turbo: 93.90% accuracy
o3-mini: 96.40% accuracy

The vast majority of the incorrect results from 03-mini were very close to the correct answer.

Sometimes the model would get a calculation wrong, and often when it involved decimals. For instance o3-mini said: "684.43 × 1.32 = 903.41 dollars (rounded to 2 decimal places)" when the answer is 903.4476.

More often, the model would not follow the instructions of a question when it was told to "Give your answer and any work steps rounded to 2 decimal places." Despite explicit instruction, intermediate steps would include more than the requested number of decimal places, and result in incorrect rounding for that step, or steps after it.

Other times, unique phrasing would cause misunderstanding. For example, "Then, following calculations, the cement becomes one-5.65 of the present number of parts," entails multiplication by 1/5.65, or division by 5.65, but instead it would be interpreted as multiplication by 5.65. The model performed correctly when it was phrased as 1/5.65, but the phrasing with "one-" and then the number was frequently misunderstood.

Fine-tuning or training a model on this dataset might require evaluation of the intermediate steps, but for the purposes of a benchmark, judging the final answer was enough, as the mistakes in intermediate steps would result in an incorrect calculation in the end.

Conclusion

Having this dataset as benchmark, especially as the questions and answers are template-generated, can be a valuable tool for the community, since the answers will not be memorized from data elsewhere. And it can encourage the development of resilient reasoning.

It can be useful for testing the capabilities of small language models that aim to perform well for either reasoning tasks or mathematical tasks, as opposed to using a massive model's output, which may be impractical to generate as well as to validate.

It can serve as a playground for seeing how well new prompting techniques or agentic exploits enhance problem solving.

Though the current dataset is English-only, the templates themselves could be translated into other languages, and allow for reasoning tasks in even more contexts.

We hope researchers will find the dataset useful, and push the state of the art for mathematical problem solving and beyond.

Matthew Waller