A month or two ago I wrote this post which expressed my frustration with various issues around private datasets as a way of measuring the mathematical abilities of language models. More generally I was frustrated about the difficulty of being able to judge closed source software owned by a tech company when it’s extremely difficult to do science (i.e. perform reproducible experiments) on it. The post was written in anger, and I’ve calmed down a bit now. In the post I foolishly basically said “If you want a job done well you should do it yourself so let me try”.
In this post I will firstly go over my frustrations again (i.e. whinge a bit), and then I will report on my own (basically failed) attempt to do what Epoch AI did on the cheap. I’ll then reflect on lessons learnt and I would be interested in people’s suggestions about what to do next. My first post did open up a conversation, with several people making comments to me which I found very helpful and clarifying, and hopefully I’ll pass on some of these useful thoughts below.
Language models and mathematics
My frustrations with language models as a tool for mathematics remain. Since I wrote the first post I have been experimenting with them on questions around the mathematics of Fermat’s Last Theorem. A phenomenon I saw several times was that when it came to specific details, models were extremely good at inserting unjustified or invalid assumptions. Here is an extremely dumb example to start us off.
Earlier this year I asked several language models what the dimension of the space of modular forms of level Gamma_0(5^100) and weight 3^100 was. Obviously to many readers this question will be gobbledegook so let me explain the point. There is a moderately complicated formula for the dimension of the space of modular forms of level Gamma_0(N) and weight k, which all language models will know by now. That formula goes: “if k is odd, then the answer is 0, else the answer is [some messy equation depending on N and k]”. Every model I asked this question to just blindly applied the equation; none of them checked to see if 3^100 was odd or not (and of course it is, because it’s the product of 100 odd numbers). In short, the answer to this question is “it’s obviously zero”. No model got it right, and I tried four (the formula gives an answer which is a gigantic number). Anyone who finds an LLM which gets this right, please let me know in the comments: it surprised me that such basic mistakes were still being made but perhaps I didn’t try enough systems (and I’m aware that they’re getting better all the time).
I found several other examples of this phenomenon, where answers to specific questions used techniques which relied on assumptions which were invalid. It is completely consistent in my mind that language models are going to struggle if they are training on textbooks or papers which start with sentences like “in this book/paper, all rings and schemes are assumed to be Noetherian”. The model takes ideas from the source and what is to stop it applying them in situations where this assumption is invalid? My understanding of these systems is that they do not even really “understand” that the assumption is there. The modular forms example above showed me that they can even lose track of an assumption which is being made only perhaps half a page earlier.
We have seen the great successes of language models when applied to self-contained high-school or undergraduate level level problems but right now I really don’t see how we can move beyond this with the ideas we have; applying invalid assumptions is really just a form of confabulation (hallucination) which is occurring because these machines still have no understanding, they are regurgitating techniques which they have seen based on pattern-matching as opposed to mathematical rigour. I think these examples of systems implicitly making unjustified assumptions in order to proceed are exposing a problem which will be very difficult to resolve. Although I am not an expert, and my opinion is arguably biased, I think that integration of LLMs with computer theorem provers is required to take this stuff to the next level (just as integration of LLMs with traditional programming languages such as python solved the problem of early LLMs being completely unable to accurately multiply two ten-digit numbers).
Language models are answering the wrong questions
A major problem that I think we have right now in the area of language models and mathematics, is that scientists need to quantify success so that it can be measured, and the currently favoured way of doing this is hugely flawed. Both the AIMO prize competitions and the FrontierMath dataset involve private datasets of questions which are AI-hard, but the answer is a whole number. The dataset I proposed to make also had this property. But any mathematician would remark that this is completely unrepresentative of what mathematics actually is. Mathematicians try to find patterns, and then they try to prove that their observations are correct. They make conjectures, and they prove theorems or find counterexamples. We have seen examples of neural networks making conjectures and finding counterexamples (for example here, here, here, here, and there are many other examples) but language models are not involved at all in these works; these papers are all using neural networks to analyse mathematical data and do pattern-spotting. The new Epoch AI Frontier Math Tier 4 proposal is pushing this idea as far as it will go, asking for questions which test modern research methods but whose answers need to be a number; this is effectively ruling out many areas of mathematics which cannot be shoehorned into such a framework.
I personally would like to see better ways of testing language models on research level mathematics, but right now this is difficult because if the question is “prove this theorem” (which would be a huge step forward from “work out this number”) but the answer is a page of mathematical text then it’s currently very difficult for a machine to judge this output. Such output thus needs to be read and judged by a human, and this is slow and expensive: it doesn’t scale. This is another reason why I think that formal mathematics needs to be involved in order to progress: theorem provers are able to accurately assess the correctness of mathematical proofs. The downside is that the proof needs to be written in the language of the theorem prover, making the task of producing it far more difficult for an AI.
The current status of the “Think of a number” experiment.
Having whinged about how “what is this number” is the wrong question, I will now report on my attempt to make a database of hard number theory questions of the form “what is this number”. Around six weeks ago I wrote the blog post linked to above, and I circulated it to many colleagues of mine in the number theory community. There was an initial flurry of interest and I got some super problems: within only a few days I had 15 very promising-looking questions. The vast majority were in number theory, essentially none of them could be solved using only the techniques in a typical undergraduate degree. Some of them were accessible using what one could call “1st year graduate student level” techniques; others needed genuinely profound results. The answers were numbers, or could easily be translated into numbers (e.g. they were lists of numbers).
Perhaps due to deficiencies in my own explanation of what I was looking for, the problems I was sent fell into two rather distinct groups. Firstly, there were problems which could be solved by human reasoning alone, i.e. using pen and paper, as long as you knew the relevant techniques and theorems in the literature. And secondly there were problems which boiled down to computations which were completely impossible to do by hand but which could be done by writing bespoke code in standard computer algebra packages (such as pari-gp or magma or SageMath) and then leaving your computer on (for a few seconds or perhaps a few hours).
Interestingly, some of the questions I was sent had non-unique solutions. This was not a development I had been expecting. They were of the form “find a number with the following property” (or variants of this, for example “find an elliptic curve over a finite field with the following property”; such an object can be encoded by a number). For these questions, a new problem arises in checking the answer. In all cases, checking the answer was much easier than finding an answer, but sometimes checking the answer was something which took a few seconds or minutes of running bespoke code on a computer algebra system, which provides an extra headache if one wants to do such a thing automatically. I had not anticipated such questions coming in.
Some of the problems came with solutions, but some did not. Of the ones which did not, for some of them I could envisage spending many hours solving them manually. Ouch. I had not expected this twist either: my perhaps naive plan was clearly to get other people to do the work for me for free and here this had come back to bite me. In fact one comment made to me after the original post was by a computer scientist who said “how have these tech companies tricked you and your colleagues into doing work which benefits them, for free?”. Ouch.
But anyway, so far so good: we were only a few days in, I had 15 questions, an unexpectedly large amount of homework, some technical problems which I had not envisaged, but in general things were looking really interesting.
And then the submissions literally completely dried up. I had one more submission just before the 20th Feb deadline taking me to 16.
And I still haven’t done the homework, so for several of these questions nobody actually knows the answers at all. However as the submitter of one of these questions pointed out, “whilst I don’t know the answer, it would still be very easy to look at a machine’s reasoning and point out the mistakes in it”.
So to summarise, I have some data, but I am not sure that it’s enough to make an interesting thing yet; and to make it an interesting thing I would (a) need more data and (b) need to do a lot of time-consuming work which right now I am not sure I am motivated to do 😦 I guess the main problem is that even if I did do this work, it is not clear to me that the results would be of any interest.
I am not entirely clear about how to move forwards from here but several people have asked me about the status of the problem so I felt obliged to post this update. For those that submitted questions: I’ve not told them to anyone else, and I would note that Epoch AI seem to be offering $7500 per problem which they accept onto their dataset (see the link to Tier 4 above).
Lessons learnt
Epoch AI have made what sounds like a very impressive database, but it cost them a lot of money. I tried to do the same thing for free based on the good will of my number theory colleagues and I got nowhere near.
In retrospect I think that my reaction in anger to “current datasets are seriously flawed, let’s make another one” was very naive in the sense that even if I had got 20 or even 50 hard number theory problems with unique solutions, such a dataset still has many of the problems which I am complaining about. There are all sorts of problems with private datasets (e.g. they can only be used once per tech company as far as I can see, although AIMO’s approach using kaggle works great for open source models). And they are not measuring what mathematicians are doing (proving theorems); instead they are measuring what the tech companies are capable of doing (producing numbers).
Several people said to me about the original post “this is clearly a mathematician in denial, scared for their career because the machines are coming”. I can see that my thoughts can be interpreted in this way. But right now I am so sceptical about current approaches when it comes to machines proving theorems that I’m going to stick to my guns. I think my current take on things is: progress is incredibly rapid, but these systems are still so far away from being a game-changer to a researcher when it comes to helping with technical details. And past performance is not a guarantee of future results.
There have been some successes with computers helping mathematicians in crazy new ways. Neural networks being used as data-analysis tools have helped research mathematicians to make progress in certain areas of mathematics where questions are of a certain form (e.g. “find a combinatorial counterexample to this claim” or “spot a new pattern in this database of numbers”). However much of mathematics is not of this form. Language models have helped research mathematicians; I have an ever-growing list of examples of language models making helpful observations or helping with brainstorming ideas. The percentage hit rate right now is extremely low (e.g. “ask the LLM 20 questions and it says one useful thing for one of them”). Even a 5% hit rate is exciting — don’t get me wrong! It is great that my community is learning how to use these tools, and it is great that these tools are occasionally proving useful to research mathematicians. But google is also a useful tool for a research mathematician and nobody is running around claiming that search engines are going to solve mathematics. Sure LLMs will get better. But will they become a game-changer? Right now this is not at all clear to me.
Something which I think would help to take LLMs to the next level would be far more rigorous ways of measuring their abilities on technical questions. Right now the FrontierMath dataset seems to be the best thing we have. It is not adequate for many reasons, but right now is difficult for me to see how to do better with current technology. As I continue to say, integration of LLMs with theorem provers seems to me to be a very natural next step. But this is expensive and difficult, and recent successes such as AlphaProof are still only engaging with mathematical questions which can be understood by a schoolchild.
It has been suggested to me that one big bottleneck with the approach of integrating LLMs and ITPs is that the size of the formal mathematics libraries available is too small. “The bottleneck is mathlib“, someone from a tech company told me recently; Lean’s formal mathematics library is “only” millions of lines of code rather than billions, even if you take into account the over 1300 open pull requests which need reviewing. The formal mathematics community is too small and mathematics departments are not good at rewarding work in this area. Perhaps a useful intermediate goal is figuring out how AI can help to increase the size of a formal mathematics library, so other AIs can learn from it. Sergei Gukov suggests around 32 minutes into the video on the Epoch AI Tier 4 blog post that autoformalization (that is, getting a machine to read in LaTeX and output code in a language such as Lean) will be the key. How difficult this will be remains to be seen; I have seen progress recently but whether it will scale is another question. One interesting test problem: how long will it be before we can feed in the 10,000 page human-written proof of the classification of finite simple groups into a computer and get formalized code out? Whether or not this helps AI to solve modern problems, it would certainly be a spectacular milestone.














