Think of a number: an update

A month or two ago I wrote this post which expressed my frustration with various issues around private datasets as a way of measuring the mathematical abilities of language models. More generally I was frustrated about the difficulty of being able to judge closed source software owned by a tech company when it’s extremely difficult to do science (i.e. perform reproducible experiments) on it. The post was written in anger, and I’ve calmed down a bit now. In the post I foolishly basically said “If you want a job done well you should do it yourself so let me try”.

In this post I will firstly go over my frustrations again (i.e. whinge a bit), and then I will report on my own (basically failed) attempt to do what Epoch AI did on the cheap. I’ll then reflect on lessons learnt and I would be interested in people’s suggestions about what to do next. My first post did open up a conversation, with several people making comments to me which I found very helpful and clarifying, and hopefully I’ll pass on some of these useful thoughts below.

Language models and mathematics

My frustrations with language models as a tool for mathematics remain. Since I wrote the first post I have been experimenting with them on questions around the mathematics of Fermat’s Last Theorem. A phenomenon I saw several times was that when it came to specific details, models were extremely good at inserting unjustified or invalid assumptions. Here is an extremely dumb example to start us off.

Earlier this year I asked several language models what the dimension of the space of modular forms of level Gamma_0(5^100) and weight 3^100 was. Obviously to many readers this question will be gobbledegook so let me explain the point. There is a moderately complicated formula for the dimension of the space of modular forms of level Gamma_0(N) and weight k, which all language models will know by now. That formula goes: “if k is odd, then the answer is 0, else the answer is [some messy equation depending on N and k]”. Every model I asked this question to just blindly applied the equation; none of them checked to see if 3^100 was odd or not (and of course it is, because it’s the product of 100 odd numbers). In short, the answer to this question is “it’s obviously zero”. No model got it right, and I tried four (the formula gives an answer which is a gigantic number). Anyone who finds an LLM which gets this right, please let me know in the comments: it surprised me that such basic mistakes were still being made but perhaps I didn’t try enough systems (and I’m aware that they’re getting better all the time).

I found several other examples of this phenomenon, where answers to specific questions used techniques which relied on assumptions which were invalid. It is completely consistent in my mind that language models are going to struggle if they are training on textbooks or papers which start with sentences like “in this book/paper, all rings and schemes are assumed to be Noetherian”. The model takes ideas from the source and what is to stop it applying them in situations where this assumption is invalid? My understanding of these systems is that they do not even really “understand” that the assumption is there. The modular forms example above showed me that they can even lose track of an assumption which is being made only perhaps half a page earlier.

We have seen the great successes of language models when applied to self-contained high-school or undergraduate level level problems but right now I really don’t see how we can move beyond this with the ideas we have; applying invalid assumptions is really just a form of confabulation (hallucination) which is occurring because these machines still have no understanding, they are regurgitating techniques which they have seen based on pattern-matching as opposed to mathematical rigour. I think these examples of systems implicitly making unjustified assumptions in order to proceed are exposing a problem which will be very difficult to resolve. Although I am not an expert, and my opinion is arguably biased, I think that integration of LLMs with computer theorem provers is required to take this stuff to the next level (just as integration of LLMs with traditional programming languages such as python solved the problem of early LLMs being completely unable to accurately multiply two ten-digit numbers).

Language models are answering the wrong questions

A major problem that I think we have right now in the area of language models and mathematics, is that scientists need to quantify success so that it can be measured, and the currently favoured way of doing this is hugely flawed. Both the AIMO prize competitions and the FrontierMath dataset involve private datasets of questions which are AI-hard, but the answer is a whole number. The dataset I proposed to make also had this property. But any mathematician would remark that this is completely unrepresentative of what mathematics actually is. Mathematicians try to find patterns, and then they try to prove that their observations are correct. They make conjectures, and they prove theorems or find counterexamples. We have seen examples of neural networks making conjectures and finding counterexamples (for example here, here, here, here, and there are many other examples) but language models are not involved at all in these works; these papers are all using neural networks to analyse mathematical data and do pattern-spotting. The new Epoch AI Frontier Math Tier 4 proposal is pushing this idea as far as it will go, asking for questions which test modern research methods but whose answers need to be a number; this is effectively ruling out many areas of mathematics which cannot be shoehorned into such a framework.

I personally would like to see better ways of testing language models on research level mathematics, but right now this is difficult because if the question is “prove this theorem” (which would be a huge step forward from “work out this number”) but the answer is a page of mathematical text then it’s currently very difficult for a machine to judge this output. Such output thus needs to be read and judged by a human, and this is slow and expensive: it doesn’t scale. This is another reason why I think that formal mathematics needs to be involved in order to progress: theorem provers are able to accurately assess the correctness of mathematical proofs. The downside is that the proof needs to be written in the language of the theorem prover, making the task of producing it far more difficult for an AI.

The current status of the “Think of a number” experiment.

Having whinged about how “what is this number” is the wrong question, I will now report on my attempt to make a database of hard number theory questions of the form “what is this number”. Around six weeks ago I wrote the blog post linked to above, and I circulated it to many colleagues of mine in the number theory community. There was an initial flurry of interest and I got some super problems: within only a few days I had 15 very promising-looking questions. The vast majority were in number theory, essentially none of them could be solved using only the techniques in a typical undergraduate degree. Some of them were accessible using what one could call “1st year graduate student level” techniques; others needed genuinely profound results. The answers were numbers, or could easily be translated into numbers (e.g. they were lists of numbers).

Perhaps due to deficiencies in my own explanation of what I was looking for, the problems I was sent fell into two rather distinct groups. Firstly, there were problems which could be solved by human reasoning alone, i.e. using pen and paper, as long as you knew the relevant techniques and theorems in the literature. And secondly there were problems which boiled down to computations which were completely impossible to do by hand but which could be done by writing bespoke code in standard computer algebra packages (such as pari-gp or magma or SageMath) and then leaving your computer on (for a few seconds or perhaps a few hours).

Interestingly, some of the questions I was sent had non-unique solutions. This was not a development I had been expecting. They were of the form “find a number with the following property” (or variants of this, for example “find an elliptic curve over a finite field with the following property”; such an object can be encoded by a number). For these questions, a new problem arises in checking the answer. In all cases, checking the answer was much easier than finding an answer, but sometimes checking the answer was something which took a few seconds or minutes of running bespoke code on a computer algebra system, which provides an extra headache if one wants to do such a thing automatically. I had not anticipated such questions coming in.

Some of the problems came with solutions, but some did not. Of the ones which did not, for some of them I could envisage spending many hours solving them manually. Ouch. I had not expected this twist either: my perhaps naive plan was clearly to get other people to do the work for me for free and here this had come back to bite me. In fact one comment made to me after the original post was by a computer scientist who said “how have these tech companies tricked you and your colleagues into doing work which benefits them, for free?”. Ouch.

But anyway, so far so good: we were only a few days in, I had 15 questions, an unexpectedly large amount of homework, some technical problems which I had not envisaged, but in general things were looking really interesting.

And then the submissions literally completely dried up. I had one more submission just before the 20th Feb deadline taking me to 16.

And I still haven’t done the homework, so for several of these questions nobody actually knows the answers at all. However as the submitter of one of these questions pointed out, “whilst I don’t know the answer, it would still be very easy to look at a machine’s reasoning and point out the mistakes in it”.

So to summarise, I have some data, but I am not sure that it’s enough to make an interesting thing yet; and to make it an interesting thing I would (a) need more data and (b) need to do a lot of time-consuming work which right now I am not sure I am motivated to do 😦 I guess the main problem is that even if I did do this work, it is not clear to me that the results would be of any interest.

I am not entirely clear about how to move forwards from here but several people have asked me about the status of the problem so I felt obliged to post this update. For those that submitted questions: I’ve not told them to anyone else, and I would note that Epoch AI seem to be offering $7500 per problem which they accept onto their dataset (see the link to Tier 4 above).

Lessons learnt

Epoch AI have made what sounds like a very impressive database, but it cost them a lot of money. I tried to do the same thing for free based on the good will of my number theory colleagues and I got nowhere near.

In retrospect I think that my reaction in anger to “current datasets are seriously flawed, let’s make another one” was very naive in the sense that even if I had got 20 or even 50 hard number theory problems with unique solutions, such a dataset still has many of the problems which I am complaining about. There are all sorts of problems with private datasets (e.g. they can only be used once per tech company as far as I can see, although AIMO’s approach using kaggle works great for open source models). And they are not measuring what mathematicians are doing (proving theorems); instead they are measuring what the tech companies are capable of doing (producing numbers).

Several people said to me about the original post “this is clearly a mathematician in denial, scared for their career because the machines are coming”. I can see that my thoughts can be interpreted in this way. But right now I am so sceptical about current approaches when it comes to machines proving theorems that I’m going to stick to my guns. I think my current take on things is: progress is incredibly rapid, but these systems are still so far away from being a game-changer to a researcher when it comes to helping with technical details. And past performance is not a guarantee of future results.

There have been some successes with computers helping mathematicians in crazy new ways. Neural networks being used as data-analysis tools have helped research mathematicians to make progress in certain areas of mathematics where questions are of a certain form (e.g. “find a combinatorial counterexample to this claim” or “spot a new pattern in this database of numbers”). However much of mathematics is not of this form. Language models have helped research mathematicians; I have an ever-growing list of examples of language models making helpful observations or helping with brainstorming ideas. The percentage hit rate right now is extremely low (e.g. “ask the LLM 20 questions and it says one useful thing for one of them”). Even a 5% hit rate is exciting — don’t get me wrong! It is great that my community is learning how to use these tools, and it is great that these tools are occasionally proving useful to research mathematicians. But google is also a useful tool for a research mathematician and nobody is running around claiming that search engines are going to solve mathematics. Sure LLMs will get better. But will they become a game-changer? Right now this is not at all clear to me.

Something which I think would help to take LLMs to the next level would be far more rigorous ways of measuring their abilities on technical questions. Right now the FrontierMath dataset seems to be the best thing we have. It is not adequate for many reasons, but right now is difficult for me to see how to do better with current technology. As I continue to say, integration of LLMs with theorem provers seems to me to be a very natural next step. But this is expensive and difficult, and recent successes such as AlphaProof are still only engaging with mathematical questions which can be understood by a schoolchild.

It has been suggested to me that one big bottleneck with the approach of integrating LLMs and ITPs is that the size of the formal mathematics libraries available is too small. “The bottleneck is mathlib“, someone from a tech company told me recently; Lean’s formal mathematics library is “only” millions of lines of code rather than billions, even if you take into account the over 1300 open pull requests which need reviewing. The formal mathematics community is too small and mathematics departments are not good at rewarding work in this area. Perhaps a useful intermediate goal is figuring out how AI can help to increase the size of a formal mathematics library, so other AIs can learn from it. Sergei Gukov suggests around 32 minutes into the video on the Epoch AI Tier 4 blog post that autoformalization (that is, getting a machine to read in LaTeX and output code in a language such as Lean) will be the key. How difficult this will be remains to be seen; I have seen progress recently but whether it will scale is another question. One interesting test problem: how long will it be before we can feed in the 10,000 page human-written proof of the classification of finite simple groups into a computer and get formalized code out? Whether or not this helps AI to solve modern problems, it would certainly be a spectacular milestone.

Posted in Machine Learning, Research formalisation | Tagged , , , , , | 16 Comments

What is a quotient?

Undergraduate mathematicians usually have a hard time defining functions from quotients in Lean, because they have been taught a specific model for quotients in their classes, which is not the model that Lean uses. This post is an attempt to explain what’s going on.

What are the natural numbers?

Before I start on what quotients are, let’s talk about what the natural numbers are. What is 37? What is it made of?

If you asked Euclid or Euler or Gauss or Riemann this question, they would think you were talking nonsense. 37 is a number, and it’s not made of anything. If you asked Peano, you would perhaps get a more nuanced answer: Peano might say that it doesn’t matter what 37 is made of, all that matters is that the natural numbers satisfy Peano’s axioms, because this is all you need to set up the theory (and if you don’t believe him, you should play the natural number game).

But if you were to ask someone who knows something about the foundations of mathematics, they would say that it is possible to give an answer, but only after you have decided which foundations you are using. If you build mathematics within set theory then the natural numbers are a set, as is 37, and under the usual conventions we have that 37 is the set {0,1,2,3,…,36}. If you build things within type theory then the natural numbers are typically an inductive type, and 37 is the term succ (succ (succ (succ ... (succ 0)))...) of this type. Finally, if you build mathematics within category theory then the natural numbers are typically an object of a certain category, and 37 will be a certain morphism from a terminal object into this natural number object.

But the point here is that the actual details of what 37 “is” do not matter. Gauss was proving quadratic reciprocity before any of these foundational theories were even proposed. All that matters is that the natural numbers, however they are modelled, satisfy Peano’s axioms. Peano’s axioms characterise the natural numbers uniquely up to unique isomorphism. In other words, if you have two models of the natural numbers and they both satisfy Peano’s axioms then those models are in a very strong sense “the same thing”.

Similarly, it doesn’t matter what the real numbers actually are, they can be a set, a type or an object of a category, and their elements can be Cauchy sequences, Dedekind cuts, Bourbaki uniform space completions, Eudoxus reals or whatever your favourite construction of the real numbers is; all that matters is that they are a complete archimedean ordered field. This hypothesis characterises the reals uniquely up to unique isomorphism, and you can build all of modern analysis from it (indeed my first analysis lecturer did just this: they developed all the basic theory of sequences, sums and integrals just from this assumption).

What is a quotient?

When I was a kid, I was into the maths olympiad scene, and I knew about the integers mod n. For example I knew that the integers mod 10 were {0,1,2,3,4,5,6,7,8,9} with addition and multiplication redefined so that if you go above 10 then you just subtract 10 until you’re back less than 10. For example 9 + 3 = 2 in the integers mod 10, because 10=0 and 11=1 and so on. This was my childhood model of the integers mod 10. But then I went to the University of Cambridge and there I was told by my algebra lecturer that this model was wrong. The integers mod 10, or \mathbb{Z}/10\mathbb{Z} as I was now expected to call them, were a quotient group, and so by definition the elements were cosets. Turns out that 3 wasn’t an element of the integers mod 10 after all, turns out that the thing I was calling 3 was actually the infinite set {...-17,-7,3,13,23,33,43,...}, or 3+10\mathbb{Z} or [3].

At the time, I was confused, because I had hard evidence that my model for the integers mod 10 worked fine, and nobody had hauled me up on this when marking my solutions to olympiad questions. But it was explained to me that my idea of choosing a canonical set of coset representatives did not generalise well, and that I would be better off modelling quotients as sets of sets. Whatever. It took me decades to understand that this claim was not the end of the story (and I thank Patrick Massot for pointing this out to me).

The point of this post is to explain that, just as it doesn’t matter what the natural numbers are as long as they satisfy Peano’s axioms, and just as it doesn’t matter what the real numbers are as long as they are a complete ordered archimedean field, it also doesn’t matter what the elements of a quotient group are, as long as a certain axiom is satisfied, which will characterise quotient groups uniquely up to unique isomorphism.

“Well-defined” functions.

So what is the “axiom for quotients”? It is typically very well-hidden in an undergraduate mathematics curriculum, and is usually explained in terms which heavily rely on the “set of sets” model for quotients. The axiom is often not clearly stated (I looked back in my undergraduate notes and it never seems to be made explicit). However I do understand why not; the axiom is actually a universal property, which is quite an abstract concept. One would like to talk about things like quotient groups as early as possible in the mathematics curriculum (for example, to state the first isomorphism theorem for groups); but students are typically only told about universal properties in more advanced algebra courses covering, for example, tensor products and localisations (both of which are, uncoincidentally, constructed as quotients).

The axiom we seek is typically hidden behind this strange idea that a function is “well-defined”. Let’s keep running with our example of the additive group of integers mod 10. If we want to define a function to this set, it’s very easy. There is a canonical “reduction mod 10” function from the integers to the integers mod 10; let’s call it q.

If we now have some other set X and want to give a function f from X to \mathbb{Z}/10\mathbb{Z}, then one way to do so would simply be to give a function \tilde{f} from X to \mathbb{Z}, and then just compose this with q.

Given \tilde{f}, we can just make f by composing with q.

If X happens to be a group and \tilde{f} happens to be a group homomorphism, then f will also be a group homomorphism, because q is. So we have a perfectly good way of defining maps into the quotient, which doesn’t rely on anything fancy. The fun starts when we want to define a map out of the quotient. This is where we have to check that something is “well-defined”.

Let’s work through a mathematically simple example to remind us of the “well-defined yoga” which we put undergraduates through. To know a positive integer mod 10 is to know its last digit, and if you know the last digit of a number then you can figure out if it’s even or odd. So there should be some natural map s from the integers mod 10 to the set {Even,Odd}. Let’s write this map with a dotted arrow and then discuss how we can go about defining it.

We want to define the map s and check it’s “well-defined”.

Let’s now run through the way we are formally taught how to construct s. First we choose an integer mod 10, for example 8, or, as my algebra lecturer would have called it, [8]. Then we remember that secretly this element is actually an infinite set {...-12,-2,8,18,28,...}. So now we choose a random element in this set, for example 28. Now 28 is clearly even, so we define s([8])=\mathrm{Even}.

But there’s a catch! The catch is that at some point in the procedure we made a completely random choice of an element of an infinite set. What if we had chosen 108 or 1000008 instead? Well, here’s a funny thing: both 108 and 1000008 are also even, as indeed is every element of the equivalence class, so it didn’t actually matter which element we chose; we always got the same answer. This means that our definition of s([8]) is “well-defined”. Indeed, we can dress up the key point of this argument as a theorem:

Theorem. If x and y are integers which are congruent mod 10, then x is even if and only if y is even, and x is odd if and only if y is odd.

Proof. If x and y are congruent mod 10, then x - y is a multiple of 10 and thus a multiple of 2, so x and y have the same parity: they’re either both even, or both odd. QED.

We defined s([8]); let’s now do the general case. Given t, an integer mod 10, t is secretly an equivalence class of integers, where two integers are equivalent if they’re congruent mod 10. Choose an element x of this equivalence class and see if it’s even or odd; define s(t) to be the answer. If we had instead chosen a different element y in t then by the theorem above we would have got the same answer for s(t); thus s(t) is “well-defined” and thus exists.

The question is: what actually just happened?

The universal property

What we actually did was the following. There’s certainly a well-defined map \tilde{s} from the integers to {Even,Odd}. So what we actually have is this situation:

And what we want to do is to construct the diagonal map s such that \tilde{s}=s\circ q:

But, in contrast to when we were making f from \tilde{f}, we can’t define s by composing q and \tilde{s} because they’re not composable this time: the source of one of these maps isn’t equal to the target of the other. We already defined s in the traditional way above, using this “well-defined” argument, which boiled down to the theorem we stated and proved earlier. Now we have introduced \tilde{s} we can restate this theorem as

Theorem. If x and y are congruent mod 10 then \tilde{s}(x)=\tilde{s}(y).

In other words,

Theorem. If x and y are equivalent, then \tilde{s}(x) and \tilde{s}(y) are equal.

This theorem was exactly what we needed to construct the function s.

So the axiom must be: to give a function from \mathbb{Z}/10\mathbb{Z} to a set Y is to give a function from \mathbb{Z} to Y which is constant on equivalence classes. That is what all this “well-defined” stuff boils down to.

How it’s done in Lean

Let’s now consider the general case, and how it’s done in Lean. Given a type X and an equivalence relation r on X, one can make the quotient Quot r of X by this equivalence relation (so in the example above, r is “congruence mod 10” and Quot r is \mathbb{Z}/10\mathbb{Z}), and there’s a function Quot.mk r : X → Quot r (i.e. Quot.mk r is the function we were calling q above). The terms of type Quot r are not equivalence classes; they are opaque (i.e., we don’t know what they are). If you really want the equivalence class associated to an element of the quotient, you can have it; just take its preimage under Quot.mk r. But are you sure you actually want it?

One time when we think we want the equivalence class is when defining a function from the quotient, using the “well-defined yoga” that we just ploughed through above. Lean’s version of this yoga is called Quot.lift. This is a function which eats a function \tilde{s} : X → Y plus a proof that it is constant on equivalence classes, and “descends” it to a function s : Quot r → Y such that \tilde{s} = s \circ q. Furthermore the Lean developers have designed things so that the proof that \tilde{s}(x)=s(q(x)) is just rfl, i.e., it’s true by definition. Unfortunately they also decided to write their diagrams upside-down so we are stuck with calling s the “lift” of \tilde{s}.

This axiom, that to give a function X/~ → Y is to give a function X → Y constant on equivalence classes, and that the obvious triangle commutes, is the universal property of quotients. Just like the earlier examples of axioms characterising constructions, it is all we need to know about quotients. We don’t need to assume that an element of a quotient is a coset, all we need to know is the axiom, and we can use this, and nothing more than this, to develop the entire theory of quotient groups, quotient rings and so on; indeed this is what happens in Lean’s mathematics library. There are also many more convenient consequences of this axiom which are available to a Lean user: for example QuotientGroup.lift is the construction which given a group homomorphism G → H and a proof that the normal subgroup N of G is in the kernel of this homomorphism, descends (or “lifts” :-/) this homomorphism to a homomorphism G/N → H. To give another example, QuotientGroup.map is the function which eats a group homomorphism φ : G → H, normal subgroups N of G and M of H and a proof that N ⊆ φ⁻¹(M), and produces a group homomorphism G/N → H/M. But all of these more convenient ways to construct maps from a quotient are ultimately built from Quot.lift.

If you want to see more details of how to use quotients in Lean, I show how to define the integers from the naturals here, and the rationals from the integers here.

Appendix: what Lean’s quotients are made of

If you’re still a bit confused by the above, you might not want to read this last part.

One reason why Lean does not model quotients as sets of sets is that Lean uses type theory as its foundation, so there are types G and terms g, and Lean writes g : G when a set theorist would write g\in G. But g is a term, not a type, so x : g makes no sense. Set theory allows you to make sets of sets, but type theory does not allow you to make types of types; types contain terms, terms don’t contain anything because they’re not containers, and that’s how it works. We can make sets in type theory and then make sets of sets, but given that groups and rings and fields and so on are all set up as types in Lean’s mathematics library, there is little point using Lean’s sets just to make a model of a quotient when it can be done using the axiomatic approach above. However Lean still has to somehow actually define the quotient of a type X by an equivalence relation: what is the definition under the hood?

I was amused to learn that in Lean, internally the terms of X/~ are just defined to be the terms of X! But equality is redefined to be ~. This can be done because unlike in set theory, where an axiom tells you when two sets are equal, equality is a much more flexible concept in Lean’s type theory; it is defined as an inductive type rather than being a fundamental part of the logic. If you think this approach is strange, I would argue that actually it probably aligns far better with Gauss’ model of the integers mod 10 than the “set of sets” approach commonly adopted by mathematics departments nowadays. Surely we’re happy with the idea that 2 = 12 in \mathbb{Z}/10\mathbb{Z}? Lean just takes this literally, rather than muddying the waters.

Posted in formalising mathematics course, M1F, undergrad maths | Tagged , , , , | 6 Comments

Think of a number.

My feed was recently clogged up with news articles reporting that Sam Altman thinks that AGI is here, or will be here next year, or whatever. I will refrain from giving even more air to this nonsense by linking to the stories. This kind of irresponsible hype-generation drives me nuts (although it also drives up stock prices so I can see why the tech bros are motivated to do it). Sure AI can have a good crack at undergraduate mathematics right now, and sure that’s pretty amazing. But our universities are full of students who can also have a good crack at undergraduate mathematics so in some sense we have achieved very little (and certainly nothing which is of any use to me as a working mathematician in 2025). If AI cannot do mathematics at PhD-student level (i.e., if it can’t start thinking for itself) then AGI is certainly not here, whatever “AGI” even means.

In an attempt to move beyond the hype and to focus on where we are right now, I’m going to try an experiment. It might fail. But if you’re a number theorist reading this, you can help me make it succeed.

Can AI do mathematics?

The null hypothesis as far as I am concerned is “not really”. More precisely, I have seen absolutely no evidence that language models are doing anything other than paraphrasing and attempting to generalise what they’ve seen on the internet, not always correctly. Don’t get me wrong — this gets you a long way! Indeed many undergraduate mathematics exams at university level are designed so that students who have a basic knowledge of the ideas in the course will pass the exam. The exam will test this knowledge by asking the student either to regurgitate the ideas, or to apply them in situations analogous to those seen in the problem sheets which came with the course. You can pass these exams by intelligent pattern-matching (if you have an encyclopedic knowledge of the material, which a machine would have). Furthermore, undergraduate pure mathematics is stagnant. The courses (other than my Lean course) which we teach in the pure mathematics degree at Imperial College London are essentially exactly the same as those which were offered to me when I was an undergraduate in Cambridge in the late 1980s, and very little post-1960 mathematics is taught at undergraduate level, even at the top institutions in the world, because it simply takes too long to get there. This is exactly why language models can pass undergraduate mathematics exams. But as I already said, this is no use to me as a working mathematician and it is certainly not AGI.

How do we test AI?

If you look at currently available mathematics databases, they are essentially all focused on mathematics which is at undergraduate or olympiad-style level. Perhaps the tech bros are deluded in thinking that because their models are getting high marks on such databases then they’re doing well. I want to firstly stress the fundamental notion that these databases are completely unrepresentative of mathematics. More recently we had the FrontierMath dataset, which was designed by mathematicians and which looked like a real step in the right direction, even though the answers to the questions were not proofs or ideas, but just numbers (which is also a long way from what research mathematics looks like); the dataset is private, but the 5 sample questions which were revealed to us were clearly beyond undergraduate level. People (including me) were tricked into saying publically that any machine which can do something nontrivial here would really be a breakthrough. The next thing we know, OpenAI were claiming that their system could get 25 percent on this database. Oof. But shortly after that, Epoch AI (who put he dataset together) revealed that 25 percent of the questions were undergraduate or olympiad level, and this week it transpires that OpenAI were behind the funding of the dataset and reportedly were even given access to some of the questions (added later: in fact Elliot Glazer from Epoch AI has confirmed this on Reddit); all of a sudden 25 percent doesn’t sound so great any more.

So maybe we need to do it all again :-/

Let’s make a database

I want people (i.e. researchers in number theory at PhD level or beyond) to help me put together a secret database of hard number theory problems. I already have 5 but I need at least 20 and ideally far more, so I need help. The problems need to be beyond undergraduate level in the sense that undergraduates will not have been taught all the skills necessary to solve them. Because LLMs are not up to writing proofs, the answers unfortunately need to be non-negative integers: there is simply no point asking an LLM basic research level questions and then ploughing through the crap they produce and giving it 0/10; anyone who thinks that LLMs are actually capable of writing proofs should show me one that can do the Putnam because all efforts I saw on the 2024 exam were abysmal and this is undergraduate level: as always, the difficulty here is that soon after the questions and solutions are made public, the models train on them and the experiment is thus invalidated. In particular there is no point asking a model how to solve the 2024 Putnam questions now, one would expect perfect solutions, parroted off the internet by the parrots which we are being told are close to AGI. Note that restricting to questions for which the answer is a non-negative integer is again moving in a direction which is really really far from what researchers actually do, but unfortunately this is the level we are at right now.

Once we have a decent amount of questions, which perhaps means at least 20 and maybe means 50, I’ll announce this and then any AI company who wants a go can get in touch and I’ll send them the questions but with no solutions and they can send me back the answers and then I’ll announce the company and the score they got publically. Each company is allowed one go. When a few have had a go, I’ll make the questions public.

What will the questions look like?

I’ll finish this post with a more technical description of what I’m looking for. I need to be able to solve the question myself so they should be in number theory (broadly interpreted) or a nearby area. The answer should be a nonnegative integer (expressible in base 10 so perhaps < 10^100). Ideally they should be expressible in LaTeX without any fancy diagrams. The questions may or may not be comprehensible to an undergraduate but the proofs should be beyond all but the smartest ones, ideally because they need material which is not taught to undergraduates. The questions should not be mindless variants of standard questions which are already available on the internet — although variants which actually need some understanding of what is going on are fine; we are trying to test the difference between understanding and the stochastic parrot model which is my model of an LLM: we are interested in testing the hypothesis that a language model can think mathematically. Language models can use computers and the internet so questions where you need to use a computer or online database to answer them is fine, although the question should not simply be “look up a number in a database” as this is too easy. Ideally the question needs an idea which is not explicit in the question. Ideally the answer is not easily guessable, because I have heard reports of LLMs getting questions on the FrontierMath dataset right, backed up by completely spurious reasoning. Please don’t test your question by typing it into an LLM to see the answer. Any ideas? Contact me at my Imperial email.

Let’s give it 1 month, so applications close on 20th Feb and I’ll report back soon after that date, hopefully to say that we have a decent-sized database and the experiment can commence. Thanks in advance to all number theorists who respond.

Posted in Machine Learning, number theory | Tagged , , , , , , | 12 Comments

Can AI do maths yet? Thoughts from a mathematician.

So the big news this week is that o3, OpenAI’s new language model, got 25% on FrontierMath. Let’s start by explaining what this means.

What is o3? What is FrontierMath?

A language model, as probably most people know, is one of these things like ChatGPT where you can ask it a question and it will write some sentences which are an attempt to give you an answer. There were language models before ChatGPT, and on the whole they couldn’t even write coherent sentences and paragraphs. ChatGPT was really the first public model which was coherent. There have been many other models since. Right now they’re still getting better really fast. How much longer this will go on for nobody knows, but there are lots of people pouring lots of money into this game so it would be a fool who bets on progress slowing down any time soon. o3 is a new language model.

FrontierMath is a secret dataset of “hundreds” of hard maths questions, curated by Epoch AI, and announced last month. “Hundreds” is a quote from the paper (first line of the abstract), but I’ve heard a rumour that when the paper came out there were under 200 questions, although I’ve heard another rumour that apparently more are have been added since. As an academic mathematician who spent their entire life collaborating openly on research problems and sharing my ideas with other people, it frustrates me a little that already in this paragraph we’ve seen more questions than answers — I am not even to give you a coherent description of some basic facts about this dataset, for example, its size. However there is a good reason for the secrecy. Language models train on large databases of knowledge, so the moment you make a database of maths questions public, the language models will train on it. And then if you ask such a model a question from the database they’ll probably just rattle off the answer which they already saw.

How hard is the FrontierMath dataset?

So what are the questions in the FrontierMath dataset like? Here’s what we know. They’re not “prove this theorem!” questions, they’re “find this number!” questions. More precisely, the paper says “Problems had to possess definitive, computable answers that could be automatically verified”, and in the five sample problems which were made public from the dataset (Appendix A of the paper, pages 14 to 23) the solutions are all positive whole numbers (one answer is 9811, another is 367707, and the final three solutions are even larger — clearly these questions are designed in such a way that random guesswork is extremely unlikely to succeed). The sample questions are nontrivial, even to a research mathematician. I understood the statements of all five questions. I could do the third one relatively quickly (I had seen the trick before that the function mapping a natural n to alpha^n was p-adically continuous in n iff the p-adic valuation of alpha-1 was positive) and I knew exactly how to do the 5th one (it’s a standard trick involving the Weil conjectures for curves) but I didn’t bother doing the algebra to work out the exact 13-digit answer. The first and second question I knew I couldn’t do, and I figured I might be able to make progress on the 4th one if I put some real effort in, but ultimately I didn’t attempt it, I just read the solution. I suspect that a typical smart mathematics undergraduate would struggle to do even one of these questions. To do the first one you would, I imagine, have to be at least a PhD student in analytic number theory. The FrontierMath paper contains some quotes from mathematicians about the difficulty level of the problems. Tao (Fields Medal) says “These are extremely challenging” and suggests that they can only be tackled by a domain expert (and indeed the two sample questions which I could solve are in arithmetic, my area of expertise; I failed to do all of the ones outside my area). Borcherds (also Fields Medal) however is quoted in the paper as saying that machines producing numerical answers “aren’t quite the same as coming up with original proofs”.

So why make such a dataset? The problem is that grading solutions to “hundreds” of answers to “prove this theorem!” questions is expensive (one would not trust a machine to do grading at this level, at least in 2024, so one would have to pay human experts), whereas checking whether hundreds of numbers in one list correspond to hundreds of numbers in another list can be done in a fraction of a second by a computer. As Borcherds pointed out, mathematics researchers spend most of the time trying to come up with proofs or ideas, rather than numbers, however the FrontierMath dataset is still extremely valuable because the area of AI for mathematics is desperately short of hard datasets, and creating a dataset such as this is very hard work (or equivalently very expensive). This recent article by Frieder et al talks in a lot more depth about the shortcomings in datasets for AI in mathematics.

So there was an article about the FrontierMath dataset in Science and I was quoted in it as saying “If you have a system that can ace that database, then it’s game over for mathematicians.” Just to be clear: I had nothing to do with the dataset, I’ve only seen the five public questions, and was basing my comments on those. I also said “In my opinion, currently, AI is a long way away from being able to do those questions … but I’ve been wrong before”. And then this week there’s an announcement that the language model o3 got a score of 25 percent on the dataset. I was shocked.

What exactly has happened here?

Why was I shocked? Because my mental model on where “AI” is currently, when it comes to doing mathematics, is “undergrad or pre-undergrad”. It’s getting very good at “Olympiad-style” problems of the sort given to bright high-schoolers. Within a year it’s absolutely clear that AI systems will be passing undergraduate mathematics exams (not least because when you’re setting an undergraduate mathematics exam you ideally need to make sure that you don’t fail 50 percent of the class, so you throw in a couple of standard questions which are very similar to questions that the students have seen already, to ensure that those with a basic understanding of the course will pass the exam. Machines will easily be able to ace such questions). But the jump from that to having innovative ideas at advanced undergrad/early PhD level beyond recycling standard ideas seems to me to be quite a big one. For example I was very unimpressed by the ChatGPT answers to the recent Putnam exam posted here — as far as I can see only question B4 was answered adequately by the machine, most other answers are worth one or two out of 10 at most. So I was expecting this dataset to remain pretty unattackable for a couple of years.

My initial excitement was tempered however by a post from Elliot Glazer from Epoch AI on Reddit where he claimed that in fact 25 percent of the problems in the dataset were “IMO/undergrad style problems”. This claim is a little confusing because I would be hard pressed to apply such adjectives to any of the five publically-released problems in the dataset; even the simplest one used the Weil conjectures for curves (or a brute force argument which is probably just about possible but would be extremely painful, as it involves factoring 10^12 degree 3 polynomials over a finite field, although this could certainly be parallelised). This of course raises questions in my mind about what the actual level of the problems in this secret dataset is (or equivalently whether the five public questions are actually a representative sample), but this is not knowledge which we’re likely to have access to. Given this new piece of information that 25 percent of the problems are undergraduate level, perhaps I will revert to being unsurprised again, but will look forward to being surprised when AI is getting nearer 50 percent on the dataset, because performance at “qual level” (as Elliot describes it — the next 50 percent of the questions) is exactly what I’m waiting to see from these systems — for me this would represent a big breakthrough.

Prove this theorem!

However, as Borcherds points out, even if we ended up with a machine which was super-human at “find this number!” questions, it would still have limited applicability in many areas of research mathematics, where the key question of interest is usually how to “prove this theorem!”. In my mind, the biggest success story in 2024 is DeepMind’s AlphaProof, which solved four out of the six 2024 IMO (International Mathematics Olympiad) problems. These were either “prove this theorem!” or “find a number and furthermore prove that it’s the right number” questions and for three of them, the output of the machine was a fully formalized Lean proof. Lean is an interactive theorem prover with a solid mathematics library mathlib containing many of the techniques needed to solve IMO problems and a lot more besides; DeepMind’s system’s solutions were human-checked and verified to be “full marks” solutions. However, we are back at high school level again; whilst the questions are extremely hard, the solutions use only school-level techniques. In 2025 I’m sure we’ll see machines performing at gold level standard in the IMO. However this now forces us to open up the “grading” can of worms which I’ve already mentioned once, and I’ll finish this post by talking a little more about it.

Who is marking the machines?

July 2025. I can envisage the following situation. As well as hundreds of the world’s smartest schoolchildren entering the IMO, there will be machines entering. Hopefully not too many though. Because the systems will be of two types. There will be systems submitting answers in the language of a computer proof checker like Lean (or Rocq, Isabelle, or many others). And there will be language models submitting answers in human language. The big difference between these two submissions are that: if a marker verifies that the statement of the question has been correctly translated into the computer proof checker, then all they need to do is to check that the proof compiles and then they basically know that it is a “full marks” solution. For the language models we will have a situation like the poor Putnam solutions above — the computer will write something, it will look convincing, but a human is going to have to read it carefully and grade it, and there is certainly no guarantee that it will be a “full marks” solution. Borcherds is right to remind the AI community that “prove this theorem!” is what we really want to see as mathematicians, and language models are currently at least an order of magnitude less accurate than expert humans when it comes to logical reasoning. I am dreading the inevitable onslaught in a year or two of language model “proofs” of the Riemann hypothesis which will just contain claims which are vague or inaccurate in the middle of 10 pages of correct mathematics which the human will have to wade through to find the line which doesn’t hold up. On the other hand, theorem provers are at least an order of magnitude more accurate: every time I’ve seen Lean not accept a human argument in the mathematical literature, the human has been wrong.

In fact, as mathematicians, we would like to see more than “prove this theorem!”. We would like to see “prove this theorem, correctly, and explain what makes the proof work in a way which we humans understand”. With the language model approach I worry (a lot) about “correctly” and with the theorem prover approach I worry about “in a way which we humans understand”. There is still a huge amount to be done. Progress is currently happening really quickly. But we are a long way away. When will we “beat the undergraduate barrier”? Nobody knows.

Posted in Machine Learning, Olympiad stuff, Research formalisation | Tagged , , , , , | 16 Comments

Fermat’s Last Theorem — how it’s going

So I’m two months into trying to teach a proof of Fermat’s Last Theorem (FLT) to a computer. Most of “how it’s going” is quite tedious and technical to explain: to cut a long story short, Wiles proved an “R=T” theorem and much of the work so far has gone into teaching the computer what R and T are; we still have not finished either definition. However my PhD student Andrew Yang has already proved the abstract commutative algebra result which we need (“if abstract rings R and T satisfy lots of technical conditions then they’re equal”), and this is an exciting first step. The current state of the write-up is here, and the system we are using is Lean and its mathematical library mathlib, maintained by the Lean prover community. If you know a bit of Lean and a bit of number theory then feel free to read the contribution guidelines, checkout the project dashboard and claim an issue. As I say, we’re two months in. However we already have one interesting story, which I felt was worth sharing. Who knows if it’s an indication of what is to come.

We’re not formalising the old-fashioned 1990s proof of FLT. Since then, work of many people (Diamond/Fujiwara, Kisin, Taylor, Scholze and several others) led to the proof being generalised and simplified, and part of the motivation of my work is not to just get FLT over the line but to prove these more general and powerful results. Why? Because if the AI mathematics revolution actually happens (which it might) and if Lean turns out to be an important component (which it might) then computers will be in a better position to start helping humans to push back the boundaries of modern number theory because of this formalization work, as they’ll have access to key modern definitions in a form which they understand. One concept which was not used in Wiles’ original proof but which is being used in the proof we’re formalizing, is crystalline cohomology, a theory developed in the 60s and 70s in Paris, with the foundations laid down by Berthelot following ideas of Grothendieck. The basic idea here is that the classical exponential and logarithm functions play a key role in differential geometry (relating Lie algebras and Lie groups, for example), and in particular in understanding de Rham cohomology, but they do not work in more arithmetic situations (for example in characteristic p); the theory of “divided power structures”, developed in the 1960s in a series of beautiful papers by Roby, play a crucial role in constructing an analogue of these functions which can be used in the arithmetic case. tl;dr: we need to teach the computer crystalline cohomology, so first we need to teach it divided powers.

Antoine Chambert-Loir and Maria Ines de Frutos Fernandez have been teaching the theory of divided powers to Lean, and over the summer Lean did that irritating thing which it sometimes does: it complained about the human presentation of an argument in the standard literature, and on closer inspection it turned out that the human argument left something to be desired. In particular a key lemma in Roby’s work seems to be incorrect. When Antoine told me this in a DM, he remarked that he would suppose I thought this was funny, and indeed the very long string of laughing emojis which he got back as a response to his message confirmed this. However Antoine, being rather more professional than me, argued that instead of me tweeting about the issue (which I can’t do anyway because I left Twitter and joined bluesky yesterday), we should in fact attempt to fix the problem. We went about this in rather different ways. Antoine put it on his job list to look at, and I completely ignored the problem and just started occasionally mentioning to people that the proof was in trouble, in a weak sense. I say “in a weak sense” because this observation has to be put into some context. According to the way I currently view mathematics (as a formalist), the entire theory of crystalline cohomology vanished from the literature at the moment Antoine discovered the issue, with massive collateral damage (for example huge chunks of Scholze’s work just disappeared, entire books and papers vaporised etc). But this disappearance is only temporary. Crystalline cohomology is in no practical sense “wrong”. The theorems were still undoubtedly correct, it’s just that the proofs were as far as I am concerned incomplete (or at least, the ones Antoine and Maria Ines were following were), and unfortunately it is now our job to fix them. The thing I want to stress is that it was absolutely clear to both me and Antoine that the proofs of the main results were of course going to be fixable, even if an intermediate lemma was false, because crystalline cohomology has been used so much since the 1970s that if there were a problem with it, it would have come to light a long time ago. Every expert I’ve spoken to is in complete agreement on this point (and several even went so far as to claim that I’m making a fuss about nothing, but perhaps they don’t understand what formalization actually means in practice: you can’t just say “I’m sure it’s fixable” — you have to actually fix it). One added twist is that Roby, Grothendieck and Berthelot have all died, so we could not go back to the original experts and ask directly for help.

[For those that are interested in more technical details, here they are: Berthelot’s thesis does not develop the theory of divided powers from scratch, he uses Roby’s “Les algebres a puissances divisees”, published in Bull Sci Math, 2ieme serie, 89, 1965, pages 75-91. Lemme 8 (on p86) of that paper seems to be false and it’s not obvious how to repair the proof; the proof of the lemma misquotes another lemma of Roby from his 1963 Ann Sci ENS paper; the correct statement is Gamma_A(M) tensor_A R = Gamma_R(M tensor_A R) but one of the tensor products accidentally falls off in the application. This breaks Roby’s proof that the divided power algebra of a module has divided powers, and thus stops us from defining the ring A_{cris}.]

So as I say, Antoine worked on fixing the problem, whereas I just worked on gossiping about it to the experts, and I made the mistake of telling Tadashi Tokieda about it in a coffeeshop in Islington, he duly went back to Stanford and mentioned it to Brian Conrad, and the next thing I knew Conrad was in my inbox asking me what was all this about crystalline cohomology being wrong. I explained the technical details of the issue, Conrad agreed that there seemed to be a problem and he went off to think about it. Several hours later he got back to me and pointed out that another, different, proof of the claim that the universal divided power algebra of a module had divided powers was in the appendix to the Berthelot-Ogus book on crystalline cohomology, and that as far as Conrad was concerned this approach should be fine. The proof was back on!

And that is pretty much the end of the story, other than the fact that last month I visited Berkeley and I had lunch with Arthur Ogus, who I’ve known since I was a post-doc there in the 90s. I’d promised Arthur a story of how he’d saved Fermat’s Last Theorem, and over the meal I told him about how his appendix had dug me out of a hole. His response was “Oh! That appendix has several errors in it! But it’s OK, I think I know how to fix them.”

This story really highlights, to me, the poor job which humans do of documenting modern mathematics. There appear to be so many things which are “known to the experts” but not correctly documented. The experts are in agreement that the important ideas are robust enough to withstand knocks like this, but the details of what is actually going on might not actually be where you expect them to be. For me, this is just one of many reasons why humans might want to consider getting mathematics written down properly, i.e. in a formal system, where the chances of error are orders of magnitude smaller. However most mathematicians are not formalists, and for those people I need to justify my work in a different way. For those mathematicians, I argue that teaching machines our arguments is a crucial step towards getting machines to do it themselves. Until then, we seemed to be doomed to fix up human errors manually.

The story does have a happy ending though — two weeks ago Maria Ines gave a talk about formalization of divided powers in the Cambridge Formalization of Mathematics seminar (which was started by Angeliki Koutsoukou-Argyraki a couple of years ago — thanks Angeliki!), and my understanding of Maria Ines’ talk is that these issues have now been sorted out. So we are actually back on track. Until the next time the literature lets us down…

Posted in Fermat's Last Theorem, Research formalisation | Tagged , , , , , , , , | 18 Comments

Lean in 2024

A huge amount happened in the Lean theorem prover community in 2023; this blog post looks back at some of these events, plus some of what we have to look forward to in 2024.

Modern mathematics

I personally am a member of the Lean community because of its phenomenal mathematics library mathlib, which was born around six and a half years ago and has now become a very fast-growing database of mathematics, containing many (but by no means all) of the definitions used by modern research mathematicians today. 2023 was the year where we really began to accrue evidence that formalisation of certain parts of serious modern mathematics is feasible in “real time”. Quanta’s video on 2023’s biggest breakthroughs in mathematics talks about several results; let’s take a look at their status when it comes to formalisation in Lean.

The first result in the Quanta video is the result by Campos–Griffiths–Morris–Sahasrabudhe giving an exponential improvement for upper bounds for Ramsey numbers. The previous post on this blog, by Bhavik Mehta, discusses his formalisation of the result; he formalised the 50+ page paper in five months (before the paper had even been refereed).

The second result in the video is the work by Smith, Myers, Kaplan and Goodman-Strauss on aperiodic monotiles. Myers has developed enough of the theory of planar geometry to formalise the solution to a 2019 IMO problem, and announced earlier this week his intention to formalise the result in Lean in 2024.

The third and final result mentioned in Quanta’s round-up was the Kelley–Maka result on bounds for how big a set of positive integers must be before it must contain a 3-term arithmetic progression. A formalisation of the result is being actively worked on by Yael Dillies; they have made substantial progress.

However, when it comes to formalising mathematics in real time, we now have an even more spectacular data point: Terry Tao led a team which formalised the main result in his breakthrough new paper with Gowers, Green and Manners proving Katalin Marton’s polynomial Freiman-Ruzsa conjecture. The 33 page paper was formalised in just three weeks (before the paper had even been submitted)! Tao used Patrick Massot’s Lean blueprint software to make this web page which contains a LaTeX/Lean hybrid document, containing an explanation of the proof which is comprehensible to both a human and a computer. He also made very effective use of the Lean Zulip, with summaries posted every few days of what was ready to be worked upon; ultimately 25 people contributed.

Based on these and other stories coming out of the mathlib community, one can come to the conclusion that there are parts of the modern theory of combinatorics (and in particular additive combinatorics) for which asking for a complete computer formalisation is not an unreasonable thing to do. Mehta and Bloom also formalised Bloom’s proof of the Erdos-Graham unit fractions conjecture, which for me is evidence that some parts of analytic number theory are also becoming amenable to this technique.

Fermat’s Last Theorem

However not all of modern mathematics is ready to be eaten by these systems. For example in September I was awarded a grant by the EPSRC (the UK mathematics funding body) to spend five years of my life formalising a proof of Fermat’s Last Theorem in Lean! Lean has the definitions of elliptic curves and modular forms required to start us off, but we are still working on other definitions, such as automorphic representations, finite flat group schemes and the like. I cannot yet link to a blueprint because the project does not officially start until October 2024 and right now I’m more worried about teaching my undergraduate Lean course, but I am part way through writing one and right now the graph looks like this:

People familiar with Tao’s blueprint graph will know that green means “done” and blue means “ready to be done”. The new orange colour means “a long way away”. Already we see the first major bifurcation: we need Mazur’s theorem on the torsion subgroups of elliptic curves over the rationals, a huge project, and completely independent of the R=T theorems we’ll need to push Wiles’ ideas over the line. The route we’ll be taking is not the original Wiles/Taylor-Wiles argument, but a more modern one due to Richard Taylor, who is collaborating with me on the mathematics of the project.

The Focused Research Organisation

In July 2023 the Lean Focused Research Organisation (FRO) was launched. Focused Research Organisations are a new type of nonprofit for science, backed by Convergent Research. Lean’s FRO is funded by the Simons Foundation, Sloan, and Richard Merkin. This has enabled a team of people to work full-time on supporting the ever-growing Lean infrastructure and focus on many things (such as making the mathematics library compile faster, and adding infrastructure and tactics to enhance the user experience: I give some examples below). The FRO is hiring, by the way. One example of the ways that the FRO has made my own life easier: Scott Morrison has developed omega for Lean 4, a tactic for proving obvious-to-mathematicians lemmas about integers and naturals; these things will no longer slow down my undergraduate students.

In fact July 2023 was an amazing month for the community, because as well as the announcement of the FRO, it marked the end of the project to port all of Lean’s mathematics library from the now end-of-life Lean 3 to Lean 4.

Growing user base in computer science

But let’s get back to 2024. Sometimes it felt to me that Lean 3 was viewed as “a tool for mathematics”, with its mathematics library the flagship project. But both the FRO and other researchers in computer science are contributing code and tools to make Lean much more than this. Between 9th and 12th January 2024, the Lean Together 2024 conference showcased some of what was currently going on. As well as talks by mathematicians on growing the mathematics library, there were plenty of talks by computer scientists on tools or projects which they have been working on with Lean. For example, David Thraine Christianson from the FRO presented Verso, his domain-specific language for documentation. Static blog posts such as Tao’s Lean 4 proof tour can now be enhanced using this tool. Take a look at Joachim Breitner’s blog post about recursive definitions on the Lean-lang web page: this contains “live” code (hover over it!) and the entire post was generated from a Lean file; if the language updates and the code breaks then a warning will be generated. Perhaps in the future, infrastructure developed using Alex Best’s leaff tool (also presented at the conference) could even be used to suggest fixes.

Other non-mathematical highlights for me at the conference: Emina Torlak gave a presentation about how Amazon are using Lean in their Cedar project; Evgenia Karunus and Anton Kovsharov spoke about Paperproof, their visualisation tool for Lean proofs (which undergraduates seem to really like when I show it off in talks), and Kaiyu Yang spoke about LeanCopilot, an LLM tool trained on Lean’s mathematics library, whose main goal is to generate Lean proofs. I have always been skeptical of people who want to claim “Lean + AI = AGI”; in fact I’ve been maximally on the other side of the argument, with my opinion being that LLMs are right now of essentially no use to me as a working mathematician. But because I thought my undergraduates might like what I’d seen in the talk, I added LeanCopilot as a dependency for the repository I’m using in my undergraduate course, and it really can do a bunch of the problems (sometimes in totally different ways to my model solutions). I also tried it on a question someone asked on the Lean Zulip chat and it solved that too — and quickly. I think it will be interesting to see whether future versions of these tools will actually be of use to me in the Fermat project I mentioned above (which is funded until 2029; who knows what language models will be capable of then).

What else is to come in 2024?

Lean’s widgets offer the user a very flexible way of presenting data in an interactive graphical way. One application: I am very much hoping that this year will be the year that I will be able to start doing high school level geometry interactively in Lean; I give talks in schools about the Natural Number Game but I would like to write some new material about how to interact with statements such as “angle at centre equals twice angle at circumference” (proving it, using it) in Lean: there are some subtleties here which are not usually explained to schoolchildren (such as the difference between oriented and unoriented angles) and which would be exposed by a formal approach.

As far as mathematics goes, my team at Imperial are currently working on Tate modules, Selmer groups, the local and global Langlands program, commutative algebra, Lie algebras and more. When I start working full time on Fermat’s Last Theorem this will push the development of more of this kind of mathematics in Lean. There seems to be a coherent plan developing on the Lean Zulip for getting a basic theory of L-functions formalised (thanks to David Loeffler, Michael Stoll and others); one natural goal would be results like Cebotarev’s density theorem, used in Fermat. Heather Macbeth in 2023 formalised the definition of Riemannian manifolds; I feel like differential geometry is another area which now has huge potential for growth. The FRO have made it clear that they want to support this push for more mathematics in Lean; any mathematician interested in joining the community might want to start by working through the free online mathematics in Lean textbook.

For plans in computer science-related topics in 2024, I would invite you to consult the FRO roadmap . One thing I like about that page is the broad vision which the FRO has: one of the goals is “enhancing its interface and functionality to cater to a wide range of users, from mathematicians and software developers to verification engineers, AI researchers, and the academic community.”

As always, the Lean community web pages are another place where people can begin to learn about Lean (including installation instructions and so on); anyone who wants to try a fast interactive type-based programming language for their project, or who wants to formalise some mathematics, is welcome to jump in, and ask questions on the Zulip chat. Full names are preferred, and the two rules which have emerged in our community are: stay on topic, and be nice. Beginner questions are welcome in the #new-members stream.

Posted in Uncategorized | Tagged , , | 3 Comments

Formalising modern research mathematics in real time

(This is a guest post by Bhavik Mehta)

On March 16, 2023, a paper by Campos, Griffiths, Morris, and Sahasrabudhe appeared on the arXiv, announcing an exponential improvement to the upper bound on Ramsey numbers, an open problem since 1935. Around the same time, posts by Terence Tao, Timothy Gowers and Gil Kalai appeared, all discussing this announcement. These were soon followed by articles in Quanta, Nature and New Scientist. The new paper was clearly exciting news.

About six months after this, I finished formally verifying the titular result of Campos, Griffiths, Morris, and Sahasrabudhe’s beautiful paper in the Lean Theorem Prover. In this post I’ll discuss the gap between informal mathematics and formalised mathematics, demonstrating that very recent and involved arguments in combinatorics can often be formalised in real time. The code for this formalisation is available here: https://github.com/b-mehta/exponential-ramsey/blob/main/src/main_results.lean, and this specific file shows an overview of the relevant definitions and theorems. Another version can be found in a branch of Lean’s mathlib, which additionally shows that the proof has been checked by Lean, and further verified by an external type-checker.

This isn’t the first recent result in combinatorics to be formalised in Lean soon after its announcement: together with Thomas Bloom I formalised a 2021 paper of his within a year of it being uploaded, and Ellenberg-Gijswijt’s 2016 result giving upper bounds on the cap-set problem (which later appeared in the Annals) was formalised by Dahmen, Hölzl, and Lewis in 2019. Of course, no post about formalisation in combinatorics would be complete without a mention of the the four colour theorem, proved in 1976 (with the final gaps closed in 1989) and formalised in Coq in 2005.

5 years, 50 pages, 5 months

Quite unlike the remarkably short proof of Ellenberg-Gijswijt result on cap-sets, the new result here has a particularly involved proof which was a 5-year long project of the authors, and the paper comes to over 50 pages in total. In comparison, the formalisation took me about 5 months, including prerequisites, and I ended up with around 20k lines of Lean code.

For the sake of comparison with other recent large-scale formalisations, the Liquid Tensor Experiment took just over 18 months with a large amount of person-hours, and the portion of the lecture notes that was formalised was around 10 pages; although that project had a massive amount of prerequisites to formalise also. The cap-set proof (4 pages long) similarly had a large proportion of prerequisities to deal with, as that project was written at a time when Lean’s mathlib had not much support for polynomials, which is of course necessary for a proof which uses the polynomial method.

The formalised exponential Ramsey proof also relies on mathlib, most heavily the support for manipulating finite sets, but vitally on asymptotic estimates as well, since almost all the statements are about “sufficiently large” naturals. Another key aspect of the proof is probabilistic constructions, in which we show the existence of an object because it has non-zero probability of appearing if we try to make it randomly. Fortunately, the probability spaces needed here are all finite, so they can be formally represented by sums on finite sets.1

The end of the proof requires surprisingly precise numerical approximations, as the final part of the argument fundamentally relies on the fact that the red and blue regions in the diagram below don’t intersect. In the original paper, it is enough to say that this can be checked by computer (though the authors give suggestions towards a slightly more detailed proof as well), but formally some more careful analysis is needed.

A graph of a red and blue region which are very close to intersecting.

In terms of the global structure of the proof, the formal version follows the informal version closely, which is a testament to both the clarity of the paper, and the strength of Lean and mathlib to formalise practical arguments. In particular, no blueprint was needed for this formalisation, and it was in some sense done “on the fly”. I understood each part of the proof as it came to translate it, rather than writing a separate semi-formal version first. Indeed, when I started, my goal wasn’t to verify the entire paper. It was just to understand the argument because I thought it was an interesting result!

Mathematical typos

An obvious question when formalising a maths paper is how much the proof needed to change to be formalised: in other words whether there were any flaws in the paper. This question is surprisingly subtle, but a simple summary is that most of the mistakes I identified were easily fixable. Importantly all mistakes were fixable in a way that preserved the spirit of the argument, so the Lean version of the proof can reasonably be said to be a translation of the informal proof.

I like to consider certain such small mistakes “mathematical typos”. It’s fair to call them errors, and just like typos in English prose can confuse a non-native speaker, mathematical typos can impede meaning for a reader who isn’t familiar with mathematics. However, with a little experience, it’s not hard to see what the intended meaning was, and fixing it isn’t too bad. A mathematical typo certainly doesn’t constitute a flaw in the proof, and almost all the problems I found fall in this category.

That said, there were two places where I needed some help from the authors to figure out how to fix something from the paper, but it still remains true that the spirit of the argument was unchanged.2

What are Ramsey numbers?

The existence of Ramsey numbers was shown by Frank Ramsey in a 1930 paper whose goal was to show decidability of a particular fragment of first-order logic. Shortly afterward, in 1935, a paper of Erdős and Szekeres popularised Ramsey numbers, giving rise to the fundamental branch of combinatorics now known as Ramsey theory. Ramsey theory itself can fill entire lecture series and textbooks, but in this post I’ll focus on the simplest and most well-studied case: two-colour Ramsey numbers.

The definition of two-colour Ramsey numbers is fairly easy to give. For any numbers k and l, R(k, l) is the smallest number of people at a party you need, to find k people who all know each other or l people who all don’t know each other. More mathematically, for any red/blue colouring of the edges of the complete graph on R(k, l) vertices, you can find k vertices with only red edges between them or l vertices with only blue edges between them.

Erdős and Szekeres in their 1935 paper gave an upper bound of R(k, l) \leq \binom{k + l - 2}{k - 1}, (implying that the diagonal Ramsey number R(k, k) can be upper-bounded by 4^k). The research community attempted improvements to the problem of decreasing their upper bound, and in 1988 the first polynomial improvement was found by Thomason. His method was taken further by Conlon in 2009 and later by Sah, eventually reaching a bound of the form R(k, k) \leq 4^{k - (\log k)^2}. Conlon’s paper, giving the first super-polynomial improvement to the Erdős-Szekeres bound, was especially noteworthy and appeared in the Annals of Mathematics. But despite all of these successes, the question of decreasing the base of the exponent in the bound was still open. Is there a constant C < 4 for which R(k, k) \leq C ^ k?

It is worth saying something briefly about the lower bound for R(k, k). It was first shown by Erdős in 1947 that R(k, k) \geq \sqrt{2}^k using a probabilistic argument which was itself groundbreaking, and gave rise to the field of probabilistic combinatorics. While a lower order improvement has been made to this, the base \sqrt{2} has not been improved, giving a wide gap between the lower and upper bounds of \sqrt{2} and 4.

Finally, in March 2023, the paper of Campos, Griffiths, Morris and Sahasrabudhe was made public, giving the desired exponential improvement, a massive strengthening of Conlon’s result. They show that there is a constant c > 0 such that for all k, we have R(k, k) \leq (4 - c) ^ k. This paper is in the publication process, and this is the result which I formalised.

As part of this formalisation project, I additionally formalised a general form of Erdős’s lower bound and gave some upper and lower bounds for the off-diagonal Ramsey numbers R(3, k).3

“Suppose aliens invade the earth and threaten to obliterate it in a year’s time unless human beings can find the Ramsey number for red five and blue five. We could marshal the world’s best minds and fastest computers, and within a year we could probably calculate the value. If the aliens demanded the Ramsey number for red six and blue six, however, we would have no choice but to launch a preemptive attack.” – Paul Erdős, relayed by Graham and Spencer.

This famous quote of Erdős discusses the difficulty of computing R(5, 5) and R(6, 6), both of which are still unknown, and the main result of this post doesn’t get us any closer. Nonetheless, as part of this project, I formalised bounds on some of the known small values of Ramsey numbers, including R(4, 4).4

What’s next?

To get to the primary result – an exponential improvement on the upper bound for Ramsey numbers – the authors give two proofs, their second proof providing a slightly better constant. Here I only formalised the first, meaning a future project could be to improve the constant in the Lean proof. This should be easier now because a large portion of the technical difficulties have been overcome. Furthermore, any future claimed improvements on the number should be much quicker to verify, since the original framework of this paper can be extended.

This formalisation was done in Lean 3, for multiple reasons, but most importantly that when I started the prerequisite mathlib lemmas were not yet implemented in Lean 4. I’m in the process of trying to port the project to Lean 4.5

Finally, if you want to get involved in the Lean community, contribute to mathlib, use Lean to check some part of your research, or just try to understand a new argument like I did, the community website is a great place to start.


  1. While mathlib does have support for probability spaces in general, working with them in this case ultimately requires viewing them as finite sums anyway, so for convenience it’s simpler here to avoid the formal probability world. ↩︎
  2. For brevity I’ll avoid too much detail, but one example is in the proof of Theorem 9.1, where we cannot quite guarantee the red density is as large as claimed. Allowing for some small error term in \eta however, the proof can be recovered using that |U| can be made large enough. ↩︎
  3. The bounds on the off-diagonal Ramsey numbers I formalised aren’t the best-known ones, there’s really exciting recent work on lower bounds for R(3, k) and R(4, k) which I’d love to see formalised. ↩︎
  4. The problem of determining Schur numbers falls into Ramsey theory, and Schur numbers are also difficult to compute. But a recent result of Marijn Heule computes Schur number five using SAT solvers, and checks its result in a proof-checker, which itself has been formally verified in ACL2. It would be really exciting to see similar methods applied to Ramsey number five as well! ↩︎
  5. It should – in principle – be possible to port the formalisation to Lean 4, but at the moment it seems more difficult than one might hope. ↩︎
Posted in Research formalisation | Tagged , | Leave a comment

Lean 2022 round-up

A brief survey post containing some of the things which happened in the Lean community in 2022.

The Liquid Tensor Experiment

In December 2020, Fields Medallist Peter Scholze challenged the formal computer proof community to verify one of his theorems with Clausen. By June 2021 a key technical lemma had been formalised by the Lean community and there was a certain amount of media attention (e.g. here and here), to the extent that many people assumed that the challenge was over. It was not. There was the small matter of developing the theory of abelian categories, derived functors, Ext groups and various other standard tools in research mathematics. This took another year. The project was finished at the Lean for the Curious Mathematician workshop in ICERM, where Johan Commelin got the formalisation of the full Clausen–Scholze result compiling. Adam Topaz was another big player here. See the Lean community blog post announcing the result for more details, and a follow-up post discussing the question of how someone who doesn’t want to read much Lean code might attempt to check that the theorem has been correctly stated. There is also the blueprint which describes the precise mathematical details of the route to the goal.

The Sphere Eversion Project

People who know some mathematics but have not used proof assistants might have a mental model of the following form: “discrete stuff should be easy to formalise, continuous stuff is probably harder because there are pictures involved”. Patrick Massot, Floris van Doorn and Oliver Nash ably demonstrated that continuous picture-based mathematics — in this case differential topology — was now in scope for theorem provers, with their paper “Formalising the h-principle and sphere eversion“. The work formalises a modern proof of an instance of Gromov’s h-principle, and deduces Smales’ result that you can turn a sphere inside-out without creasing or snapping it, if you allow the sphere to pass through itself. The work also comes with a blueprint which explains the mathematical route the authors took. The work was accepted for publication at CPP2023 and there will be a talk on the project on 16th January 2023 at CPP.

Unit fractions

In late 2021, Thomas Bloom announced a proof of an old conjecture of Erdos and Graham on unit fractions; any positive density subset of the positive integers contains a finite subset the sum of whose reciprocals is 1. This result still surprises me — for example can you really find a finite set of distinct positive integers all of which end in 7 and such that the sum of their reciprocals is 1? Apparently you can. Bhavik Mehta and Bloom started work on a full formalisation of Bloom’s result (including the prerequisites, such as an instance of the Hardy-Littlewood circle method). They finished in June 2022, before Bloom had received a referee’s report for the paper. Here’s a link to the blueprint.

p-adic Hodge theory

Fontaine kick-started modern p-adic Hodge theory with his observation that certain p-adic period rings played a key role. María Inés de Frutos Fernández has developed enough of the theory of local fields to define the first two rings in the theory: ℂₚ and B_{HT}, together with the Galois action. An attack on local class field theory is now viable.

Kaplansky’s unit conjecture

In 2021, Gardam disproved an old conjecture of Kaplansky. In 2022 Anand Rao Tadipatri and Siddhartha Gadgil formalised Gardam’s disproof in Lean (and they’re speaking about the work in the London Learning Lean Zoom seminar on Feb 9th), but in contrast to the three projects above, they used Lean 4, meaning that they had no access to Lean 3’s mathematics library mathlib. Talking of which…

The port has started!

The thing powering mathematics in Lean is the Lean 3 mathematics library mathlib. The problem is that we now have a viable Lean 4, so the library needs to be ported, and Lean 4 is not backwards compatible with Lean 3; it’s better. A first pass at the problem was made by computer but it seems that we need humans to finish it off. The port started in earnest in December, and right now we are 12 percent of the way through. Will it get harder or easier as we proceed? Nobody knows! That’s one of the fun things about doing research. Anyone interested in learning Lean 4 by porting some basic mathematics from Lean 3 (finite sets, subgroups, topology etc all still need to be done) is welcome to take a look at the mathlib 4 porting wiki. With the advent of cached compiled olean files the job of porting got much easier in the last week or so, as you no longer have to compile anyone else’s branch manually. I am very much hoping that the port will be complete by the summer, and the way to make this happen is to get more people on board. Please consider joining us if you want to make a contribution to open source software in mathematics!

Posted in Uncategorized | Tagged , , , , | 4 Comments

Beyond the Liquid Tensor Experiment

This blog post is a report on how a bunch of mathematicians recently did something which five years ago I would have said was completely out of reach. Note that it’s my (Kevin Buzzard’s) personal take on the work; I am not speaking for all the authors or the Lean community or the mathlib community or any other organization. Note also that I was personally involved with the work.

The background

In December 2020, Fields Medallist Peter Scholze challenged the computer proof community to formally verify an important new theorem of himself and Dustin Clausen. Scholze issued the challenge here, and the challenge itself is now published in Experimental Mathematics, one of very few mathematical journals that is currently showing an interest in the area of formalisation. The name of the challenge was the Liquid Tensor Experiment (LTE), and yes it was a homage to the Liquid Tension Experiment, a band which Scholze is fond of. The theorem behind the challenge is a highly technical result, and I will not go into the details; check out the links I just mentioned for more on the mathematics involved. It would not be unreasonable to describe the theorem as a complex result about complex mathematical objects.

I have a bucket list of milestones for the area of mathematical formalisation, and formalising a complex theorem about complex objects was on this list. Why? One could argue that many of the major computer formalisations (the Kepler conjecture, four colour theorem and odd order theorem) were complex theorems about “simple objects” (spheres, planar graphs and finite groups, respectively). A formalisation which should not be placed in the same category as any of these milestone results, but which could perhaps be regarded as complementary to these major achievements, is my formalisation with Commelin and Massot of the definition and very basic properties of perfectoid spaces — this amounted to very trivial theorems about complex objects. For me, the formalisation of a proof of a complex theorem about complex objects was a missing part of the puzzle.

But not any more, because Johan Commelin, assisted by Adam Topaz and a team of other mathematicians, announced earlier this month that their Lean formalisation of Scholze’s challenge was complete. The final touches of the formalisation were put together at ICERM in Providence RI, USA, during the Lean for the Curious Mathematician 2022 workshop in July. These workshops are for working mathematicians with little or no Lean experience and who are interested in finding out what’s going on and how to get involved. If this sounds like you, then the videos for the ICERM workshop are now online at the above link. If you’re interested in seeing Commelin’s announcement of the completion of the project, it’s just over 40 minutes into my talk on the status of algebraic geometry in Lean.

More on the challenge and its resolution

The reference for the proof which we followed was Scholze’s lectures on analytic geometry, the notes for a course he gave in 2019/20 in Bonn. The theorem we formalised was Theorem 9.1 of these notes (see also Remark 8.15). On p59 of the pdf there is a five line argument reducing Theorem 9.1 to Theorem 9.4, and then the next six pages contain a proof of Theorem 9.4. I’ve claimed above that Theorem 9.1 is a complex theorem about complex objects; however Theorem 9.4 is a more low-level statement, which unravels to a technically messy but essentially completely low-level claim. It took around one month to formalise the statement of Theorem 9.4, and a further five months to formalise its proof. This averages out at about one page of mathematics per month. You can see a map of the formalisation of the proof of Theorem 9.4 here. Clicking on a rectangle (definition) or an oval (theorem) will pop up a LaTeX statement of the result and will offer you a link to the corresponding Lean code. We finished the proof of 9.4 about a year ago, and to my mild surprise the media got interested; we were even featured in Nature. What probably caused the interest was that one of Scholze’s motivations to encourage a formalisation was that he was not convinced that too many humans had, or would, plough through the proof of Theorem 9.4; the formalisation gave him the assurance that his work was correct.

All that remained then for the challenge, was to formalise the five lines in the paper which showed that Theorem 9.4 implied Theorem 9.1. I’ll reproduce them here:

“By the preceding discussion, one can compute RHom_{\mathbb{Z}[T^{-1}]}(\overline{\mathcal{M}}_{r'}(S), \widehat{V}) as the derived inverse limit of C_c^{\bullet} over all c > 0; equivalently, all c \geq c_0. Theorem 9.4 implies that
for any m \geq 0 the pro-system of cohomology groups H^m(C_c^{\bullet}) is pro-zero (as H^m(C^{\bullet}_{kc})\to H^m(C_c^{\bullet}) is zero). Thus, the derived inverse limit vanishes, as desired.”

These five lines took a further year to formalise, and that’s what has just been completed. The reason it took so long was that the Lean community had to build a working theory of abelian categories, cohomology, Ext groups, derived functors etc etc as prerequisites, and whilst some of these subjects had been approached before in theorem provers, we had to get everything formalised in one system at once and everything playing well with each other. It is worth noting however that even though I’m claiming that the proof of the theorem is “complex”, the actual length of the paper proof is nowhere near as long as, for example, the proof of the Kepler conjecture. However I do regard the result as an indication that we’re going in the right direction, pushing these systems to do things they’ve never done before (although of course I was involved in the formalisation, so one should bear this in mind when evaluating my claims here).

There’s something called the “de Bruijn factor” of a formalisation, which is the ratio of the number of lines of computer code to the number of lines of human mathematics. One major goal in the area of computer formalisation is to “get the de Bruijn factor down to 1”, that is, to make formalising mathematics in a computer no more difficult than formalising it in LaTeX. However the Commelin-Topaz-… formalisation of the Liquid Tensor Experiment seems to make a mockery of this concept, because around 30,000 lines of code went into formalising the five page proof of Theorem 9.4, and around 60,000 lines of code went into formalising the five lines above. At the very least it indicates that humans can squeeze a huge amount of mathematics into one line of text. Had the required basics (abelian categories etc) been in Lean’s mathematics library mathlib already, the formalisation would have taken far less time and far fewer lines. One thing we did benefit from was Scott Morrison’s work — Scott has built a huge amount of category theory in mathlib over the previous six years, providing us with a stable foundation to build on.

One question which a professional mathematician with no Lean experience could ask when trying to evaluate the work is: how to be sure that the Lean formalisation of the challenge actually corresponds to the mathematical version of the challenge? For example how can we tell that the symbol in the Lean code actually corresponds to the real numbers, and there is no cheating in the background? Commelin and Topaz have attempted to deal with this issue by making an examples directory in the repository, where they prove basic theorems about the objects involved (for example they show that is a complete ordered field and that any complete ordered field is uniquely isomorphic to , as evidence that Lean’s really does correspond to the real numbers and there is no trickery going on). They also created evidence for Lean’s Ext groups, profinite spaces, condensed abelian groups and so on; if you understand the mathematical definitions of these objects then you can see an indication that the project’s definitions coincide with the usual mathematical definitions, even if you cannot directly read the Lean code.

What did we learn?

Peter Scholze already explains in a second blog post two quite interesting things that we learnt during the process. Firstly, we learnt that the stable homotopy theory used in the original proof (via the Breen-Deligne resolution) can be eliminated. The way this was discovered was quite interesting; the formalisers were avoiding having to formalise this part of the proof but we were forced to start formalising the statements of what we needed; once these were clarified Commelin and Scholze realised that actually they could get away with a weaker theorem; initially we were calling the trick the “Commelin complex” but later on we discovered that the construction (but not the theorem needed about it) was made by MacLane many decades ago. And secondly Scholze seemed to learn something conceptual about the proof — he described it as “what makes the proof work”. You can see his thoughts on the matter at this mathoverflow question. But in some sense this is old news (the second blog post was published over a year ago). What else did we learn?

In my mind, one of the most important things we learnt is that if you get a group of mathematicians who are focussed on formalising an arbitrary piece of modern mathematics, then it can get done, however complex the theory is. In other words, we have a substantial piece of evidence showing that the software that computer scientists have developed for doing this task has now “caught up with humans”. One could regard this as a triumph. One could also regard it as a pretty weak statement. The computer did not “do” the mathematics — the humans did it; first Scholze and Clausen did it, and then Scholze write the argument down (in pdf notes for the course in Bonn, linked to above) in a form suitable for mathematicians, and then Commelin and Massot wrote a more detailed mathematical blueprint in a “hybrid” part-Lean part-LaTeX format (you can view the blueprint here; it goes into much of the detail of the proof), and then finally a team of people transcribed what they had written from a human language into a computer language. One thing we learn is that the computer language (Lean’s dependent type theory, in this case) is rich and expressive enough to be able to formalise recent and complex work of a Fields Medallist. One could ask whether it is feasible to formalise this work in a weaker logic, for example in Higher Order Logic, the logic used by Isabelle/HOL and the other HOL systems. I don’t think anyone actually knows the answer to this question; I have conjectured that the answer is “no” but there are expert HOL users who seem to conjecture that the answer is “yes”. My real fear is that we will never find out. Manuel Eberl is making fabulous progress formalising what I would call “standard MSc mathematics” such as the basic analytic theory of modular forms in a HOL system, but what we are missing is a HOL evangelist from the mathematical community (a kind of HOL version of me) and I don’t know how to create such a person. Indeed in some sense I am making the problem harder, because my constant evangelism for Lean to mathematicians has the consequence that research mathematicians who get interested in formalisation tend to come to Lean first, and the situation is becoming more and more like a monopoly. There are plus and minus points to this; one minus point is that nobody seems to be working on getting HOL to do things that “it wasn’t designed to do” (for example a modern treatment of sheaf cohomology, etale cohomology etc via derived functors).

We also learn that formalisation can be, at times, very slow going. However one should not lose faith here. The reason it took 1 year to formalise five lines of mathematics was that a huge amount of theory-building had to be embarked upon beforehand. That theory only needs to be built once. Lean now has a working theory of short and long exact sequences, derived functors, abelian categories, complexes and so on; furthermore there is evidence that the formalisation of these definitions is usable in practice, because we used them. Indeed this is another thing that we are learning in the Lean community; instead of thinking “what MSc course shall we formalise next” we are thinking “what major result shall we start working towards” because this not only drives development but it ensures that developments of theories are actually usable.

We learn that formalisation can be parallelised, although we already knew this long ago from the work of Gonthier et al on the odd order theorem, and the work of Hales et al on the Kepler conjecture (both papers have 10+ authors). Different people can be put to work on different areas of a formalisation project and do not even have to know about, or understand, what others are working on. The forthcoming paper on the formalisation will have a nice long list of authors, and this is something I’m very happy about; furthermore anyone interested in seeing the precise details of what everyone did can just look at the list of commits and committers on github — the details are completely open to everybody. This is open source mathematics.

The future

So where are we going and what is the point? These are very reasonable questions. One possible reaction that a mathematician might have to the work is that it is pointless, because it is a computer verification of a result which was “basically known to be true anyway”. This is missing the point. Why are we teaching computers modern mathematics? Because I believe that one day future computer systems will be helping us, and whilst I cannot see into the future, consider the following question. Scholze and Clausen are arguing that condensed abelian groups are an important new definition in mathematics. How can AI developers create a chatbot which helps graduate students to learn about the category of condensed abelian groups, until professional mathematicians have taught computers what a condensed abelian group is? I know that some people in the machine learning crowd would answer things like “use language models, train on the papers about condensed abelian groups” but I am not yet convinced that this alone will turn into a useful tool. Lean (and other theorem provers) offer a framework where one can create a library of useful theorems, examples, and counterexamples. We do not even have to type in all the proofs. If “we” (the mathematical community, for example PhD students in arithmetic geometry) start translating important statements, examples, counterexamples, references and so on, and somehow combine this with language models, we will at least make some kind of interesting thing, which I imagine will get more helpful to humans over time. Some dream of a time where computers are better at proving theorems than humans. For me this is still complete science fiction; I have seen no evidence that we are anywhere near this milestone, and have no real vision as to how we are ever going to get there. But one can instead dream of a time where computers are helping humans to do research level mathematics, and work like this makes me believe that this goal is something I might see in my lifetime.

Posted in Algebraic Geometry, liquid tensor experiment | Tagged , | 4 Comments

The Future of Interactive Theorem Proving?

This is a guest post, written by Zhangir Azerbayev. Zhangir is an undergraduate at Yale, majoring in computer science and mathematics. He completed this work while visiting Carnegie Mellon’s Hoskinson Center for Formal Mathematics.

Introduction

The history of interactive theorem proving can be told as a story of allowing the user to interact with the system at gradually higher levels of abstraction, getting further away from the axioms and closer to informal mathematics. Automath, the first interactive prover, floated very close to the axioms indeed. Mizar was the first attempt to make formal proofs comprehensible to mainstream mathematicians. Contemporary systems such as Coq and Isabelle add many useful features such as type classes and tactics, and most users don’t have to think about what the axioms even are. Today, there is still a learning curve to Lean, but once one learns the syntax, mathlib proofs are not terribly far from an informal version—at least compared to Automath. If we extrapolate this trend to its logical conclusion, we might imagine an interactive theorem prover where every single line of code is hidden behind the interface, and the user interacts with the machine purely in natural language. Recent advances in machine learning suggest such a system is not as far in the future as it may sound.

As a very small step towards realizing this vision, myself and Edward Ayers have developed Lean Chat, a VS-code extension that provides an interface for autoformalizing natural language theorem statements in Lean. Under the hood, these translation are generated by OpenAI’s Codex, a text generation model trained on Github source code.

The interface is straightforward: enter a theorem statement in good-old \LaTeX, and the app will have a go at formalizing it. If the generated formal statement is correct, fantastic! You can copy and past the statement into your Lean source file. If it is incorrect, you can give the app feedback on what to fix in a natural, conversational way. Let’s see a few examples.

Examples

Let’s begin with a simple statement about groups.

Lean would be happy with this, and we could accept this if we’re feeling lenient. But the predicate is_group_hom is deprecated, so let’s tell the app to use the up-to-date syntax.

I’m happy with that.

Let’s give the app a slightly harder example, this time the statement of an exercise from James Munkres’ famous topology textbook.

Perfect on the first try!

The next example is due to Johan Commelin: let’s ask for a statement of Burnside’s theorem, an important result in finite group theory which is not in mathlib.

Quite a lot is going on here. The app is able to figure out that the correct way of stating a group is finite is [group G] [fintype G], and the correct way of denoting the order of a group is fintype.card G. Moreover, it is able to interpret quite high-level instructions such as “no need for the existential”, suggesting the app has at least partial grasp of the semantic connection between formal and informal mathematics. Slipping up on minor syntactic conventions, like writing solvable instead of is_solvable, is one of Lean Chat’s most common mistakes. These errors can almost always be fixed with dialogue.

Unfortunately, Lean Chat is far from perfect, and often fails in unpredictable and somewhat comical ways. This next example was also discovered by Johan Commelin:

For some completely unclear reason, the app decides to consider a Lie algebra over \mathbb{R} instead of my explicit instruction to use a field k, and that’s even before addressing the fact that tangent_space_at_identity is not a mathlib definition. I tried a few different responses to the app’s initial attempt, and couldn’t get it to do anything helpful.

However, the app knows at least something about Lie algebras. The next example is due to Oliver Nash.

One particular class of problem Lean Chat struggles with is Olympiad-style questions. The next example was discovered by David Renshaw:

This example is particularly embarrassing, because despite my specific instructions, the app steadfastly sticks by its initial guess.

How does this work?

Under the hood, Lean Chat’s formalizations are generated by OpenAI’s Codex language model. The Codex model is architecturally the same as the famous GPT-3, and has a simple objective: given a string of text, predict the next word. Iterating this next-word prediction allows the model to generate coherent text. Both Codex and GPT-3 are neural networks, which means they are a function parametrized by billions of real numbers, and these parameters are tuned by gradient descent to maximize next-word prediction performance on a corpus of training data. GPT-3’s training data is virtually all the text on the internet, and Codex is trained on all the source code on Github. This explains why Codex has some modest ability at writing Lean, although Lean code is a very tiny fraction of its training data.

If you take seriously the idea that neural networks are a lot like the brain, as I do, it is unsurprising that there exists some configuration of parameters that allows the network to competently use language. However, it is almost entirely a mystery why gradient descent is able to find this configuration of weights.

These language models are pattern matchers: if you ask them to complete the sentence “Not a creature was stirring, not even a _”, it will finish with the only pattern it’s seen in it’s training data: “mouse”. We can exploit this pattern matching ability to steer our language model into doing a desired task using a technique called few-shot prompting. Suppose we tell GPT-3 to complete the following text

    English: How are you this morning?
    French: Comment vas tu ce matin?

    English: I arrived in London late because I missed my connecting flight.
    French: Je suis arrivé à Londres avec un jour de retard car j'ai raté mon vol de correspondance.

    English: Grothendieck was an algebraic geometer
    French:

Because the language model was trained to complete patterns, given this text it will generate “Grothendieck était un géomètre algébrique”. What Lean Chat does under the hood is exactly the same idea. Given the user’s input, we wrap it in the following prompt:

Natural language version: \"If $z_1, \\dots, z_n$ are complex, then $|z_1 + z_2 + \\dots + z_n|\\leq |z_1| + |z_2| + \\dots + |z_n|$.\" Translate the natural language version to a Lean mathlib version:

theorem abs_sum_leq_sum_abs (n : ℕ) (f : ℕ → ℂ) :
  abs (∑ i in finset.range n, f i) ≤ ∑ i in finset.range n, abs (f i) :=

... (two more examples go here)...

Natural language version: \"Let $X$ be a topological space; let $A$ be a subset of $X$. Suppose that for each $x\\in A$ there is an open set $U$ containing $x$ such that $U\\subset A$. Show that $A$ is open in $X$.\" Translate the natural language version to a Lean mathlib version:

theorem subset_of_open_subset_is_open (X : Type*) [topological_space X]
  (A : set X) (hA : ∀ x ∈ A, ∃ U : set X, is_open U ∧ x ∈ U ∧ U ⊆ A):
  is_open A :=

Natural language: [Your input goes here]. Translate the natural language version to a Lean mathlib version:

The Codex language model completes the pattern to generate the response you see in the app. To incorporate the user’s feedback, we simply concatenate the initial few-shot prompt, the model’s incorrect response, and the instruction “[user’s feedback goes here]. Try again:”, then feed the text back into the neural network.

Note that we don’t provide the model with any examples of failed and subsequently corrected statements, since then the model might start intentionally getting its first try wrong!

What does the future look like?

Lean chat is only the very beginning for autoformalization. Codex is an off-the-shelf model available through OpenAI’s API. Its creators likely never imagined it would be used for formal mathematics, and as such it is not at all optimized for that use case. The few-shot autoformalization capabilities of Codex were only noticed quite recently, and the community working at the intersection of machine learning and formal math has a lot of ideas for how to improve these systems. I expect that in the comings months my examples of Lean Chat’s failures will become completely outdated.

A natural next step after progress on autoformalizing theorem statements is to work on autoformalizing proofs. This is much more daunting that autoformalizing statements, but is definitely within reach. Current language model-based automatic theorem provers take as input the tactic state, and predict the next proof step. One could imagine building a prover that takes as input both a natural language proof and the tactic state to predict the next proof step, but otherwise works in exactly the same way.

Armed with these concrete ideas of how one might approach autoformalizing theorem statements and proofs, building a natural language interface to interactive theorem provers is still an ambitious project, but it is not science fiction either.

Posted in Machine Learning | Tagged , , , , | Leave a comment