In a rigorous mathematical test, four AI models, including ChatGPT 5.5 Pro, were evaluated against human performance. None of the models correctly answered all 10 questions. The best-performing model was developed by ETH Zurich, solving six out of ten problems. The test, part of the independent project First Proof, aimed to assess AI capabilities in mathematical research. Questions were previously unpublished to prevent models from relying on prior training data. A group of 30 mathematicians verified the responses. Only publicly available models participated, which limited involvement to OpenA
Bias read (Center): The article presents factual results of an AI benchmarking test without overtly favoring any side. It describes the methodology, participants, and outcomes neutrally.





