In a recent study, researchers have uncovered significant limitations in the ability of leading artificial intelligence models to handle complex historical questions. Despite excelling in tasks such as coding and podcast generation, these models face considerable challenges when it comes to advanced history exams. The research team, affiliated with the Complexity Science Hub in Austria, introduced a new evaluation standard called Hist-LLM, which assesses the accuracy of large language models (LLMs) using the extensive Seshat Global History Databank. This benchmark tests the models' responses against verified historical facts.
The findings, presented at the prestigious NeurIPS conference, revealed that even the best-performing model, GPT-4 Turbo, achieved only about 46% accuracy—barely above random chance. Maria del Rio-Chanona, co-author of the study and associate professor at University College London, emphasized that while LLMs are adept at providing basic historical facts, they fall short in addressing nuanced, doctoral-level inquiries. The study also highlighted specific instances where LLMs provided incorrect answers due to their tendency to extrapolate from more prominent historical data, often overlooking less well-known but crucial details.
The research underscores the importance of human expertise in specialized fields like history. Peter Turchin, the lead researcher, noted that LLMs are not yet a replacement for human historians in certain domains. However, the study’s authors remain optimistic about the potential for these models to assist historians in the future. By refining the benchmark with more diverse data and complex questions, the researchers aim to enhance the capabilities of LLMs in handling historical inquiries. Ultimately, this work highlights both the current limitations and the promising future of AI in supporting scholarly research.