Do AI Models Truly Understand Science—or Just Summarize It?

Science & Technology (Commonwealth Union) – The potential of Artificial Intelligence (AI) processing and interpreting vast amount of scientific data has been seen as a gamechanger however its effectiveness and accuracy has often been a key area of focus.

To keep pace with developments and advance their work, scientists need ready access to – and familiarity with – thousands of published research papers. Large language models (LLMs) appear to offer a useful way to navigate this enormous body of scientific literature, but an important question remains: can they reliably deliver complete and scientifically accurate responses to complex questions in highly specialized disciplines?

To investigate this, physicists from Cornell collaborated with researchers from Google and assembled a panel of 12 experts to evaluate six LLM systems – including ChatGPT, Claude and others. The team assessed how well these models could interpret scientific literature at a specialist level, focusing on the field of high-temperature cuprates, a group of superconducting materials. Their findings showed that some systems performed better than others. The research also identified several limitations in current LLM capabilities and outlined areas developers may need to improve in future AI models.

Eun-Ah Kim, the Hans A. Bethe Professor of Physics in Cornell’s College of Arts and Sciences and the study’s corresponding author pointed out that the study examines whether LLMs can read and interpret scientific literature in the same way a subject expert would.

She further indicated that the work is particularly relevant now because there is widespread curiosity about what LLMs can and cannot achieve, especially regarding artificial general intelligence (AGI) and their findings highlight clear gaps in current LLM abilities, indicating that they are still far from reaching AGI.

The study, titled “Expert Evaluation of LLM World Models: A High-Tc Superconductivity Case Study,” was published March 10 in the Proceedings of the National Academy of Sciences. The lead author is Haoyu Guo, a Bethe/KIC postdoctoral fellow at Cornell’s Laboratory of Atomic and Solid State Physics (LAASP).

As a graduate student, Guo focused on cuprate high-Tc superconductors, the very field highlighted in the current study indicating that the difficulty was the enormous volume of experimental data collected over decades.

He further pointed out that he wanted to see if an LLM could assist students or early-career researchers entering a new area of study—beyond just cuprates.

To explore this, the team assembled a database of 1,726 scientific papers, carefully curated by experts, documenting the history of high-temperature cuprate research. They also developed a set of 67 questions, crafted by a broader panel of specialists, designed to test deep comprehension of the literature.

Using these resources, the researchers evaluated four LLMs—ChatGPT-4, Claude 3.5, Perplexity, and Gemini Advanced Pro 1.5—alongside NotebookLM, a Google tool that answers questions based on supplied documents. They also tested a custom retrieval-augmented generation (RAG) system, capable of pulling relevant images and text from the curated papers.

Human experts graded the responses from each system without knowing which system produced them.

The results showed that systems incorporating curated content—Google’s NotebookLM and the custom RAG system—performed the strongest.

“LLMs operating on trusted data sources – papers we collected ourselves, not from the LLM searching the Internet – tend to perform better,” Guo said. “Among these, NotebookLM performs better when I have a set of papers that I want to understand better.”

While all the LLMs impressed with their ability to extract text, Kim said they were “totally incapable” to interpret data visualizations—a major shortcoming since analyzing charts and figures is a critical skill for students reading scientific papers.

The custom model, which can pull images directly from curated documents, performed much better in this area.

Guo highlighted key improvements needed for future LLMs: accurate sourcing of claims to prevent fabricated references, the ability to synthesize multiple aspects of complex problems, and stronger comprehension of plots and figures.

Do AI Models Truly Understand Science—or Just Summarize It?

Britain Faces Inflation Spike: How the Iran War Is Fuelling Price Rises and Economic Anxiety

Rising Costs Push Farmers Away from Organic Practices as Vandana Shiva Promotes Sustainable Solutions for Global Food Security

Commonwealth and Azerbaijan Launch Funding Initiative to Support Climate-Vulnerable Island Nations

Sydney’s Skies Are Being Redrawn: The Airspace Shake-Up That Will Change Every Flight

Malta’s 2026 Trade Envoy Move: A Game-Changer for Commonwealth Investment Links?

Related Articles

Nvidia’s Clawbots: Hype or the Next Big Breakthrough?

Linux Hardware’s CEO Voices Opposition to State OS Age Verification Measures

Robots That Can Feel: Researchers Develop Ultra-Sensitive Artificial Skin Using Graphene

The Power-Hungry Data Center Crisis and Why It’s Now Solvable

Disney Rewrites Its Playbook with Generative AI

BRICS expansion: five countries join, another 25 to follow in 2024

Chinese state media lauds India’s achievements under PM Modi!

Security alarm for India: Bangladesh frees Al-Qaeda terror group chief