Do AI Models Truly Understand Science—or Just Summarize It?

- Advertisement -

Science & Technology (Commonwealth Union) – The potential of Artificial Intelligence (AI) processing and interpreting vast amount of scientific data has been seen as a gamechanger however its effectiveness and accuracy has often been a key area of focus.

To keep pace with developments and advance their work, scientists need ready access to – and familiarity with – thousands of published research papers. Large language models (LLMs) appear to offer a useful way to navigate this enormous body of scientific literature, but an important question remains: can they reliably deliver complete and scientifically accurate responses to complex questions in highly specialized disciplines?

To investigate this, physicists from Cornell collaborated with researchers from Google and assembled a panel of 12 experts to evaluate six LLM systems – including ChatGPT, Claude and others. The team assessed how well these models could interpret scientific literature at a specialist level, focusing on the field of high-temperature cuprates, a group of superconducting materials. Their findings showed that some systems performed better than others. The research also identified several limitations in current LLM capabilities and outlined areas developers may need to improve in future AI models.

Eun-Ah Kim, the Hans A. Bethe Professor of Physics in Cornell’s College of Arts and Sciences and the study’s corresponding author pointed out that the study examines whether LLMs can read and interpret scientific literature in the same way a subject expert would.

She further indicated that the work is particularly relevant now because there is widespread curiosity about what LLMs can and cannot achieve, especially regarding artificial general intelligence (AGI) and their findings highlight clear gaps in current LLM abilities, indicating that they are still far from reaching AGI.

The study, titled “Expert Evaluation of LLM World Models: A High-Tc Superconductivity Case Study,” was published March 10 in the Proceedings of the National Academy of Sciences. The lead author is Haoyu Guo, a Bethe/KIC postdoctoral fellow at Cornell’s Laboratory of Atomic and Solid State Physics (LAASP).

As a graduate student, Guo focused on cuprate high-Tc superconductors, the very field highlighted in the current study indicating that the difficulty was the enormous volume of experimental data collected over decades.

He further pointed out that he wanted to see if an LLM could assist students or early-career researchers entering a new area of study—beyond just cuprates.

To explore this, the team assembled a database of 1,726 scientific papers, carefully curated by experts, documenting the history of high-temperature cuprate research. They also developed a set of 67 questions, crafted by a broader panel of specialists, designed to test deep comprehension of the literature.

Using these resources, the researchers evaluated four LLMs—ChatGPT-4, Claude 3.5, Perplexity, and Gemini Advanced Pro 1.5—alongside NotebookLM, a Google tool that answers questions based on supplied documents. They also tested a custom retrieval-augmented generation (RAG) system, capable of pulling relevant images and text from the curated papers.

Human experts graded the responses from each system without knowing which system produced them.

The results showed that systems incorporating curated content—Google’s NotebookLM and the custom RAG system—performed the strongest.

“LLMs operating on trusted data sources – papers we collected ourselves, not from the LLM searching the Internet – tend to perform better,” Guo said. “Among these, NotebookLM performs better when I have a set of papers that I want to understand better.”

 

While all the LLMs impressed with their ability to extract text, Kim said they were “totally incapable” to interpret data visualizations—a major shortcoming since analyzing charts and figures is a critical skill for students reading scientific papers.

The custom model, which can pull images directly from curated documents, performed much better in this area.

Guo highlighted key improvements needed for future LLMs: accurate sourcing of claims to prevent fabricated references, the ability to synthesize multiple aspects of complex problems, and stronger comprehension of plots and figures.

Hot this week

Why is the UN warning about war crimes in South Sudan? Rights Chief Demands Immediate Ceasefire

The South Sudanese Civil War started in December 2013....

Golden Escape: Global Unrest Fuels Rush for UAE Residency

Since the end of February, there has been a...

Royal Navy Sends Type 45 Destroyer to Mediterranean Amid Iran Tensions

The delayed departure of HMS Dragon from Portsmouth Harbour...

Why International Women’s Day must be about economic empowerment

Excerpts of an interview with Banan Massoud ElSayed, Head...
- Advertisement -

Related Articles

- Advertisement -sitaramatravels.comsitaramatravels.com

Popular Categories