Do large language models differ in their pharmacology-related response quality for oral and maxillofacial surgery? a blinded expert benchmark study

Mục Lục

Title (English)

Do large language models differ in their pharmacology-related response quality for oral and maxillofacial surgery? a blinded expert benchmark study

Thong tin bai bao / Article info

Tac gia / Authors: Mustafa Isleyen, Asenur Aydemir
Tap chi / Journal: BMC Oral Health
Ngay xuat ban / Published: 2026-07-04
DOI: 10.1186/s12903-026-09182-w
Nguon / Source: OpenAlex

Abstract (English)

Abstract Background Pharmacology represents the lowest-performing subcategory in oral and maxillofacial surgery (OMFS) evaluations of large language models (LLMs), yet no study has simultaneously compared the leading commercial LLMs across multiple pharmacological domains and question formats. This study evaluated ChatGPT 5.3, Gemini 3.1 Pro, and Claude 4.6 Sonnet in OMFS pharmacology. Methods Thirty-six OMFS pharmacology questions spanning five clinical domains (antibiotic prophylaxis, analgesics, drug–drug interactions, anesthetic pharmacology, special populations) and three formats (open-ended, multiple-choice, true/false; n = 12 each) were submitted to each LLM using a standardized role-conditioning prompt. The 108 responses were independently and blindly evaluated by two oral and maxillofacial surgeons (one specialist and one resident) on three 5-point Likert criteria. Inter-rater reliability was quantified using ICC(2,1) and Cohen’s κ_w. Inter-model differences were assessed using Friedman tests; format effects were assessed using Kruskal–Wallis tests with Bonferroni-corrected post-hoc comparisons. Results Inter-rater reliability was excellent (ICC = 0.828; κ_w = 0.827; exact agreement 91.0%). A robust hierarchy emerged: Claude > Gemini > ChatGPT (χ²(2) = 47.91, p < 0.001, W = 0.665), with all pairwise comparisons significant. Gemini and Claude did not differ significantly in any format section, indicating clinical equivalence. ChatGPT exhibited a significant decline on open-ended, integrative-reasoning items (H(2) = 17.04, p < 0.001, ε² = 0.456), absent in Gemini and Claude. Significant positive correlations among the evaluation criteria within the ChatGPT data indicated convergence among the three scoring dimensions. Conclusion Claude 4.6 Sonnet and Gemini 3.1 Pro achieved near-maximal scores on this structured pharmacology benchmark, while ChatGPT 5.3 showed a significant decline in open-ended reasoning. Current LLMs should be regarded as adjunctive tools requiring expert verification for high-risk OMFS pharmacological decisions.

Doc bai day du / Read full article

Bai dang tu dong boi plugin Ortho OA Fetcher. Anh (neu co) tu PubMed Central. Noi dung lay tu nguon open access va dich tu dong – chi mang tinh tham khao.

Ortho News

User tu dong cua plugin Ortho OA Fetcher – dang bai bao Open Access ve Chinh nha.

See author's posts

Do large language models differ in their pharmacology-related response quality for oral and maxillofacial surgery? a blinded expert benchmark study

Title (English)

Thong tin bai bao / Article info

Abstract (English)

Ortho News

Facebook Comments

Leave a Reply Cancel reply

Title (English)

Thong tin bai bao / Article info

Abstract (English)

Ortho News

Facebook Comments

Related Posts

Association between dental plaque index and COVID-19 severity: a cross-sectional study in a conflict-affected humanitarian setting

A comparative evaluation of large language models in diagnosis and treatment planning in restorative dentistry

3D-driven alveolar ridge augmentation based on reverse planning: a retrospective case series study

Leave a Reply Cancel reply