Comparing ChatGPT 3.5 and 4.0 in Low Back Pain Patient Education: Addressing Strengths, Limitations, and Psychosocial Challenges


Tabanli A., DEMİRKIRAN N. D.

World Neurosurgery, vol.196, 2025 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 196
  • Publication Date: 2025
  • Doi Number: 10.1016/j.wneu.2025.123755
  • Journal Name: World Neurosurgery
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, CAB Abstracts, Index Islamicus, MEDLINE, Veterinary Science Database
  • Keywords: Artificial intelligence, ChatGPT 3.5, ChatGPT 4.0, Low back pain
  • Kütahya Health Sciences University Affiliated: Yes

Abstract

Background: Artificial intelligence tools like ChatGPT have gained attention for their potential to support patient education by providing accessible, evidence-based information. This study compares the performance of ChatGPT 3.5 and ChatGPT 4.0 in answering common patient questions about low back pain, focusing on response quality, readability, and adherence to clinical guidelines, while also addressing the models' limitations in managing psychosocial concerns. Methods: Thirty frequently asked patient questions about low back pain were categorized into 4 groups: Diagnosis, Treatment, Psychosocial Factors, and Management Approaches. Responses generated by ChatGPT 3.5 and 4.0 were evaluated on 3 key metrics: 1) response quality: rated on a scale of 1 (excellent) to 4 (unsatisfactory); 2) DISCERN criteria: evaluating reliability and adherence to clinical guidelines, with scores ranging from 1 (low reliability) to 5 (high reliability; and 3) readability: assessed using 7 readability formulas, including Flesch-Kincaid and Gunning Fog Index. Results: ChatGPT 4.0 significantly outperformed ChatGPT 3.5 in response quality across all categories, with a mean score of 1.03 compared to 2.07 for ChatGPT 3.5 (P < 0.001). ChatGPT 4.0 also demonstrated higher DISCERN scores (4.93 vs. 4.00, P < 0.001). However, both versions struggled with psychosocial factor questions, where responses were rated lower than for Diagnosis, Treatment, and Management questions (P = 0.04). Conclusions: ChatGPT 3.5 and 4.0 limitations in addressing psychosocial concerns highlight the need for clinician oversight, particularly for emotionally sensitive issues. Enhancing artificial intelligence's capability in managing psychosocial aspects of patient care should be a priority in future iterations.