International Orthodontics, vol.24, no.1, 2026 (ESCI, Scopus)
Introduction: The objective of this study was to assess the repeatability of orthodontic responses generated by multiple large language models across repeated time points. Methods: This experimental study assessed the answers provided by ChatGPT-3.5, ChatGPT-4.0, Gemini, and Gemini-Advanced to 40 frequently asked orthodontic questions. Each model was prompted with the same questions at three time points (T0: day 0, T1: day 7, and T2: day 14). Two blinded orthodontic experts independently evaluated responses using a 3-point accuracy scale. Cohen's Kappa and ICC were applied to assess inter-rater agreement and repeatability, respectively. In addition, Friedman test with Bonferroni post-hoc analysis and Spearman correlation were used for temporal comparisons. Results: Cohen's Kappa values between raters ranged from 0.624 to 0.749, indicating substantial inter-rater agreement. ICC values for repeatability ranged from 0.666 (Gemini) to 0.960 (ChatGPT-3.5). Friedman test results revealed significant differences in model accuracy at T0 and T2 (P < 0.001). Post-hoc analysis showed ChatGPT-3.5 differed significantly from Gemini and Gemini Advanced. Spearman correlations between time points were positive but weak (ρ = 0.284 to 0.383, P < 0.001). Conclusions: The study revealed statistically significant differences in repeatability among AI models. Despite high accuracy, some models exhibited limited consistency over time. These findings underscore the importance of evaluating both accuracy and temporal stability when integrating AI systems into clinical orthodontic communication.