TY - JOUR
T1 - Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination
T2 - a cross-sectional study
AU - Torres-Zegarra, Betzy Clariza
AU - Rios-Garcia, Wagner
AU - Ñaña-Cordova, Alvaro Micael
AU - Arteaga-Cisneros, Karen Fatima
AU - Benavente Chalco, Xiomara Cristina
AU - Bustamante Ordoñez, Marina Atena
AU - Gutierrez Rios, Carlos Jesus
AU - Ramos Godoy, Carlos Alberto
AU - Teresa Panta Quezada, Kristell Luisa
AU - Gutierrez-Arratia, Jesus Daniel
AU - Flores-Cohaila, Javier Alejandro
N1 - Publisher Copyright:
2023 Korea Health Personnel Licensing Examination Institute.
PY - 2023
Y1 - 2023
N2 - Purpose: We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME). Methods: This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing). Results: GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom. Conclusion: Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.
AB - Purpose: We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME). Methods: This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing). Results: GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom. Conclusion: Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.
KW - Artificial intelligence
KW - Educational measurement
KW - Medical education
KW - Peru
UR - http://www.scopus.com/inward/record.url?scp=85177454993&partnerID=8YFLogxK
U2 - 10.3352/jeehp.2023.20.30
DO - 10.3352/jeehp.2023.20.30
M3 - Artículo
C2 - 37981579
AN - SCOPUS:85177454993
SN - 1975-5937
VL - 20
JO - Journal of Educational Evaluation for Health Professions
JF - Journal of Educational Evaluation for Health Professions
ER -