Health care and medical education are increasingly being transformed by artificial intelligence (AI), which enables machines to replicate human cognitive functions such as learning, reasoning, and problem-solving. The emergence of large language models (LLMs), including ChatGPT and similar systems, has created new opportunities in clinical decision support, diagnostics, and the training of medical students. Previous studies have shown that some AI systems can achieve diagnostic and triage accuracy comparable to that of physicians and perform well on standardized medical examinations.
The common benchmarks based on medical knowledge and clinical reasoning are licensing exams, including the United States Medical Licensing Examination (USMLE). Several experiments have tested AI systems on these tests and frequently achieved high scores with models like ChatGPT-4. Nevertheless, the previous studies have predominantly focused on text-based queries and single models, with limited data on the performance of multiple contemporary AI systems using the same dataset or on their ability to interpret image-based medical queries.
To address these gaps, researchers directly compared five modern AI models like Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek by using the 2024 National Board of Medical Examiners (NBME) Free 120-question set. This dataset is closely similar to the style and the challenge of the USMLE Step 1. This study evaluated the accuracy of the models on text-based, image-based, case-based, and information-based questions and assessed response consistency across repeated prompts as a measure of reliability.
The cross-sectional observational study was conducted between February 10 and March 5, 2025. The researchers examined 119 USMLE-like questions from the NBME Free 120 set after excluding one audio-based question. All five AI models were presented with all the questions in a standard format of prompt, and each model responded to all questions three times to provide reliability and confidence. The questions were divided into text-based and image-based, as well as case-based or information-based. Chi-square and Fisher’s exact tests were used for statistical comparisons, with Bonferonni adjustments applied to the pair-wise comparisons.
Grok achieved the highest total score of 109/119 (91.6%), followed by Copilot with a score of 101/119 (84.9%) and Gemini with 100/119 (84%). ChatGPT-4 had a score of 95/119 (79.8%), and DeepSeek had the lowest total score of 86/119 (72.3%). DeepSeek’s lower performance was primarily due to its inability to interpret visual media, resulting in zero accuracy on image-based questions. On text-only questions (n=96), DeepSeek achieved 86/96 (89.6%) accuracy.
The complexity of the type of question was best understood through Grok, with 21/23 (91.3%) accuracy on image-based questions and 70/78 (89.7%) accuracy on the case-based questions. There was also a statistically significant difference between Grok and DeepSeek in the case-based items (P = 0.01). The models on subject areas achieved maximum marks in biostatistics and epidemiology (5.8/6, 96.7%) and lowest scores in musculoskeletal, skin, and connective tissue topics (4.4/7, 62.9%). In the consistency analysis, Grok demonstrated 100% response consistency. Copilot showed the highest self-correction rate, with 112/119 (94.1%) consistency, and achieved a final accuracy of 89.9% (107/119) on the third attempt.
These results highlight significant discrepancies in the abilities of existing AI models on medical knowledge assessments. Grok was the best performer, excelling in both visual interpretation and reasoning-intensive clinical questions. Copilot and Gemini were competitive, whereas ChatGPT-4 ranked fourth despite its popularity.
The study also highlights the importance of multimodal capabilities in medical AI. Models unable to interpret images performed significantly worse on visual questions, despite strong text-based reasoning skills. Another key observation was the high response consistency across most models, suggesting that AI systems are becoming increasingly stable. Additionally, Copilot’s ability to revise and improve its answers with each attempt underscores the potential importance of self-correction mechanisms in future AI development.
Overall, the findings suggest that while ChatGPT remains one of the most widely used AI applications, newer models like Grok and Copilot are rapidly improving in medical knowledge and clinical reasoning. Continuous benchmarking will be necessary to ensure the safe and effective integration of AI technologies into medical education and clinical training.
Reference: El Natour D, Abou Alfa M, Chaaban A et al., Performance of 5 AI models on United States Medical Licensing Examination Step 1 questions: comparative observational study. JMIR AI. 2026;5:e76928. doi:10.2196/76928






