Clinical AI Performance Benchmarks

AI in healthcare is a hot topic, but the harder question is how to put it to the best possible use when deployed in everyday healthcare services. We explored this by testing AI models' ability to act as a clinician copilot during live consultations.

Crucially, every recommendation was still validated by a human doctor before it reached the patient. The question was simple: which model actually helps give safer, faster, better care?

What the Copilot Needed To Do

Ask relevant follow-up questions and investigate
Suggest diagnoses and differential possibilities
Draft prescriptions, medicines, and treatment plans

How Did We Do It?

We ran a head-to-head evaluation of three large language models across multiple specialties, including General Medicine, Gynaecology, Orthopedics, and Dentistry.

Google Gemini 3.1 Pro Preview

Qwen 3.5 Plus 02-15

GPT-5.4

How the Blind Test Worked

During each consultation, two models answered the same clinical scenario.
The clinician saw both responses without model names and chose which one was better, or if both were good or bad.
Each doctor-AI exchange, meaning a single question and response, counted as one turn.
After filtering incomplete sessions and non-clinician testers, we analyzed 413 evaluated turns.

For each turn, we recorded it as a win, loss, tie, both good, or both bad. Doctors judged both models good in nearly half of all turns, which suggests the overall quality bar is already quite high.

However, just counting wins is not enough because the models did not play the same number of matches against each other. That is where more serious ranking methods come in.

ELO and Glicko-2 Come Into Play

To fairly compare models, we used two rating systems borrowed from competitive games.

ELO rating

Baseline score of 1500. Beating a strong opponent increases your rating more than beating a weak one.

Glicko-2 rating

Similar idea, but also tracks uncertainty using RD. A smaller RD indicates greater confidence.

Results

Google Gemini 3.1 Pro Preview

ELO1549 (+49)

Glicko-21623 (RD 62)

Qwen 3.5 Plus 02-15

ELO1489 (-11)

Glicko-21581 (RD 62)

GPT-5.4

ELO1462 (-38)

Glicko-21518 (RD 63)

Gemini clearly comes out on top. However, Qwen is close enough that it cannot be dismissed. GPT-5.4 consistently lags both in direct head-to-head comparisons. To understand the close race between Qwen and Gemini, we also looked at token usage, response length, and thinking effort.

Quality vs Cost: Tokens and Response Length

Average usage per turn showed clear tradeoffs.

Model	Input tokens	Output profile	Interpretation
GPT-5.4	5190	Very short outputs, around 27 tokens / 95 chars	Cheap, but often too brief for clinical depth and empathy
Gemini	5368	Moderate outputs, around 540 tokens / 202 chars	Balanced detail without becoming too verbose
Qwen	5870	Longest outputs, 2300+ tokens / 235 chars in text length stats	Strong quality, but heavier on compute

Gemini sits in the "Goldilocks zone" with enough detail to be clinically useful, without being over-verbose or too expensive.

Thinking Time, Latency, and Deployment Reality

We also looked at auto-thinking mode, where models do extra internal reasoning before responding. Qwen and Gemini ended in a near dead heat, with 39 vs 35 wins respectively. This suggests that giving both models more thinking time makes them similarly strong, with Qwen being a good alternative if verbose answers are acceptable.

Even considering this, Qwen would lose to Gemini on latency for the perceived Indian market if local data centers are unavailable for the China-originating model.

Artificial Analysis reports Gemini 3.1 Pro Preview time-to-first-token around 32 s on Google AI Studio and 46.6 s on Vertex, while GPT-5.4 reports around 137.36 s on Azure and 156.09 s on OpenAI. For Qwen 3.5 Plus, the VALS benchmark page reports a latency index of 570.93 s.

Putting It All Together

Quality: Gemini and Qwen are the two serious contenders; GPT-5.4 consistently underperforms.
Efficiency: Gemini achieves high ratings with far fewer output tokens than Qwen.
Latency and deployment reality: Given current infrastructure and likely latency, Gemini is a more practical choice for Indian clinical settings.

References

Note: This benchmark discusses clinician-reviewed AI assistance. It is not medical advice, and any clinical use of AI should remain under qualified human supervision.

We Blind-Tested AI Clinician Copilots In Real Consultations

What the Copilot Needed To Do

How Did We Do It?

How the Blind Test Worked

ELO and Glicko-2 Come Into Play

Results

Google Gemini 3.1 Pro Preview

Qwen 3.5 Plus 02-15

GPT-5.4

Quality vs Cost: Tokens and Response Length

Thinking Time, Latency, and Deployment Reality

Putting It All Together

References