AI in healthcare is a hot topic, but the harder question is how to put it to the best possible use when deployed in everyday healthcare services. We explored this by testing AI models' ability to act as a clinician copilot during live consultations.

Crucially, every recommendation was still validated by a human doctor before it reached the patient. The question was simple: which model actually helps give safer, faster, better care?

What the Copilot Needed To Do

  • Ask relevant follow-up questions and investigate
  • Suggest diagnoses and differential possibilities
  • Draft prescriptions, medicines, and treatment plans

How Did We Do It?

We ran a head-to-head evaluation of three large language models across multiple specialties, including General Medicine, Gynaecology, Orthopedics, and Dentistry.

Google Gemini 3.1 Pro Preview
Qwen 3.5 Plus 02-15
GPT-5.4

How the Blind Test Worked

  1. During each consultation, two models answered the same clinical scenario.
  2. The clinician saw both responses without model names and chose which one was better, or if both were good or bad.
  3. Each doctor-AI exchange, meaning a single question and response, counted as one turn.
  4. After filtering incomplete sessions and non-clinician testers, we analyzed 413 evaluated turns.

For each turn, we recorded it as a win, loss, tie, both good, or both bad. Doctors judged both models good in nearly half of all turns, which suggests the overall quality bar is already quite high.

However, just counting wins is not enough because the models did not play the same number of matches against each other. That is where more serious ranking methods come in.

ELO and Glicko-2 Come Into Play

To fairly compare models, we used two rating systems borrowed from competitive games.

ELO rating

Baseline score of 1500. Beating a strong opponent increases your rating more than beating a weak one.

Glicko-2 rating

Similar idea, but also tracks uncertainty using RD. A smaller RD indicates greater confidence.

Results

Google Gemini 3.1 Pro Preview

ELO1549 (+49)
Glicko-21623 (RD 62)

Qwen 3.5 Plus 02-15

ELO1489 (-11)
Glicko-21581 (RD 62)

GPT-5.4

ELO1462 (-38)
Glicko-21518 (RD 63)

Gemini clearly comes out on top. However, Qwen is close enough that it cannot be dismissed. GPT-5.4 consistently lags both in direct head-to-head comparisons. To understand the close race between Qwen and Gemini, we also looked at token usage, response length, and thinking effort.

Quality vs Cost: Tokens and Response Length

Average usage per turn showed clear tradeoffs.

ModelInput tokensOutput profileInterpretation
GPT-5.45190Very short outputs, around 27 tokens / 95 charsCheap, but often too brief for clinical depth and empathy
Gemini5368Moderate outputs, around 540 tokens / 202 charsBalanced detail without becoming too verbose
Qwen5870Longest outputs, 2300+ tokens / 235 chars in text length statsStrong quality, but heavier on compute

Gemini sits in the "Goldilocks zone" with enough detail to be clinically useful, without being over-verbose or too expensive.

Thinking Time, Latency, and Deployment Reality

We also looked at auto-thinking mode, where models do extra internal reasoning before responding. Qwen and Gemini ended in a near dead heat, with 39 vs 35 wins respectively. This suggests that giving both models more thinking time makes them similarly strong, with Qwen being a good alternative if verbose answers are acceptable.

Even considering this, Qwen would lose to Gemini on latency for the perceived Indian market if local data centers are unavailable for the China-originating model.

Artificial Analysis reports Gemini 3.1 Pro Preview time-to-first-token around 32 s on Google AI Studio and 46.6 s on Vertex, while GPT-5.4 reports around 137.36 s on Azure and 156.09 s on OpenAI. For Qwen 3.5 Plus, the VALS benchmark page reports a latency index of 570.93 s.

Putting It All Together

  • Quality: Gemini and Qwen are the two serious contenders; GPT-5.4 consistently underperforms.
  • Efficiency: Gemini achieves high ratings with far fewer output tokens than Qwen.
  • Latency and deployment reality: Given current infrastructure and likely latency, Gemini is a more practical choice for Indian clinical settings.

References

  1. Artificial Analysis: Gemini 3.1 Pro Preview providers
  2. Artificial Analysis: GPT-5.4 providers
  3. VALS: Alibaba Qwen 3.5 Plus Thinking