Kos-1 Lite: SOTA Medical Model

Today we're announcing Kos-1 Lite, our first medical reasoning model. Kos-1 Lite achieves SOTA on HealthBench Hard at 46.6%.

Try interacting with the model here.

Motivation

General-purpose LLMs are not fit for clinical assistance. They are sycophantic when they should push back, verbose when a focused answer is needed, and lack the groundedness required to deliver sensitive medical guidance. This is expected. If you want your model to follow instructions to perform well on SWE-Bench style coding tasks, you necessarily accept some level of sycophancy. In domains like medicine, this tradeoff is unacceptable.

To fill this gap, we trained Kos-1 Lite. It is a frontier reasoning model that prioritizes clinical accuracy, reduced sycophancy, triage and deferral, and compassionate bedside manner.

Results

Kos-1 Lite achieves SOTA on HealthBench Hard and trails only GPT-5 on HealthBench. We note that these results come at a fraction of the serving costs of frontier (> 1 trillion parameter) models.

HealthBench was developed by OpenAI with 262 physicians across 60 countries to evaluate clinical reasoning beyond multiple-choice recall, using realistic open-ended conversations. The benchmark is challenging and unsaturated - frontier models like Claude Opus 4.6 and Gemini 3.1 Pro score just 20.4% and 19.8% on HealthBench Hard.

HealthBench Hard

TLDC Kos-1 Lite

46.6

GPT 5 High

46.2

Baichuan M3

44.4

GPT 5.2 High

42.0

GPT 5.1 High

40.5

GPT 5.4

40.1

AntAngelMed

39.6

Baichuan M2

34.7

GPT 5.5*

33.8

Grok 4

22.7

DeepSeek V3.2 Thinking

21.7

Claude Opus 4.6

20.4

Gemini 3.1 Pro

19.8

HealthBench

GPT 5 High

67.2

TLDC Kos-1 Lite

66.6

Baichuan M3

65.1

GPT 5.1 High

64.0

GPT 5.2 High

63.3

AntAngelMed

62.5

Baichuan M2

60.1

GPT 5.5

58.4

Grok 4

53.7

DeepSeek V3.2 Thinking

53.0

Gemini 3.1 Pro

51.0

Claude Opus 4.6

50.8

Examples

Here we compare Kos-1 Lite with GPT-5.2 on real medical queries.

Prompt

What is the next best step in a 45-year-old male with long-standing Ulcerative Colitis presents with fatigue, pruritus, jaundice, elevated alkaline phosphatase, and MRCP showing multifocal intrahepatic/extrahepatic biliary strictures and dilatations (beading)?

Kos-1 Lite

Confirm primary sclerosing cholangitis (PSC) diagnosis and stratify risk: order AMA (to exclude PBC), IgG4 (rule out IgG4-SC), ANA/SMA (assess AIH overlap), viral hepatitis panel, CBC, LFTs (bilirubin trend critical), albumin, INR, fat-soluble vitamins (A/D/E/K); urgent hepatology referral for MRCP review, EUS with fine-needle sampling if dominant stricture (>10mm), and colonoscopy with chromoendoscopy for UC dysplasia (every 1-2 years per PSC-UC guidelines).

Why this sequence?

Serology/imaging: AMA-negative supports PSC (PBC shows smooth ducts on MRCP); elevated IgG4 mandates steroid trial if confirmed. MRCP "beading" is diagnostic for PSC, but EUS/ERCP with brush cytology is reserved for dominant strictures (↑ cholangiocarcinoma risk)—CA 19-9 has low sensitivity/specificity (false positives in strictures/cholecystitis), so use only with imaging changes.
UC assessment: PSC-UC has higher CRC risk (especially pancolitis); colonoscopy must evaluate for dysplasia (targeted biopsies of irregular mucosa). Check CRP/ESR for UC activity—worsening colitis may accelerate PSC progression.
Complication screening: INR/albumin assess synthetic function; platelet count/spleen size (US/Elastography) screen for portal hypertension. Annual MRCP for biliary changes; if bilirubin rises >50% or new symptoms (fever/RUQ pain—suggest cholangitis—urgently ERCP for drainage).
Management nuances: Avoid high-dose UDCA (>15-20mg/kg)—increases CRC risk per recent trials. Pruritus: first-line cholestyramine; refractory cases: rifampin/naltrexone. Vitamin D replacement if deficient (prevents osteoporosis). Liver transplant evaluation if Child-Pugh B/C or recurrent cholangitis.

Critical caveats:

No liver biopsy for routine PSC diagnosis (low yield; use only if AIH overlap suspected—elevated AST/ALT with interface hepatitis).
If MRCP shows "skip lesions" with normal gallbladder—classic PSC; if gallbladder wall thickening—consider IgG4-SC or malignancy.
Immediate red flags: rapid jaundice rise, weight loss, new lymphadenopathy—prioritize EUS-FNA/cholangioscopy for cholangiocarcinoma (PSC increases risk 10-15x; median survival <1yr if advanced).
Never dilate strictures without cytology—bile duct injury worsens risk. Coordinate with interventional radiology/hepatology for biliary interventions.

Why not ERCP upfront? Non-therapeutic ERCP risks pancreatitis/cholecystitis—reserve for therapeutic needs (stricture dilation, stone removal). MRCP sensitivity/specificity >90% for PSC; if equivocal, use MRCP with diffusion-weighted imaging for malignancy. Serial LFTs every 3-6mo track progression—ALP >2x ULN with rising bilirubin signals advanced disease.

GPT-5.2

This presentation is most consistent with primary sclerosing cholangitis (PSC) in a patient with ulcerative colitis, and the next best step is ERCP to evaluate for a dominant stricture and to obtain brush cytology/biopsy to rule out cholangiocarcinoma. ERCP also allows therapeutic dilation and/or stenting if an obstructing stricture is present.

Why ERCP is next (given MRCP already shows “beading”)

MRCP findings of multifocal strictures with segmental dilatation (“beading”) are diagnostic for PSC in the right clinical context.
ERCP is not needed to confirm PSC when MRCP is classic, but it becomes the next step when there are features suggesting clinically significant obstruction or malignancy, such as:
- Jaundice (especially new/worsening)
- Pruritus and cholestatic labs (marked alkaline phosphatase)
- Concern for a dominant extrahepatic stricture (common in PSC and can cause obstructive jaundice)
- Need to exclude cholangiocarcinoma, which PSC patients are at increased risk for

What you do during ERCP

Identify and characterize any dominant stricture (commonly defined as ≤1.5 mm in the common bile duct or ≤1.0 mm in the hepatic duct, though exact cutoffs vary by source).
Brushings for cytology (often with adjunct testing such as FISH where available) to assess for cholangiocarcinoma.
Balloon dilation (first-line endoscopic therapy for dominant strictures in many practices) and/or temporary stenting when indicated.

Key parallel steps (important, but not the “next best” in this stem)

Cholangiocarcinoma surveillance/evaluation often includes CA 19-9 and careful imaging interpretation, but tissue/cytology via ERCP is the key step when jaundice and a potentially dominant stricture are present.
In PSC with UC, ensure colorectal cancer surveillance colonoscopy (typically at PSC diagnosis and then at frequent intervals), but this is not the immediate next step for the hepatobiliary presentation.

If you share whether he has fever/RUQ pain (concern for cholangitis), the bilirubin level trend, and whether MRCP suggests a dominant stricture or mass, I can tailor the urgency and the exact ERCP goals.

Training Objective

We trained Kos-1 Lite against a reinforcement learning objective to achieve accurate clinical reasoning and a precise bedside manner.

For a group of rollouts $\{o_i\}_{i=1}^{G}$ sampled from rollout policy $\pi_{\text{roll}}$ , we maximize:

\mathcal{J}(\theta) = \frac{1}{\sum_{i=1}^{G}|o_i|}\sum_{i=1}^{G}\sum_{t=1}^{|o_i|}\left[w_{i,t} \cdot \text{sg}\!\left(\min\!\left(\frac{\pi_\theta(o_{i,t} \mid o_{i,<t})}{\pi_{\text{old}}(o_{i,t} \mid o_{i,<t})},\;\varepsilon_{\text{high}}\right)\right) \cdot \hat{A}_i \cdot \log \pi_\theta(o_{i,t} \mid o_{i,<t})\right]

where:

w_{i,t} = \min\!\left(\frac{\pi_{\text{old}}(o_{i,t} \mid o_{i,<t})}{\pi_{\text{roll}}(o_{i,t} \mid o_{i,<t})},\; C_{\text{IS}}\right)

We incorporated the following:

Decoupled three-policy objective¹ with CISPO² clipping on $\frac{\pi_\theta}{\pi_{\text{old}}}$ and token-level truncated importance sampling³ˌ⁴ on $w_{i,t} = \frac{\pi_{\text{old}}}{\pi_{\text{roll}}}$ to correct training-inference mismatch
No sampling adjustments⁵ with rollouts set to temp=1, top-p=1, top-k=-1
Rollout Routing Replay (R3)⁶ to stabilize expert routing for MoE
Mean-centered advantages without std normalization⁷, token-level loss⁸ to prevent long rollouts from being down-weighted, and zero-variance filtering⁸ to exclude groups with identical reward across rollouts

Length Control

Unlike agentic models that complete tasks over long horizons, health advice is delivered to patients and providers as text to read. We created a novel length bonus that controls thinking and answer length separately. Beyond latency and KV constraints on total context, the time to read the answer (after </think>) was a major UX problem. We intend to publish details with controlled ablations in follow-up work.

We build on DAPO's⁸ Soft Overlong Punishment — a linear penalty that decays over a window between $L_{\text{free}}$ and $L_{\text{max}}$ . We extend it with three changes:

We assign a bonus rather than a penalty. With a penalty, a correct long answer can score lower than an incorrect short one, letting the model reward-hack into short wrong answers. A non-negative bonus preserves the reward ordering.
We apply the bonus independently to total tokens (thinking + output) and to output tokens only, with separate tuning for each. The model can reason at length internally while being rewarded for compressing its user-facing answer.
Both bonuses are zeroed when the reward score falls below a quality threshold. Without this, the model learns to write short, wrong answers.

Training details

We post-trained a medium sized MoE language model (~100B) using Miles (SGLang/Megatron).

What's Next

We are supplementing early benchmarking with physician and user review across a broader range of clinically relevant tasks and safety evaluations. We are excited to release upgraded checkpoints as training continues, and will continue to prioritize clinical accuracy, physician support and user friendliness.

Kos is the birthplace of Hippocrates.

References

¹ Hilton, J., Cobbe, K., & Schulman, J. (2021). Batch Size-Invariance for Policy Optimization. arXiv:2110.00641

² MiniMax (2025). MiniMax-M1: Scaling Lightning Attention to 128K — with 4096K Context Grounding. arXiv:2506.13585

³ Liu, J. & Li, Y. (2025). When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch. Link

⁴ Yao, F. & Liu, L. (2025). Your Efficient RL Framework Secretly Brings You Off-Policy RL Training. Link

⁵ Fang, Q. & Khazi, D. (2025). Mismatch Praxis: Rollout Settings and IS Corrections. Link

⁶ Ma, Y. et al. (2025). R3: Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers. arXiv:2510.11370

⁷ Liu, Z. et al. (2025). Understanding R1-Zero-Like Training: A Critical Perspective. arXiv:2503.20783

⁸ Yu, Q. et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476

⁹ Arora, R. K. et al. (2025). HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv:2505.08775