Kos-1 Experimental: Env-Free RL on a 1T Parameter Agentic Prior

Today we're releasing early results from training Kos-1 Experimental, a Kimi K2.5 checkpoint post-trained on the same medical RL data we used for Kos-1 Lite. As clinical workloads become more agentic, we were motivated to produce a model that paired the medical domain knowledge present in Kos-1 Lite with frontier tool-calling.

Being an on-policy method, RL forgets less. If that holds at 1T scale, short-context, text-only RL could add domain expertise to an agentic model without long, expensive rollouts while preserving existing tool-calling. As a first experiment, we ran an environment-free RL pass with the same data to see how far our Kos-1 Lite recipe could carry.

Kos-1 Experimental learns the medical behavior we intended, while inheriting Kimi K2.5's agentic prior. On OpenAI's recently released HealthBench Professional, Kos-1 Experimental raw (no system prompt, no tool use) is on par with GPT-5.4 raw, and with our harness, Kos-1 Experimental reaches the level of the newly released ChatGPT for Clinicians. On HealthBench Hard, the harness-equipped Kos-1 Experimental achieves 49.3%.

Despite these results, tool-calling on Kos-1 Experimental reveals the limits of env-free RL. It can still use tools thanks to Kimi K2.5's prior, and after medical RL its queries are often more clinically informed, but we see tool-specific regressions under long-context serving pressure. Our main hypothesis is that env-free RL can be enough to push a model to SOTA in a targeted knowledge domain, but those gains come at the cost of tool-calling instability to the hardest long-horizon agentic settings. At a minimum, env-free tasksets should be interleaved with agentic tasks to reduce forgetting, and in the ideal state, the default recipe for shipping a strong agentic model is end-to-end RL with the harness in the rollout.

Results

HealthBench Professional

HealthBench Professional is OpenAI's most recent clinician-facing benchmark, designed to evaluate realistic clinician queries across good-faith and red-teaming settings.

HealthBench Professional (+ tools)

ChatGPT for Clinicians

59.0

Kos-1 Experimental with tools

57.8

GPT-5.4 with browsing

45.8

Kos-1 Lite with tools

44.1

HealthBench Professional raw

GPT-5.5

51.8

GPT-5.4

48.1

Kos-1 Experimental raw

47.8

Claude Opus 4.7

47.0

GPT-5

46.2

GPT-5.2

45.9

Gemini 3.1 Pro

43.8

Physician-written

43.7

Kos-1 Lite

38.5

Kimi K2.5

34.0

HealthBench Pro treats a model and its harness as a single system. Per the benchmark's scoring guidance (8-samples + length-penalty), Kos-1 Experimental with harness reaches 57.8, behind only ChatGPT for Clinicians and clearly above both GPT-5.4 with browsing and Kos-1 Lite with tools.

For the no-tool comparison, the relevant path is default-prompt raw. Kos-1 Experimental raw with no system prompt reaches 47.8, below GPT-5.5's 51.8 and close behind GPT-5.4's default-prompt 48.1, but still ahead of GPT-5, GPT-5.2, Claude Opus 4.7, Gemini 3.1 Pro, and the physician baseline.

Compared to our previously released state-of-the-art medical LLM Kos-1 Lite, we see a substantial capability lift on HealthBench Professional. With harness, Kos-1 Experimental is +13.7 points above Kos-1 Lite with tools (57.8 vs 44.1). Without harness, Kos-1 Experimental raw is +9.3 points above Kos-1 Lite (47.8 vs 38.5).

HealthBench Professional tags every example by use case, source/difficulty slice, and specialty. All slice-level Δ values in this section are descriptive differences between reported point estimates, not statistical-significance claims. When the paper notes no significant difference on a slice, we still show the raw score gap but do not interpret it as significant.

Use-case slices

Slice	Share	Kos-1 Experimentalwith tools	GPT-5.4 inChatGPT for Clinicians	Δ
Care consult	45.0%	53.4	51.0	+2.4
Writing and documentation	27.0%	55.9	64.1	−8.2
Medical research	28.0%	66.6	67.0	−0.4

Source / difficulty slices

Slice	Share	Kos-1 Experimentalwith tools	GPT-5.4 inChatGPT for Clinicians	Δ
Good Faith Typical	48.8%	69.1	69.0	+0.1
Good Faith Difficult	14.9%	39.4	34.1	+5.3
Red Teaming Difficult	36.4%	50.0	55.8	−5.8

Our findings are directionally consistent with the system-level analysis in Section 5.1 of the HealthBench Professional paper: harness-driven improvements concentrate on adversarial and failure-prone examples. The paper reports ChatGPT for Clinicians at 55.8 on Red Teaming Difficult, against 26.2 for base GPT-5.4 and 30.0 for physicians, which is ChatGPT for Clinicians's largest slice-level lift. The Kos harness shows the same pattern, with the largest improvements over Kos-1 Experimental raw on Red Teaming Difficult (+18.9) and on writing and documentation as a use case (+22.0).

Where Kos-1 Experimental still trails ChatGPT for Clinicians is the writing-and-documentation slice (55.9 vs 64.1, an 8.2-point gap). This single slice accounts for most of the remaining gap between Kos-1 Experimental and ChatGPT for Clinicians on the overall score. In the paper's cross-tab (Table 2), 64.1% of the writing-and-documentation examples come from the Red Teaming Difficult pool. So most of the gap to ChatGPT for Clinicians on this slice is about handling adversarial writing prompts (role-play framing, misleading premises packaged as a documentation task, and so on), and only a smaller part is about routine note generation.

On the largest slice (Good Faith Typical, 48.8% of the benchmark), Kos-1 Experimental is essentially tied with ChatGPT for Clinicians (69.1 vs 69.0), and both sit well above the physician baseline (55.7). On the everyday, non-adversarial part of the benchmark, our system performs at the SOTA level.

Top 10 Specialties Kos-1 Experimental with tools vs ChatGPT for Clinicians

Nephrology

+6.5

Obstetrics & gynecology

+5.7

Anesthesiology

+5.3

Hematology & oncology

+4.4

Orthopedics

+4.1

Pediatrics

+3.7

Neurology

+0.7

Cardiology

-8.0

Dermatology

-8.1

Psychiatry

-17.7

The top 10 specialties account for 303 of 525 examples (57.7%). On these slices, Kos-1 Experimental with harness is essentially tied with ChatGPT for Clinicians in aggregate: 56.9 vs 57.7, a 0.7-point gap. It is higher in seven of the ten specialties, led by nephrology (+6.5), obstetrics and gynecology (+5.7), anesthesiology (+5.3), hematology and oncology (+4.4), orthopedics (+4.1), and pediatrics (+3.7). The remaining gap is concentrated in psychiatry (-17.7), dermatology (-8.1), and cardiology (-8.0).

Clinical-Aware Search Queries

We also found that env-free medical RL improves how Kos-1 Experimental uses tools for medical retrieval. Compared with the untrained Kimi K2.5 baseline, the post-trained checkpoint writes better queries. Its searches are more clinically scoped, more likely to preserve the decision boundary in the user's question, and less likely to waste calls on literal term expansion or generic background lookup.

Across the five traces we shared here, Kos-1 Experimental turns the prompt into the clinical decision it actually needs to resolve. For solid-tumor bispecifics, it searches approval status and named agents instead of a generic class query. For pregnancy of unknown location, it routes unilateral pain plus positive hCG to ectopic-risk triage. For the pediatric medication schedule, it searches the toxicity mechanisms rather than only the four drug names. For NSCLC brain metastases and coronary endarterectomy, it retrieves the guideline and outcome context needed to answer the real tradeoff rather than collapsing to a single broad rule.

HealthBench Hard

HealthBench Hard contains 1,000 examples that were difficult for frontier models at the time of OpenAI's HealthBench release.

With harness, Kos-1 Experimental reaches 49.3, further pushing the HealthBench Hard SOTA set by our earlier Kos-1 Lite. Built on the Qwen3-Next-80B-A3B base, Kos-1 Lite scores 46.6 without tools but falls to 46.1 when we add the same tool harness. Kos-1 Experimental raw with the default system prompt and no tools, is 40.7.

HealthBench Hard

Kos-1 Experimental

+tools

49.3

Kos-1 Lite

46.6

GPT-5 High

46.2

Kos-1 Lite with tools

46.1

Baichuan M3

44.4

GPT-5.2 High

42.0

GPT-5.1 High

40.5

GPT-5.4

40.1

AntAngelMed

39.6

Baichuan M2

34.7

GPT-5.5*

33.8

Kimi K2.5

25.6

Grok 4

22.7

DeepSeek V3.2 Thinking

21.7

Claude Opus 4.6

20.4

Gemini 3.1 Pro

19.8

*We report GPT-5.5's score per OpenAI's System Card. OpenAI notes the scores in the card are reported under an updated HealthBench implementation, which we note reduced scores across the board.

HealthBench Hard computes the raw rubric reward, without the length adjustment used in HealthBench Professional. The HealthBench Professional paper motivates that adjustment around a known length sensitivity: open-ended rubric benchmarks can reward longer answers because longer responses have more chances to hit positive criteria, even if the information is not coherently organized or delivered in an actionable way.

Instead of relying on answer length to cover many possible rubric criteria, Kos-1 Experimental uses retrieval to identify the decision boundary the prompt is asking about, then spends its shorter answer budget on that boundary. The same tools that hurt Kos-1 Lite help Kos-1 Experimental because Kimi K2.5 brings a stronger agentic prior. Kos-1 Experimental is more likely to call tools for the right reason, write the right query, and then fold the retrieved evidence back into a focused clinical answer.

By theme, the comparison between Kos-1 Experimental with tools and Kos-1 Lite with tools looks like this:

HealthBench Hard Kos-1 Experimental with tools vs Kos-1 Lite with tools, by theme

Emergency referrals

+15.8

Context seeking

+7.7

Complex responses

+5.2

Communication

+4.1

Health data tasks

+1.0

Global health

+0.3

Hedging

-2.0

The improvement is concentrated in emergency referrals and context seeking. Global health and health data tasks are close to flat. Kos-1 Lite with tools is higher on hedging by 2.0 points.

Below are a few traces comparing Kos-1 Experimental with tools against Kos-1 Lite with tools.

Chemotherapy fever: the user asks whether a 37.8C temperature is emergent and what antibiotic to use. Kos-1 Experimental with tools scored 0.74 by searching febrile-neutropenia definitions, ANC thresholds, and empiric antimicrobial guidance. Kos-1 Lite with tools scored −0.10 after drifting toward MASCC/NCCN risk-stratification detail before anchoring the immediate emergency frame.
Heart failure renal dosing: the user asks how to adjust bisoprolol after eGFR drops from 60 to 30. Kos-1 Experimental with tools scored 1.00 by combining the bisoprolol label with HFrEF beta-blocker guidance, CKD dosing evidence, and trial-dose context. Kos-1 Lite with tools scored 0.29 with a narrower drug-label lookup and an overconfident dose-reduction recommendation.
Neonatal bilirubin: a 37-week newborn has total serum bilirubin 12 mg/dL at 48 hours of life. Kos-1 Experimental with tools scored 0.44 by searching the 2022 AAP hour-specific threshold, gestational-age band, and phototherapy decision boundary. Kos-1 Lite with tools scored −0.26 with broader neonatal-jaundice and Bhutani-nomogram searches.
Postpartum PE shorthand: the user writes "calc wells postpartum sob." Kos-1 Experimental with tools scored 0.87 by first translating the shorthand into Wells score, postpartum dyspnea, PE diagnosis, and pregnancy/postpartum adaptations. Kos-1 Lite with tools scored 0.28 with a single Wells-modification query and less postpartum-specific triage context.

Harness Details

Kos-1 Experimental uses the Kos harness with seven medical retrieval tools:

Tool	Purpose
`web_search`	General web search. Used for ambiguous abbreviations, current guideline lookups, fallback search, and cross-source disambiguation.
`search_triage_guideline`	Searches authoritative patient-education and guideline-style pages, primarily MedlinePlus and NHS content where available. Queries should be short condition or symptom names.
`search_drug_label`	Looks up FDA-approved drug label information through RxNorm, openFDA, and DailyMed. Used for dosing, contraindications, boxed warnings, interactions, adverse reactions, and pregnancy- or population-specific sections.
`search_pubmed`	Searches PubMed for biomedical literature. Queries are kept short, usually 3-5 medical terms.
`search_clinical_trials`	Searches ClinicalTrials.gov for ongoing or completed clinical studies. Used only when the user asks about experimental treatments, enrollment, or trial status.
`search_icd`	Searches WHO ICD codes and definitions by condition name or code.
`read_url`	Reads a specific webpage when a search result needs full source text.

Training Details

Kos-1 Experimental is post-trained on Kimi K2.5 on Tinker. The recipe carries forward a similar shape to Kos-1 Lite, adjusted for Tinker's configurability. Additionally, the learning rate is increased to account for LoRA, following LoRA Without Regret.

Setting	Value
Loss	CISPO with `clip_low=0`, `clip_high=5`, `kl_penalty_coef=0`
Optimizer	Adam
Batch	`groups_per_batch=32`, `group_size=8`, `num_substeps=4`
LoRA rank	$4$
Learning rate	$1\times10^{-5}$
Rollout max tokens	$10{,}000$
Temperature	$1.0$

All rollouts were graded with task-specific rubrics, graded by an LLM judge. We will release followup work this week on how we chose a grader.

The model learned medical response behavior under single-turn, no-tool conditions, and the tool harness was introduced later at evaluation and serving time.

Data

All models in the Kos-1 family are trained on a rubric dataset, generated by Curriculum.

Length Control

As health advice is often delivered to providers as text to read, we control for the user facing response length in addition to total response length. We train Kos-1 Experimental under a two-stage conciseness curriculum, with a single quality gate held at 0.68 throughout so that length pressure never activates below an acceptable answer-correctness level.

Let $r$ be the normalized rubric reward. For each length channel $j \in \{\mathrm{response}, \mathrm{output}\}$ , let $\ell_j$ be the token length, $\ell_j^{\mathrm{free}}$ the free-length threshold, $\ell_j^{\max}$ the length cap, and $\beta_j$ the maximum bonus.

q(r) = \mathbf{1}[r \geq 0.68]

g(\ell;\ell^{\mathrm{free}},\ell^{\max},\beta) = \begin{cases} \beta & \ell \leq \ell^{\mathrm{free}} \\ \beta \dfrac{\ell^{\max} - \ell}{\ell^{\max} - \ell^{\mathrm{free}}} & \ell^{\mathrm{free}} < \ell < \ell^{\max} \\ 0 & \ell \geq \ell^{\max} \end{cases}

\mathrm{Reward}_{\mathrm{final}} = r + q(r)\sum_{j \in \{\mathrm{response}, \mathrm{output}\}} g(\ell_j;\ell_j^{\mathrm{free}},\ell_j^{\max},\beta_j)

The conciseness bonus has two independent channels: total response length and visible output length. Each channel has its own free budget, max cap, and bonus at floor. Under the free budget, the model receives the full length bonus. Between the free budget and the max cap, the bonus decays linearly to zero. Above the cap, that channel contributes no bonus. If the rubric score is below the quality gate, both length bonuses are zero. This makes the reward prefer shorter answers only after the answer is already medically acceptable under the judge.

Stage 1 establishes core medical capability with a baseline conciseness reward:

Channel	Free threshold	Cap	Bonus
Response length $(j=\mathrm{response})$	$\ell_{\mathrm{response}}^{\mathrm{free}}=4000$	$\ell_{\mathrm{response}}^{\max}=6000$	$\beta_{\mathrm{response}}=0.15$
Visible output length $(j=\mathrm{output})$	$\ell_{\mathrm{output}}^{\mathrm{free}}=1500$	$\ell_{\mathrm{output}}^{\max}=3000$	$\beta_{\mathrm{output}}=0.17$

At the end of stage 1, the model generates roughly 2k non-thinking output tokens on average.

Stage 2 warm-starts from the stage-1 endpoint and tightens the output-token side of the reward across a short curriculum:

$\ell_{\mathrm{output}}^{\mathrm{free}}$ moves from 1500 down to 700
$\ell_{\mathrm{output}}^{\max}$ moves from 3000 down to 2000-2300
$\beta_{\mathrm{output}}$ moves from 0.17 up to 0.55
$\beta_{\mathrm{response}}$ is lowered to 0.03, so that output-token compression becomes the dominant signal

After stage 2, visible output averages around 1000 tokens while maintaining answer correctness.

Behavior Shift After Length Control

The surprising part of stage 2 is that clinical shorthand emerges on its own without being directly rewarded. After stage 2 length control, Kos-1 Experimental becomes more willing to write in compressed clinician-note fragments.

Uremia trace

GFR typically <15 mL/min/1.73m², Stage 4-5 CKD or AKI

Urgent dialysis red flags: hyperK with ECG changes, severe acidosis, refractory pulm edema

A simple check across the 525 aligned raw answers shows uppercase or abbreviation-like tokens rising from about 118 to 158 per 1000 words, while emojis, tables, headings, and long bullet scaffolds fall. This is usually a compact but normal clinical register: a Nature Communications paper on deciphering clinical abbreviations frames abbreviations and shorthand as part of real clinical notes, and health systems maintain local clinical abbreviation lists. The result is shorter and less conversational, but often denser, more precise, and closer to how clinicians write in time-sensitive settings. At the same time, we did see the model occasionally invent its own abbreviations, so excessive compression should be treated with caution.

Deployment

Kos-1 Experimental is a LoRA adapter trained on Kimi K2.5. The adapter weights are high-precision floats. We merge the adapter into a BF16 copy of the base model, then convert most of the merged model to NVFP4. The routed MoE experts and the attention output projections are stored as NVFP4. The attention QKV projections, the shared experts, the two dense MLP layers, and lm_head stay in BF16. The starting point for our pipeline is the Tinker Cookbook PR that adds Kimi K2 / K2.5 shard-by-shard merging with INT4 expert dequant/requant. For the packed INT4 experts, that PR dequantizes the weights, merges the LoRA, and then re-quantizes back to the same INT4 packed format. Our pipeline does the merge in BF16 and exports the merged text-only model to NVFP4 instead.

We would have preferred to serve the LoRA online. In that setup, the base checkpoint does not change, the adapter stays in high precision, and the inference engine applies the LoRA at runtime on each affected projection. We tried this with SGLang, but throughput on this Kimi K2.5 MoE adapter was too low to be practical. So we fell back to merging the adapter in BF16 and exporting an NVFP4 checkpoint instead.

We are grateful to the Baseten team for the NVFP4 deployment idea and support on the actual conversion. See Baseten's writeup on the fastest Kimi K2.5 deployment.

The Problem of High-Precision LoRA Updates

Kimi K2.5's routed expert matrices are stored as packed INT4 with _scale tensors. Every group of 32 weights shares one floating-point _scale, and a weight is reconstructed as:

\mathrm{weight} \approx \mathrm{int4\_code}\times \mathrm{scale}.

The sampled Kimi K2.5 expert tensors show why a high-precision LoRA update is hard to preserve in that grid:

Quantity	Value	What it measures
Mean INT4 `_scale`	$5.95\times10^{-3}$	Grid spacing. The distance between adjacent INT4 values under the shared scale
Half a grid step	$2.97\times10^{-3}$	Rounding threshold. Movements below roughly half a step usually stay in the same code
Uniform rounding-noise estimate, $\mathrm{scale}/4$	$1.49\times10^{-3}$	Expected absolute rounding error if the rounding error is uniform on $[-\mathrm{half\_step}, +\mathrm{half\_step}]$
Mean weight magnitude $\operatorname{mean}(\lvert w\rvert)$ (merged BF16)	$1.40\times10^{-2}$	The size of a typical weight itself
Mean LoRA merge delta $\operatorname{mean}(\lvert\Delta\rvert)$	$4.23\times10^{-5}$	The size of the high-precision LoRA update added to each weight

The mean INT4 step is about 140× larger than the average LoRA movement. The half-step rounding threshold is about 70× larger, and the $\mathrm{scale}/4$ rounding-noise estimate is still about 35× larger.

The distribution check points in the same direction. In the layer-1 experts we checked, the INT4 group-scale median is about $6.8\times10^{-3}$ to $7.2\times10^{-3}$ , which puts the local half-step around $3.4\times10^{-3}$ to $3.6\times10^{-3}$ . The LoRA $|\Delta|$ median is about $3.1\times10^{-5}$ , and its 99th percentile is only about $1.4\times10^{-4}$ to $1.8\times10^{-4}$ . In those experts, essentially none of the updates exceed the original $\mathrm{scale}/2$ threshold or even the $\mathrm{scale}/4$ threshold.

Tinker's INT4 merge path dequantizes the packed expert weight, applies the LoRA merge in BF16, and requantizes the merged weight back to packed INT4. This rewrites both the group scale and the individual INT4 codes. After requantization, each 32-weight group still shares one signed 16-code uniform grid. In the current implementation, that scale is set by the group's maximum merged magnitude $a_g$ :

\mathrm{scale}_g \approx \frac{a_g}{7}.

Each merged value is then rounded before clamping:

q_i=\mathrm{round}\left(\frac{w_i'}{\mathrm{scale}_g}\right),

This preserves the group's range and sets the grid step by the largest weights in the group.

There is a range-resolution tradeoff here. For INT4 with scale $s$ , the low-end positive levels are $0$ and $s$ , so the first rounding boundary is at $s/2$ . To make a $4.23\times10^{-5}$ LoRA movement capable of changing an INT4 code by itself, the half-step threshold would need to be comparable to that update:

\frac{s}{2}\approx4.23\times10^{-5}, \qquad\Longrightarrow\qquad s\approx8.46\times10^{-5}.

In other words, the INT4 group would need a grid step on the order of $10^{-4}$ . At $s=10^{-4}$ , signed INT4 only covers about:

[-8\times10^{-4}, 7\times10^{-4}].

That range is enough only for very small groups, with true amax below roughly $7\times10^{-4}$ to $8\times10^{-4}$ . It is not enough for a normal expert group containing weights around the observed mean magnitude, $1.40\times10^{-2}$ . At $s=10^{-4}$ , a weight of that size maps to code 140:

\frac{1.40\times10^{-2}}{1\times10^{-4}}\approx 140,

while signed INT4 only has codes around $[-8, 7]$ . That value would saturate rather than be represented accurately. Keeping enough range for ordinary weights pushes the scale back into the $10^{-3}$ to $10^{-2}$ range, where $4\times10^{-5}$ LoRA movement is usually far below the local rounding threshold and gets rounded away.

Merging in BF16, Exporting to NVFP4

Each NVFP4 tensor element is an E2M1 FP4 value, with 1 sign bit, 2 exponent bits, and 1 mantissa bit. The signed codebook is:

\mathcal{Q}_{\mathrm{E2M1}}=\{0,\pm0.5,\pm1,\pm1.5,\pm2,\pm3,\pm4,\pm6\}.

NVFP4 adds two scales to this small codebook. Every 16-value block gets an FP8 E4M3 block scale, and the whole tensor gets an FP32 tensor scale:

\hat{w}_i^{\mathrm{NVFP4}} = S_t\,s_b\,q_i,\qquad q_i \in \mathcal{Q}_{\mathrm{E2M1}},\quad i\in b

Here $S_t$ is the FP32 tensor scale and $s_b$ is the FP8 block scale. For merged BF16 weights:

w_i'=w_i+\Delta_i

and a 16-value block amax:

a_b=\max_{i\in b}|w_i'|

a standard amax/full-range reference choice gives:

d_b = S_t\,s_b \approx \frac{a_b}{6}.

After BF16 merge, NVFP4 requantizes the merged tensor from scratch into 16-value blocks with a local floating-point codebook, an FP8 block scale, and an FP32 tensor scale. The question is the same as above: after scaling, is the local rounding step small enough for a $4.23\times10^{-5}$ LoRA movement to matter?

For a fixed $d_b$ , the E2M1 levels decode to:

\{0,\pm0.5d_b,\pm d_b,\pm1.5d_b,\pm2d_b,\pm3d_b,\pm4d_b,\pm6d_b\}.

The range is about $[-6d_b,6d_b]$ , and the scales move that range to fit each local block. In a small-amax block, $d_b$ can become small enough to make the low-end FP4 levels comparable to the LoRA update. For example, if $a_b=3\times10^{-4}$ , the amax reference gives:

d_b\approx\frac{3\times10^{-4}}{6}=5\times10^{-5}.

The first positive decoded values are then $2.5\times10^{-5}$ , $5\times10^{-5}$ , and $7.5\times10^{-5}$ , from the $0.5d_b$ , $d_b$ , and $1.5d_b$ . In this small-amax case, a LoRA update of this size can cross a local rounding boundary and affect the exported code.

At ordinary weight scales around $1.40\times10^{-2}$ , however, the local grid is still too coarse in either format. At that scale, a 32-weight INT4 group has $s=a_b/7\approx2.0\times10^{-3}$ , so its half-step is about $1.0\times10^{-3}$ , roughly 24× larger than the LoRA movement. NVFP4 is finer but still too coarse for such blocks: $d_b=a_b/6\approx2.3\times10^{-3}$ , so the low-end half-step is $0.25d_b\approx5.8\times10^{-4}$ , roughly 14× larger.

The NVFP4 claim is therefore aggregate. With fewer samples per block, 16-weight NVFP4 blocks create more chances for some blocks to enter the $10^{-4}$ to $10^{-3}$ amax regime than 32-weight INT4 groups do. In those small-amax blocks, $d_b$ is locally small and the FP4 codebook is densest near zero. The low-end half-step is $0.25d_b\approx a_b/24$ . A uniform INT4 grid at the same amax has low-end half-step $a_b/14$ , so the NVFP4 low-end interval is about $24/14\approx1.7\times$ finer. Some average-sized LoRA deltas will still be rounded away, but NVFP4 gives the merged checkpoint more local regions where a small update can land near a rounding boundary.

MXFP4 sits between these two choices. It shares the E2M1 value codebook with NVFP4, giving it the same near-zero floating-point spacing. It still uses 32-value blocks and an E8M0 scale. For this LoRA, the main advantage comes from making the post-merge grid local enough that some blocks enter the $10^{-4}$ to $10^{-3}$ amax regime. NVFP4 is better aligned with that goal through 16-value blocks, E4M3 block scales, and an FP32 tensor scale.

After LoRA training on Tinker, we move through BF16 and finish in NVFP4:

Dequantize the INT4 Kimi base into BF16.
Extract the text-only checkpoint. We do not need the multimodal head for serving.
Merge the LoRA adapter into the BF16 base shard by shard. The merge happens entirely in BF16.
Calibrate the merged BF16 checkpoint model using a local SFT-style text mix.
Export the result as a primarily-NVFP4 checkpoint, with the selective BF16 carve-outs described above.

PTQ Calibration

We export the merged BF16 checkpoint with NVIDIA ModelOpt. The run uses the Baseten on-policy PTQ calibration corpus with 256 samples, sequence length 3072, batch size 32, and KV-cache quantization disabled. The calibration pass exercises routed experts before export and lets ModelOpt collect the amax state needed for the final checkpoint.

Tools, RL, and Regressions

We also spotted tool-use related regressions while serving Kos-1 Experimental.

When we serve Kos-1 Experimental, if we use the OpenAI client to request the model, the default template will render tool declarations in JSON. The model still answers and still calls tools, but its actual performance drops noticeably, and even the tone and style of its replies shift visibly. Kimi K2.5's technical report notes in Appendix B that K2.5 natively uses TypeScript for tool declarations, and that to remain compatible with the default OpenAI client it was also trained on the JSON declaration format. We think this may be the cause of what we observed: declaring tools as JSON introduces a larger distribution drift at serving time, and that drift erases part of the behavior our RL had installed.

Even with TypeScript tool declarations, we saw a second regression: thinking-style content and stray </think> tags occasionally leaking into the final answer. Tool-using Kos-1 Experimental hits this in roughly 3% of multi-turn tool runs, while Kos-1 Experimental raw (no tools) hits it in only about 0.2% of runs. Since Kos-1 Experimental training never exposed the model to tools and never rendered any tool-declare content, our leading hypothesis is that prolonged pure-medical, no-tool RL shifted the model's tool-call behavior, which then surfaces as this multi-turn rendering glitch at serving time.

We saw the same issue become more severe in long-context CLI-agent settings. When we tried to run Kos-1 Experimental on PhysicianBench, a long-horizon EHR-agent benchmark, tool calls often returned large amounts of text. Under that long-context tool-observation load, the model's formatting became brittle: it would sometimes fail to close a tool call, or fall into repeated-token or patterned-token loops. Because of these format failures, Kos-1 Experimental degraded from 12% pass rate of baseline Kimi K2.5, to 7% on PhysicianBench.

What's Next

We're doubling down on harness-first training. While this release shows the value of env-free training, we still believe in narrowing the sim2real gap and training a medical agent end-to-end with the same harness meant to be deployed in prod. End-to-end agent training aligns the training and production distributions, and early results show improved, rather than regressed, tool-use behavior.

Kos is the birthplace of Hippocrates.

References

The LLM Data Company. Kos-1 Lite: SOTA Medical Model. Mar 3, 2026. llmdata.com/blog/kos-1
Moonshot AI. Kimi K2.5: Visual Agentic Intelligence. Technical report, 2026. kimi.com
I. Shenfeld, J. Pari, and P. Agrawal. RL's Razor: Why Online Reinforcement Learning Forgets Less. arXiv:2509.04259, 2025. arxiv.org
OpenAI. HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats. 2026. PDF
OpenAI. HealthBench: Evaluating Large Language Models Towards Improved Human Health. 2025. PDF
OpenAI. GPT-5.5 System Card. Apr 23, 2026. PDF
J. Schulman and Thinking Machines Lab. LoRA Without Regret. Sep 29, 2025. thinkingmachines.ai
MiniMax. MiniMax-M1: Scaling Lightning Attention to 128K — with 4096K Context Grounding. arXiv:2506.13585, 2025. arxiv.org
The LLM Data Company. Curriculum. llmdata.com/curriculum
A. Rajkomar, E. Loreaux, Y. Liu, et al. Deciphering clinical abbreviations with a privacy protecting machine learning system. Nature Communications 13, 7456 (2022). nature.com
South Eastern Sydney Local Health District. Clinical Abbreviations List. July 2022. PDF
Thinking Machines Lab. Kimi K2 / K2.5 shard-by-shard merge with INT4 expert dequant/requant. Tinker Cookbook PR #573. github.com
Baseten. How we built the fastest Kimi K2.5 on Artificial Analysis. Mar 9, 2026. baseten.co
R. Liu, I. Q. Mohiuddin, A. J. Schoeffler, K. Renduchintala, A. Nayak, P. L. Vemu, S. C. Vedak, K. C. Black, J. L. Havlik, I. Ogunmola, S. P. Ma, R. Dhatt, J. H. Chen. PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments. arXiv:2605.02240, May 2026. arxiv.org