The LLM Data Company

See our latest release

Kos-1 Experimental: Env-free RL at 1T Params

Post-Training Models for Production Agents

“With different training the same ostensive teaching of these words would have effected a quite different understanding.”

LUDWIG WITTGENSTEIN

The LLM Data Company trains specialized models end-to-end inside their production harness. The result is Pareto-dominant specialists that beat frontier models at a fraction of the serving cost.

Today, enterprises are deploying general, expensive models in their production agents, and the cost to replace them with post-trained, specialized models has been prohibitive.

Frontier models are designed to be deployed in thousands of harnesses. To achieve that generality, labs train on a variety of contrived RL environments, hoping skills transfer to whatever the customer eventually does with the model. The resulting model is large, general, and increasingly expensive to serve as workloads become agentic and run for hours and days.

For practical domains, the bottleneck is data. In verifiable domains, like math and code, training signal is cheap to generate. In practical domains, there is no clean answer to train on. The domains where specialized models are most valuable are precisely those where building them has been slowest and most expensive: gated by human labeling, taking months to iterate. Frontier labs spend billions per year, but this is untenable for most enterprises.

Training agent-specific models improves performance for real production use cases, and now, the data to create them is scalable.

Models are more performant and less expensive when they are trained specifically for your use case, within your harness. The sim2real gap narrows when your model weights have seen what your agent will see in production.

Solving the data problem is less obvious. Unblocking training for enterprise teams requires a rethinking of what models need to learn from first principles. Throwing more manual effort at the problem does not scale in quality, volume, and domain specificity.

Models learn from tasks and rewards, so we built Curriculum, an autoresearch platform that curates tasks and rewards for on-policy RL, paired with the harness the model will run in. This technique has existence proof: we’ve used it to train our own SOTA medical models and partnered with frontier agent companies.

We work with focused teams looking to replace generalist models with specialists and eliminate the false dichotomy between cheaper and better.

RESEARCH

Read our latest work:

MAY 2026

Kos-1 Experimental: Env-free RL at 1T Params

Read→