Curriculum

Data autoresearch for training SOTA agents.

Use Curriculum to post-train your knowledge work agent in your harness.

Existence proof

Kos-1 Lite achieved medical SOTA trained entirely on Curriculum. Curriculum also drove our work with Perplexity on DRACO.

Tasks for your production harness.

An agent is composed of a model and a harness. The harness enriches an LLM with prompts, tools, skills, and scaffolding so it can perform production tasks. Foundation models have been trained to generalize to any harness. We build tasks scoped to your harness and domain, enabling you to RL train a model that is cheaper and more performant than a generic foundation model.

Tasks are scoped to the capabilities your harness expects. They require the right domain knowledge, capabilities, and final output, but unlike common RL environments, they don’t prescribe a fixed toolset or MCP server. As your harness matures and preferred tools are changed (e.g., MCP to programmatic), the same taskset can be used to train the weights of the model steering it. The harness is the training environment, and performance gains target the workflow you ship.

You ownCurriculum produces

Task

Instruction

Difficulty-aware task.

Harness

Your agent runtime

Prompts, tools, scaffolding, output contract.

Model

Rubric

Reward

Dense score computed by Rubric grader.

Resource sandbox

Replayable state

Per-rollout databases, files, and APIs the harness reads to enable deterministic replay.

For stateful agents, we replace production dependencies with resource sandboxes. The harness keeps its client code, and resource bindings change from production URLs to sandbox URLs, and the agent workspace is initialized from a snapshotted filesystem for each rollout. Live systems stay untouched, and each attempt sees replayable state.

A tool’s implementation sees postgres://prod.orders-db.internal replaced with postgres://sandbox.orders-db.br_a1b2c3, and at the start of each rollout, a snapshotted filesystem is mounted to the agent workspace.

Dense rubrics reward the non-verifiable.

Frontier-grade rubrics are binary, self-contained, and generalize the space of correct answers so wrong responses measurably fail and diverse correct answers are rewarded.

Curriculum rubrics define a valid solution space instead of a myopic canonical answer. Criteria cover evidence, reasoning, constraints, and failure modes. Positive criteria reward necessary behavior, and negative criteria penalize active errors.

Most hand-written rubrics fail by confusing one good answer with the whole solution space. A model can solve the task correctly in a different way, or sound convincing while missing the real driver. Curriculum rubrics are written around the boundary of correctness: the facts that must be present, the ranges that allow legitimate variation, the reasoning links that separate good from shallow, and the active errors that should lower reward. They also avoid common grading failures: overlapping criteria that double-count the same behavior, subjective criteria that make pass/fail nondeterministic, and criteria that only work if the judge model has privileged information the rubric never states.

Connect with TLDC.

We work with focused teams looking to train models for their production agents. If you’re looking to push your agent on the performance-cost Pareto, we’d like to talk.

daanish@llmdata.com->