Curriculum

Data autoresearch for training SOTA agents.

Use Curriculum to author bespoke RL data generation pipelines that convert compute to dense, calibrated rubrics over a diverse task set.

Talk to us->

Existence proof

Kos-1 Lite achieved medical SOTA trained entirely on Curriculum data. Curriculum also drove our work with Perplexity on DRACO.

Tasks for your production harness.

An agent is composed of a model and a harness. The harness enriches an LLM with prompts, tools, skills, and scaffolding so it can perform production tasks. Foundation models have been trained to generalize to any harness. We build tasks scoped to your harness and domain, enabling you to RL train a model that is cheaper and more performant than a generic foundation model.

Tasks are scoped to the capabilities your harness expects. They require the right domain knowledge, capabilities, and final output, but unlike common RL environments, they don’t prescribe a fixed toolset or MCP server. As your harness matures and preferred tools are changed (e.g., MCP to programmatic), the same taskset can be used to train the weights of the model steering it. The harness is the training environment, and performance gains target the workflow you ship.

You ownCurriculum produces
Task

Instruction

Difficulty-aware task.

Harness

Your agent runtime

Prompts, tools, scaffolding, output contract.

Model
Rubric

Reward

Dense score computed by Rubric grader.

Resource sandbox

Replayable state

Per-rollout databases, files, and APIs the harness reads to enable deterministic replay.

For stateful agents, we replace production dependencies with resource sandboxes. The harness keeps its client code, and resource bindings change from production URLs to sandbox URLs, and the agent workspace is initialized from a snapshotted filesystem for each rollout. Live systems stay untouched, and each attempt sees replayable state.

A tool’s implementation sees postgres://prod.orders-db.internal replaced with postgres://sandbox.orders-db.br_a1b2c3, and at the start of each rollout, a snapshotted filesystem is mounted to the agent workspace.

Autoresearch

A single Curriculum task consisting of O(100) tokens can take on the order of O(100,000) tokens to generate. Instead of traditional synthetic data methods that generate and verify samples, we use models to dynamically generate, verify, ablate, and diversify data pipelines.

Dense rubrics reward the non-verifiable.

Frontier-grade rubrics are binary, self-contained, and generalize the space of correct answers so wrong responses measurably fail and diverse correct answers are rewarded.

Curriculum rubrics define a valid solution space instead of a myopic canonical answer. Criteria cover evidence, reasoning, constraints, and failure modes. Positive criteria reward necessary behavior, and negative criteria penalize active errors.

Most hand-written rubrics fail by confusing one good answer with the whole solution space. A model can solve the task correctly in a different way, or sound convincing while missing the real driver. Curriculum rubrics are written around the boundary of correctness: the facts that must be present, the ranges that allow legitimate variation, the reasoning links that separate good from shallow, and the active errors that should lower reward. They also avoid common grading failures: overlapping criteria that double-count the same behavior, subjective criteria that make pass/fail nondeterministic, and criteria that only work if the judge model has privileged information the rubric never states.

Medical Rubric Example55 criteria
Clinical Reasoning18 criteria
+20

Identifies BRASH syndrome by name or describes the positive-feedback loop where bradycardia worsens renal perfusion which worsens hyperkalemia which worsens bradycardia

+10

States trimethoprim blocks ENaC or potassium secretion in the collecting duct, causing hyperkalemia

+10

States trimethoprim causes type 4 RTA or non-anion-gap metabolic acidosis via impaired distal H+ and K+ secretion

+10

States sulfamethoxazole has sulfonylurea-like activity stimulating insulin release, contributing to hypoglycemia especially in renal failure

+10

States carvedilol provides AV-nodal blockade that synergizes with hyperkalemia to produce profound bradycardia by blocking sympathetic compensation

+10

States beta-blockade impairs beta-2-mediated intracellular potassium uptake, worsening hyperkalemia

+10

States lisinopril suppresses aldosterone, further impairing renal potassium excretion

+10

States ACE inhibitors impair GFR maintenance during low-perfusion states by reducing efferent arteriolar tone

+10

States metformin accumulates in severe AKI and can cause or contribute to lactic acidosis (MALA) with lactate 4.8 and pH 7.18

+10

States empagliflozin carries risk of euglycemic DKA and that beta-hydroxybutyrate must be checked given low glucose and anion gap acidosis

+10

States combined SGLT2 inhibitor and loop diuretic use contributes to volume depletion and prerenal AKI

+10

Identifies a mixed high-anion-gap and non-anion-gap metabolic acidosis based on the lab values provided

+10

Calculates or states the anion gap as approximately 17 using Na 131 minus Cl 100 minus bicarb 14

+10

States atropine is typically ineffective for bradycardia in this setting because the mechanism is AV-nodal blockade and hyperkalemia rather than vagal

+10

States carvedilol impairs hepatic glycogenolysis or blunts counterregulatory adrenergic response, worsening or prolonging hypoglycemia

-25

States the acidosis is solely from renal failure without recognizing the mixed high-AG and non-AG components or their distinct etiologies

+8

Explains at least 3 specific reasons why treating only the potassium is insufficient (e.g., hypoglycemia risk with insulin, BRASH loop persistence, ongoing infection, offending drugs still on board)

+8

Notes trimethoprim blocks tubular creatinine secretion causing potential pseudo-elevation of Cr, while acknowledging true AKI is also present given parallel BUN rise

Clinical Judgment11 criteria
+20

States that standard 10-unit insulin dose is dangerous given baseline glucose of 62, severe AKI prolonging insulin clearance, and beta-blockade masking hypoglycemia

-100

Recommends continuing metformin in a patient with Cr 5.9, lactate 4.8, and pH 7.18

-50

Recommends 10 units regular insulin IV without specifying dextrose preloading or dose reduction for a patient with glucose 62 and severe AKI

-25

Recommends sodium polystyrene sulfonate (Kayexalate) in a hypoperfused or shocked patient without acknowledging bowel necrosis risk

-25

Dismisses glucose of 62 as clinically insignificant or states dextrose given with insulin adequately addresses it without recommending preloading or frequent monitoring

-50

States that correcting potassium alone will resolve the shock and bradycardia without need for vasopressor support or breaking the BRASH loop

+8

Recommends emergent nephrology consult for renal replacement therapy given K 7.8 with ECG changes, pH 7.18, Cr 5.9, and shock

+8

Recommends cautious fluid resuscitation with small boluses (e.g., 250 mL) guided by bedside ultrasound or clinical reassessment given EF 35% and shock

+8

Recommends ICU admission or continuous cardiac monitoring for this critically ill patient

-10

Reassures that kidneys will likely recover without dialysis despite K 7.8 with ECG changes, pH 7.18, Cr 5.9, and hemodynamic instability

-10

Advises against any fluid resuscitation or advises large unmonitored fluid boluses without acknowledging the competing risks of EF 35% and hypovolemic shock

Information Seeking4 criteria
+8

Uses hedging language such as 'possible' or 'cannot confirm without' when discussing acute interstitial nephritis from TMP-SMX since urine eosinophils and biopsy are unavailable

+8

Acknowledges uncertainty about relative contributions of type A (hypoperfusion) vs type B (metformin) lactic acidosis given that only a single lactate value of 4.8 is available without a trend

+8

Uses hedging language when discussing euglycemic DKA from empagliflozin since beta-hydroxybutyrate result is pending and not yet available

-25

Definitively diagnoses acute interstitial nephritis without hedging when no urine eosinophils, peripheral eosinophilia, or biopsy results are available

Care Delivery22 criteria
+10

Recommends epinephrine as the vasopressor of choice because it provides chronotropy, inotropy, and beta-2-mediated potassium shift

+8

Recommends IV calcium (calcium chloride or calcium gluconate) as the first intervention for cardiac membrane stabilization

+8

Recommends dextrose administration (e.g., D50W bolus) before or concurrent with insulin to address baseline hypoglycemia of 62

+8

Recommends reduced insulin dose (e.g., 5 units instead of standard 10 units) given hypoglycemia, AKI, and beta-blockade

+8

Recommends point-of-care glucose monitoring at least every 15-30 minutes after insulin administration

+8

Recommends placing transcutaneous pacing pads given QRS 142 ms and HR 52 with risk of progression to asystole

+8

Recommends holding at least trimethoprim-sulfamethoxazole, lisinopril, carvedilol, metformin, and empagliflozin

+8

Recommends checking serum beta-hydroxybutyrate to evaluate for euglycemic DKA from empagliflozin

+8

Recommends chest X-ray or lung imaging to characterize the bibasilar crackles as pulmonary edema vs pneumonia vs other

+8

Recommends serial lactate measurements with a specific interval (e.g., every 2-4 hours) to assess perfusion response and differentiate type A from type B lactic acidosis

+8

Recommends empiric broad-spectrum antibiotics for suspected urosepsis such as ceftriaxone 2g IV or piperacillin-tazobactam renally adjusted

+8

Recommends obtaining blood cultures and urine culture before or concurrent with antibiotic administration

+8

Recommends sodium bicarbonate administration for severe acidemia (pH 7.18) with a target pH of at least 7.25

+8

Recommends renal ultrasound to exclude obstructive uropathy in a male patient with UTI and severe AKI

+8

Recommends UA with microscopy to evaluate for muddy brown casts (ATN) or WBC casts (AIN) to help phenotype the AKI

+8

Recommends checking and repleting magnesium (target at least 2.0 mg/dL) given arrhythmia risk and electrolyte derangements

+8

Recommends repeat BMP, VBG or ABG, and ECG within 1-2 hours after initial potassium-shifting therapy to assess response

+8

Affirms the PGY-1's clinical skepticism of the senior's potassium-only plan as well-founded rather than dismissing it or deferring to seniority

+8

Presents the most urgent action or safety-critical information within the first two paragraphs

+8

Organizes management steps in a prioritized sequence starting with cardiac stabilization before potassium shifting and other interventions

-10

Repeats the same information or phrasing multiple times within the response

-10

Includes extended discussion of topics unrelated to the drug interaction cascade or the senior's plan (e.g., long-term CHF optimization, gout management, or outpatient diabetes regimen)

Rubric coverage at this density takes 10s of hours to hand-author for a single task. Curriculum handles compliant rubric generation at scale, autonomously.

Get in touch

Connect with TLDC.

We work with focused teams looking to unblock their data bottlenecks. If you’re looking to push your agent on the performance-cost Pareto, we’d like to talk.

daanish@llmdata.com->