Research & Insights

Technical writing for
the people who build
frontier models.

Methodological frameworks, data quality analysis, and operational thinking on AI post-training infrastructure — written by practitioners, for researchers.

Security · RLHF Infrastructure
Supply Chain Security in RLHF Pipelines: What the Mercor Breach Actually Revealed
The breach was not a sophisticated attack. It was an architectural inevitability. When a single vendor holds the training methodologies of competing AI labs in a shared infrastructure environment, cascade failure is not a risk — it is a design specification. This piece examines the root causes and what a structurally sound alternative looks like.
Read the full paper →
Methodology · IRR
Inter-Rater Reliability by Domain: Why Cohen's Kappa Thresholds Must Be Task-Specific
Read →
Supply Chain Security in RLHF Pipelines: What the Mercor Breach Actually Revealed

The breach that exposed Mercor's production environment in early 2026 was described in press coverage as a "sophisticated supply chain attack." That framing, while technically accurate, obscures the more important story: the attack succeeded not because it was sophisticated, but because the architecture it targeted was designed to fail in exactly this way.

When a single vendor holds the proprietary training methodologies of competing AI labs in a shared infrastructure environment, cascade failure is not a risk to be managed. It is a design specification.

Understanding why requires examining what "supply chain attack" actually means in the context of AI training infrastructure, and why the standard enterprise security responses — better monitoring, faster patch cycles, stronger perimeter controls — are insufficient responses to a structural vulnerability.

The LiteLLM Vector: What Actually Happened

The attack entered Mercor's environment through a compromised version of LiteLLM, an open-source library used to route calls between different AI model providers. LiteLLM is a legitimate, widely-used tool. Its compromise was not a Mercor-specific failure — it was a supply chain attack against the broader AI development ecosystem.

What made it catastrophic for Mercor specifically was not the vulnerability in LiteLLM. It was the architecture that LiteLLM was deployed into: a shared production environment where a single compromised dependency had access to data and communications across all client relationships simultaneously.

The blast radius of a supply chain attack is a function of the access model of the environment it enters. In an architecture where every client shares the same infrastructure, the blast radius is every client. In an architecture where every client operates in a physically isolated environment, the blast radius is bounded to one tenant.

The Root Cause: Shared Infrastructure as a Design Choice

Shared infrastructure is not a security oversight. It is an economic choice. Running isolated environments per client costs more — in cloud infrastructure spend, in operational complexity, in engineering resources required to maintain separation. For a vendor optimizing for growth and margin, shared infrastructure is the rational choice.

The problem is that the cost of shared infrastructure is not borne by the vendor. It is borne by the labs whose training methodologies, evaluation rubrics, and contractor interactions are exposed when the inevitable breach occurs. This is a classic externalized risk problem: the entity making the architectural decision that creates the risk is not the entity that pays when the risk materializes.

The structural fix is not better monitoring of a shared environment. It is an architecture where each client's data, secrets, and communications live in a physically separate environment that a breach elsewhere cannot reach. Not a separate VPC. Not a separate subnet. A separate cloud account with separate billing, separate IAM, separate encryption keys, and separate audit logs — enforced at the infrastructure provider level.

The Software Bill of Materials Gap

A Software Bill of Materials (SBOM) is a formal, machine-readable inventory of every software component in a production system — direct dependencies and transitive dependencies. SBOM requirements are now mandated for software sold to US federal agencies under the 2021 Executive Order on Improving the Nation's Cybersecurity.

The LiteLLM vulnerability was a known CVE. An SBOM process with automated CVE monitoring would have flagged it. The question is not whether the vulnerability existed — it did, in a widely-used library. The question is whether the engineering culture and tooling to detect it before exploitation were in place.

For AI training data vendors, whose production environments contain some of the most competitively sensitive methodologies in the technology industry, SBOM enforcement should be a prerequisite for production deployment — not an aspirational goal.

What Secure RLHF Infrastructure Actually Requires

The following controls represent the minimum viable security architecture for a vendor holding AI lab training data. Each one addresses a specific failure mode documented in the Mercor breach and in prior security incidents in the AI training supply chain.

  • Per-client cloud account isolation: Each lab's data lives in a dedicated cloud account (AWS account, GCP project, or Azure subscription). Cross-account network routing is disabled at the provider level.
  • SBOM enforcement with automated CVE monitoring: Every production dependency is documented. Automated scans run on every code change and daily against all running packages. Critical CVEs trigger a sub-48-hour remediation SLA.
  • Hardware Security Module (HSM) key management: Encryption keys managed in isolated HSM instances. Keys never appear in application code, environment variables, or configuration files.
  • Zero-trust network architecture: Every service-to-service communication requires explicit authentication. No implicit trust based on network location.
  • Immutable audit logging: All data access events logged to an immutable store. Logs cannot be altered by any internal actor — including the security team.
  • Independent penetration testing: Annual third-party penetration test with results shared with client labs. A vendor that only conducts internal security reviews has an obvious conflict of interest in reporting findings.

The Governance Gap That Monitoring Cannot Fix

The most important insight from the Mercor breach is not technical. It is governance-related. The labs whose data was exposed had no visibility into Mercor's security architecture before the breach. They were not consulted when LiteLLM was adopted as a dependency. They had no audit rights that would have revealed the shared infrastructure model. They had no financial remedy when the breach occurred.

This is the co-pilot problem. When a vendor is not accountable to its clients for its security decisions, those decisions will optimize for the vendor's interests. Structural accountability — architecture approval rights, unannounced audit rights, financial penalties for breach — is not a nice-to-have feature. It is the mechanism that aligns vendor security incentives with client security requirements.

The AI training supply chain needs vendors whose security architecture was designed with client input, whose security practices are verifiable by clients at any time, and whose contracts include financial consequences when security failures occur. Those three requirements eliminate most of the current vendor market and define exactly the gap that a new entrant can fill.

Inter-Rater Reliability by Domain: Why Cohen's Kappa Thresholds Must Be Task-Specific

Inter-rater reliability (IRR) is the most important quality metric in human evaluation data — and the most commonly misapplied. The standard practice at most AI training data vendors is to set a single platform-wide minimum IRR threshold and apply it uniformly across all task types and domains. This approach is methodologically unsound and produces misleading quality signals that can mask serious evaluation problems in high-stakes domains.

A Cohen's Kappa of 0.72 on a creative writing quality task is excellent. The same score on a clinical diagnostic accuracy task is a patient safety problem in a research context. Treating them as equivalent is not a quality standard — it is the absence of one.

What Cohen's Kappa Actually Measures

Cohen's Kappa (κ) measures agreement between raters beyond what would be expected by chance. A κ of 0 indicates no agreement beyond chance. A κ of 1.0 indicates perfect agreement. The conventional interpretation benchmarks are: below 0.40 as poor, 0.40-0.59 as fair, 0.60-0.74 as good, and 0.75-1.0 as excellent.

These benchmarks were derived from Cohen's original 1960 paper on educational measurement and have been widely applied in psychology, medicine, and linguistics. Their widespread adoption in AI evaluation has been largely uncritical — which is a problem, because the appropriate kappa threshold depends heavily on task characteristics that vary substantially across AI evaluation domains.

The Domain-Specificity Problem

Three characteristics of an evaluation task determine what constitutes an acceptable IRR threshold: the objective verifiability of the correct answer, the consequences of inter-rater disagreement for downstream model behavior, and the expected natural disagreement rate among genuine domain experts.

Objective verifiability varies enormously. A code correctness evaluation — does this function produce the correct output for the specified inputs — has a verifiable ground truth. A creative writing quality evaluation does not. Applying the same kappa threshold to both ignores the fundamental difference in task structure.

Downstream consequences matter for threshold-setting in ways that are rarely formalized. A low-kappa evaluation of clinical reasoning tasks, incorporated into training data for a medical AI system, produces a model whose clinical reasoning is trained on noise. The consequences of that noise are different in magnitude from the consequences of noise in, say, a poem quality evaluation dataset.

Natural expert disagreement rates vary by domain and are often misinterpreted as evaluation quality problems when they are actually valid signals of genuine task complexity. Experienced radiologists reading identical mammograms disagree on diagnosis at rates that would produce a κ below many platforms' quality thresholds. The solution is not to discard this signal — it is to design the evaluation task to capture the disagreement as a feature rather than treating it as a defect.

Recommended Domain-Specific Thresholds

The following threshold framework reflects both the methodological literature and empirical calibration work across our expert pools. These are minimum thresholds for dataset delivery — not internal targets.

  • Medical/Clinical (κ ≥ 0.80): High objective verifiability on many subtasks, combined with high downstream consequence, justifies a strict threshold. Tasks where genuine expert disagreement is expected (ambiguous diagnostic presentations) should be flagged separately rather than suppressed.
  • Legal reasoning (κ ≥ 0.78): Statute interpretation and case outcome prediction have meaningful ground truth anchors, but expert disagreement on edge cases is valid and should be preserved rather than forced into agreement.
  • Software engineering (κ ≥ 0.82): Code correctness has the highest objective verifiability of any domain. The threshold is accordingly strict. Security vulnerability identification requires domain-specific calibration given the specialized expertise required.
  • General reasoning/instruction following (κ ≥ 0.74): Lower threshold reflects genuine task ambiguity. Rubric clarity is the primary lever for improving IRR in this domain — not expert selection.
  • Creative/writing quality (κ ≥ 0.70): Subjective by design. IRR monitoring here primarily serves to detect rubric drift and evaluator confusion rather than objective quality failures.

The Rubric Clarity Lever

The most common response to low IRR scores at AI training data vendors is to change the expert pool — remove the evaluators who are most frequently in the minority position. This is almost always the wrong intervention. Low IRR is most commonly caused not by inadequate experts but by inadequate rubrics.

When two credentialed domain experts consistently disagree on an evaluation, the first question should be: is the rubric specific enough to distinguish between their positions? In most cases, the answer is no. The rubric is using language that the experts are interpreting differently, and the disagreement reflects rubric ambiguity rather than evaluator error.

The correct intervention is a calibration session: bring the disagreeing experts together, examine the specific cases where they diverged, identify the rubric language that is being interpreted differently, and update the rubric with additional worked examples that resolve the ambiguity. This process produces both a better rubric and a better-calibrated expert pool — two benefits that the "replace the evaluator" approach fails to deliver.

Implications for Lab Procurement

When evaluating AI training data vendors, labs should ask not just "what is your IRR threshold?" but "what are your domain-specific IRR thresholds, how were they derived, and what is your intervention protocol when a batch falls below threshold?" A vendor that cannot answer these questions specifically does not have a quality measurement program — it has a quality measurement gesture.

The distinction matters because IRR scores are the primary quality signal in post-training data, and a misleading IRR score is in some ways worse than no score at all. It provides false assurance that the data meets a quality standard when it does not.

The Expert Vetting Problem: Why AI Resume Screening Selects Against the People You Most Need

The most valuable evaluators for frontier AI model training are not the people with the most conventional credentials. They are people with deep, hard-won expertise in specific domains — expertise that often looks unusual by the pattern-matching standards of an automated resume screener.

The retired federal judge with 30 years of statutory interpretation experience. The rural emergency physician whose clinical reasoning has been sharpened by years of diagnostic decision-making without specialist backup. The self-taught systems programmer whose career doesn't map neatly to a progression of recognizable employer names. These are the evaluators whose judgments provide the most signal-dense training data for models learning to reason about law, medicine, and code.

AI resume screening selects against all of them.

How AI Resume Screening Works — and Why It Fails for Expert Selection

AI resume screening systems work by learning statistical patterns from large populations of resumes that were historically associated with positive hiring outcomes. They are effective for identifying candidates who resemble previous successful hires in a given role — which is exactly what makes them counterproductive for domain expert selection in AI training.

The population of "successful hires" that AI screeners learn from is almost entirely drawn from conventional employment markets: people hired through standard channels, with credentials from recognized institutions, with career progressions that follow expected patterns. The screeners are implicitly trained on a definition of expertise that prioritizes credential legibility over actual domain mastery.

The problem is not that AI screeners are inaccurate. The problem is that they are accurately identifying a specific type of person — one whose expertise is signaled through conventional credential channels — and filtering out everyone else. In many contexts that is appropriate. In domain expert selection for AI training, it systematically excludes the most valuable evaluators.

The Non-Standard Expert Problem

Consider what a genuine expert's resume looks like across several domains:

Medical: A retired attending physician with 25 years of clinical experience may have a resume with a single employer, an MD from a state medical school, and no publications. Their expertise is in their clinical judgment — built from tens of thousands of patient encounters — not in their credential profile. An AI screener looking for "research experience" or "recent institutional affiliation" will filter them out.

Legal: A former state appellate court judge who retired to private practice may not have BigLaw experience, a law review publication record, or a clerkship on a federal circuit. Their expertise in statutory interpretation and appellate reasoning is exceptional. Their resume looks like a career in state government.

Software engineering: A senior systems programmer whose most significant work is in open-source projects and whose career includes stints at companies that no longer exist will not pattern-match to the "5 years at a FAANG-equivalent" heuristic that AI screeners use as a proxy for engineering depth.

The Calibration Session as the Real Qualification Gate

The solution to the non-standard expert problem is not to improve AI screening. It is to use AI screening as a first-pass efficiency tool for obvious disqualifiers — not as the primary gate — and to make the calibration session the real qualification mechanism.

A calibration session involves asking an expert candidate to evaluate a set of gold standard tasks with known correct answers, using the specific rubric they will be working with. This test directly measures the thing we actually care about: can this person apply the evaluation rubric to this type of task with acceptable accuracy and consistency?

A retired federal judge who has never taken an AI evaluation test before may need 30 minutes of rubric orientation. After that orientation, their performance on the calibration task set will reflect their actual domain expertise — not their credential legibility. The calibration session finds the right evaluators regardless of what their resume looks like.

The Deceptive Compensation Compound Effect

The AI screening problem does not exist in isolation. It compounds with the deceptive compensation practices documented at current market vendors. When expert candidates discover that the $35/hour rate advertised at application is actually a per-task rate that produces an effective hourly rate of $13-27, the people most likely to stay despite this discovery are not the most accomplished domain experts — they are people with fewer alternatives.

Accomplished domain experts — the retired judges, the senior physicians, the experienced engineers — have sufficient alternative income sources to decline an engagement with deceptive compensation. The compensation model therefore compounds the screening model's filtering effect: the screener selects for credential-legible candidates, and the compensation model then selects, within that group, for those with the fewest alternatives.

The result is an expert pool that is less expert, less motivated, and less stable than the advertised composition would suggest — and the training data quality reflects all three of those deficiencies.

What a Sound Expert Selection Model Looks Like

Human-reviewed credentialing is the non-negotiable foundation. For medical domain tasks: active license verification against state medical board records, not resume self-attestation. For legal tasks: bar admission verification and subject matter confirmation through a brief structured interview. For financial tasks: credential database verification for CFA, CPA, and relevant licensing.

Beyond credential verification, the calibration session is the primary selection mechanism. It measures what we actually need to measure — evaluation quality on the specific task type — and it is inherently immune to resume pattern-matching bias because it does not depend on resume content at all.

Finally, transparent compensation with a guaranteed minimum effective hourly floor selects for evaluators who chose the engagement based on the actual terms, not based on advertised terms that were later contradicted. An expert who understood and accepted the actual compensation terms is more committed to the engagement than one who feels deceived by it.

Data Provenance Standards for Post-Training: A Practitioner's Framework

Data provenance — the documented history of a dataset's origin, collection methodology, and transformations — is a well-established concept in scientific research, legal proceedings, and enterprise data governance. In AI post-training data, it remains almost entirely absent. Most labs cannot answer basic provenance questions about the datasets that most directly shape their models' behavior.

This gap matters for three distinct reasons that are often conflated: reproducibility, which is a research integrity requirement; IP ownership, which is a legal requirement; and regulatory compliance, which is an emerging requirement under the EU AI Act and analogous frameworks.

What Provenance Documentation Requires

A complete data provenance record for a post-training dataset must answer the following questions with sufficient specificity to reconstruct the dataset from scratch if needed:

  • Who evaluated it? Not just "credentialed domain experts" — but what credentials were verified, when they were verified, and by what method.
  • What rubric was used? The specific version of the rubric, with a changelog documenting any changes from prior versions and the rationale for each change.
  • When was it evaluated? Timestamps for each individual evaluation decision, not just for batch delivery.
  • What was the inter-rater reliability? IRR scores at the batch level and, for complex tasks, at the task-type level.
  • Were there any anomalies? Gold standard injection accuracy, flagged items, re-evaluation events, and the resolution of each.
  • What are the known limitations? Expert pool demographic composition, geographic distribution, any selection biases introduced by the credentialing process, and any known gaps between the task distribution in the dataset and the target distribution.
The datasets most directly shaping frontier AI model behavior are among the least documented artifacts in the AI research ecosystem. A pre-training dataset assembled from web scrapes has more provenance documentation than most RLHF evaluation datasets.

The Datasheets for Datasets Standard

Gebru et al.'s 2018 "Datasheets for Datasets" paper proposed a standardized documentation format for machine learning datasets, analogous to the component datasheets used in electronics manufacturing. The framework asks dataset creators to document motivation, composition, collection process, preprocessing, uses, distribution, and maintenance.

Despite widespread adoption in the academic ML community, datasheets remain rare in the AI training data vendor market. This is partly an incentive problem: comprehensive documentation takes time and resources, and current clients have not historically required it as a condition of contract. This is changing as labs mature their procurement processes and as regulatory frameworks begin to require training data documentation.

For human evaluation datasets specifically, the Datasheets for Datasets framework requires extension to capture evaluator-specific provenance that the original framework did not anticipate. The key additions are expert credential documentation at the individual evaluation level, rubric version tracking, and IRR reporting by task category.

The Reproducibility Requirement

Reproducibility in AI training data means something specific: given the documentation for a dataset, a second team should be able to produce a statistically equivalent dataset using the same methodology. This requires that the documentation captures not just what was done but how it was done in sufficient detail to replicate the process.

Current vendor practice falls short of this standard in three consistent ways. Rubrics are delivered at contract signing and never updated in documentation even when they are updated in practice. Expert pool composition is described in aggregate terms ("senior engineers with FAANG experience") that do not permit replication. And quality filtering criteria — which evaluations were excluded and why — are often applied informally without documentation.

The immutable audit log is the technical foundation for reproducibility. If every evaluation decision is logged with evaluator identifier, rubric version, timestamp, and raw scores before consensus aggregation, the dataset can be exactly reconstructed from the log. This is a technical requirement, not a documentation requirement — it must be built into the data collection system from the start.

IP Ownership and Pre-Contract Work

One of the most consequential provenance gaps in current vendor practice is the documentation of when evaluation work was performed relative to contract execution. As documented in the Mercor Labor Dossier, a significant number of evaluators at current vendors begin producing work before their contracts — and specifically before their IP assignment agreements — are executed.

Under US intellectual property law, a contractor retains ownership of work product produced before a valid IP assignment agreement is executed. This means that training datasets containing evaluations produced in the pre-contract period contain data of uncertain IP ownership. The lab that paid for those evaluations may not hold clean legal title to them.

Complete provenance documentation includes the contract execution timestamp for every evaluator and flags any evaluations produced before contract execution. This is a legal compliance requirement, not a quality measurement issue — but it requires provenance infrastructure to track.

The EU AI Act Dimension

The EU AI Act, which entered full effect in 2025, establishes documentation requirements for high-risk AI systems that extend to training data. Systems classified as high-risk — including those used in medical, legal, educational, employment, and critical infrastructure contexts — must maintain documentation of training data sources, methodologies, and quality measures.

The documentation requirements are not yet fully specified for post-training data specifically, but the direction is clear: regulatory compliance for high-risk AI systems will require the kind of provenance documentation that almost no current training data vendor provides. Labs building AI systems in high-risk categories — which includes most of the applications that Anthropic, Google DeepMind, and Microsoft are developing — need training data provenance infrastructure in place before their regulatory audits, not after.

A Practical Implementation Framework

The following framework represents the minimum viable provenance infrastructure for a post-training data engagement. Each element is technically achievable with current tooling — the gap is not capability but implementation discipline.

  • Immutable audit logging: Every evaluation event logged to an append-only store. No deletion or modification of historical log entries by any system actor.
  • Rubric version control: Every rubric change creates a new version with a mandatory changelog. Affected historical data is automatically flagged when a rubric changes.
  • Expert credential manifest: Per-dataset documentation of credential type verified, verification date, and verification method for every evaluator who contributed to the dataset.
  • Contract execution tracking: Timestamp of contract execution for every evaluator. Evaluations produced before contract execution flagged for client review.
  • IRR reporting by task category: Not just overall IRR — IRR broken down by task type, with flagging of categories that fall below domain-specific thresholds.
  • Data card delivery: Every dataset delivered with a complete data card meeting the Datasheets for Datasets standard, extended for human evaluation data.
What Constitutional AI Alignment Actually Requires from a Training Data Vendor

Anthropic's Constitutional AI (CAI) framework is one of the most rigorous published approaches to aligning large language model behavior with a set of explicitly stated values. It uses a set of principles — a "constitution" — to guide both the generation of training data and the reinforcement learning process that shapes model behavior.

What is less often discussed is the extent to which the quality of CAI-aligned training data depends on the human evaluators who produce the preference signals that drive RLHF. A constitutional principle that is not operationalized in an evaluation rubric that human evaluators can apply consistently is a principle that may not reliably shape model behavior.

Constitutional AI is not a training algorithm that eliminates the need for high-quality human evaluation data. It is a framework that makes the quality requirements for that data more demanding, not less — because the signal being produced needs to reflect not just factual accuracy but principled judgment about competing values.

What CAI Asks of Human Evaluators

In a standard RLHF pipeline, human evaluators are asked to compare two model outputs and indicate which one is better according to specified criteria — typically helpfulness, harmlessness, and honesty. This is a comparative judgment task that can be operationalized through relatively straightforward rubrics.

CAI evaluation asks something more complex. Evaluators must assess model outputs against a set of principles that can conflict with each other in specific cases. A model output that is maximally helpful to the user may require providing information that could, in some contexts, cause harm. A model output that is maximally harmless may be unhelpful to the point of being useless. Constitutional evaluation requires evaluators to make principled judgments about how to trade off competing values in specific contexts.

This is a task that requires evaluators who understand the constitutional principles, can reason about how they apply in edge cases, and can make consistent judgments about value trade-offs. It cannot be performed reliably by evaluators who are completing tasks as quickly as possible to maximize their per-task earnings, or by evaluators who have received a vague briefing on the rubric and are left to interpret it independently.

The Rubric Operationalization Gap

Constitutional principles are stated at a level of abstraction that requires operationalization before they can be used as evaluation rubrics. "Avoid being harmful" is a constitutional principle, not an evaluation criterion. It does not tell an evaluator what to do when a model output provides accurate information about a topic that could be used for harm by some users but is genuinely useful to the majority.

The operationalization of constitutional principles into task-specific rubrics is methodological work that requires both deep understanding of the principles and deep understanding of the specific task domain. A rubric for evaluating constitutional alignment in clinical advice responses must be developed differently from a rubric for evaluating constitutional alignment in responses to legal questions or in responses to requests for persuasive writing.

This operationalization work cannot be outsourced to the evaluators themselves. It requires collaborative rubric development between the lab's alignment researchers who understand the constitutional framework and the training data vendor's methodology team who understand how to translate abstract principles into measurable evaluation criteria.

The Calibration Requirement for Value-Laden Tasks

IRR is harder to achieve on constitutional evaluation tasks than on factual accuracy tasks, for principled reasons. When evaluators are asked to make judgments about competing values, genuine value disagreements among evaluators will produce lower kappa scores even when all evaluators are performing the task carefully and in good faith.

This creates a methodological challenge: some disagreement in constitutional evaluations reflects genuine moral disagreement rather than rubric misapplication or evaluator error. The appropriate response is not to force agreement through rubric tightening — forcing agreement on genuinely contested value questions produces false signal that misrepresents the underlying diversity of human moral judgment.

The appropriate response is to distinguish, in the calibration process, between disagreements that reflect rubric ambiguity and disagreements that reflect genuine value diversity. Rubric ambiguity disagreements should be resolved through rubric clarification. Value diversity disagreements should be preserved in the training data as a signal — with appropriate metadata flagging — rather than collapsed into an artificial consensus.

Security Requirements for CAI Training Data

Constitutional AI training data has security requirements beyond standard RLHF data. The evaluation rubrics that operationalize Anthropic's constitutional principles represent proprietary alignment methodology. The patterns of evaluator disagreement in CAI evaluation reveal information about where the constitutional framework produces ambiguous guidance. Both are competitively sensitive.

A vendor that holds CAI evaluation rubrics in a shared infrastructure environment with data from competing labs creates a specific risk: that the methodology Anthropic has developed for aligning Claude's behavior could be exposed to competitors through a security breach. This is not a hypothetical risk — it is exactly what the Mercor breach demonstrated was possible.

CAI training data should be held in an architecture where the rubrics, the evaluation data, and the evaluator-level signals are all stored in a client-isolated environment with access controls that prevent any cross-client exposure. The vendor should not have the ability to use insights from one lab's CAI evaluation data to inform rubric development for another lab — even indirectly, even in aggregate form.

What a CAI-Ready Training Data Vendor Looks Like

The requirements are specific and demanding. The vendor must have the methodology capability to translate constitutional principles into task-specific evaluation rubrics in collaboration with Anthropic's alignment team. It must have the evaluator management infrastructure to run the calibration sessions required to achieve acceptable IRR on value-laden comparative tasks. It must have the data architecture to preserve principled evaluator disagreements as signal rather than suppressing them as noise. And it must have the security architecture to hold CAI training data in isolation from every other client relationship.

Most of these requirements are not met by the current vendor market — not because the technical capabilities do not exist, but because the market has not historically been asked to meet them. Labs that care about the quality of their constitutional alignment training data should be asking whether their vendor meets these requirements, and should be willing to change vendors if the answer is no.