Data Methodology — Provenance AI

The Production Pipeline

From rubric design to
delivered dataset.

The Provenance AI production pipeline was designed by asking the question DeepMind, Anthropic, and every serious AI research team eventually asks: can we reproduce this dataset exactly? Can we trace every evaluation back to its source? Can we verify the credentials of every expert who touched it? The answer at every vendor in the current market is no. Our answer is yes — by architecture.

Foundation

Rubric Co-Design with Your Research Team

Every task type begins with a rubric co-design session involving the lab's research team. Acceptance criteria are explicit, measurable, and written before any expert sees an assignment. Rubrics are version-controlled — every change creates a new version with a changelog, and historical data produced under prior versions is flagged automatically. We do not use "feels good" quality thresholds. We use operationalized acceptance criteria with worked examples for each scoring level.

Expert Selection

Credential Verification Before Task Assignment

Expert credentials are verified before any task assignment — not after. For medical domain tasks: license verification against state medical board records, specialty certification confirmation, current DEA registration status where applicable. For legal tasks: active bar admission verification, practice area confirmation. For financial tasks: CFA/CPA credential verification. Background checks are valid for 30 days maximum — if a check expires before deployment, renewal is a hard system gate that blocks task assignment until complete.

Calibration

Gold Standard Practice Before Live Production

No expert produces live data until they have demonstrated rubric comprehension above the minimum IRR threshold on a set of gold standard practice tasks with known correct answers. New experts complete a minimum of 50 gold standard tasks across a representative sample of the task distribution before live assignment. Experts who score below the minimum threshold receive additional calibration training and must re-qualify before live production. This eliminates the "learning curve" problem that produces the lowest-quality data in the first weeks of any engagement.

Production

Continuous IRR Monitoring During Active Tasks

During live production, 5% of all tasks are gold standard injections — tasks with known correct answers used to monitor for expert quality drift continuously. If any expert's gold standard performance drops below threshold during an active engagement, their tasks are flagged for senior review before delivery. Every evaluation is scored for inter-rater reliability in real time. Batches below the contractual minimum IRR threshold are held and re-evaluated — they are never delivered to the lab.

Delivery

Dataset + Full Provenance Report

Every delivered dataset includes: the annotated data in the agreed format, a full data card meeting Datasheets for Datasets standards, an expert credential manifest (name-anonymized but credential-level verified), IRR scores by task category, rubric version documentation, and a complete audit trail of every evaluation decision. The dataset is exactly reproducible from the audit trail — a core requirement for research-grade data.

Quality Standards

Inter-rater reliability thresholds
by domain.

Inter-rater reliability (IRR) is measured using Cohen's Kappa for categorical tasks and intraclass correlation coefficient (ICC) for ordinal tasks. The following minimum thresholds are contractual obligations — not internal targets. Batches that do not meet them are re-evaluated at our cost before delivery.

Domain	Task Type	Minimum Cohen's Kappa	Target
Medical / Clinical	Diagnostic reasoning evaluation, clinical accuracy scoring	κ ≥ 0.80	κ ≥ 0.88
Legal	Legal reasoning accuracy, case outcome prediction, statute interpretation	κ ≥ 0.78	κ ≥ 0.85
Software Engineering	Code correctness, security vulnerability identification, code quality scoring	κ ≥ 0.82	κ ≥ 0.90
Financial	Financial analysis accuracy, regulatory compliance scoring	κ ≥ 0.76	κ ≥ 0.84
STEM Research	Scientific claim verification, methodology evaluation	κ ≥ 0.75	κ ≥ 0.83
General Reasoning	Instruction following, factual accuracy, helpfulness scoring	κ ≥ 0.74	κ ≥ 0.82
Creative / Writing	Quality assessment, style adherence, coherence scoring	κ ≥ 0.70	κ ≥ 0.78

Expert Network

Credentialing standards
by domain.

The following credential minimums apply to all active domain experts. Human-reviewed credentialing is the primary gate — AI resume screening is a first-pass efficiency tool only, and any automated flag is reviewed by a domain expert manager before any rejection decision is made.

Medical / Clinical

MD, DO, RN with specialty certification, or PhD in biomedical field with clinical research experience. Active license verification required.

Min. κ ≥ 0.80 · License verified monthly

Legal

JD with active bar admission in relevant jurisdiction. Minimum 3 years practice experience in relevant subject matter. Federal clerkship or BigLaw background preferred.

Min. κ ≥ 0.78 · Bar status verified quarterly

Financial

CFA, CPA, or minimum 5 years at a recognized financial institution in a relevant analytical role. Series licensing verified where applicable.

Min. κ ≥ 0.76 · Credentials verified at onboarding

Software Engineering

Senior or Staff engineer level with minimum 6 years experience. FAANG-equivalent or recognized open-source contribution history. GitHub profile review required.

Min. κ ≥ 0.82 · Portfolio reviewed at onboarding

STEM Research

PhD or active postdoctoral researcher in relevant field. Publication record reviewed. Subject matter expertise confirmed against task domain.

Min. κ ≥ 0.75 · Publication record verified

Multilingual

Native or C2-level proficiency in target language with professional translation or linguistic analysis experience. In-language calibration required.

Min. κ ≥ 0.74 · Proficiency tested at onboarding

Sample Data Card

What every delivery
includes.

Every dataset delivered by Provenance AI includes a data card in the following format, meeting Datasheets for Datasets and Data Nutrition Label standards. The following is an anonymized representative example.

// Provenance AI Data Card — Representative Sample

dataset_id:PAI-2026-MED-0042

task_type:Clinical reasoning evaluation — diagnostic accuracy scoring

client_id:[REDACTED — client-isolated]

rubric_version:v2.3 — changelog attached

total_evaluations:5,000

evaluations_per_item:3 (consensus scoring)

expert_count:24 active evaluators

credential_tier:MD/DO — active license verified

background_check_date:All within 30-day validity window at deployment

irr_cohens_kappa:κ = 0.847 (exceeds contractual minimum κ ≥ 0.80)

gold_standard_accuracy:94.2% (gold standard injection: 5% of tasks)

collection_period:2026-03-01 to 2026-04-15

audit_log_id:Full immutable log — available on request

reproducibility:Dataset exactly reproducible from audit trail

known_limitations:Expert pool skews toward US-licensed practitioners. International clinical reasoning variations may be underrepresented.

Training data that meetsresearch publication standards— not just delivery standards.

From rubric design todelivered dataset.

Inter-rater reliability thresholdsby domain.

Credentialing standardsby domain.

What every deliveryincludes.

Training data that meets
research publication standards
— not just delivery standards.

From rubric design to
delivered dataset.

Inter-rater reliability thresholds
by domain.

Credentialing standards
by domain.

What every delivery
includes.