Data Methodology
Training data that meets
research publication standards
— not just delivery standards.
Every dataset delivered by Provenance AI comes with a full methodology report, expert credential manifest, inter-rater reliability scores, and a complete audit trail. This page documents exactly how we produce it.
The Production Pipeline
From rubric design to
delivered dataset.
The Provenance AI production pipeline was designed by asking the question DeepMind, Anthropic, and every serious AI research team eventually asks: can we reproduce this dataset exactly? Can we trace every evaluation back to its source? Can we verify the credentials of every expert who touched it? The answer at every vendor in the current market is no. Our answer is yes — by architecture.
1
Foundation
Rubric Co-Design with Your Research Team
Every task type begins with a rubric co-design session involving the lab's research team. Acceptance criteria are explicit, measurable, and written before any expert sees an assignment. Rubrics are version-controlled — every change creates a new version with a changelog, and historical data produced under prior versions is flagged automatically. We do not use "feels good" quality thresholds. We use operationalized acceptance criteria with worked examples for each scoring level.
2
Expert Selection
Credential Verification Before Task Assignment
Expert credentials are verified before any task assignment — not after. For medical domain tasks: license verification against state medical board records, specialty certification confirmation, current DEA registration status where applicable. For legal tasks: active bar admission verification, practice area confirmation. For financial tasks: CFA/CPA credential verification. Background checks are valid for 30 days maximum — if a check expires before deployment, renewal is a hard system gate that blocks task assignment until complete.
3
Calibration
Gold Standard Practice Before Live Production
No expert produces live data until they have demonstrated rubric comprehension above the minimum IRR threshold on a set of gold standard practice tasks with known correct answers. New experts complete a minimum of 50 gold standard tasks across a representative sample of the task distribution before live assignment. Experts who score below the minimum threshold receive additional calibration training and must re-qualify before live production. This eliminates the "learning curve" problem that produces the lowest-quality data in the first weeks of any engagement.
4
Production
Continuous IRR Monitoring During Active Tasks
During live production, 5% of all tasks are gold standard injections — tasks with known correct answers used to monitor for expert quality drift continuously. If any expert's gold standard performance drops below threshold during an active engagement, their tasks are flagged for senior review before delivery. Every evaluation is scored for inter-rater reliability in real time. Batches below the contractual minimum IRR threshold are held and re-evaluated — they are never delivered to the lab.
5
Delivery
Dataset + Full Provenance Report
Every delivered dataset includes: the annotated data in the agreed format, a full data card meeting Datasheets for Datasets standards, an expert credential manifest (name-anonymized but credential-level verified), IRR scores by task category, rubric version documentation, and a complete audit trail of every evaluation decision. The dataset is exactly reproducible from the audit trail — a core requirement for research-grade data.
Quality Standards
Inter-rater reliability thresholds
by domain.
Inter-rater reliability (IRR) is measured using Cohen's Kappa for categorical tasks and intraclass correlation coefficient (ICC) for ordinal tasks. The following minimum thresholds are contractual obligations — not internal targets. Batches that do not meet them are re-evaluated at our cost before delivery.
| Domain | Task Type | Minimum Cohen's Kappa | Target |
| Medical / Clinical | Diagnostic reasoning evaluation, clinical accuracy scoring | κ ≥ 0.80 | κ ≥ 0.88 |
| Legal | Legal reasoning accuracy, case outcome prediction, statute interpretation | κ ≥ 0.78 | κ ≥ 0.85 |
| Software Engineering | Code correctness, security vulnerability identification, code quality scoring | κ ≥ 0.82 | κ ≥ 0.90 |
| Financial | Financial analysis accuracy, regulatory compliance scoring | κ ≥ 0.76 | κ ≥ 0.84 |
| STEM Research | Scientific claim verification, methodology evaluation | κ ≥ 0.75 | κ ≥ 0.83 |
| General Reasoning | Instruction following, factual accuracy, helpfulness scoring | κ ≥ 0.74 | κ ≥ 0.82 |
| Creative / Writing | Quality assessment, style adherence, coherence scoring | κ ≥ 0.70 | κ ≥ 0.78 |
Expert Network
Credentialing standards
by domain.
The following credential minimums apply to all active domain experts. Human-reviewed credentialing is the primary gate — AI resume screening is a first-pass efficiency tool only, and any automated flag is reviewed by a domain expert manager before any rejection decision is made.
Medical / Clinical
MD, DO, RN with specialty certification, or PhD in biomedical field with clinical research experience. Active license verification required.
Min. κ ≥ 0.80 · License verified monthly
Legal
JD with active bar admission in relevant jurisdiction. Minimum 3 years practice experience in relevant subject matter. Federal clerkship or BigLaw background preferred.
Min. κ ≥ 0.78 · Bar status verified quarterly
Financial
CFA, CPA, or minimum 5 years at a recognized financial institution in a relevant analytical role. Series licensing verified where applicable.
Min. κ ≥ 0.76 · Credentials verified at onboarding
Software Engineering
Senior or Staff engineer level with minimum 6 years experience. FAANG-equivalent or recognized open-source contribution history. GitHub profile review required.
Min. κ ≥ 0.82 · Portfolio reviewed at onboarding
STEM Research
PhD or active postdoctoral researcher in relevant field. Publication record reviewed. Subject matter expertise confirmed against task domain.
Min. κ ≥ 0.75 · Publication record verified
Multilingual
Native or C2-level proficiency in target language with professional translation or linguistic analysis experience. In-language calibration required.
Min. κ ≥ 0.74 · Proficiency tested at onboarding
Sample Data Card
What every delivery
includes.
Every dataset delivered by Provenance AI includes a data card in the following format, meeting Datasheets for Datasets and Data Nutrition Label standards. The following is an anonymized representative example.
dataset_id:PAI-2026-MED-0042
task_type:Clinical reasoning evaluation — diagnostic accuracy scoring
client_id:[REDACTED — client-isolated]
rubric_version:v2.3 — changelog attached
total_evaluations:5,000
evaluations_per_item:3 (consensus scoring)
expert_count:24 active evaluators
credential_tier:MD/DO — active license verified
background_check_date:All within 30-day validity window at deployment
irr_cohens_kappa:κ = 0.847 (exceeds contractual minimum κ ≥ 0.80)
gold_standard_accuracy:94.2% (gold standard injection: 5% of tasks)
collection_period:2026-03-01 to 2026-04-15
audit_log_id:Full immutable log — available on request
reproducibility:Dataset exactly reproducible from audit trail
known_limitations:Expert pool skews toward US-licensed practitioners. International clinical reasoning variations may be underrepresented.