OPT–Code Evaluation Protocol

Inputs:
1) System description (code or prose)
2) Candidate OPT–Code line
3) Candidate rationale

Evaluation pass:
The prompt evaluator returns:
    - Verdict: PASS / WEAK_PASS / FAIL
    - Score: 0–100
    - Issue categories: Format, Mechanism, Parallelism/Pipelines, Composition, Attributes
    - Summary: short explanation

Acceptance threshold: PASS or WEAK_PASS with score >= 70.

Double annotation:
To improve reliability:
    - Run classification with Model A and Model B (or two runs of same model)
    - Evaluate both independently
Metrics:
    - Exact-match OPT (binary)
    - Jaccard similarity on root sets
    - Levenshtein distance between OPT-Code strings
    - Weighted mechanism agreement (semantic distances)

Adjudication:
If A and B differ substantially:
    - Provide both classifications and evaluator reports to an adjudicator model
    - Adjudicator chooses the better one OR synthesizes a new one
    - Re-run evaluator on the adjudicated OPT-Code

Quality metrics:
    - Evaluator pass rate
    - Inter-model consensus rate
    - Root-level confusion matrix
    - Parallelism/pipeline misclassification rate

Longitudinal tracking:
Archive all cases with metadata (system description, candidate codes,
verdicts, adjudications, timestamps, model versions) to track drift and
systematic biases.