OPT–Code Evaluation Protocol Inputs: 1) System description (code or prose) 2) Candidate OPT–Code line 3) Candidate rationale Evaluation pass: The prompt evaluator returns: - Verdict: PASS / WEAK_PASS / FAIL - Score: 0–100 - Issue categories: Format, Mechanism, Parallelism/Pipelines, Composition, Attributes - Summary: short explanation Acceptance threshold: PASS or WEAK_PASS with score >= 70. Double annotation: To improve reliability: - Run classification with Model A and Model B (or two runs of same model) - Evaluate both independently Metrics: - Exact-match OPT (binary) - Jaccard similarity on root sets - Levenshtein distance between OPT-Code strings - Weighted mechanism agreement (semantic distances) Adjudication: If A and B differ substantially: - Provide both classifications and evaluator reports to an adjudicator model - Adjudicator chooses the better one OR synthesizes a new one - Re-run evaluator on the adjudicated OPT-Code Quality metrics: - Evaluator pass rate - Inter-model consensus rate - Root-level confusion matrix - Parallelism/pipeline misclassification rate Longitudinal tracking: Archive all cases with metadata (system description, candidate codes, verdicts, adjudications, timestamps, model versions) to track drift and systematic biases.