44 lines
1.3 KiB
TeX
44 lines
1.3 KiB
TeX
OPT–Code Evaluation Protocol
|
||
|
||
Inputs:
|
||
1) System description (code or prose)
|
||
2) Candidate OPT–Code line
|
||
3) Candidate rationale
|
||
|
||
Evaluation pass:
|
||
The prompt evaluator returns:
|
||
- Verdict: PASS / WEAK_PASS / FAIL
|
||
- Score: 0–100
|
||
- Issue categories: Format, Mechanism, Parallelism/Pipelines, Composition, Attributes
|
||
- Summary: short explanation
|
||
|
||
Acceptance threshold: PASS or WEAK_PASS with score >= 70.
|
||
|
||
Double annotation:
|
||
To improve reliability:
|
||
- Run classification with Model A and Model B (or two runs of same model)
|
||
- Evaluate both independently
|
||
Metrics:
|
||
- Exact-match OPT (binary)
|
||
- Jaccard similarity on root sets
|
||
- Levenshtein distance between OPT-Code strings
|
||
- Weighted mechanism agreement (semantic distances)
|
||
|
||
Adjudication:
|
||
If A and B differ substantially:
|
||
- Provide both classifications and evaluator reports to an adjudicator model
|
||
- Adjudicator chooses the better one OR synthesizes a new one
|
||
- Re-run evaluator on the adjudicated OPT-Code
|
||
|
||
Quality metrics:
|
||
- Evaluator pass rate
|
||
- Inter-model consensus rate
|
||
- Root-level confusion matrix
|
||
- Parallelism/pipeline misclassification rate
|
||
|
||
Longitudinal tracking:
|
||
Archive all cases with metadata (system description, candidate codes,
|
||
verdicts, adjudications, timestamps, model versions) to track drift and
|
||
systematic biases.
|
||
|