Operational-Premise-Taxonomy/paper/pieces/opt-code-evaluation-protoco...

44 lines
1.3 KiB
TeX
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

OPTCode Evaluation Protocol
Inputs:
1) System description (code or prose)
2) Candidate OPTCode line
3) Candidate rationale
Evaluation pass:
The prompt evaluator returns:
- Verdict: PASS / WEAK_PASS / FAIL
- Score: 0100
- Issue categories: Format, Mechanism, Parallelism/Pipelines, Composition, Attributes
- Summary: short explanation
Acceptance threshold: PASS or WEAK_PASS with score >= 70.
Double annotation:
To improve reliability:
- Run classification with Model A and Model B (or two runs of same model)
- Evaluate both independently
Metrics:
- Exact-match OPT (binary)
- Jaccard similarity on root sets
- Levenshtein distance between OPT-Code strings
- Weighted mechanism agreement (semantic distances)
Adjudication:
If A and B differ substantially:
- Provide both classifications and evaluator reports to an adjudicator model
- Adjudicator chooses the better one OR synthesizes a new one
- Re-run evaluator on the adjudicated OPT-Code
Quality metrics:
- Evaluator pass rate
- Inter-model consensus rate
- Root-level confusion matrix
- Parallelism/pipeline misclassification rate
Longitudinal tracking:
Archive all cases with metadata (system description, candidate codes,
verdicts, adjudications, timestamps, model versions) to track drift and
systematic biases.