Operational-Premise-Taxonomy/paper/pieces/app-opt-code-evaluation-pro...

116 lines
4.4 KiB
TeX

\subsection{Evaluation Protocol for OPT--Code Classification}
\label{app:evaluation-protocol}
The following protocol provides a reproducible and auditable procedure for
evaluating OPT--Code classifications generated by large language models. The
protocol aligns with reproducible computational research practices and is
designed to support reliable inter-model comparison, adjudication, and
longitudinal quality assurance.
\subsubsection{Inputs}
For each system under evaluation, the following inputs are provided:
\begin{enumerate}
\item \textbf{System description}: source code, algorithmic description, or
detailed project summary.
\item \textbf{Candidate OPT--Code}: produced by a model using the minimal or
maximal prompt (Section~\ref{sec:opt-prompts}).
\item \textbf{Candidate rationale}: a short explanation provided by the model
describing its classification.
\end{enumerate}
These inputs are then supplied to the OPT--Code Prompt Evaluator
(Appendix~\ref{app:prompt-evaluator}).
\subsubsection{Evaluation Pass}
The evaluator produces:
\begin{itemize}
\item \textbf{Verdict}: \texttt{PASS}, \texttt{WEAK\_PASS}, or \texttt{FAIL}.
\item \textbf{Score}: an integer from 0--100.
\item \textbf{Issue categories}: format, mechanism, parallelism/pipelines,
composition, and attribute plausibility.
\item \textbf{Summary}: a short free-text evaluation.
\end{itemize}
A classification is considered \emph{acceptable} if it is rated
\texttt{PASS} or \texttt{WEAK\_PASS} with a score $\geq 70$.
\subsubsection{Double-Annotation Procedure}
To reduce model-specific biases or hallucinations, each system description is
classified independently by two LLMs or by two runs of the same LLM with
different seeds:
\begin{enumerate}
\item Model A produces an OPT--Code and rationale.
\item Model B produces an OPT--Code and rationale.
\item Each is independently evaluated by the Prompt Evaluator.
\end{enumerate}
Inter-model agreement is quantified using one or more of the following metrics:
\begin{itemize}
\item \textbf{Exact-match OPT} (binary): whether the root composition matches
identically.
\item \textbf{Partial-match similarity}: Jaccard similarity between root sets
(e.g., comparing \texttt{Evo+Lrn} with \texttt{Evo+Sch}).
\item \textbf{Levenshtein distance} (string distance over the structured
OPT--Code line).
\item \textbf{Weighted mechanism agreement}: weights reflecting the semantic
distances between roots (e.g., \Swm\ is closer to \Evo\ than to \Sym).
\end{itemize}
Discrepancies trigger a joint review.
\subsubsection{Adjudication Phase}
If the two candidate classifications differ substantially (e.g., different root
sets or different compositions), an adjudication step is performed:
\begin{enumerate}
\item Provide the system description, both candidate OPT--Codes, both
rationales, and both evaluator reports to a third model (or human expert).
\item Use a specialized \emph{adjudicator prompt} that asks the model to
choose the better classification according to OPT rules.
\item Require the adjudicator to justify its decision and to propose a final,
consensus OPT--Code.
\end{enumerate}
A new evaluator pass is then run on the adjudicated OPT--Code to confirm
correctness.
\subsubsection{Quality Metrics}
The following quality-reporting metrics may be computed at the level of a batch
of evaluations:
\begin{itemize}
\item \textbf{Evaluator pass rate}: proportion of \texttt{PASS} or
\texttt{WEAK\_PASS} verdicts.
\item \textbf{Inter-model consensus rate}: exact match vs.~non-exact match.
\item \textbf{Root-level confusion matrix}: which OPT roots are mistaken for
others, across models or datasets.
\item \textbf{Pipeline sensitivity}: how often parallelism or data pipelines
are misclassified as mechanisms.
\end{itemize}
These metrics allow the OPT framework to be applied consistently and help
identify systematic weaknesses in model-based classification pipelines.
\subsubsection{Longitudinal Tracking}
For large-scale use (e.g., benchmarking industrial systems), it is recommended to
store for each case:
\begin{itemize}
\item the system description,
\item both model classifications,
\item evaluator verdicts and scores,
\item adjudicated decisions,
\item timestamps and model versions.
\end{itemize}
Such archival enables longitudinal analysis of model performance, drift, and
taxonomy usage over time.