Operational-Premise-Taxonomy/paper/pieces/app-opt-code-evaluation-pro...

\subsection{Evaluation Protocol for OPT--Code Classification}
\label{app:evaluation-protocol}

The following protocol provides a reproducible and auditable procedure for
evaluating OPT--Code classifications generated by large language models. The
protocol aligns with reproducible computational research practices and is
designed to support reliable inter-model comparison, adjudication, and
longitudinal quality assurance.

\subsubsection{Inputs}

For each system under evaluation, the following inputs are provided:

\begin{enumerate}
    \item \textbf{System description}: source code, algorithmic description, or
    detailed project summary.
    \item \textbf{Candidate OPT--Code}: produced by a model using the minimal or
    maximal prompt (Section~\ref{sec:opt-prompts}).
    \item \textbf{Candidate rationale}: a short explanation provided by the model
    describing its classification.
\end{enumerate}

These inputs are then supplied to the OPT--Code Prompt Evaluator
(Appendix~\ref{app:prompt-evaluator}).

\subsubsection{Evaluation Pass}

The evaluator produces:

\begin{itemize}
    \item \textbf{Verdict}: \texttt{PASS}, \texttt{WEAK\_PASS}, or \texttt{FAIL}.
    \item \textbf{Score}: an integer from 0--100.
    \item \textbf{Issue categories}: format, mechanism, parallelism/pipelines,
    composition, and attribute plausibility.
    \item \textbf{Summary}: a short free-text evaluation.
\end{itemize}

A classification is considered \emph{acceptable} if it is rated
\texttt{PASS} or \texttt{WEAK\_PASS} with a score $\geq 70$.

\subsubsection{Double-Annotation Procedure}
To reduce model-specific biases or hallucinations, each system description is
classified independently by two LLMs or by two runs of the same LLM with
different seeds:
\begin{enumerate}
    \item Model A produces an OPT--Code and rationale.
    \item Model B produces an OPT--Code and rationale.
    \item Each is independently evaluated by the Prompt Evaluator.
\end{enumerate}

Inter-model agreement is quantified using one or more of the following metrics:

\begin{itemize}
    \item \textbf{Exact-match OPT} (binary): whether the root composition matches
    identically.
    \item \textbf{Partial-match similarity}: Jaccard similarity between root sets
    (e.g., comparing \texttt{Evo+Lrn} with \texttt{Evo+Sch}).
    \item \textbf{Levenshtein distance} (string distance over the structured
    OPT--Code line).
    \item \textbf{Weighted mechanism agreement}: weights reflecting the semantic
    distances between roots (e.g., \Swm\ is closer to \Evo\ than to \Sym).
\end{itemize}

Discrepancies trigger a joint review.

\subsubsection{Adjudication Phase}

If the two candidate classifications differ substantially (e.g., different root
sets or different compositions), an adjudication step is performed:

\begin{enumerate}
    \item Provide the system description, both candidate OPT--Codes, both
    rationales, and both evaluator reports to a third model (or human expert).
    \item Use a specialized \emph{adjudicator prompt} that asks the model to
    choose the better classification according to OPT rules.
    \item Require the adjudicator to justify its decision and to propose a final,
    consensus OPT--Code.
\end{enumerate}

A new evaluator pass is then run on the adjudicated OPT--Code to confirm
correctness.

\subsubsection{Quality Metrics}

The following quality-reporting metrics may be computed at the level of a batch
of evaluations:

\begin{itemize}
    \item \textbf{Evaluator pass rate}: proportion of \texttt{PASS} or
    \texttt{WEAK\_PASS} verdicts.
    \item \textbf{Inter-model consensus rate}: exact match vs.~non-exact match.
    \item \textbf{Root-level confusion matrix}: which OPT roots are mistaken for
    others, across models or datasets.
    \item \textbf{Pipeline sensitivity}: how often parallelism or data pipelines
    are misclassified as mechanisms.
\end{itemize}

These metrics allow the OPT framework to be applied consistently and help
identify systematic weaknesses in model-based classification pipelines.

\subsubsection{Longitudinal Tracking}

For large-scale use (e.g., benchmarking industrial systems), it is recommended to
store for each case:

\begin{itemize}
    \item the system description,
    \item both model classifications,
    \item evaluator verdicts and scores,
    \item adjudicated decisions,
    \item timestamps and model versions.
\end{itemize}

Such archival enables longitudinal analysis of model performance, drift, and
taxonomy usage over time.