116 lines
4.4 KiB
TeX
116 lines
4.4 KiB
TeX
\subsection{Evaluation Protocol for OPT--Code Classification}
|
|
\label{app:evaluation-protocol}
|
|
|
|
The following protocol provides a reproducible and auditable procedure for
|
|
evaluating OPT--Code classifications generated by large language models. The
|
|
protocol aligns with reproducible computational research practices and is
|
|
designed to support reliable inter-model comparison, adjudication, and
|
|
longitudinal quality assurance.
|
|
|
|
\subsubsection{Inputs}
|
|
|
|
For each system under evaluation, the following inputs are provided:
|
|
|
|
\begin{enumerate}
|
|
\item \textbf{System description}: source code, algorithmic description, or
|
|
detailed project summary.
|
|
\item \textbf{Candidate OPT--Code}: produced by a model using the minimal or
|
|
maximal prompt (Section~\ref{sec:opt-prompts}).
|
|
\item \textbf{Candidate rationale}: a short explanation provided by the model
|
|
describing its classification.
|
|
\end{enumerate}
|
|
|
|
These inputs are then supplied to the OPT--Code Prompt Evaluator
|
|
(Appendix~\ref{app:prompt-evaluator}).
|
|
|
|
\subsubsection{Evaluation Pass}
|
|
|
|
The evaluator produces:
|
|
|
|
\begin{itemize}
|
|
\item \textbf{Verdict}: \texttt{PASS}, \texttt{WEAK\_PASS}, or \texttt{FAIL}.
|
|
\item \textbf{Score}: an integer from 0--100.
|
|
\item \textbf{Issue categories}: format, mechanism, parallelism/pipelines,
|
|
composition, and attribute plausibility.
|
|
\item \textbf{Summary}: a short free-text evaluation.
|
|
\end{itemize}
|
|
|
|
A classification is considered \emph{acceptable} if it is rated
|
|
\texttt{PASS} or \texttt{WEAK\_PASS} with a score $\geq 70$.
|
|
|
|
\subsubsection{Double-Annotation Procedure}
|
|
To reduce model-specific biases or hallucinations, each system description is
|
|
classified independently by two LLMs or by two runs of the same LLM with
|
|
different seeds:
|
|
\begin{enumerate}
|
|
\item Model A produces an OPT--Code and rationale.
|
|
\item Model B produces an OPT--Code and rationale.
|
|
\item Each is independently evaluated by the Prompt Evaluator.
|
|
\end{enumerate}
|
|
|
|
Inter-model agreement is quantified using one or more of the following metrics:
|
|
|
|
\begin{itemize}
|
|
\item \textbf{Exact-match OPT} (binary): whether the root composition matches
|
|
identically.
|
|
\item \textbf{Partial-match similarity}: Jaccard similarity between root sets
|
|
(e.g., comparing \texttt{Evo+Lrn} with \texttt{Evo+Sch}).
|
|
\item \textbf{Levenshtein distance} (string distance over the structured
|
|
OPT--Code line).
|
|
\item \textbf{Weighted mechanism agreement}: weights reflecting the semantic
|
|
distances between roots (e.g., \Swm\ is closer to \Evo\ than to \Sym).
|
|
\end{itemize}
|
|
|
|
Discrepancies trigger a joint review.
|
|
|
|
\subsubsection{Adjudication Phase}
|
|
|
|
If the two candidate classifications differ substantially (e.g., different root
|
|
sets or different compositions), an adjudication step is performed:
|
|
|
|
\begin{enumerate}
|
|
\item Provide the system description, both candidate OPT--Codes, both
|
|
rationales, and both evaluator reports to a third model (or human expert).
|
|
\item Use a specialized \emph{adjudicator prompt} that asks the model to
|
|
choose the better classification according to OPT rules.
|
|
\item Require the adjudicator to justify its decision and to propose a final,
|
|
consensus OPT--Code.
|
|
\end{enumerate}
|
|
|
|
A new evaluator pass is then run on the adjudicated OPT--Code to confirm
|
|
correctness.
|
|
|
|
\subsubsection{Quality Metrics}
|
|
|
|
The following quality-reporting metrics may be computed at the level of a batch
|
|
of evaluations:
|
|
|
|
\begin{itemize}
|
|
\item \textbf{Evaluator pass rate}: proportion of \texttt{PASS} or
|
|
\texttt{WEAK\_PASS} verdicts.
|
|
\item \textbf{Inter-model consensus rate}: exact match vs.~non-exact match.
|
|
\item \textbf{Root-level confusion matrix}: which OPT roots are mistaken for
|
|
others, across models or datasets.
|
|
\item \textbf{Pipeline sensitivity}: how often parallelism or data pipelines
|
|
are misclassified as mechanisms.
|
|
\end{itemize}
|
|
|
|
These metrics allow the OPT framework to be applied consistently and help
|
|
identify systematic weaknesses in model-based classification pipelines.
|
|
|
|
\subsubsection{Longitudinal Tracking}
|
|
|
|
For large-scale use (e.g., benchmarking industrial systems), it is recommended to
|
|
store for each case:
|
|
|
|
\begin{itemize}
|
|
\item the system description,
|
|
\item both model classifications,
|
|
\item evaluator verdicts and scores,
|
|
\item adjudicated decisions,
|
|
\item timestamps and model versions.
|
|
\end{itemize}
|
|
|
|
Such archival enables longitudinal analysis of model performance, drift, and
|
|
taxonomy usage over time.
|