\subsection{Evaluation Protocol for OPT--Code Classification} \label{app:evaluation-protocol} The following protocol provides a reproducible and auditable procedure for evaluating OPT--Code classifications generated by large language models. The protocol aligns with reproducible computational research practices and is designed to support reliable inter-model comparison, adjudication, and longitudinal quality assurance. \subsubsection{Inputs} For each system under evaluation, the following inputs are provided: \begin{enumerate} \item \textbf{System description}: source code, algorithmic description, or detailed project summary. \item \textbf{Candidate OPT--Code}: produced by a model using the minimal or maximal prompt (Section~\ref{sec:opt-prompts}). \item \textbf{Candidate rationale}: a short explanation provided by the model describing its classification. \end{enumerate} These inputs are then supplied to the OPT--Code Prompt Evaluator (Appendix~\ref{app:prompt-evaluator}). \subsubsection{Evaluation Pass} The evaluator produces: \begin{itemize} \item \textbf{Verdict}: \texttt{PASS}, \texttt{WEAK\_PASS}, or \texttt{FAIL}. \item \textbf{Score}: an integer from 0--100. \item \textbf{Issue categories}: format, mechanism, parallelism/pipelines, composition, and attribute plausibility. \item \textbf{Summary}: a short free-text evaluation. \end{itemize} A classification is considered \emph{acceptable} if it is rated \texttt{PASS} or \texttt{WEAK\_PASS} with a score $\geq 70$. \subsubsection{Double-Annotation Procedure} To reduce model-specific biases or hallucinations, each system description is classified independently by two LLMs or by two runs of the same LLM with different seeds: \begin{enumerate} \item Model A produces an OPT--Code and rationale. \item Model B produces an OPT--Code and rationale. \item Each is independently evaluated by the Prompt Evaluator. \end{enumerate} Inter-model agreement is quantified using one or more of the following metrics: \begin{itemize} \item \textbf{Exact-match OPT} (binary): whether the root composition matches identically. \item \textbf{Partial-match similarity}: Jaccard similarity between root sets (e.g., comparing \texttt{Evo+Lrn} with \texttt{Evo+Sch}). \item \textbf{Levenshtein distance} (string distance over the structured OPT--Code line). \item \textbf{Weighted mechanism agreement}: weights reflecting the semantic distances between roots (e.g., \Swm\ is closer to \Evo\ than to \Sym). \end{itemize} Discrepancies trigger a joint review. \subsubsection{Adjudication Phase} If the two candidate classifications differ substantially (e.g., different root sets or different compositions), an adjudication step is performed: \begin{enumerate} \item Provide the system description, both candidate OPT--Codes, both rationales, and both evaluator reports to a third model (or human expert). \item Use a specialized \emph{adjudicator prompt} that asks the model to choose the better classification according to OPT rules. \item Require the adjudicator to justify its decision and to propose a final, consensus OPT--Code. \end{enumerate} A new evaluator pass is then run on the adjudicated OPT--Code to confirm correctness. \subsubsection{Quality Metrics} The following quality-reporting metrics may be computed at the level of a batch of evaluations: \begin{itemize} \item \textbf{Evaluator pass rate}: proportion of \texttt{PASS} or \texttt{WEAK\_PASS} verdicts. \item \textbf{Inter-model consensus rate}: exact match vs.~non-exact match. \item \textbf{Root-level confusion matrix}: which OPT roots are mistaken for others, across models or datasets. \item \textbf{Pipeline sensitivity}: how often parallelism or data pipelines are misclassified as mechanisms. \end{itemize} These metrics allow the OPT framework to be applied consistently and help identify systematic weaknesses in model-based classification pipelines. \subsubsection{Longitudinal Tracking} For large-scale use (e.g., benchmarking industrial systems), it is recommended to store for each case: \begin{itemize} \item the system description, \item both model classifications, \item evaluator verdicts and scores, \item adjudicated decisions, \item timestamps and model versions. \end{itemize} Such archival enables longitudinal analysis of model performance, drift, and taxonomy usage over time.