\begin{verbatim} When performing evaluation with local LLMs, here is general guidance on selection criteria and some concrete examples. What you need from the model: For OPT classification, the model needs: Good code and algorithm understanding (to infer mechanism). Decent instruction-following (to stick to the output format). Basic reasoning about parallelism vs mechanism (with the explicit guidance you’ve added). That generally points you to ~7B–14B “instruct” models with decent coding chops, rather than tiny 1–3B models. General advice Use instruct-tuned variants (e.g., Instruct / Chat / DPO) rather than base models. Prefer models with good coding benchmarks (HumanEval, MBPP, etc.) because they’re better at recognizing algorithm patterns. For multi-step pipelines (Classifier, Evaluator, Adjudicator), you can: Run them all on the same model, or Use a slightly larger / better model for Evaluator + Adjudicator, and a smaller one for the Classifier. Concrete model families (local-friendly) A few commonly used open models in the ~7–14B range that are good candidates to try: LLaMA 3 8B Instruct: Very strong instruction following and general reasoning for its size, good for code and system-descriptions. Available through multiple runtimes (vLLM, Ollama, llamafile, etc.). Mistral 7B Instruct (or derivative fine-tunes like OpenHermes, Dolphin, etc.): Good general-purpose and coding performance; widely used in local setups. Good choice if you’re already using Mistral-based stacks. Qwen2 7B / 14B Instruct: Strong multilingual and coding abilities; the 14B variant is particularly capable if you have the VRAM. Nice balance of reasoning and strict formatting. Phi-3-mini (3.8B) instruct: Much smaller, but surprisingly capable on reasoning tasks; might be borderline for very subtle OPT distinctions but could work as a classifier with careful prompting. Evaluator/Adjudicator roles might benefit from a larger model than this, though. Code-oriented variants (if you’re mostly classifying source code rather than prose): “Code LLaMA” derivatives “DeepSeek-Coder” style models These can be quite good at recognizing patterns like GA loops, RL training loops, etc., though you sometimes need to reinforce the formatting constraints. In a local stack, a reasonable starting configuration would be: Classifier A: LLaMA 3 8B Instruct (maximal prompt) Classifier B: Mistral 7B Instruct (minimal or maximal prompt) Evaluator: Qwen2 14B Instruct (if you’ve got VRAM) or LLaMA 3 8B if not Adjudicator: same as Evaluator If you want to conserve resources, you can just use a single 7–8B model for all roles and rely on the explicit prompts plus your evaluator rubric to catch errors. \end{verbatim}