\begin{verbatim}

When performing evaluation with local LLMs, here is general guidance on selection criteria and some concrete examples.

What you need from the model:

For OPT classification, the model needs:

    Good code and algorithm understanding (to infer mechanism).

    Decent instruction-following (to stick to the output format).

    Basic reasoning about parallelism vs mechanism (with the explicit guidance you’ve added).

That generally points you to ~7B–14B “instruct” models with decent coding chops, rather than tiny 1–3B models.
General advice

    Use instruct-tuned variants (e.g., Instruct / Chat / DPO) rather than base models.

    Prefer models with good coding benchmarks (HumanEval, MBPP, etc.) because they’re better at recognizing algorithm patterns.

    For multi-step pipelines (Classifier, Evaluator, Adjudicator), you can:

        Run them all on the same model, or

        Use a slightly larger / better model for Evaluator + Adjudicator, and a smaller one for the Classifier.

Concrete model families (local-friendly)

A few commonly used open models in the ~7–14B range that are good candidates to try:

    LLaMA 3 8B Instruct:
    Very strong instruction following and general reasoning for its size, good for code and system-descriptions. Available through multiple runtimes (vLLM, Ollama, llamafile, etc.).

    Mistral 7B Instruct (or derivative fine-tunes like OpenHermes, Dolphin, etc.):
    Good general-purpose and coding performance; widely used in local setups. Good choice if you’re already using Mistral-based stacks.

    Qwen2 7B / 14B Instruct:
    Strong multilingual and coding abilities; the 14B variant is particularly capable if you have the VRAM. Nice balance of reasoning and strict formatting.

    Phi-3-mini (3.8B) instruct:
    Much smaller, but surprisingly capable on reasoning tasks; might be borderline for very subtle OPT distinctions but could work as a classifier with careful prompting. Evaluator/Adjudicator roles might benefit from a larger model than this, though.

    Code-oriented variants (if you’re mostly classifying source code rather than prose):

        “Code LLaMA” derivatives

        “DeepSeek-Coder” style models
        These can be quite good at recognizing patterns like GA loops, RL training loops, etc., though you sometimes need to reinforce the formatting constraints.

In a local stack, a reasonable starting configuration would be:

    Classifier A: LLaMA 3 8B Instruct (maximal prompt)

    Classifier B: Mistral 7B Instruct (minimal or maximal prompt)

    Evaluator: Qwen2 14B Instruct (if you’ve got VRAM) or LLaMA 3 8B if not

    Adjudicator: same as Evaluator

If you want to conserve resources, you can just use a single 7–8B model for all roles and rely on the explicit prompts plus your evaluator rubric to catch errors.

\end{verbatim}