Operational-Premise-Taxonomy/paper/pieces/local-llm-choice.tex

65 lines
2.8 KiB
TeX
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

\begin{verbatim}
When performing evaluation with local LLMs, here is general guidance on selection criteria and some concrete examples.
What you need from the model:
For OPT classification, the model needs:
Good code and algorithm understanding (to infer mechanism).
Decent instruction-following (to stick to the output format).
Basic reasoning about parallelism vs mechanism (with the explicit guidance youve added).
That generally points you to ~7B14B “instruct” models with decent coding chops, rather than tiny 13B models.
General advice
Use instruct-tuned variants (e.g., Instruct / Chat / DPO) rather than base models.
Prefer models with good coding benchmarks (HumanEval, MBPP, etc.) because theyre better at recognizing algorithm patterns.
For multi-step pipelines (Classifier, Evaluator, Adjudicator), you can:
Run them all on the same model, or
Use a slightly larger / better model for Evaluator + Adjudicator, and a smaller one for the Classifier.
Concrete model families (local-friendly)
A few commonly used open models in the ~714B range that are good candidates to try:
LLaMA 3 8B Instruct:
Very strong instruction following and general reasoning for its size, good for code and system-descriptions. Available through multiple runtimes (vLLM, Ollama, llamafile, etc.).
Mistral 7B Instruct (or derivative fine-tunes like OpenHermes, Dolphin, etc.):
Good general-purpose and coding performance; widely used in local setups. Good choice if youre already using Mistral-based stacks.
Qwen2 7B / 14B Instruct:
Strong multilingual and coding abilities; the 14B variant is particularly capable if you have the VRAM. Nice balance of reasoning and strict formatting.
Phi-3-mini (3.8B) instruct:
Much smaller, but surprisingly capable on reasoning tasks; might be borderline for very subtle OPT distinctions but could work as a classifier with careful prompting. Evaluator/Adjudicator roles might benefit from a larger model than this, though.
Code-oriented variants (if youre mostly classifying source code rather than prose):
“Code LLaMA” derivatives
“DeepSeek-Coder” style models
These can be quite good at recognizing patterns like GA loops, RL training loops, etc., though you sometimes need to reinforce the formatting constraints.
In a local stack, a reasonable starting configuration would be:
Classifier A: LLaMA 3 8B Instruct (maximal prompt)
Classifier B: Mistral 7B Instruct (minimal or maximal prompt)
Evaluator: Qwen2 14B Instruct (if youve got VRAM) or LLaMA 3 8B if not
Adjudicator: same as Evaluator
If you want to conserve resources, you can just use a single 78B model for all roles and rely on the explicit prompts plus your evaluator rubric to catch errors.
\end{verbatim}