65 lines
2.8 KiB
TeX
65 lines
2.8 KiB
TeX
|
||
\begin{verbatim}
|
||
|
||
When performing evaluation with local LLMs, here is general guidance on selection criteria and some concrete examples.
|
||
|
||
What you need from the model:
|
||
|
||
For OPT classification, the model needs:
|
||
|
||
Good code and algorithm understanding (to infer mechanism).
|
||
|
||
Decent instruction-following (to stick to the output format).
|
||
|
||
Basic reasoning about parallelism vs mechanism (with the explicit guidance you’ve added).
|
||
|
||
That generally points you to ~7B–14B “instruct” models with decent coding chops, rather than tiny 1–3B models.
|
||
General advice
|
||
|
||
Use instruct-tuned variants (e.g., Instruct / Chat / DPO) rather than base models.
|
||
|
||
Prefer models with good coding benchmarks (HumanEval, MBPP, etc.) because they’re better at recognizing algorithm patterns.
|
||
|
||
For multi-step pipelines (Classifier, Evaluator, Adjudicator), you can:
|
||
|
||
Run them all on the same model, or
|
||
|
||
Use a slightly larger / better model for Evaluator + Adjudicator, and a smaller one for the Classifier.
|
||
|
||
Concrete model families (local-friendly)
|
||
|
||
A few commonly used open models in the ~7–14B range that are good candidates to try:
|
||
|
||
LLaMA 3 8B Instruct:
|
||
Very strong instruction following and general reasoning for its size, good for code and system-descriptions. Available through multiple runtimes (vLLM, Ollama, llamafile, etc.).
|
||
|
||
Mistral 7B Instruct (or derivative fine-tunes like OpenHermes, Dolphin, etc.):
|
||
Good general-purpose and coding performance; widely used in local setups. Good choice if you’re already using Mistral-based stacks.
|
||
|
||
Qwen2 7B / 14B Instruct:
|
||
Strong multilingual and coding abilities; the 14B variant is particularly capable if you have the VRAM. Nice balance of reasoning and strict formatting.
|
||
|
||
Phi-3-mini (3.8B) instruct:
|
||
Much smaller, but surprisingly capable on reasoning tasks; might be borderline for very subtle OPT distinctions but could work as a classifier with careful prompting. Evaluator/Adjudicator roles might benefit from a larger model than this, though.
|
||
|
||
Code-oriented variants (if you’re mostly classifying source code rather than prose):
|
||
|
||
“Code LLaMA” derivatives
|
||
|
||
“DeepSeek-Coder” style models
|
||
These can be quite good at recognizing patterns like GA loops, RL training loops, etc., though you sometimes need to reinforce the formatting constraints.
|
||
|
||
In a local stack, a reasonable starting configuration would be:
|
||
|
||
Classifier A: LLaMA 3 8B Instruct (maximal prompt)
|
||
|
||
Classifier B: Mistral 7B Instruct (minimal or maximal prompt)
|
||
|
||
Evaluator: Qwen2 14B Instruct (if you’ve got VRAM) or LLaMA 3 8B if not
|
||
|
||
Adjudicator: same as Evaluator
|
||
|
||
If you want to conserve resources, you can just use a single 7–8B model for all roles and rely on the explicit prompts plus your evaluator rubric to catch errors.
|
||
|
||
\end{verbatim}
|