31 lines
896 B
Markdown
31 lines
896 B
Markdown
# Doclift Claim Tournament
|
|
|
|
This benchmark is a small evaluation harness for comparing multiple
|
|
doclift prose-claim extraction strategies before changing the default
|
|
GroundRecall import behavior.
|
|
|
|
Current tracks:
|
|
|
|
- `conservative`: prefers higher precision and sentence-level claims.
|
|
- `broad`: allows paragraph-level claims and shorter sentence candidates to
|
|
improve recall.
|
|
|
|
Judge criteria:
|
|
|
|
- maximize F1 against the benchmark gold claims
|
|
- prefer higher recall when F1 ties
|
|
- penalize meta or identity-claim noise
|
|
- prefer predicted claim counts close to the gold-set size
|
|
|
|
Fixture location:
|
|
|
|
- `tests/fixtures/doclift_claim_eval/`
|
|
|
|
Primary entrypoint:
|
|
|
|
- `groundrecall.doclift_claim_tournament.evaluate_doclift_claim_tracks(...)`
|
|
|
|
This is intentionally small and deterministic. It is meant to support an
|
|
iterative tournament workflow, not to serve as a full evaluation platform by
|
|
itself.
|