Verification added to process.

2026-03-29 05:09:18 -04:00 · 2026-03-29 05:09:18 -04:00 · 35d8eb8386
parent fc7e1f1844
commit 35d8eb8386
8 changed files with 638 additions and 3 deletions
--- a/.gitignore
+++ b/.gitignore
@ -3,4 +3,5 @@ __pycache__/
 .venv/
 .cache/
 *.pyc
 *.egg-info/
 library.sqlite3
--- a/README.md
+++ b/README.md
@ -48,6 +48,7 @@ The initial repo includes:
 - a small CLI for ingest, search, inspection, and export;
 - review-state tracking on entries, per-field ingest provenance, and field-level conflict review;
 - plaintext reference extraction into draft BibTeX for numbered, APA-like, wrapped-line, and simple book-style references;
 - standalone verification and disambiguation of free-text references or partial BibTeX into auditable BibTeX/JSON results with `x_status`, `x_confidence`, `x_source`, `x_query`, and alternate-candidate traces;
 - identifier-first metadata resolution for DOI, OpenAlex, DBLP, arXiv, and DataCite-backed entries, with OpenAlex/DataCite title-search fallback;
 - local citation-graph traversal over stored `cites`, `cited_by`, and `crossref` edges;
 - Crossref- and OpenAlex-backed graph expansion that materializes draft related works and edge provenance;
@ -132,6 +133,8 @@ PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 apply-conflict
 PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --seed-bib seed.bib --topic "bayesian nonparametrics"
 PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --topic "bayesian nonparametrics" --preview --topic-commit-limit 5
 PYTHONPATH=src .venv/bin/python -m citegeist extract references.txt --output draft.bib
 PYTHONPATH=src .venv/bin/python -m citegeist verify --string '"Graph-first bibliography augmentation" Smith 2024' --context "citation graphs" --format json
 PYTHONPATH=src .venv/bin/python -m citegeist verify --bib draft.bib --output verified.bib
 PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve smith2024graphs
 PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve-stubs --doi-only --preview --limit 25
 PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve-stubs --doi-only --all-misc --limit 25
@ -167,6 +170,13 @@ OpenAlex expansion is also conservative about noisy secondary records. Discoveri
 For live-source development, prefer fixture-backed or cache-backed source clients so resolver and expansion work can be exercised repeatedly without re-hitting upstream APIs on every run.
 ## Adopted Ideas From Earlier Repos
 `citegeist` now absorbs two useful patterns from adjacent bibliography tools while keeping them inside the main Python 3 package boundary:
 - From `VeriBib`: a standalone `verify` workflow for ambiguous strings or rough BibTeX, with explicit confidence/status audit fields and alternate-candidate traces before you commit changes to the main library.
 - From `TOA-Bib-Updater`: resumable, artifact-oriented corpus processing remains the preferred process model for large imports. In practice this already appears in the TalkOrigins example pipeline through saved manifests, review exports, duplicate reports, and staged topic-phrase review flows.
 ## Example Application
 - Use `stage-topic-phrases` to load those suggestions into the database as review items. Staging stores the candidate in `suggested_phrase` and marks the topic `pending` without changing the active `expansion_phrase`.
--- a/ROADMAP.md
+++ b/ROADMAP.md
@ -25,8 +25,22 @@ Completed:
 - lightweight BibTeX parsing;
 - SQLite storage for entries, creators, identifiers, and relations;
 - local text search using SQLite FTS5 when available;
 - standalone verification/disambiguation output for free-text references and partial BibTeX with auditable match metadata;
 - tests for ingest, relation storage, and search.
 ## Comparison Notes From Related Repos
 The adjacent `TOA-Bib-Updater` and `VeriBib` repositories are useful prior art, but they contribute different things:
 - `VeriBib` contributes a good pre-ingest verification pattern: inspect ambiguous strings or partial BibTeX, rank candidates from legal metadata sources, and emit explicit audit fields instead of silently trusting a single match.
 - `TOA-Bib-Updater` contributes process discipline more than core data modeling: resumable long-running jobs, preserved source artifacts, and generated review outputs for manual inspection.
 `citegeist` should absorb those ideas where they improve the main local research workflow:
 1. keep verification and auditability in the core package, not just entry resolution after ingest;
 2. keep resumable manifests and review exports for large acquisition workflows, especially example pipelines and batch imports;
 3. avoid coupling the core model to brittle source-specific scraping logic.
 ## Phase 1: Core Ingestion And Export
 Priority: P0
@ -67,7 +81,8 @@ Tasks:
 - support ingestion of OCR- or PDF-derived plaintext bibliography sections;
 - add normalization for author names, years, title casing, and page ranges;
 - prefer sentence-boundary venue detection over naive keyword splits so title text containing words like `report` is not truncated;
- repair partially extracted venue stubs such as `Occas.` or `Proc.` by reparsing the full raw reference line when the structured fields are obviously incomplete;
+- repair partially extracted venue stubs such as `Occas.` or `Proc.` by reparsing the full raw reference line when the structured fields are obviously
 incomplete;
 - preserve improved local draft parses even when remote enrichment remains unresolved, so later parser fixes can refresh stored BibTeX without requiring a successful metadata match;
 - build gold-test fixtures from real, messy reference examples.
@ -122,7 +137,6 @@ Tasks:
 - expose unresolved nodes so the user can decide what to enrich next.
 Why this matters:
 - this is central to literature discovery rather than mere bibliography cleanup;
 - it turns the database into a research navigation tool.
@ -164,7 +178,6 @@ Goal:
 Broaden source acquisition without mixing that complexity into the core model.
 Tasks:
 - add source adapters for open-access theses and dissertation repositories;
 - add support for harvesting publisher citation pages and preprint metadata pages;
 - define per-source import provenance and rate-limit behavior;
--- a/src/citegeist/init.py
+++ b/src/citegeist/init.py
@ -7,12 +7,14 @@ from .harvest import OaiMetadataFormat, OaiPmhHarvester, OaiSet
 from .resolve import MetadataResolver, merge_entries, merge_entries_with_conflicts
 from .sources import SourceClient
 from .storage import BibliographyStore
 from .verify import BibliographyVerifier, VerificationResult, VerificationMatch
 __all__ = [
    "BibEntry",
    "BatchBootstrapRunner",
    "BatchJobResult",
    "BibliographyStore",
    "BibliographyVerifier",
    "BootstrapResult",
    "Bootstrapper",
    "CrossrefExpander",
@ -22,6 +24,8 @@ __all__ = [
    "OaiMetadataFormat",
    "OaiSet",
    "SourceClient",
    "VerificationMatch",
    "VerificationResult",
    "extract_references",
    "load_batch_jobs",
    "merge_entries",
--- a/src/citegeist/cli.py
+++ b/src/citegeist/cli.py
@ -16,6 +16,7 @@ from .extract import extract_references
 from .harvest import OaiPmhHarvester
 from .resolve import MetadataResolver, merge_entries_with_conflicts
 from .storage import BibliographyStore
 from .verify import BibliographyVerifier, render_verification_results
 def build_parser() -> argparse.ArgumentParser:
@ -69,6 +70,24 @@ def build_parser() -> argparse.ArgumentParser:
    extract_parser.add_argument("input", help="Plaintext file containing bibliography-style references")
    extract_parser.add_argument("--output", help="Write extracted BibTeX to a file instead of stdout")
    verify_parser = subparsers.add_parser(
        "verify",
        help="Verify or disambiguate free-text references or BibTeX entries without modifying the database",
    )
    verify_group = verify_parser.add_mutually_exclusive_group(required=True)
    verify_group.add_argument("--string", help="Single free-text reference query")
    verify_group.add_argument("--list", dest="list_input", help="Path to a text file with one query per line")
    verify_group.add_argument("--bib", help="Path to a BibTeX file whose entries should be verified")
    verify_parser.add_argument("--context", default="", help="Optional topic context used for scoring")
    verify_parser.add_argument("--limit", type=int, default=5, help="Maximum candidates to inspect per input")
    verify_parser.add_argument(
        "--format",
        choices=["bibtex", "json"],
        default="bibtex",
        help="Output format for verification results",
    )
    verify_parser.add_argument("--output", help="Write verification results to a file instead of stdout")
    resolve_parser = subparsers.add_parser("resolve", help="Enrich stored entries from external metadata sources")
    resolve_parser.add_argument("citation_keys", nargs="+", help="Citation keys to enrich")
@ -535,6 +554,8 @@ def main(argv: list[str] | None = None) -> int:
            return _run_apply_conflict(store, args.citation_key, args.field_name)
        if args.command == "extract":
            return _run_extract(Path(args.input), args.output)
        if args.command == "verify":
            return _run_verify(args.string, args.list_input, args.bib, args.context, args.limit, args.format, args.output)
        if args.command == "resolve":
            return _run_resolve(store, args.citation_keys)
        if args.command == "resolve-stubs":
@ -783,6 +804,36 @@ def _run_extract(input_path: Path, output: str | None) -> int:
    return 0
 def _run_verify(
    string_input: str | None,
    list_input: str | None,
    bib_input: str | None,
    context: str,
    limit: int,
    output_format: str,
    output: str | None,
 ) -> int:
    verifier = BibliographyVerifier()
    if string_input is not None:
        results = [verifier.verify_string(string_input, context=context, limit=limit)]
    elif list_input is not None:
        values = [line.strip() for line in Path(list_input).read_text(encoding="utf-8").splitlines() if line.strip()]
        results = verifier.verify_strings(values, context=context, limit=limit)
    elif bib_input is not None:
        results = verifier.verify_bib_file(bib_input, context=context, limit=limit)
    else:
        print("verify requires one input source", file=sys.stderr)
        return 1
    rendered = render_verification_results(results, output_format)
    if output:
        Path(output).write_text(rendered + ("\n" if rendered and not rendered.endswith("\n") else ""), encoding="utf-8")
    else:
        if rendered:
            print(rendered)
    return 0
 def _print_progress(label: str, index: int, total: int, detail: str | None = None) -> None:
    message = f"[{index}/{total}] {label}"
    if detail:
--- a/src/citegeist/verify.py
+++ b/src/citegeist/verify.py
@ -0,0 +1,358 @@
 from __future__ import annotations
 import json
 import re
 from dataclasses import dataclass
 from pathlib import Path
 from .bibtex import BibEntry, parse_bibtex, render_bibtex
 from .resolve import MetadataResolver, Resolution
@dataclass(slots=True)
 class VerificationMatch:
    entry: BibEntry
    score: float
    source_label: str
@dataclass(slots=True)
 class VerificationResult:
    query: str
    context: str
    status: str
    confidence: float
    entry: BibEntry
    source_label: str
    alternates: list[VerificationMatch]
    input_type: str
    input_key: str | None = None
    def to_bib_entry(self) -> BibEntry:
        fields = dict(self.entry.fields)
        fields["x_status"] = self.status
        fields["x_confidence"] = f"{self.confidence:.2f}"
        fields["x_source"] = self.source_label
        fields["x_query"] = self.query
        fields["x_context"] = self.context
        if self.input_type == "bib" and self.input_key:
            fields["x_input_key"] = self.input_key
        if self.alternates:
            fields["x_alternates"] = " || ".join(
                _serialize_alternate(match) for match in self.alternates
            )
        return BibEntry(
            entry_type=self.entry.entry_type,
            citation_key=self.entry.citation_key,
            fields=fields,
        )
    def to_dict(self) -> dict[str, object]:
        return {
            "query": self.query,
            "context": self.context,
            "input_type": self.input_type,
            "input_key": self.input_key,
            "status": self.status,
            "confidence": round(self.confidence, 4),
            "source_label": self.source_label,
            "entry": {
                "citation_key": self.entry.citation_key,
                "entry_type": self.entry.entry_type,
                "fields": dict(self.entry.fields),
            },
            "alternates": [
                {
                    "citation_key": match.entry.citation_key,
                    "entry_type": match.entry.entry_type,
                    "score": round(match.score, 4),
                    "source_label": match.source_label,
                    "fields": dict(match.entry.fields),
                }
                for match in self.alternates
            ],
        }
 class BibliographyVerifier:
    def __init__(self, resolver: MetadataResolver | None = None) -> None:
        self.resolver = resolver or MetadataResolver()
    def verify_string(self, value: str, context: str = "", limit: int = 5) -> VerificationResult:
        query_fields = _fields_from_string(value)
        return self._verify_query(
            query_fields,
            query=value,
            context=context,
            limit=limit,
            input_type="string",
        )
    def verify_bib_entry(self, entry: BibEntry, context: str = "", limit: int = 5) -> VerificationResult:
        query = " ".join(
            part
            for part in (
                entry.fields.get("doi", ""),
                entry.fields.get("title", ""),
                entry.fields.get("author", ""),
                entry.fields.get("year", ""),
            )
            if part
        ).strip()
        query_fields = {
            "title": entry.fields.get("title", ""),
            "authors": _split_authors(entry.fields.get("author", "")),
            "year": entry.fields.get("year", ""),
            "venue": entry.fields.get("journal", "") or entry.fields.get("booktitle", ""),
        }
        return self._verify_query(
            query_fields,
            query=query or entry.citation_key,
            context=context,
            limit=limit,
            input_type="bib",
            input_key=entry.citation_key,
            source_entry=entry,
        )
    def verify_strings(self, values: list[str], context: str = "", limit: int = 5) -> list[VerificationResult]:
        return [self.verify_string(value, context=context, limit=limit) for value in values if value.strip()]
    def verify_bib_file(self, path: str | Path, context: str = "", limit: int = 5) -> list[VerificationResult]:
        entries = parse_bibtex(Path(path).read_text(encoding="utf-8"))
        return [self.verify_bib_entry(entry, context=context, limit=limit) for entry in entries]
    def _verify_query(
        self,
        query_fields: dict[str, object],
        *,
        query: str,
        context: str,
        limit: int,
        input_type: str,
        input_key: str | None = None,
        source_entry: BibEntry | None = None,
    ) -> VerificationResult:
        if source_entry is not None and source_entry.fields.get("doi"):
            direct = self.resolver.resolve_doi(source_entry.fields["doi"]) or self.resolver.resolve_datacite_doi(
                source_entry.fields["doi"]
            )
            if direct is not None:
                return VerificationResult(
                    query=query,
                    context=context,
                    status="exact",
                    confidence=1.0,
                    entry=direct.entry,
                    source_label=direct.source_label,
                    alternates=[],
                    input_type=input_type,
                    input_key=input_key,
                )
        candidate_limit = max(1, limit)
        candidates = self._collect_candidates(
            title=str(query_fields.get("title", "")),
            query=query,
            limit=candidate_limit,
        )
        scored = [
            VerificationMatch(
                entry=entry,
                score=_score_candidate(query_fields, context, entry),
                source_label=source_label,
            )
            for entry, source_label in candidates
        ]
        scored.sort(
            key=lambda item: (
                -item.score,
                item.entry.fields.get("year", ""),
                item.entry.citation_key,
            )
        )
        best = scored[0] if scored else None
        if best is None:
            fallback_entry = source_entry or _placeholder_entry(query_fields, query, input_key)
            return VerificationResult(
                query=query,
                context=context,
                status="not_found",
                confidence=0.0,
                entry=fallback_entry,
                source_label="none",
                alternates=[],
                input_type=input_type,
                input_key=input_key,
            )
        status = _status_from_match(best)
        return VerificationResult(
            query=query,
            context=context,
            status=status,
            confidence=best.score,
            entry=best.entry,
            source_label=best.source_label,
            alternates=scored[1: min(len(scored), 4)],
            input_type=input_type,
            input_key=input_key,
        )
    def _collect_candidates(self, *, title: str, query: str, limit: int) -> list[tuple[BibEntry, str]]:
        candidates: list[tuple[BibEntry, str]] = []
        seen: set[str] = set()
        search_title = title or query
        for source_name, source_entries in (
            ("crossref", self.resolver.search_crossref(search_title, limit=limit)),
            ("openalex", self.resolver.search_openalex(search_title, limit=limit)),
            ("datacite", self.resolver.search_datacite(search_title, limit=limit)),
        ):
            for entry in source_entries:
                signature = _candidate_signature(entry)
                if signature in seen:
                    continue
                seen.add(signature)
                candidates.append((entry, f"{source_name}:search:{search_title}"))
        return candidates
 def render_verification_results(results: list[VerificationResult], output_format: str) -> str:
    if output_format == "json":
        return json.dumps([result.to_dict() for result in results], indent=2)
    return render_bibtex([result.to_bib_entry() for result in results])
 def _fields_from_string(value: str) -> dict[str, object]:
    year_match = re.search(r"\b(1[6-9]\d{2}|20\d{2}|21\d{2})\b", value)
    year = year_match.group(1) if year_match else ""
    quoted_title = re.search(r"[\"“”‘’'`](.+?)[\"“”‘’'`]", value)
    title = quoted_title.group(1).strip() if quoted_title else ""
    author_source = value
    if quoted_title:
        author_source = author_source.replace(quoted_title.group(0), " ")
    if year:
        author_source = author_source.replace(year, " ")
    author_tokens = [token.strip(",.;:") for token in author_source.split() if token.strip(",.;:")]
    authors: list[str] = [author_tokens[0]] if author_tokens else []
    return {"title": title, "authors": authors, "year": year, "venue": ""}
 def _score_candidate(query_fields: dict[str, object], context: str, entry: BibEntry) -> float:
    score = 0.0
    query_title = _tokenize(str(query_fields.get("title", "")))
    candidate_title = _tokenize(entry.fields.get("title", ""))
    if query_title:
        overlap = len(query_title & candidate_title) / max(1, len(query_title))
        if overlap >= 0.9:
            score += 0.55
        elif overlap >= 0.7:
            score += 0.40
        elif overlap >= 0.5:
            score += 0.20
    query_authors = [author for author in query_fields.get("authors", []) if author]
    if query_authors:
        query_surname = _surname(query_authors[0])
        candidate_surname = _surname(_split_authors(entry.fields.get("author", ""))[0]) if entry.fields.get("author") else ""
        if query_surname and query_surname == candidate_surname:
            score += 0.25
    query_year = str(query_fields.get("year", "")).strip()
    candidate_year = entry.fields.get("year", "").strip()
    if query_year and candidate_year:
        if query_year == candidate_year:
            score += 0.15
        else:
            try:
                delta = abs(int(query_year) - int(candidate_year))
                if delta == 1:
                    score += 0.07
            except ValueError:
                pass
    query_venue = str(query_fields.get("venue", "")).strip()
    candidate_venue = entry.fields.get("journal", "").strip() or entry.fields.get("booktitle", "").strip()
    if query_venue and candidate_venue and _normalize(query_venue) == _normalize(candidate_venue):
        score += 0.05
    if context:
        context_tokens = _tokenize(context)
        abstract_tokens = _tokenize(entry.fields.get("abstract", ""))
        if context_tokens & abstract_tokens:
            score += 0.05
    return min(score, 1.0)
 def _status_from_match(match: VerificationMatch) -> str:
    if match.entry.fields.get("doi") and match.score >= 0.95:
        return "exact"
    if match.score >= 0.75:
        return "high_confidence"
    return "ambiguous"
 def _split_authors(value: str) -> list[str]:
    return [part.strip() for part in value.split(" and ") if part.strip()]
 def _surname(value: str) -> str:
    text = value.strip()
    if not text:
        return ""
    if "," in text:
        return text.split(",", 1)[0].strip().lower()
    return text.split()[-1].strip().lower()
 def _tokenize(value: str) -> set[str]:
    return {token for token in re.split(r"\W+", value.lower()) if token}
 def _normalize(value: str) -> str:
    return " ".join(value.lower().split())
 def _serialize_alternate(match: VerificationMatch) -> str:
    authors = _split_authors(match.entry.fields.get("author", ""))
    first_author = authors[0] if authors else ""
    return "|".join(
        (
            match.entry.fields.get("doi", ""),
            match.entry.fields.get("title", ""),
            first_author,
            match.entry.fields.get("year", ""),
            f"{match.score:.2f}",
        )
    )
 def _candidate_signature(entry: BibEntry) -> str:
    return "|".join(
        (
            entry.fields.get("doi", "").lower(),
            _normalize(entry.fields.get("title", "")),
            entry.fields.get("year", ""),
        )
    )
 def _placeholder_entry(query_fields: dict[str, object], query: str, input_key: str | None) -> BibEntry:
    title = str(query_fields.get("title", "")) or query
    authors = query_fields.get("authors", [])
    year = str(query_fields.get("year", ""))
    citation_key = input_key or _slugify_key(title or query)
    fields = {"title": title}
    if authors:
        fields["author"] = " and ".join(str(author) for author in authors)
    if year:
        fields["year"] = year
    return BibEntry(entry_type="misc", citation_key=citation_key, fields=fields)
 def _slugify_key(value: str) -> str:
    slug = re.sub(r"[^a-z0-9]+", "", value.lower())
    return slug[:40] or "verification"
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@ -138,6 +138,115 @@ def test_cli_provenance_and_status_updates(tmp_path: Path):
    assert "reviewed" in status.stdout
 def test_cli_verify_string_outputs_json_with_audit_fields(tmp_path: Path):
    from citegeist.bibtex import BibEntry
    database = tmp_path / "library.sqlite3"
    with patch("citegeist.cli.BibliographyVerifier.verify_string") as mocked_verify:
        from citegeist.verify import VerificationResult
        mocked_verify.return_value = VerificationResult(
            query='"Graph-first bibliography augmentation" Smith 2024',
            context="citation graphs",
            status="high_confidence",
            confidence=0.82,
            entry=BibEntry(
                entry_type="article",
                citation_key="smith2024graphs",
                fields={
                    "author": "Smith, Jane",
                    "title": "Graph-first bibliography augmentation",
                    "year": "2024",
                    "doi": "10.1000/example-doi",
                },
            ),
            source_label="crossref:search:Graph-first bibliography augmentation",
            alternates=[],
            input_type="string",
            input_key=None,
        )
        stdout_buffer = io.StringIO()
        with redirect_stdout(stdout_buffer):
            exit_code = main(
                [
                    "--db",
                    str(database),
                    "verify",
                    "--string",
                    '"Graph-first bibliography augmentation" Smith 2024',
                    "--context",
                    "citation graphs",
                    "--format",
                    "json",
                ]
            )
    assert exit_code == 0
    payload = json.loads(stdout_buffer.getvalue())
    assert payload[0]["status"] == "high_confidence"
    assert payload[0]["source_label"] == "crossref:search:Graph-first bibliography augmentation"
    assert payload[0]["entry"]["citation_key"] == "smith2024graphs"
 def test_cli_verify_bib_outputs_json(tmp_path: Path):
    bib_path = tmp_path / "partial.bib"
    bib_path.write_text(
        """
@misc{roughentry,
  title = {Graph-first bibliography augmentation},
  year = {2024}
 }
 """,
        encoding="utf-8",
    )
    with patch("citegeist.cli.BibliographyVerifier.verify_bib_file") as mocked_verify:
        from citegeist.bibtex import BibEntry
        from citegeist.verify import VerificationResult
        mocked_verify.return_value = [
            VerificationResult(
                query="Graph-first bibliography augmentation 2024",
                context="",
                status="ambiguous",
                confidence=0.61,
                entry=BibEntry(
                    entry_type="article",
                    citation_key="candidate2024",
                    fields={
                        "title": "Graph-first bibliography augmentation",
                        "year": "2024",
                    },
                ),
                source_label="openalex:search:Graph-first bibliography augmentation",
                alternates=[],
                input_type="bib",
                input_key="roughentry",
            )
        ]
        stdout_buffer = io.StringIO()
        with redirect_stdout(stdout_buffer):
            exit_code = main(
                [
                    "--db",
                    str(tmp_path / "library.sqlite3"),
                    "verify",
                    "--bib",
                    str(bib_path),
                    "--format",
                    "json",
                ]
            )
    assert exit_code == 0
    payload = json.loads(stdout_buffer.getvalue())
    assert payload[0]["status"] == "ambiguous"
    assert payload[0]["input_key"] == "roughentry"
    assert payload[0]["entry"]["citation_key"] == "candidate2024"
 def test_cli_resolve_updates_entry(tmp_path: Path):
    bib_path = tmp_path / "input.bib"
    bib_path.write_text(
--- a/tests/test_verify.py
+++ b/tests/test_verify.py
@ -0,0 +1,89 @@
 from __future__ import annotations
 from citegeist.bibtex import BibEntry
 from citegeist.resolve import Resolution
 from citegeist.verify import BibliographyVerifier
 def test_verifier_uses_direct_doi_resolution_for_bib_entries():
    verifier = BibliographyVerifier()
    verifier.resolver.resolve_doi = lambda value: Resolution(  # type: ignore[method-assign]
        entry=BibEntry(
            entry_type="article",
            citation_key="doi101000example",
            fields={
                "author": "Smith, Jane",
                "title": "Resolved Work",
                "year": "2024",
                "doi": value,
            },
        ),
        source_type="resolver",
        source_label=f"crossref:doi:{value}",
    )
    result = verifier.verify_bib_entry(
        BibEntry(
            entry_type="misc",
            citation_key="seed2024",
            fields={"title": "Rough Work", "doi": "10.1000/example"},
        )
    )
    assert result.status == "exact"
    assert result.confidence == 1.0
    assert result.entry.fields["title"] == "Resolved Work"
    assert result.source_label == "crossref:doi:10.1000/example"
 def test_verifier_scores_and_sorts_search_candidates():
    verifier = BibliographyVerifier()
    verifier.resolver.search_crossref = lambda title, limit=5: [  # type: ignore[method-assign]
        BibEntry(
            entry_type="article",
            citation_key="goodmatch",
            fields={
                "author": "Smith, Jane",
                "title": "Graph-first bibliography augmentation",
                "year": "2024",
                "doi": "10.1000/good",
            },
        ),
        BibEntry(
            entry_type="article",
            citation_key="weaker",
            fields={
                "author": "Doe, Alex",
                "title": "Graph search methods",
                "year": "2023",
            },
        ),
    ]
    verifier.resolver.search_openalex = lambda title, limit=5: []  # type: ignore[method-assign]
    verifier.resolver.search_datacite = lambda title, limit=5: []  # type: ignore[method-assign]
    result = verifier.verify_string('"Graph-first bibliography augmentation" Smith 2024')
    assert result.entry.citation_key == "goodmatch"
    assert result.status in {"high_confidence", "exact"}
    assert result.alternates[0].entry.citation_key == "weaker"
 def test_verification_result_to_bib_entry_contains_audit_fields():
    verifier = BibliographyVerifier()
    verifier.resolver.search_crossref = lambda title, limit=5: []  # type: ignore[method-assign]
    verifier.resolver.search_openalex = lambda title, limit=5: []  # type: ignore[method-assign]
    verifier.resolver.search_datacite = lambda title, limit=5: []  # type: ignore[method-assign]
    result = verifier._verify_query(  # type: ignore[attr-defined]
        {"title": "Missing Work", "authors": [], "year": "", "venue": ""},
        query="Missing Work",
        context="",
        limit=1,
        input_type="string",
    )
    bib_entry = result.to_bib_entry()
    assert bib_entry.fields["x_status"] == "not_found"
    assert bib_entry.fields["x_query"] == "Missing Work"