Verification added to process.
This commit is contained in:
parent
fc7e1f1844
commit
35d8eb8386
|
|
@ -3,4 +3,5 @@ __pycache__/
|
||||||
.venv/
|
.venv/
|
||||||
.cache/
|
.cache/
|
||||||
*.pyc
|
*.pyc
|
||||||
|
*.egg-info/
|
||||||
library.sqlite3
|
library.sqlite3
|
||||||
|
|
|
||||||
10
README.md
10
README.md
|
|
@ -48,6 +48,7 @@ The initial repo includes:
|
||||||
- a small CLI for ingest, search, inspection, and export;
|
- a small CLI for ingest, search, inspection, and export;
|
||||||
- review-state tracking on entries, per-field ingest provenance, and field-level conflict review;
|
- review-state tracking on entries, per-field ingest provenance, and field-level conflict review;
|
||||||
- plaintext reference extraction into draft BibTeX for numbered, APA-like, wrapped-line, and simple book-style references;
|
- plaintext reference extraction into draft BibTeX for numbered, APA-like, wrapped-line, and simple book-style references;
|
||||||
|
- standalone verification and disambiguation of free-text references or partial BibTeX into auditable BibTeX/JSON results with `x_status`, `x_confidence`, `x_source`, `x_query`, and alternate-candidate traces;
|
||||||
- identifier-first metadata resolution for DOI, OpenAlex, DBLP, arXiv, and DataCite-backed entries, with OpenAlex/DataCite title-search fallback;
|
- identifier-first metadata resolution for DOI, OpenAlex, DBLP, arXiv, and DataCite-backed entries, with OpenAlex/DataCite title-search fallback;
|
||||||
- local citation-graph traversal over stored `cites`, `cited_by`, and `crossref` edges;
|
- local citation-graph traversal over stored `cites`, `cited_by`, and `crossref` edges;
|
||||||
- Crossref- and OpenAlex-backed graph expansion that materializes draft related works and edge provenance;
|
- Crossref- and OpenAlex-backed graph expansion that materializes draft related works and edge provenance;
|
||||||
|
|
@ -132,6 +133,8 @@ PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 apply-conflict
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --seed-bib seed.bib --topic "bayesian nonparametrics"
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --seed-bib seed.bib --topic "bayesian nonparametrics"
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --topic "bayesian nonparametrics" --preview --topic-commit-limit 5
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --topic "bayesian nonparametrics" --preview --topic-commit-limit 5
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist extract references.txt --output draft.bib
|
PYTHONPATH=src .venv/bin/python -m citegeist extract references.txt --output draft.bib
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist verify --string '"Graph-first bibliography augmentation" Smith 2024' --context "citation graphs" --format json
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist verify --bib draft.bib --output verified.bib
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve smith2024graphs
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve smith2024graphs
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve-stubs --doi-only --preview --limit 25
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve-stubs --doi-only --preview --limit 25
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve-stubs --doi-only --all-misc --limit 25
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve-stubs --doi-only --all-misc --limit 25
|
||||||
|
|
@ -167,6 +170,13 @@ OpenAlex expansion is also conservative about noisy secondary records. Discoveri
|
||||||
|
|
||||||
For live-source development, prefer fixture-backed or cache-backed source clients so resolver and expansion work can be exercised repeatedly without re-hitting upstream APIs on every run.
|
For live-source development, prefer fixture-backed or cache-backed source clients so resolver and expansion work can be exercised repeatedly without re-hitting upstream APIs on every run.
|
||||||
|
|
||||||
|
## Adopted Ideas From Earlier Repos
|
||||||
|
|
||||||
|
`citegeist` now absorbs two useful patterns from adjacent bibliography tools while keeping them inside the main Python 3 package boundary:
|
||||||
|
|
||||||
|
- From `VeriBib`: a standalone `verify` workflow for ambiguous strings or rough BibTeX, with explicit confidence/status audit fields and alternate-candidate traces before you commit changes to the main library.
|
||||||
|
- From `TOA-Bib-Updater`: resumable, artifact-oriented corpus processing remains the preferred process model for large imports. In practice this already appears in the TalkOrigins example pipeline through saved manifests, review exports, duplicate reports, and staged topic-phrase review flows.
|
||||||
|
|
||||||
## Example Application
|
## Example Application
|
||||||
|
|
||||||
- Use `stage-topic-phrases` to load those suggestions into the database as review items. Staging stores the candidate in `suggested_phrase` and marks the topic `pending` without changing the active `expansion_phrase`.
|
- Use `stage-topic-phrases` to load those suggestions into the database as review items. Staging stores the candidate in `suggested_phrase` and marks the topic `pending` without changing the active `expansion_phrase`.
|
||||||
|
|
|
||||||
19
ROADMAP.md
19
ROADMAP.md
|
|
@ -25,8 +25,22 @@ Completed:
|
||||||
- lightweight BibTeX parsing;
|
- lightweight BibTeX parsing;
|
||||||
- SQLite storage for entries, creators, identifiers, and relations;
|
- SQLite storage for entries, creators, identifiers, and relations;
|
||||||
- local text search using SQLite FTS5 when available;
|
- local text search using SQLite FTS5 when available;
|
||||||
|
- standalone verification/disambiguation output for free-text references and partial BibTeX with auditable match metadata;
|
||||||
- tests for ingest, relation storage, and search.
|
- tests for ingest, relation storage, and search.
|
||||||
|
|
||||||
|
## Comparison Notes From Related Repos
|
||||||
|
|
||||||
|
The adjacent `TOA-Bib-Updater` and `VeriBib` repositories are useful prior art, but they contribute different things:
|
||||||
|
|
||||||
|
- `VeriBib` contributes a good pre-ingest verification pattern: inspect ambiguous strings or partial BibTeX, rank candidates from legal metadata sources, and emit explicit audit fields instead of silently trusting a single match.
|
||||||
|
- `TOA-Bib-Updater` contributes process discipline more than core data modeling: resumable long-running jobs, preserved source artifacts, and generated review outputs for manual inspection.
|
||||||
|
|
||||||
|
`citegeist` should absorb those ideas where they improve the main local research workflow:
|
||||||
|
|
||||||
|
1. keep verification and auditability in the core package, not just entry resolution after ingest;
|
||||||
|
2. keep resumable manifests and review exports for large acquisition workflows, especially example pipelines and batch imports;
|
||||||
|
3. avoid coupling the core model to brittle source-specific scraping logic.
|
||||||
|
|
||||||
## Phase 1: Core Ingestion And Export
|
## Phase 1: Core Ingestion And Export
|
||||||
|
|
||||||
Priority: P0
|
Priority: P0
|
||||||
|
|
@ -67,7 +81,8 @@ Tasks:
|
||||||
- support ingestion of OCR- or PDF-derived plaintext bibliography sections;
|
- support ingestion of OCR- or PDF-derived plaintext bibliography sections;
|
||||||
- add normalization for author names, years, title casing, and page ranges;
|
- add normalization for author names, years, title casing, and page ranges;
|
||||||
- prefer sentence-boundary venue detection over naive keyword splits so title text containing words like `report` is not truncated;
|
- prefer sentence-boundary venue detection over naive keyword splits so title text containing words like `report` is not truncated;
|
||||||
- repair partially extracted venue stubs such as `Occas.` or `Proc.` by reparsing the full raw reference line when the structured fields are obviously incomplete;
|
- repair partially extracted venue stubs such as `Occas.` or `Proc.` by reparsing the full raw reference line when the structured fields are obviously
|
||||||
|
incomplete;
|
||||||
- preserve improved local draft parses even when remote enrichment remains unresolved, so later parser fixes can refresh stored BibTeX without requiring a successful metadata match;
|
- preserve improved local draft parses even when remote enrichment remains unresolved, so later parser fixes can refresh stored BibTeX without requiring a successful metadata match;
|
||||||
- build gold-test fixtures from real, messy reference examples.
|
- build gold-test fixtures from real, messy reference examples.
|
||||||
|
|
||||||
|
|
@ -122,7 +137,6 @@ Tasks:
|
||||||
- expose unresolved nodes so the user can decide what to enrich next.
|
- expose unresolved nodes so the user can decide what to enrich next.
|
||||||
|
|
||||||
Why this matters:
|
Why this matters:
|
||||||
|
|
||||||
- this is central to literature discovery rather than mere bibliography cleanup;
|
- this is central to literature discovery rather than mere bibliography cleanup;
|
||||||
- it turns the database into a research navigation tool.
|
- it turns the database into a research navigation tool.
|
||||||
|
|
||||||
|
|
@ -164,7 +178,6 @@ Goal:
|
||||||
Broaden source acquisition without mixing that complexity into the core model.
|
Broaden source acquisition without mixing that complexity into the core model.
|
||||||
|
|
||||||
Tasks:
|
Tasks:
|
||||||
|
|
||||||
- add source adapters for open-access theses and dissertation repositories;
|
- add source adapters for open-access theses and dissertation repositories;
|
||||||
- add support for harvesting publisher citation pages and preprint metadata pages;
|
- add support for harvesting publisher citation pages and preprint metadata pages;
|
||||||
- define per-source import provenance and rate-limit behavior;
|
- define per-source import provenance and rate-limit behavior;
|
||||||
|
|
|
||||||
|
|
@ -7,12 +7,14 @@ from .harvest import OaiMetadataFormat, OaiPmhHarvester, OaiSet
|
||||||
from .resolve import MetadataResolver, merge_entries, merge_entries_with_conflicts
|
from .resolve import MetadataResolver, merge_entries, merge_entries_with_conflicts
|
||||||
from .sources import SourceClient
|
from .sources import SourceClient
|
||||||
from .storage import BibliographyStore
|
from .storage import BibliographyStore
|
||||||
|
from .verify import BibliographyVerifier, VerificationResult, VerificationMatch
|
||||||
|
|
||||||
__all__ = [
|
__all__ = [
|
||||||
"BibEntry",
|
"BibEntry",
|
||||||
"BatchBootstrapRunner",
|
"BatchBootstrapRunner",
|
||||||
"BatchJobResult",
|
"BatchJobResult",
|
||||||
"BibliographyStore",
|
"BibliographyStore",
|
||||||
|
"BibliographyVerifier",
|
||||||
"BootstrapResult",
|
"BootstrapResult",
|
||||||
"Bootstrapper",
|
"Bootstrapper",
|
||||||
"CrossrefExpander",
|
"CrossrefExpander",
|
||||||
|
|
@ -22,6 +24,8 @@ __all__ = [
|
||||||
"OaiMetadataFormat",
|
"OaiMetadataFormat",
|
||||||
"OaiSet",
|
"OaiSet",
|
||||||
"SourceClient",
|
"SourceClient",
|
||||||
|
"VerificationMatch",
|
||||||
|
"VerificationResult",
|
||||||
"extract_references",
|
"extract_references",
|
||||||
"load_batch_jobs",
|
"load_batch_jobs",
|
||||||
"merge_entries",
|
"merge_entries",
|
||||||
|
|
|
||||||
|
|
@ -16,6 +16,7 @@ from .extract import extract_references
|
||||||
from .harvest import OaiPmhHarvester
|
from .harvest import OaiPmhHarvester
|
||||||
from .resolve import MetadataResolver, merge_entries_with_conflicts
|
from .resolve import MetadataResolver, merge_entries_with_conflicts
|
||||||
from .storage import BibliographyStore
|
from .storage import BibliographyStore
|
||||||
|
from .verify import BibliographyVerifier, render_verification_results
|
||||||
|
|
||||||
|
|
||||||
def build_parser() -> argparse.ArgumentParser:
|
def build_parser() -> argparse.ArgumentParser:
|
||||||
|
|
@ -69,6 +70,24 @@ def build_parser() -> argparse.ArgumentParser:
|
||||||
extract_parser.add_argument("input", help="Plaintext file containing bibliography-style references")
|
extract_parser.add_argument("input", help="Plaintext file containing bibliography-style references")
|
||||||
extract_parser.add_argument("--output", help="Write extracted BibTeX to a file instead of stdout")
|
extract_parser.add_argument("--output", help="Write extracted BibTeX to a file instead of stdout")
|
||||||
|
|
||||||
|
verify_parser = subparsers.add_parser(
|
||||||
|
"verify",
|
||||||
|
help="Verify or disambiguate free-text references or BibTeX entries without modifying the database",
|
||||||
|
)
|
||||||
|
verify_group = verify_parser.add_mutually_exclusive_group(required=True)
|
||||||
|
verify_group.add_argument("--string", help="Single free-text reference query")
|
||||||
|
verify_group.add_argument("--list", dest="list_input", help="Path to a text file with one query per line")
|
||||||
|
verify_group.add_argument("--bib", help="Path to a BibTeX file whose entries should be verified")
|
||||||
|
verify_parser.add_argument("--context", default="", help="Optional topic context used for scoring")
|
||||||
|
verify_parser.add_argument("--limit", type=int, default=5, help="Maximum candidates to inspect per input")
|
||||||
|
verify_parser.add_argument(
|
||||||
|
"--format",
|
||||||
|
choices=["bibtex", "json"],
|
||||||
|
default="bibtex",
|
||||||
|
help="Output format for verification results",
|
||||||
|
)
|
||||||
|
verify_parser.add_argument("--output", help="Write verification results to a file instead of stdout")
|
||||||
|
|
||||||
resolve_parser = subparsers.add_parser("resolve", help="Enrich stored entries from external metadata sources")
|
resolve_parser = subparsers.add_parser("resolve", help="Enrich stored entries from external metadata sources")
|
||||||
resolve_parser.add_argument("citation_keys", nargs="+", help="Citation keys to enrich")
|
resolve_parser.add_argument("citation_keys", nargs="+", help="Citation keys to enrich")
|
||||||
|
|
||||||
|
|
@ -535,6 +554,8 @@ def main(argv: list[str] | None = None) -> int:
|
||||||
return _run_apply_conflict(store, args.citation_key, args.field_name)
|
return _run_apply_conflict(store, args.citation_key, args.field_name)
|
||||||
if args.command == "extract":
|
if args.command == "extract":
|
||||||
return _run_extract(Path(args.input), args.output)
|
return _run_extract(Path(args.input), args.output)
|
||||||
|
if args.command == "verify":
|
||||||
|
return _run_verify(args.string, args.list_input, args.bib, args.context, args.limit, args.format, args.output)
|
||||||
if args.command == "resolve":
|
if args.command == "resolve":
|
||||||
return _run_resolve(store, args.citation_keys)
|
return _run_resolve(store, args.citation_keys)
|
||||||
if args.command == "resolve-stubs":
|
if args.command == "resolve-stubs":
|
||||||
|
|
@ -783,6 +804,36 @@ def _run_extract(input_path: Path, output: str | None) -> int:
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _run_verify(
|
||||||
|
string_input: str | None,
|
||||||
|
list_input: str | None,
|
||||||
|
bib_input: str | None,
|
||||||
|
context: str,
|
||||||
|
limit: int,
|
||||||
|
output_format: str,
|
||||||
|
output: str | None,
|
||||||
|
) -> int:
|
||||||
|
verifier = BibliographyVerifier()
|
||||||
|
if string_input is not None:
|
||||||
|
results = [verifier.verify_string(string_input, context=context, limit=limit)]
|
||||||
|
elif list_input is not None:
|
||||||
|
values = [line.strip() for line in Path(list_input).read_text(encoding="utf-8").splitlines() if line.strip()]
|
||||||
|
results = verifier.verify_strings(values, context=context, limit=limit)
|
||||||
|
elif bib_input is not None:
|
||||||
|
results = verifier.verify_bib_file(bib_input, context=context, limit=limit)
|
||||||
|
else:
|
||||||
|
print("verify requires one input source", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
rendered = render_verification_results(results, output_format)
|
||||||
|
if output:
|
||||||
|
Path(output).write_text(rendered + ("\n" if rendered and not rendered.endswith("\n") else ""), encoding="utf-8")
|
||||||
|
else:
|
||||||
|
if rendered:
|
||||||
|
print(rendered)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
def _print_progress(label: str, index: int, total: int, detail: str | None = None) -> None:
|
def _print_progress(label: str, index: int, total: int, detail: str | None = None) -> None:
|
||||||
message = f"[{index}/{total}] {label}"
|
message = f"[{index}/{total}] {label}"
|
||||||
if detail:
|
if detail:
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,358 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from .bibtex import BibEntry, parse_bibtex, render_bibtex
|
||||||
|
from .resolve import MetadataResolver, Resolution
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(slots=True)
|
||||||
|
class VerificationMatch:
|
||||||
|
entry: BibEntry
|
||||||
|
score: float
|
||||||
|
source_label: str
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(slots=True)
|
||||||
|
class VerificationResult:
|
||||||
|
query: str
|
||||||
|
context: str
|
||||||
|
status: str
|
||||||
|
confidence: float
|
||||||
|
entry: BibEntry
|
||||||
|
source_label: str
|
||||||
|
alternates: list[VerificationMatch]
|
||||||
|
input_type: str
|
||||||
|
input_key: str | None = None
|
||||||
|
|
||||||
|
def to_bib_entry(self) -> BibEntry:
|
||||||
|
fields = dict(self.entry.fields)
|
||||||
|
fields["x_status"] = self.status
|
||||||
|
fields["x_confidence"] = f"{self.confidence:.2f}"
|
||||||
|
fields["x_source"] = self.source_label
|
||||||
|
fields["x_query"] = self.query
|
||||||
|
fields["x_context"] = self.context
|
||||||
|
if self.input_type == "bib" and self.input_key:
|
||||||
|
fields["x_input_key"] = self.input_key
|
||||||
|
if self.alternates:
|
||||||
|
fields["x_alternates"] = " || ".join(
|
||||||
|
_serialize_alternate(match) for match in self.alternates
|
||||||
|
)
|
||||||
|
return BibEntry(
|
||||||
|
entry_type=self.entry.entry_type,
|
||||||
|
citation_key=self.entry.citation_key,
|
||||||
|
fields=fields,
|
||||||
|
)
|
||||||
|
|
||||||
|
def to_dict(self) -> dict[str, object]:
|
||||||
|
return {
|
||||||
|
"query": self.query,
|
||||||
|
"context": self.context,
|
||||||
|
"input_type": self.input_type,
|
||||||
|
"input_key": self.input_key,
|
||||||
|
"status": self.status,
|
||||||
|
"confidence": round(self.confidence, 4),
|
||||||
|
"source_label": self.source_label,
|
||||||
|
"entry": {
|
||||||
|
"citation_key": self.entry.citation_key,
|
||||||
|
"entry_type": self.entry.entry_type,
|
||||||
|
"fields": dict(self.entry.fields),
|
||||||
|
},
|
||||||
|
"alternates": [
|
||||||
|
{
|
||||||
|
"citation_key": match.entry.citation_key,
|
||||||
|
"entry_type": match.entry.entry_type,
|
||||||
|
"score": round(match.score, 4),
|
||||||
|
"source_label": match.source_label,
|
||||||
|
"fields": dict(match.entry.fields),
|
||||||
|
}
|
||||||
|
for match in self.alternates
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class BibliographyVerifier:
|
||||||
|
def __init__(self, resolver: MetadataResolver | None = None) -> None:
|
||||||
|
self.resolver = resolver or MetadataResolver()
|
||||||
|
|
||||||
|
def verify_string(self, value: str, context: str = "", limit: int = 5) -> VerificationResult:
|
||||||
|
query_fields = _fields_from_string(value)
|
||||||
|
return self._verify_query(
|
||||||
|
query_fields,
|
||||||
|
query=value,
|
||||||
|
context=context,
|
||||||
|
limit=limit,
|
||||||
|
input_type="string",
|
||||||
|
)
|
||||||
|
|
||||||
|
def verify_bib_entry(self, entry: BibEntry, context: str = "", limit: int = 5) -> VerificationResult:
|
||||||
|
query = " ".join(
|
||||||
|
part
|
||||||
|
for part in (
|
||||||
|
entry.fields.get("doi", ""),
|
||||||
|
entry.fields.get("title", ""),
|
||||||
|
entry.fields.get("author", ""),
|
||||||
|
entry.fields.get("year", ""),
|
||||||
|
)
|
||||||
|
if part
|
||||||
|
).strip()
|
||||||
|
query_fields = {
|
||||||
|
"title": entry.fields.get("title", ""),
|
||||||
|
"authors": _split_authors(entry.fields.get("author", "")),
|
||||||
|
"year": entry.fields.get("year", ""),
|
||||||
|
"venue": entry.fields.get("journal", "") or entry.fields.get("booktitle", ""),
|
||||||
|
}
|
||||||
|
return self._verify_query(
|
||||||
|
query_fields,
|
||||||
|
query=query or entry.citation_key,
|
||||||
|
context=context,
|
||||||
|
limit=limit,
|
||||||
|
input_type="bib",
|
||||||
|
input_key=entry.citation_key,
|
||||||
|
source_entry=entry,
|
||||||
|
)
|
||||||
|
|
||||||
|
def verify_strings(self, values: list[str], context: str = "", limit: int = 5) -> list[VerificationResult]:
|
||||||
|
return [self.verify_string(value, context=context, limit=limit) for value in values if value.strip()]
|
||||||
|
|
||||||
|
def verify_bib_file(self, path: str | Path, context: str = "", limit: int = 5) -> list[VerificationResult]:
|
||||||
|
entries = parse_bibtex(Path(path).read_text(encoding="utf-8"))
|
||||||
|
return [self.verify_bib_entry(entry, context=context, limit=limit) for entry in entries]
|
||||||
|
|
||||||
|
def _verify_query(
|
||||||
|
self,
|
||||||
|
query_fields: dict[str, object],
|
||||||
|
*,
|
||||||
|
query: str,
|
||||||
|
context: str,
|
||||||
|
limit: int,
|
||||||
|
input_type: str,
|
||||||
|
input_key: str | None = None,
|
||||||
|
source_entry: BibEntry | None = None,
|
||||||
|
) -> VerificationResult:
|
||||||
|
if source_entry is not None and source_entry.fields.get("doi"):
|
||||||
|
direct = self.resolver.resolve_doi(source_entry.fields["doi"]) or self.resolver.resolve_datacite_doi(
|
||||||
|
source_entry.fields["doi"]
|
||||||
|
)
|
||||||
|
if direct is not None:
|
||||||
|
return VerificationResult(
|
||||||
|
query=query,
|
||||||
|
context=context,
|
||||||
|
status="exact",
|
||||||
|
confidence=1.0,
|
||||||
|
entry=direct.entry,
|
||||||
|
source_label=direct.source_label,
|
||||||
|
alternates=[],
|
||||||
|
input_type=input_type,
|
||||||
|
input_key=input_key,
|
||||||
|
)
|
||||||
|
|
||||||
|
candidate_limit = max(1, limit)
|
||||||
|
candidates = self._collect_candidates(
|
||||||
|
title=str(query_fields.get("title", "")),
|
||||||
|
query=query,
|
||||||
|
limit=candidate_limit,
|
||||||
|
)
|
||||||
|
scored = [
|
||||||
|
VerificationMatch(
|
||||||
|
entry=entry,
|
||||||
|
score=_score_candidate(query_fields, context, entry),
|
||||||
|
source_label=source_label,
|
||||||
|
)
|
||||||
|
for entry, source_label in candidates
|
||||||
|
]
|
||||||
|
scored.sort(
|
||||||
|
key=lambda item: (
|
||||||
|
-item.score,
|
||||||
|
item.entry.fields.get("year", ""),
|
||||||
|
item.entry.citation_key,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
best = scored[0] if scored else None
|
||||||
|
if best is None:
|
||||||
|
fallback_entry = source_entry or _placeholder_entry(query_fields, query, input_key)
|
||||||
|
return VerificationResult(
|
||||||
|
query=query,
|
||||||
|
context=context,
|
||||||
|
status="not_found",
|
||||||
|
confidence=0.0,
|
||||||
|
entry=fallback_entry,
|
||||||
|
source_label="none",
|
||||||
|
alternates=[],
|
||||||
|
input_type=input_type,
|
||||||
|
input_key=input_key,
|
||||||
|
)
|
||||||
|
|
||||||
|
status = _status_from_match(best)
|
||||||
|
return VerificationResult(
|
||||||
|
query=query,
|
||||||
|
context=context,
|
||||||
|
status=status,
|
||||||
|
confidence=best.score,
|
||||||
|
entry=best.entry,
|
||||||
|
source_label=best.source_label,
|
||||||
|
alternates=scored[1: min(len(scored), 4)],
|
||||||
|
input_type=input_type,
|
||||||
|
input_key=input_key,
|
||||||
|
)
|
||||||
|
|
||||||
|
def _collect_candidates(self, *, title: str, query: str, limit: int) -> list[tuple[BibEntry, str]]:
|
||||||
|
candidates: list[tuple[BibEntry, str]] = []
|
||||||
|
seen: set[str] = set()
|
||||||
|
search_title = title or query
|
||||||
|
|
||||||
|
for source_name, source_entries in (
|
||||||
|
("crossref", self.resolver.search_crossref(search_title, limit=limit)),
|
||||||
|
("openalex", self.resolver.search_openalex(search_title, limit=limit)),
|
||||||
|
("datacite", self.resolver.search_datacite(search_title, limit=limit)),
|
||||||
|
):
|
||||||
|
for entry in source_entries:
|
||||||
|
signature = _candidate_signature(entry)
|
||||||
|
if signature in seen:
|
||||||
|
continue
|
||||||
|
seen.add(signature)
|
||||||
|
candidates.append((entry, f"{source_name}:search:{search_title}"))
|
||||||
|
return candidates
|
||||||
|
|
||||||
|
|
||||||
|
def render_verification_results(results: list[VerificationResult], output_format: str) -> str:
|
||||||
|
if output_format == "json":
|
||||||
|
return json.dumps([result.to_dict() for result in results], indent=2)
|
||||||
|
return render_bibtex([result.to_bib_entry() for result in results])
|
||||||
|
|
||||||
|
|
||||||
|
def _fields_from_string(value: str) -> dict[str, object]:
|
||||||
|
year_match = re.search(r"\b(1[6-9]\d{2}|20\d{2}|21\d{2})\b", value)
|
||||||
|
year = year_match.group(1) if year_match else ""
|
||||||
|
quoted_title = re.search(r"[\"“”‘’'`](.+?)[\"“”‘’'`]", value)
|
||||||
|
title = quoted_title.group(1).strip() if quoted_title else ""
|
||||||
|
author_source = value
|
||||||
|
if quoted_title:
|
||||||
|
author_source = author_source.replace(quoted_title.group(0), " ")
|
||||||
|
if year:
|
||||||
|
author_source = author_source.replace(year, " ")
|
||||||
|
author_tokens = [token.strip(",.;:") for token in author_source.split() if token.strip(",.;:")]
|
||||||
|
authors: list[str] = [author_tokens[0]] if author_tokens else []
|
||||||
|
return {"title": title, "authors": authors, "year": year, "venue": ""}
|
||||||
|
|
||||||
|
|
||||||
|
def _score_candidate(query_fields: dict[str, object], context: str, entry: BibEntry) -> float:
|
||||||
|
score = 0.0
|
||||||
|
query_title = _tokenize(str(query_fields.get("title", "")))
|
||||||
|
candidate_title = _tokenize(entry.fields.get("title", ""))
|
||||||
|
if query_title:
|
||||||
|
overlap = len(query_title & candidate_title) / max(1, len(query_title))
|
||||||
|
if overlap >= 0.9:
|
||||||
|
score += 0.55
|
||||||
|
elif overlap >= 0.7:
|
||||||
|
score += 0.40
|
||||||
|
elif overlap >= 0.5:
|
||||||
|
score += 0.20
|
||||||
|
|
||||||
|
query_authors = [author for author in query_fields.get("authors", []) if author]
|
||||||
|
if query_authors:
|
||||||
|
query_surname = _surname(query_authors[0])
|
||||||
|
candidate_surname = _surname(_split_authors(entry.fields.get("author", ""))[0]) if entry.fields.get("author") else ""
|
||||||
|
if query_surname and query_surname == candidate_surname:
|
||||||
|
score += 0.25
|
||||||
|
|
||||||
|
query_year = str(query_fields.get("year", "")).strip()
|
||||||
|
candidate_year = entry.fields.get("year", "").strip()
|
||||||
|
if query_year and candidate_year:
|
||||||
|
if query_year == candidate_year:
|
||||||
|
score += 0.15
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
delta = abs(int(query_year) - int(candidate_year))
|
||||||
|
if delta == 1:
|
||||||
|
score += 0.07
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
query_venue = str(query_fields.get("venue", "")).strip()
|
||||||
|
candidate_venue = entry.fields.get("journal", "").strip() or entry.fields.get("booktitle", "").strip()
|
||||||
|
if query_venue and candidate_venue and _normalize(query_venue) == _normalize(candidate_venue):
|
||||||
|
score += 0.05
|
||||||
|
|
||||||
|
if context:
|
||||||
|
context_tokens = _tokenize(context)
|
||||||
|
abstract_tokens = _tokenize(entry.fields.get("abstract", ""))
|
||||||
|
if context_tokens & abstract_tokens:
|
||||||
|
score += 0.05
|
||||||
|
|
||||||
|
return min(score, 1.0)
|
||||||
|
|
||||||
|
|
||||||
|
def _status_from_match(match: VerificationMatch) -> str:
|
||||||
|
if match.entry.fields.get("doi") and match.score >= 0.95:
|
||||||
|
return "exact"
|
||||||
|
if match.score >= 0.75:
|
||||||
|
return "high_confidence"
|
||||||
|
return "ambiguous"
|
||||||
|
|
||||||
|
|
||||||
|
def _split_authors(value: str) -> list[str]:
|
||||||
|
return [part.strip() for part in value.split(" and ") if part.strip()]
|
||||||
|
|
||||||
|
|
||||||
|
def _surname(value: str) -> str:
|
||||||
|
text = value.strip()
|
||||||
|
if not text:
|
||||||
|
return ""
|
||||||
|
if "," in text:
|
||||||
|
return text.split(",", 1)[0].strip().lower()
|
||||||
|
return text.split()[-1].strip().lower()
|
||||||
|
|
||||||
|
|
||||||
|
def _tokenize(value: str) -> set[str]:
|
||||||
|
return {token for token in re.split(r"\W+", value.lower()) if token}
|
||||||
|
|
||||||
|
|
||||||
|
def _normalize(value: str) -> str:
|
||||||
|
return " ".join(value.lower().split())
|
||||||
|
|
||||||
|
|
||||||
|
def _serialize_alternate(match: VerificationMatch) -> str:
|
||||||
|
authors = _split_authors(match.entry.fields.get("author", ""))
|
||||||
|
first_author = authors[0] if authors else ""
|
||||||
|
return "|".join(
|
||||||
|
(
|
||||||
|
match.entry.fields.get("doi", ""),
|
||||||
|
match.entry.fields.get("title", ""),
|
||||||
|
first_author,
|
||||||
|
match.entry.fields.get("year", ""),
|
||||||
|
f"{match.score:.2f}",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _candidate_signature(entry: BibEntry) -> str:
|
||||||
|
return "|".join(
|
||||||
|
(
|
||||||
|
entry.fields.get("doi", "").lower(),
|
||||||
|
_normalize(entry.fields.get("title", "")),
|
||||||
|
entry.fields.get("year", ""),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _placeholder_entry(query_fields: dict[str, object], query: str, input_key: str | None) -> BibEntry:
|
||||||
|
title = str(query_fields.get("title", "")) or query
|
||||||
|
authors = query_fields.get("authors", [])
|
||||||
|
year = str(query_fields.get("year", ""))
|
||||||
|
citation_key = input_key or _slugify_key(title or query)
|
||||||
|
fields = {"title": title}
|
||||||
|
if authors:
|
||||||
|
fields["author"] = " and ".join(str(author) for author in authors)
|
||||||
|
if year:
|
||||||
|
fields["year"] = year
|
||||||
|
return BibEntry(entry_type="misc", citation_key=citation_key, fields=fields)
|
||||||
|
|
||||||
|
|
||||||
|
def _slugify_key(value: str) -> str:
|
||||||
|
slug = re.sub(r"[^a-z0-9]+", "", value.lower())
|
||||||
|
return slug[:40] or "verification"
|
||||||
|
|
@ -138,6 +138,115 @@ def test_cli_provenance_and_status_updates(tmp_path: Path):
|
||||||
assert "reviewed" in status.stdout
|
assert "reviewed" in status.stdout
|
||||||
|
|
||||||
|
|
||||||
|
def test_cli_verify_string_outputs_json_with_audit_fields(tmp_path: Path):
|
||||||
|
from citegeist.bibtex import BibEntry
|
||||||
|
|
||||||
|
database = tmp_path / "library.sqlite3"
|
||||||
|
with patch("citegeist.cli.BibliographyVerifier.verify_string") as mocked_verify:
|
||||||
|
from citegeist.verify import VerificationResult
|
||||||
|
|
||||||
|
mocked_verify.return_value = VerificationResult(
|
||||||
|
query='"Graph-first bibliography augmentation" Smith 2024',
|
||||||
|
context="citation graphs",
|
||||||
|
status="high_confidence",
|
||||||
|
confidence=0.82,
|
||||||
|
entry=BibEntry(
|
||||||
|
entry_type="article",
|
||||||
|
citation_key="smith2024graphs",
|
||||||
|
fields={
|
||||||
|
"author": "Smith, Jane",
|
||||||
|
"title": "Graph-first bibliography augmentation",
|
||||||
|
"year": "2024",
|
||||||
|
"doi": "10.1000/example-doi",
|
||||||
|
},
|
||||||
|
),
|
||||||
|
source_label="crossref:search:Graph-first bibliography augmentation",
|
||||||
|
alternates=[],
|
||||||
|
input_type="string",
|
||||||
|
input_key=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
stdout_buffer = io.StringIO()
|
||||||
|
with redirect_stdout(stdout_buffer):
|
||||||
|
exit_code = main(
|
||||||
|
[
|
||||||
|
"--db",
|
||||||
|
str(database),
|
||||||
|
"verify",
|
||||||
|
"--string",
|
||||||
|
'"Graph-first bibliography augmentation" Smith 2024',
|
||||||
|
"--context",
|
||||||
|
"citation graphs",
|
||||||
|
"--format",
|
||||||
|
"json",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
assert exit_code == 0
|
||||||
|
payload = json.loads(stdout_buffer.getvalue())
|
||||||
|
assert payload[0]["status"] == "high_confidence"
|
||||||
|
assert payload[0]["source_label"] == "crossref:search:Graph-first bibliography augmentation"
|
||||||
|
assert payload[0]["entry"]["citation_key"] == "smith2024graphs"
|
||||||
|
|
||||||
|
|
||||||
|
def test_cli_verify_bib_outputs_json(tmp_path: Path):
|
||||||
|
bib_path = tmp_path / "partial.bib"
|
||||||
|
bib_path.write_text(
|
||||||
|
"""
|
||||||
|
@misc{roughentry,
|
||||||
|
title = {Graph-first bibliography augmentation},
|
||||||
|
year = {2024}
|
||||||
|
}
|
||||||
|
""",
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
with patch("citegeist.cli.BibliographyVerifier.verify_bib_file") as mocked_verify:
|
||||||
|
from citegeist.bibtex import BibEntry
|
||||||
|
from citegeist.verify import VerificationResult
|
||||||
|
|
||||||
|
mocked_verify.return_value = [
|
||||||
|
VerificationResult(
|
||||||
|
query="Graph-first bibliography augmentation 2024",
|
||||||
|
context="",
|
||||||
|
status="ambiguous",
|
||||||
|
confidence=0.61,
|
||||||
|
entry=BibEntry(
|
||||||
|
entry_type="article",
|
||||||
|
citation_key="candidate2024",
|
||||||
|
fields={
|
||||||
|
"title": "Graph-first bibliography augmentation",
|
||||||
|
"year": "2024",
|
||||||
|
},
|
||||||
|
),
|
||||||
|
source_label="openalex:search:Graph-first bibliography augmentation",
|
||||||
|
alternates=[],
|
||||||
|
input_type="bib",
|
||||||
|
input_key="roughentry",
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
stdout_buffer = io.StringIO()
|
||||||
|
with redirect_stdout(stdout_buffer):
|
||||||
|
exit_code = main(
|
||||||
|
[
|
||||||
|
"--db",
|
||||||
|
str(tmp_path / "library.sqlite3"),
|
||||||
|
"verify",
|
||||||
|
"--bib",
|
||||||
|
str(bib_path),
|
||||||
|
"--format",
|
||||||
|
"json",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
assert exit_code == 0
|
||||||
|
payload = json.loads(stdout_buffer.getvalue())
|
||||||
|
assert payload[0]["status"] == "ambiguous"
|
||||||
|
assert payload[0]["input_key"] == "roughentry"
|
||||||
|
assert payload[0]["entry"]["citation_key"] == "candidate2024"
|
||||||
|
|
||||||
|
|
||||||
def test_cli_resolve_updates_entry(tmp_path: Path):
|
def test_cli_resolve_updates_entry(tmp_path: Path):
|
||||||
bib_path = tmp_path / "input.bib"
|
bib_path = tmp_path / "input.bib"
|
||||||
bib_path.write_text(
|
bib_path.write_text(
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,89 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from citegeist.bibtex import BibEntry
|
||||||
|
from citegeist.resolve import Resolution
|
||||||
|
from citegeist.verify import BibliographyVerifier
|
||||||
|
|
||||||
|
|
||||||
|
def test_verifier_uses_direct_doi_resolution_for_bib_entries():
|
||||||
|
verifier = BibliographyVerifier()
|
||||||
|
verifier.resolver.resolve_doi = lambda value: Resolution( # type: ignore[method-assign]
|
||||||
|
entry=BibEntry(
|
||||||
|
entry_type="article",
|
||||||
|
citation_key="doi101000example",
|
||||||
|
fields={
|
||||||
|
"author": "Smith, Jane",
|
||||||
|
"title": "Resolved Work",
|
||||||
|
"year": "2024",
|
||||||
|
"doi": value,
|
||||||
|
},
|
||||||
|
),
|
||||||
|
source_type="resolver",
|
||||||
|
source_label=f"crossref:doi:{value}",
|
||||||
|
)
|
||||||
|
|
||||||
|
result = verifier.verify_bib_entry(
|
||||||
|
BibEntry(
|
||||||
|
entry_type="misc",
|
||||||
|
citation_key="seed2024",
|
||||||
|
fields={"title": "Rough Work", "doi": "10.1000/example"},
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result.status == "exact"
|
||||||
|
assert result.confidence == 1.0
|
||||||
|
assert result.entry.fields["title"] == "Resolved Work"
|
||||||
|
assert result.source_label == "crossref:doi:10.1000/example"
|
||||||
|
|
||||||
|
|
||||||
|
def test_verifier_scores_and_sorts_search_candidates():
|
||||||
|
verifier = BibliographyVerifier()
|
||||||
|
verifier.resolver.search_crossref = lambda title, limit=5: [ # type: ignore[method-assign]
|
||||||
|
BibEntry(
|
||||||
|
entry_type="article",
|
||||||
|
citation_key="goodmatch",
|
||||||
|
fields={
|
||||||
|
"author": "Smith, Jane",
|
||||||
|
"title": "Graph-first bibliography augmentation",
|
||||||
|
"year": "2024",
|
||||||
|
"doi": "10.1000/good",
|
||||||
|
},
|
||||||
|
),
|
||||||
|
BibEntry(
|
||||||
|
entry_type="article",
|
||||||
|
citation_key="weaker",
|
||||||
|
fields={
|
||||||
|
"author": "Doe, Alex",
|
||||||
|
"title": "Graph search methods",
|
||||||
|
"year": "2023",
|
||||||
|
},
|
||||||
|
),
|
||||||
|
]
|
||||||
|
verifier.resolver.search_openalex = lambda title, limit=5: [] # type: ignore[method-assign]
|
||||||
|
verifier.resolver.search_datacite = lambda title, limit=5: [] # type: ignore[method-assign]
|
||||||
|
|
||||||
|
result = verifier.verify_string('"Graph-first bibliography augmentation" Smith 2024')
|
||||||
|
|
||||||
|
assert result.entry.citation_key == "goodmatch"
|
||||||
|
assert result.status in {"high_confidence", "exact"}
|
||||||
|
assert result.alternates[0].entry.citation_key == "weaker"
|
||||||
|
|
||||||
|
|
||||||
|
def test_verification_result_to_bib_entry_contains_audit_fields():
|
||||||
|
verifier = BibliographyVerifier()
|
||||||
|
verifier.resolver.search_crossref = lambda title, limit=5: [] # type: ignore[method-assign]
|
||||||
|
verifier.resolver.search_openalex = lambda title, limit=5: [] # type: ignore[method-assign]
|
||||||
|
verifier.resolver.search_datacite = lambda title, limit=5: [] # type: ignore[method-assign]
|
||||||
|
|
||||||
|
result = verifier._verify_query( # type: ignore[attr-defined]
|
||||||
|
{"title": "Missing Work", "authors": [], "year": "", "venue": ""},
|
||||||
|
query="Missing Work",
|
||||||
|
context="",
|
||||||
|
limit=1,
|
||||||
|
input_type="string",
|
||||||
|
)
|
||||||
|
|
||||||
|
bib_entry = result.to_bib_entry()
|
||||||
|
|
||||||
|
assert bib_entry.fields["x_status"] == "not_found"
|
||||||
|
assert bib_entry.fields["x_query"] == "Missing Work"
|
||||||
Loading…
Reference in New Issue