Updates based on batch usage.

2026-04-21 09:19:11 -04:00 · 2026-04-21 09:19:11 -04:00 · 39fe5ea86c
parent 4894341ba8
commit 39fe5ea86c
8 changed files with 183 additions and 15 deletions
--- a/.gitignore
+++ b/.gitignore
@ -8,3 +8,7 @@ library.sqlite3
 ops/
 .codex
 SESSION_*
 topic-*
 talkorigins-out/
 talkorigins.sqlite3
 to_batch.sh*
--- a/README.md
+++ b/README.md
@ -202,6 +202,12 @@ Broad BibTeX exports skip DOI-only placeholder records such as `Referenced work
 Long-running CLI commands report progress on `stderr` so `stdout` remains clean for JSON, BibTeX, or tabular output.
 For long-running commands that emit structured output on `stdout`, prefer `tee` with a descriptive filename so you keep a reviewable artifact without losing live terminal feedback. For example:
 ```bash
 PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 bootstrap-batch talkorigins-out/talkorigins_jobs.json | tee talkorigins-bootstrap-results.json
 ```
 BibTeX parse/render round-trips normalize simple escaped special characters such as `\_`, `\&`, and `\%` back to plain field values internally, then re-escape them on export. This prevents repeated commands such as `resolve` from turning a valid field like `discovered\_from = {...}` into `discovered\\_from = {...}` after rewriting an entry.
 Crossref reference expansion is intentionally conservative about weak discoveries. If a cited reference has no DOI and Crossref only exposes it as an unstructured citation blob, `expand --source crossref`, `expand-topic --source crossref`, and bootstrap flows now skip materializing that record unless the fallback metadata looks like a cleaner non-`misc` work such as conference proceedings. When Crossref does expose thesis or dissertation references only as unstructured text, citegeist now also tries to extract the actual work title instead of keeping the entire ProQuest-style citation blob in the `title` field. This reduces junk `@misc` entries and cleaner `@phdthesis` fallbacks whose `title` field is really a pasted citation string.
@ -357,8 +363,9 @@ and the example-scoped CLI commands:
 ```bash
 PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
 PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 bootstrap-batch talkorigins-out/talkorigins_jobs.json | tee talkorigins-bootstrap-results.json
 PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
-PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
+PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only | tee talkorigins-duplicates-preview.json
 ```
 The older `scrape-talkorigins`-style command names remain available as compatibility aliases. The full example workflow and reconstruction notes live in [examples/talkorigins/README.md](./examples/talkorigins/README.md).
--- a/examples/cli/README.md
+++ b/examples/cli/README.md
@ -518,6 +518,12 @@ Run a JSON batch file:
 .venv/bin/python -m citegeist --db library.sqlite3 bootstrap-batch artificial-life.json
 ```
 Keep the JSON result payload while watching live progress:
 ```bash
 .venv/bin/python -m citegeist --db library.sqlite3 bootstrap-batch artificial-life.json | tee artificial-life-bootstrap-results.json
 ```
 ### Topic Phrase Review Workflow
 Apply topic phrases directly:
@ -676,6 +682,12 @@ Control generated bootstrap defaults:
 .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --topic-limit 10 --topic-commit-limit 5 --status draft
 ```
 Run the generated bootstrap batch and save the per-job JSON results:
 ```bash
 PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 bootstrap-batch talkorigins-out/talkorigins_jobs.json | tee talkorigins-bootstrap-results.json
 ```
 Validate the generated manifest:
 ```bash
@ -694,6 +706,12 @@ Inspect duplicate clusters:
 .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --min-count 2 --match origin --topic abiogenesis --preview --weak-only
 ```
 Keep the duplicate-cluster report in a named file while reviewing the terminal output:
 ```bash
 .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --min-count 2 --match origin --topic abiogenesis --preview --weak-only | tee talkorigins-duplicates-preview.json
 ```
 Ingest the reconstructed corpus:
 ```bash
--- a/examples/talkorigins/README.md
+++ b/examples/talkorigins/README.md
@ -24,8 +24,9 @@ The preferred CLI commands are example-scoped:
 ```bash
 PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
 PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 bootstrap-batch talkorigins-out/talkorigins_jobs.json | tee talkorigins-bootstrap-results.json
 PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
-PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
+PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only | tee talkorigins-duplicates-preview.json
 PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-suggest-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json
 PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json
 PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export-topic-phrase-reviews --output topic-phrase-review.json
@ -48,5 +49,6 @@ The example scrape writes:
 ## Notes
 - Long-running commands print progress to `stderr`, so JSON or tabular results on `stdout` can be safely captured with `tee` into named files such as `talkorigins-bootstrap-results.json`.
 - The example-specific CLI names have compatibility aliases matching the older `scrape-talkorigins` style commands.
 - Topic phrase staging, review, and export commands are generic `citegeist` functionality and are not specific to TalkOrigins.
--- a/src/citegeist/resolve.py
+++ b/src/citegeist/resolve.py
@ -1,6 +1,8 @@
 from __future__ import annotations
 import html
 import http.client
 import os
 import re
 import urllib.error
 import urllib.parse
@ -23,9 +25,15 @@ class MetadataResolver:
        self,
        user_agent: str = "citegeist/0.1 (local research tool)",
        source_client: SourceClient | None = None,
        ncbi_api_key: str | None = None,
        ncbi_tool: str | None = None,
        ncbi_email: str | None = None,
    ) -> None:
        self.user_agent = user_agent
        self.source_client = source_client or SourceClient(user_agent=user_agent)
        self.ncbi_api_key = ncbi_api_key if ncbi_api_key is not None else os.environ.get("NCBI_API_KEY", "")
        self.ncbi_tool = ncbi_tool if ncbi_tool is not None else os.environ.get("NCBI_TOOL", "citegeist")
        self.ncbi_email = ncbi_email if ncbi_email is not None else os.environ.get("NCBI_EMAIL", "")
    def resolve_entry(self, entry: BibEntry) -> Resolution | None:
        if doi := entry.fields.get("doi"):
@ -182,7 +190,9 @@ class MetadataResolver:
        normalized_pmid = _normalize_pmid(pmid)
        if not normalized_pmid:
            return None
-        query = urllib.parse.urlencode({"db": "pubmed", "id": normalized_pmid, "retmode": "xml"})
+        query = urllib.parse.urlencode(
            self._ncbi_params({"db": "pubmed", "id": normalized_pmid, "retmode": "xml"})
        )
        root = self._safe_get_xml(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{query}")
        if root is None:
            return None
@ -261,12 +271,12 @@ class MetadataResolver:
        if not query_text:
            return []
        query = urllib.parse.urlencode(
-            {
+            self._ncbi_params({
                "db": "pubmed",
                "retmode": "json",
                "retmax": max(1, limit),
                "term": query_text,
-            }
+            })
        )
        payload = self._safe_get_json(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?{query}")
        if payload is None:
@ -283,19 +293,38 @@ class MetadataResolver:
    def _safe_get_json(self, url: str) -> dict | None:
        try:
            return self.source_client.get_json(url)
-        except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ValueError):
+        except (
            http.client.RemoteDisconnected,
            urllib.error.HTTPError,
            urllib.error.URLError,
            TimeoutError,
            ValueError,
        ):
            return None
    def _safe_get_text(self, url: str) -> str | None:
        try:
            return self.source_client.get_text(url)
-        except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ValueError):
+        except (
            http.client.RemoteDisconnected,
            urllib.error.HTTPError,
            urllib.error.URLError,
            TimeoutError,
            ValueError,
        ):
            return None
    def _safe_get_xml(self, url: str) -> ET.Element | None:
        try:
            return self.source_client.get_xml(url)
-        except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ET.ParseError, ValueError):
+        except (
            http.client.RemoteDisconnected,
            urllib.error.HTTPError,
            urllib.error.URLError,
            TimeoutError,
            ET.ParseError,
            ValueError,
        ):
            return None
    def search_openalex_best_match(
@ -344,13 +373,13 @@ class MetadataResolver:
            return []
        id_param = ",".join(ordered_pmids)
-        summary_query = urllib.parse.urlencode({"db": "pubmed", "retmode": "json", "id": id_param})
+        summary_query = urllib.parse.urlencode(self._ncbi_params({"db": "pubmed", "retmode": "json", "id": id_param}))
        summaries_payload = self._safe_get_json(
            f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?{summary_query}"
        ) or {}
        summaries = summaries_payload.get("result", {})
-        fetch_query = urllib.parse.urlencode({"db": "pubmed", "id": id_param, "retmode": "xml"})
+        fetch_query = urllib.parse.urlencode(self._ncbi_params({"db": "pubmed", "id": id_param, "retmode": "xml"}))
        root = self._safe_get_xml(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{fetch_query}")
        articles = _pubmed_articles_by_pmid(root)
@ -363,6 +392,16 @@ class MetadataResolver:
            entries.append(_pubmed_record_to_entry(summary or {}, article, fallback_pmid=pmid))
        return entries
    def _ncbi_params(self, params: dict[str, object]) -> dict[str, object]:
        enriched = dict(params)
        if self.ncbi_api_key:
            enriched["api_key"] = self.ncbi_api_key
        if self.ncbi_tool:
            enriched["tool"] = self.ncbi_tool
        if self.ncbi_email:
            enriched["email"] = self.ncbi_email
        return enriched
 def merge_entries(base: BibEntry, resolved: BibEntry) -> BibEntry:
    merged, _ = merge_entries_with_conflicts(base, resolved)
    return merged
--- a/src/citegeist/sources.py
+++ b/src/citegeist/sources.py
@ -1,7 +1,9 @@
 from __future__ import annotations
 import http.client
 import hashlib
 import json
 import time
 import urllib.error
 import urllib.request
 import xml.etree.ElementTree as ET
@ -14,10 +16,14 @@ class SourceClient:
        user_agent: str = "citegeist/0.1 (local research tool)",
        cache_dir: str | Path | None = None,
        fixtures_dir: str | Path | None = None,
        max_retries: int = 2,
        retry_backoff_seconds: float = 1.0,
    ) -> None:
        self.user_agent = user_agent
        self.cache_dir = Path(cache_dir) if cache_dir else None
        self.fixtures_dir = Path(fixtures_dir) if fixtures_dir else None
        self.max_retries = max(0, max_retries)
        self.retry_backoff_seconds = max(0.0, retry_backoff_seconds)
    def get_json(self, url: str) -> dict:
        cached = self._read_cached(url, "json")
@ -49,24 +55,58 @@ class SourceClient:
    def try_get_json(self, url: str) -> dict | None:
        try:
            return self.get_json(url)
-        except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ValueError):
+        except (
            http.client.RemoteDisconnected,
            urllib.error.HTTPError,
            urllib.error.URLError,
            TimeoutError,
            ValueError,
        ):
            return None
    def try_get_text(self, url: str) -> str | None:
        try:
            return self.get_text(url)
-        except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ValueError):
+        except (
            http.client.RemoteDisconnected,
            urllib.error.HTTPError,
            urllib.error.URLError,
            TimeoutError,
            ValueError,
        ):
            return None
    def try_get_xml(self, url: str) -> ET.Element | None:
        try:
            return self.get_xml(url)
-        except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ET.ParseError, ValueError):
+        except (
            http.client.RemoteDisconnected,
            urllib.error.HTTPError,
            urllib.error.URLError,
            TimeoutError,
            ET.ParseError,
            ValueError,
        ):
            return None
    def _fetch_bytes(self, url: str) -> bytes:
        for attempt in range(self.max_retries + 1):
            try:
                with urllib.request.urlopen(self._request(url)) as response:
                    return response.read()
            except http.client.RemoteDisconnected:
                if attempt >= self.max_retries:
                    raise
                self._sleep_before_retry(attempt)
            except urllib.error.HTTPError as exc:
                if exc.code not in {429, 500, 502, 503, 504} or attempt >= self.max_retries:
                    raise
                self._sleep_before_retry(attempt)
            except urllib.error.URLError:
                if attempt >= self.max_retries:
                    raise
                self._sleep_before_retry(attempt)
        raise RuntimeError("unreachable")
    def _request(self, url: str) -> urllib.request.Request:
        return urllib.request.Request(
@ -96,6 +136,9 @@ class SourceClient:
        path = self.cache_dir / self._cache_key(url, suffix)
        path.write_bytes(payload)
    def _sleep_before_retry(self, attempt: int) -> None:
        time.sleep(self.retry_backoff_seconds * (2**attempt))
    def _decode_text(self, payload: bytes) -> str:
        for encoding in ("utf-8", "utf-8-sig", "iso-8859-1", "latin-1"):
            try:
--- a/tests/test_resolve.py
+++ b/tests/test_resolve.py
@ -285,6 +285,24 @@ def test_resolver_tries_pmid_before_dblp():
    ]
 def test_resolver_pubmed_requests_include_ncbi_params():
    resolver = MetadataResolver(ncbi_api_key="key123", ncbi_tool="citegeist", ncbi_email="dev@example.com")
    requested_urls: list[str] = []
    def fake_get_json(url: str):
        requested_urls.append(url)
        return {"esearchresult": {"idlist": []}}
    resolver.source_client.get_json = fake_get_json  # type: ignore[method-assign]
    resolver.search_pubmed("abiogenesis", limit=2)
    assert requested_urls
    assert "api_key=key123" in requested_urls[0]
    assert "tool=citegeist" in requested_urls[0]
    assert "email=dev%40example.com" in requested_urls[0]
 def test_openalex_work_to_entry_maps_basic_fields():
    entry = _openalex_work_to_entry(
        {
--- a/tests/test_sources.py
+++ b/tests/test_sources.py
@ -1,3 +1,4 @@
 import http.client
 from pathlib import Path
 import urllib.error
@ -51,3 +52,39 @@ def test_source_client_try_get_json_returns_none_on_http_error(tmp_path: Path):
    client._fetch_bytes = raise_404  # type: ignore[method-assign]
    assert client.try_get_json("https://example.org/missing") is None
 def test_source_client_retries_remote_disconnects(tmp_path: Path):
    client = SourceClient(cache_dir=tmp_path / "cache", max_retries=2, retry_backoff_seconds=0.0)
    attempts = {"count": 0}
    def flaky_fetch(_url: str) -> bytes:
        attempts["count"] += 1
        if attempts["count"] < 3:
            raise http.client.RemoteDisconnected("closed")
        return b'{"ok": true}'
    client._fetch_bytes = SourceClient._fetch_bytes.__get__(client, SourceClient)  # type: ignore[method-assign]
    client._request = lambda url: url  # type: ignore[method-assign]
    class FakeResponse:
        def __enter__(self):
            return self
        def __exit__(self, exc_type, exc, tb):
            return False
        def read(self) -> bytes:
            return flaky_fetch("https://example.org/test")
    import urllib.request
    original_urlopen = urllib.request.urlopen
    urllib.request.urlopen = lambda _request: FakeResponse()  # type: ignore[assignment]
    try:
        payload = client.get_json("https://example.org/test")
    finally:
        urllib.request.urlopen = original_urlopen  # type: ignore[assignment]
    assert payload["ok"] is True
    assert attempts["count"] == 3