Updates based on batch usage.

This commit is contained in:
welsberr 2026-04-21 09:19:11 -04:00
parent 4894341ba8
commit 39fe5ea86c
8 changed files with 183 additions and 15 deletions

4
.gitignore vendored
View File

@ -8,3 +8,7 @@ library.sqlite3
ops/ ops/
.codex .codex
SESSION_* SESSION_*
topic-*
talkorigins-out/
talkorigins.sqlite3
to_batch.sh*

View File

@ -202,6 +202,12 @@ Broad BibTeX exports skip DOI-only placeholder records such as `Referenced work
Long-running CLI commands report progress on `stderr` so `stdout` remains clean for JSON, BibTeX, or tabular output. Long-running CLI commands report progress on `stderr` so `stdout` remains clean for JSON, BibTeX, or tabular output.
For long-running commands that emit structured output on `stdout`, prefer `tee` with a descriptive filename so you keep a reviewable artifact without losing live terminal feedback. For example:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 bootstrap-batch talkorigins-out/talkorigins_jobs.json | tee talkorigins-bootstrap-results.json
```
BibTeX parse/render round-trips normalize simple escaped special characters such as `\_`, `\&`, and `\%` back to plain field values internally, then re-escape them on export. This prevents repeated commands such as `resolve` from turning a valid field like `discovered\_from = {...}` into `discovered\\_from = {...}` after rewriting an entry. BibTeX parse/render round-trips normalize simple escaped special characters such as `\_`, `\&`, and `\%` back to plain field values internally, then re-escape them on export. This prevents repeated commands such as `resolve` from turning a valid field like `discovered\_from = {...}` into `discovered\\_from = {...}` after rewriting an entry.
Crossref reference expansion is intentionally conservative about weak discoveries. If a cited reference has no DOI and Crossref only exposes it as an unstructured citation blob, `expand --source crossref`, `expand-topic --source crossref`, and bootstrap flows now skip materializing that record unless the fallback metadata looks like a cleaner non-`misc` work such as conference proceedings. When Crossref does expose thesis or dissertation references only as unstructured text, citegeist now also tries to extract the actual work title instead of keeping the entire ProQuest-style citation blob in the `title` field. This reduces junk `@misc` entries and cleaner `@phdthesis` fallbacks whose `title` field is really a pasted citation string. Crossref reference expansion is intentionally conservative about weak discoveries. If a cited reference has no DOI and Crossref only exposes it as an unstructured citation blob, `expand --source crossref`, `expand-topic --source crossref`, and bootstrap flows now skip materializing that record unless the fallback metadata looks like a cleaner non-`misc` work such as conference proceedings. When Crossref does expose thesis or dissertation references only as unstructured text, citegeist now also tries to extract the actual work title instead of keeping the entire ProQuest-style citation blob in the `title` field. This reduces junk `@misc` entries and cleaner `@phdthesis` fallbacks whose `title` field is really a pasted citation string.
@ -357,8 +363,9 @@ and the example-scoped CLI commands:
```bash ```bash
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20 PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 bootstrap-batch talkorigins-out/talkorigins_jobs.json | tee talkorigins-bootstrap-results.json
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only | tee talkorigins-duplicates-preview.json
``` ```
The older `scrape-talkorigins`-style command names remain available as compatibility aliases. The full example workflow and reconstruction notes live in [examples/talkorigins/README.md](./examples/talkorigins/README.md). The older `scrape-talkorigins`-style command names remain available as compatibility aliases. The full example workflow and reconstruction notes live in [examples/talkorigins/README.md](./examples/talkorigins/README.md).

View File

@ -518,6 +518,12 @@ Run a JSON batch file:
.venv/bin/python -m citegeist --db library.sqlite3 bootstrap-batch artificial-life.json .venv/bin/python -m citegeist --db library.sqlite3 bootstrap-batch artificial-life.json
``` ```
Keep the JSON result payload while watching live progress:
```bash
.venv/bin/python -m citegeist --db library.sqlite3 bootstrap-batch artificial-life.json | tee artificial-life-bootstrap-results.json
```
### Topic Phrase Review Workflow ### Topic Phrase Review Workflow
Apply topic phrases directly: Apply topic phrases directly:
@ -676,6 +682,12 @@ Control generated bootstrap defaults:
.venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --topic-limit 10 --topic-commit-limit 5 --status draft .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --topic-limit 10 --topic-commit-limit 5 --status draft
``` ```
Run the generated bootstrap batch and save the per-job JSON results:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 bootstrap-batch talkorigins-out/talkorigins_jobs.json | tee talkorigins-bootstrap-results.json
```
Validate the generated manifest: Validate the generated manifest:
```bash ```bash
@ -694,6 +706,12 @@ Inspect duplicate clusters:
.venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --min-count 2 --match origin --topic abiogenesis --preview --weak-only .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --min-count 2 --match origin --topic abiogenesis --preview --weak-only
``` ```
Keep the duplicate-cluster report in a named file while reviewing the terminal output:
```bash
.venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --min-count 2 --match origin --topic abiogenesis --preview --weak-only | tee talkorigins-duplicates-preview.json
```
Ingest the reconstructed corpus: Ingest the reconstructed corpus:
```bash ```bash

View File

@ -24,8 +24,9 @@ The preferred CLI commands are example-scoped:
```bash ```bash
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20 PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 bootstrap-batch talkorigins-out/talkorigins_jobs.json | tee talkorigins-bootstrap-results.json
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only | tee talkorigins-duplicates-preview.json
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-suggest-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-suggest-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export-topic-phrase-reviews --output topic-phrase-review.json PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export-topic-phrase-reviews --output topic-phrase-review.json
@ -48,5 +49,6 @@ The example scrape writes:
## Notes ## Notes
- Long-running commands print progress to `stderr`, so JSON or tabular results on `stdout` can be safely captured with `tee` into named files such as `talkorigins-bootstrap-results.json`.
- The example-specific CLI names have compatibility aliases matching the older `scrape-talkorigins` style commands. - The example-specific CLI names have compatibility aliases matching the older `scrape-talkorigins` style commands.
- Topic phrase staging, review, and export commands are generic `citegeist` functionality and are not specific to TalkOrigins. - Topic phrase staging, review, and export commands are generic `citegeist` functionality and are not specific to TalkOrigins.

View File

@ -1,6 +1,8 @@
from __future__ import annotations from __future__ import annotations
import html import html
import http.client
import os
import re import re
import urllib.error import urllib.error
import urllib.parse import urllib.parse
@ -23,9 +25,15 @@ class MetadataResolver:
self, self,
user_agent: str = "citegeist/0.1 (local research tool)", user_agent: str = "citegeist/0.1 (local research tool)",
source_client: SourceClient | None = None, source_client: SourceClient | None = None,
ncbi_api_key: str | None = None,
ncbi_tool: str | None = None,
ncbi_email: str | None = None,
) -> None: ) -> None:
self.user_agent = user_agent self.user_agent = user_agent
self.source_client = source_client or SourceClient(user_agent=user_agent) self.source_client = source_client or SourceClient(user_agent=user_agent)
self.ncbi_api_key = ncbi_api_key if ncbi_api_key is not None else os.environ.get("NCBI_API_KEY", "")
self.ncbi_tool = ncbi_tool if ncbi_tool is not None else os.environ.get("NCBI_TOOL", "citegeist")
self.ncbi_email = ncbi_email if ncbi_email is not None else os.environ.get("NCBI_EMAIL", "")
def resolve_entry(self, entry: BibEntry) -> Resolution | None: def resolve_entry(self, entry: BibEntry) -> Resolution | None:
if doi := entry.fields.get("doi"): if doi := entry.fields.get("doi"):
@ -182,7 +190,9 @@ class MetadataResolver:
normalized_pmid = _normalize_pmid(pmid) normalized_pmid = _normalize_pmid(pmid)
if not normalized_pmid: if not normalized_pmid:
return None return None
query = urllib.parse.urlencode({"db": "pubmed", "id": normalized_pmid, "retmode": "xml"}) query = urllib.parse.urlencode(
self._ncbi_params({"db": "pubmed", "id": normalized_pmid, "retmode": "xml"})
)
root = self._safe_get_xml(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{query}") root = self._safe_get_xml(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{query}")
if root is None: if root is None:
return None return None
@ -261,12 +271,12 @@ class MetadataResolver:
if not query_text: if not query_text:
return [] return []
query = urllib.parse.urlencode( query = urllib.parse.urlencode(
{ self._ncbi_params({
"db": "pubmed", "db": "pubmed",
"retmode": "json", "retmode": "json",
"retmax": max(1, limit), "retmax": max(1, limit),
"term": query_text, "term": query_text,
} })
) )
payload = self._safe_get_json(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?{query}") payload = self._safe_get_json(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?{query}")
if payload is None: if payload is None:
@ -283,19 +293,38 @@ class MetadataResolver:
def _safe_get_json(self, url: str) -> dict | None: def _safe_get_json(self, url: str) -> dict | None:
try: try:
return self.source_client.get_json(url) return self.source_client.get_json(url)
except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ValueError): except (
http.client.RemoteDisconnected,
urllib.error.HTTPError,
urllib.error.URLError,
TimeoutError,
ValueError,
):
return None return None
def _safe_get_text(self, url: str) -> str | None: def _safe_get_text(self, url: str) -> str | None:
try: try:
return self.source_client.get_text(url) return self.source_client.get_text(url)
except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ValueError): except (
http.client.RemoteDisconnected,
urllib.error.HTTPError,
urllib.error.URLError,
TimeoutError,
ValueError,
):
return None return None
def _safe_get_xml(self, url: str) -> ET.Element | None: def _safe_get_xml(self, url: str) -> ET.Element | None:
try: try:
return self.source_client.get_xml(url) return self.source_client.get_xml(url)
except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ET.ParseError, ValueError): except (
http.client.RemoteDisconnected,
urllib.error.HTTPError,
urllib.error.URLError,
TimeoutError,
ET.ParseError,
ValueError,
):
return None return None
def search_openalex_best_match( def search_openalex_best_match(
@ -344,13 +373,13 @@ class MetadataResolver:
return [] return []
id_param = ",".join(ordered_pmids) id_param = ",".join(ordered_pmids)
summary_query = urllib.parse.urlencode({"db": "pubmed", "retmode": "json", "id": id_param}) summary_query = urllib.parse.urlencode(self._ncbi_params({"db": "pubmed", "retmode": "json", "id": id_param}))
summaries_payload = self._safe_get_json( summaries_payload = self._safe_get_json(
f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?{summary_query}" f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?{summary_query}"
) or {} ) or {}
summaries = summaries_payload.get("result", {}) summaries = summaries_payload.get("result", {})
fetch_query = urllib.parse.urlencode({"db": "pubmed", "id": id_param, "retmode": "xml"}) fetch_query = urllib.parse.urlencode(self._ncbi_params({"db": "pubmed", "id": id_param, "retmode": "xml"}))
root = self._safe_get_xml(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{fetch_query}") root = self._safe_get_xml(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{fetch_query}")
articles = _pubmed_articles_by_pmid(root) articles = _pubmed_articles_by_pmid(root)
@ -363,6 +392,16 @@ class MetadataResolver:
entries.append(_pubmed_record_to_entry(summary or {}, article, fallback_pmid=pmid)) entries.append(_pubmed_record_to_entry(summary or {}, article, fallback_pmid=pmid))
return entries return entries
def _ncbi_params(self, params: dict[str, object]) -> dict[str, object]:
enriched = dict(params)
if self.ncbi_api_key:
enriched["api_key"] = self.ncbi_api_key
if self.ncbi_tool:
enriched["tool"] = self.ncbi_tool
if self.ncbi_email:
enriched["email"] = self.ncbi_email
return enriched
def merge_entries(base: BibEntry, resolved: BibEntry) -> BibEntry: def merge_entries(base: BibEntry, resolved: BibEntry) -> BibEntry:
merged, _ = merge_entries_with_conflicts(base, resolved) merged, _ = merge_entries_with_conflicts(base, resolved)
return merged return merged

View File

@ -1,7 +1,9 @@
from __future__ import annotations from __future__ import annotations
import http.client
import hashlib import hashlib
import json import json
import time
import urllib.error import urllib.error
import urllib.request import urllib.request
import xml.etree.ElementTree as ET import xml.etree.ElementTree as ET
@ -14,10 +16,14 @@ class SourceClient:
user_agent: str = "citegeist/0.1 (local research tool)", user_agent: str = "citegeist/0.1 (local research tool)",
cache_dir: str | Path | None = None, cache_dir: str | Path | None = None,
fixtures_dir: str | Path | None = None, fixtures_dir: str | Path | None = None,
max_retries: int = 2,
retry_backoff_seconds: float = 1.0,
) -> None: ) -> None:
self.user_agent = user_agent self.user_agent = user_agent
self.cache_dir = Path(cache_dir) if cache_dir else None self.cache_dir = Path(cache_dir) if cache_dir else None
self.fixtures_dir = Path(fixtures_dir) if fixtures_dir else None self.fixtures_dir = Path(fixtures_dir) if fixtures_dir else None
self.max_retries = max(0, max_retries)
self.retry_backoff_seconds = max(0.0, retry_backoff_seconds)
def get_json(self, url: str) -> dict: def get_json(self, url: str) -> dict:
cached = self._read_cached(url, "json") cached = self._read_cached(url, "json")
@ -49,24 +55,58 @@ class SourceClient:
def try_get_json(self, url: str) -> dict | None: def try_get_json(self, url: str) -> dict | None:
try: try:
return self.get_json(url) return self.get_json(url)
except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ValueError): except (
http.client.RemoteDisconnected,
urllib.error.HTTPError,
urllib.error.URLError,
TimeoutError,
ValueError,
):
return None return None
def try_get_text(self, url: str) -> str | None: def try_get_text(self, url: str) -> str | None:
try: try:
return self.get_text(url) return self.get_text(url)
except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ValueError): except (
http.client.RemoteDisconnected,
urllib.error.HTTPError,
urllib.error.URLError,
TimeoutError,
ValueError,
):
return None return None
def try_get_xml(self, url: str) -> ET.Element | None: def try_get_xml(self, url: str) -> ET.Element | None:
try: try:
return self.get_xml(url) return self.get_xml(url)
except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError, ET.ParseError, ValueError): except (
http.client.RemoteDisconnected,
urllib.error.HTTPError,
urllib.error.URLError,
TimeoutError,
ET.ParseError,
ValueError,
):
return None return None
def _fetch_bytes(self, url: str) -> bytes: def _fetch_bytes(self, url: str) -> bytes:
for attempt in range(self.max_retries + 1):
try:
with urllib.request.urlopen(self._request(url)) as response: with urllib.request.urlopen(self._request(url)) as response:
return response.read() return response.read()
except http.client.RemoteDisconnected:
if attempt >= self.max_retries:
raise
self._sleep_before_retry(attempt)
except urllib.error.HTTPError as exc:
if exc.code not in {429, 500, 502, 503, 504} or attempt >= self.max_retries:
raise
self._sleep_before_retry(attempt)
except urllib.error.URLError:
if attempt >= self.max_retries:
raise
self._sleep_before_retry(attempt)
raise RuntimeError("unreachable")
def _request(self, url: str) -> urllib.request.Request: def _request(self, url: str) -> urllib.request.Request:
return urllib.request.Request( return urllib.request.Request(
@ -96,6 +136,9 @@ class SourceClient:
path = self.cache_dir / self._cache_key(url, suffix) path = self.cache_dir / self._cache_key(url, suffix)
path.write_bytes(payload) path.write_bytes(payload)
def _sleep_before_retry(self, attempt: int) -> None:
time.sleep(self.retry_backoff_seconds * (2**attempt))
def _decode_text(self, payload: bytes) -> str: def _decode_text(self, payload: bytes) -> str:
for encoding in ("utf-8", "utf-8-sig", "iso-8859-1", "latin-1"): for encoding in ("utf-8", "utf-8-sig", "iso-8859-1", "latin-1"):
try: try:

View File

@ -285,6 +285,24 @@ def test_resolver_tries_pmid_before_dblp():
] ]
def test_resolver_pubmed_requests_include_ncbi_params():
resolver = MetadataResolver(ncbi_api_key="key123", ncbi_tool="citegeist", ncbi_email="dev@example.com")
requested_urls: list[str] = []
def fake_get_json(url: str):
requested_urls.append(url)
return {"esearchresult": {"idlist": []}}
resolver.source_client.get_json = fake_get_json # type: ignore[method-assign]
resolver.search_pubmed("abiogenesis", limit=2)
assert requested_urls
assert "api_key=key123" in requested_urls[0]
assert "tool=citegeist" in requested_urls[0]
assert "email=dev%40example.com" in requested_urls[0]
def test_openalex_work_to_entry_maps_basic_fields(): def test_openalex_work_to_entry_maps_basic_fields():
entry = _openalex_work_to_entry( entry = _openalex_work_to_entry(
{ {

View File

@ -1,3 +1,4 @@
import http.client
from pathlib import Path from pathlib import Path
import urllib.error import urllib.error
@ -51,3 +52,39 @@ def test_source_client_try_get_json_returns_none_on_http_error(tmp_path: Path):
client._fetch_bytes = raise_404 # type: ignore[method-assign] client._fetch_bytes = raise_404 # type: ignore[method-assign]
assert client.try_get_json("https://example.org/missing") is None assert client.try_get_json("https://example.org/missing") is None
def test_source_client_retries_remote_disconnects(tmp_path: Path):
client = SourceClient(cache_dir=tmp_path / "cache", max_retries=2, retry_backoff_seconds=0.0)
attempts = {"count": 0}
def flaky_fetch(_url: str) -> bytes:
attempts["count"] += 1
if attempts["count"] < 3:
raise http.client.RemoteDisconnected("closed")
return b'{"ok": true}'
client._fetch_bytes = SourceClient._fetch_bytes.__get__(client, SourceClient) # type: ignore[method-assign]
client._request = lambda url: url # type: ignore[method-assign]
class FakeResponse:
def __enter__(self):
return self
def __exit__(self, exc_type, exc, tb):
return False
def read(self) -> bytes:
return flaky_fetch("https://example.org/test")
import urllib.request
original_urlopen = urllib.request.urlopen
urllib.request.urlopen = lambda _request: FakeResponse() # type: ignore[assignment]
try:
payload = client.get_json("https://example.org/test")
finally:
urllib.request.urlopen = original_urlopen # type: ignore[assignment]
assert payload["ok"] is True
assert attempts["count"] == 3