Compare commits

...

3 Commits

12 changed files with 678 additions and 113 deletions

View File

@ -56,11 +56,15 @@ The initial repo includes:
- OAI-PMH repository discovery via `Identify`, `ListSets`, and `ListMetadataFormats` to target harvests more precisely;
- bibliography bootstrap workflows that can start from a seed `.bib`, a topic phrase, or both;
- batch bootstrap orchestration from JSON job files containing seed BibTeX paths, topic phrases, or both;
- a TalkOrigins scraper that fixes repeated-author plaintext references, emits per-topic seed BibTeX files, and writes a batch JSON specification;
- normalized tables for entries, creators, identifiers, and citation relations;
- full-text-search-ready indexing over title, abstract, and fulltext when SQLite FTS5 is available;
- tests covering parsing, ingestion, relation storage, and search.
Example applications live alongside the core package rather than defining it. Current examples include:
- a topic-only bootstrap workflow for `artificial life` in [examples/artificial-life/README.md](./examples/artificial-life/README.md);
- the TalkOrigins bibliography pipeline under [`citegeist.examples.talkorigins`](./src/citegeist/examples/talkorigins.py) with a usage guide in [examples/talkorigins/README.md](./examples/talkorigins/README.md).
The prioritized execution plan lives in [ROADMAP.md](./ROADMAP.md).
## Layout
@ -69,6 +73,7 @@ The prioritized execution plan lives in [ROADMAP.md](./ROADMAP.md).
citegeist/
src/citegeist/
bibtex.py
examples/
storage.py
tests/
test_storage.py
@ -125,7 +130,6 @@ PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve-confli
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 apply-conflict smith2024graphs title
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --seed-bib seed.bib --topic "bayesian nonparametrics"
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --topic "bayesian nonparametrics" --preview --topic-commit-limit 5
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 scrape-talkorigins talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
PYTHONPATH=src .venv/bin/python -m citegeist extract references.txt --output draft.bib
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve smith2024graphs
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 topics
@ -143,44 +147,14 @@ PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export --outpu
For live-source development, prefer fixture-backed or cache-backed source clients so resolver and expansion work can be exercised repeatedly without re-hitting upstream APIs on every run.
For large legacy plaintext corpora such as the TalkOrigins bibliography, prefer a two-step workflow:
1. `scrape-talkorigins` to generate cleaned per-topic `seed_bib` files plus a `talkorigins_jobs.json` batch spec.
2. `bootstrap-batch` on that JSON file when you want to ingest, resolve, and expand from the generated seeds.
The TalkOrigins scrape output now includes:
- `seeds/*.bib` per-topic seed BibTeX files for `bootstrap-batch`
- `plaintext/*.txt` per-topic cleaned GSA-style plaintext with repeated authors expanded
- `site/topics/*.html` reconstructed topic pages with hide/show BibTeX blocks
- `talkorigins_full.txt` and `talkorigins_full.bib` aggregate downloads
- `snapshots/*.json` cached topic payloads so reruns can resume without re-fetching already scraped topics
After a full scrape, run:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist validate-talkorigins talkorigins-out/talkorigins_manifest.json
PYTHONPATH=src .venv/bin/python -m citegeist duplicates-talkorigins talkorigins-out/talkorigins_manifest.json --limit 20
PYTHONPATH=src .venv/bin/python -m citegeist duplicates-talkorigins talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
PYTHONPATH=src .venv/bin/python -m citegeist suggest-talkorigins-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 review-topic-phrase abiogenesis accepted --notes "curated from local corpus"
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 apply-topic-phrases topic-phrases.json
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 enrich-talkorigins talkorigins-out/talkorigins_manifest.json --limit 20
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins-copy.sqlite3 enrich-talkorigins talkorigins-out/talkorigins_manifest.json --limit 5 --apply --allow-unsafe-search-matches
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 review-talkorigins talkorigins-out/talkorigins_manifest.json --output talkorigins-review.json
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 apply-talkorigins-corrections talkorigins-out/talkorigins_manifest.json talkorigins-corrections.json
```
That report summarizes parse coverage and flags suspicious entry-type / venue combinations for manual cleanup.
It also reports duplicate clusters across topic seed files so you can gauge how much deduplication pressure to expect before ingestion.
Use `duplicates-talkorigins` when you want to inspect specific clusters, filter by text, restrict the audit to one topic slug, or preview only weak canonicalization outcomes before importing.
Use `suggest-talkorigins-phrases` to derive candidate stored expansion phrases from the existing TalkOrigins topic corpus itself. The output is deterministic JSON keyed by topic slug, with a suggested phrase plus the extracted keywords that drove it. This is a useful first pass before setting topic phrases in the database or editing generated batch jobs.
## Example Application
Use `stage-topic-phrases` to load those suggestions into the database as review items. Staging stores the candidate in `suggested_phrase` and marks the topic `pending` without changing the active `expansion_phrase`.
Use `review-topic-phrase` to accept or reject one staged suggestion in place. Accepting a suggestion copies it into `expansion_phrase`; rejecting it preserves the review state without changing the live phrase.
Use `export-topic-phrase-reviews` to write an editable JSON template directly from the database for the currently staged suggestions. That gives you a round-trip path from DB review queue to file edits and back into `review-topic-phrases`.
Use `review-topic-phrase` to accept or reject one staged suggestion in place. Accepting a suggestion copies it into `expansion_phrase` and clears it from the staged review queue; rejecting it preserves the staged suggestion together with its review state.
Use `review-topic-phrases` when you want to apply many accept/reject decisions from one JSON file. Each item should carry `slug`, `status`, and optional `phrase` / `review_notes`.
Use `apply-topic-phrases` when you want a direct patch path instead of the staged review flow. It accepts either the raw suggestion list or an object with a `topics` list, and will apply `suggested_phrase` or `phrase` to matching topic slugs immediately.
Use `topic-phrase-reviews --phrase-review-status pending` when you want a compact audit view of unresolved staged suggestions, including both the current live phrase and the pending replacement.
Use `enrich-talkorigins` when you want to target those weak canonical entries for resolver-based metadata upgrades before retrying graph expansion on imported topic slices.
Use `review-talkorigins` when you want one JSON review artifact that combines weak canonical clusters with dry-run enrichment outcomes for manual cleanup.
Use `expand-topic` when you already have both a topic phrase and a curated topic seed set in the database: it expands outward from the topics existing entries, then only assigns discovered works back to that topic if they clear a topic-relevance threshold. Write-enabled assignment is stricter than preview ranking: a candidate must clear the score threshold and show a non-generic title anchor to the topic phrase, so broad methods papers do not get attached just because their abstracts or related terms overlap. On large noisy topics, prefer `--seed-key` to restrict the run to just the trusted seed entries you want to expand from, and use `--preview` first to inspect discovered candidates and relevance scores before writing anything.
@ -189,6 +163,24 @@ Use `set-topic-phrase` to store a curated expansion phrase on the topic itself.
Use `topics --phrase-review-status pending` when you want to audit only topics whose staged phrase suggestions still need review.
`--allow-unsafe-search-matches` exists only for bounded experiments on copied databases when you explicitly want to relax trust to exercise downstream expansion behavior.
The TalkOrigins corpus pipeline remains in the repository as an example application rather than a core package surface. Use the example-scoped Python namespace:
```python
from citegeist.examples.talkorigins import TalkOriginsScraper
```
and the example-scoped CLI commands:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
```
The older `scrape-talkorigins`-style command names remain available as compatibility aliases. The full example workflow and reconstruction notes live in [examples/talkorigins/README.md](./examples/talkorigins/README.md).
For a smaller example that starts from a topic phrase alone, see [examples/artificial-life/README.md](./examples/artificial-life/README.md).
Correction files are simple JSON:
```json
@ -210,15 +202,6 @@ Correction files are simple JSON:
`fields` values overwrite the canonical entry for that duplicate-cluster key. Set a field to `null` to remove it.
To import the reconstructed corpus into SQLite while collapsing duplicate works across topics into canonical entries:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 ingest-talkorigins talkorigins-out/talkorigins_manifest.json
```
That import preserves many-to-many topic membership through the `topics` and `entry_topics` tables.
After import, use `topics`, `topic-entries`, `search --topic`, and `export-topic` to inspect or export topic slices from the consolidated database.
Live-source workflow:
```bash

View File

@ -0,0 +1,100 @@
# Artificial Life Topic-Seeding Example
This example shows the smallest useful `citegeist` workflow that starts from a topic phrase alone.
The seed phrase is:
```text
artificial life
```
## What It Demonstrates
- topic-only bootstrap without a seed `.bib`;
- previewing ranked candidate seed entries before writing anything;
- storing a curated topic slug, topic name, and expansion phrase in the database;
- running later topic-aware expansion from that stored phrase.
## Preview First
Use a preview run to inspect the best candidate seed entries without changing the database:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 \
bootstrap \
--topic "artificial life" \
--topic-slug artificial-life \
--topic-name "Artificial life" \
--store-topic-phrase "artificial life alife artificial organisms complex systems evolution simulation" \
--topic-limit 10 \
--topic-commit-limit 5 \
--preview
```
That returns ranked candidates gathered through the configured resolver/search stack.
## Commit The Topic Seeds
Once the preview looks reasonable, run the same bootstrap without `--preview`:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 \
bootstrap \
--topic "artificial life" \
--topic-slug artificial-life \
--topic-name "Artificial life" \
--store-topic-phrase "artificial life alife artificial organisms complex systems evolution simulation" \
--topic-limit 10 \
--topic-commit-limit 5
```
That does three things:
1. finds topic-relevant seed entries;
2. stores them in the bibliography database;
3. creates or updates the `artificial-life` topic row with the curated expansion phrase.
## Inspect The Result
```bash
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 topics
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 topic-entries artificial-life
```
If you want to adjust the stored phrase later:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 \
set-topic-phrase artificial-life "artificial life alife artificial organisms autonomous agents evolution simulation"
```
## Optional Batch Form
The same topic-only seed can be expressed as a batch job:
```json
[
{
"name": "artificial-life-topic-seed",
"topic": "artificial life",
"topic_slug": "artificial-life",
"topic_name": "Artificial life",
"topic_phrase": "artificial life alife artificial organisms complex systems evolution simulation",
"topic_limit": 10,
"topic_commit_limit": 5,
"expand": false
}
]
```
Run it with:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap-batch artificial-life.json
```
## Notes
- This example is intentionally generic and corpus-independent.
- The exact candidate set depends on live source availability and resolver behavior.
- Prefer preview mode before committing topic-only seeds, because topic phrases are noisier than curated seed `.bib` inputs.

View File

@ -0,0 +1,52 @@
# TalkOrigins Example
This example shows how to use `citegeist` on a large legacy plaintext bibliography corpus.
It is intentionally positioned as an application of the core library, not as the main product surface.
## What It Demonstrates
- scraping a legacy bibliography index;
- normalizing repeated-author plaintext references;
- converting topic pages into per-topic seed BibTeX;
- generating batch bootstrap specs for downstream ingest and expansion;
- reconstructing cleaned plaintext and BibTeX topic pages for review;
- validating parse quality, duplicate clusters, and weak canonical entries;
- curating topic phrases and correction files before broader enrichment.
The example implementation lives under the Python namespace:
```python
from citegeist.examples.talkorigins import TalkOriginsScraper
```
The preferred CLI commands are example-scoped:
```bash
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-suggest-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export-topic-phrase-reviews --output topic-phrase-review.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 review-topic-phrases topic-phrase-review.json
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-enrich talkorigins-out/talkorigins_manifest.json --limit 20
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-review talkorigins-out/talkorigins_manifest.json --output talkorigins-review.json
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-apply-corrections talkorigins-out/talkorigins_manifest.json talkorigins-corrections.json
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-ingest talkorigins-out/talkorigins_manifest.json
```
## Output Artifacts
The example scrape writes:
- `seeds/*.bib` per-topic seed BibTeX files;
- `plaintext/*.txt` cleaned GSA-style plaintext with repeated authors expanded;
- `site/topics/*.html` reconstructed topic pages with hide/show BibTeX blocks;
- `talkorigins_full.txt` and `talkorigins_full.bib` aggregate downloads;
- `snapshots/*.json` cached topic payloads so reruns can resume.
## Notes
- The example-specific CLI names have compatibility aliases matching the older `scrape-talkorigins` style commands.
- Topic phrase staging, review, and export commands are generic `citegeist` functionality and are not specific to TalkOrigins.

View File

@ -7,18 +7,6 @@ from .harvest import OaiMetadataFormat, OaiPmhHarvester, OaiSet
from .resolve import MetadataResolver, merge_entries, merge_entries_with_conflicts
from .sources import SourceClient
from .storage import BibliographyStore
from .talkorigins import (
TalkOriginsBatchExport,
TalkOriginsDuplicateCluster,
TalkOriginsEnrichmentResult,
TalkOriginsIngestReport,
TalkOriginsReviewExport,
TalkOriginsScraper,
TalkOriginsSeedSet,
TalkOriginsTopicPhraseSuggestion,
TalkOriginsTopic,
TalkOriginsValidationReport,
)
__all__ = [
"BibEntry",
@ -34,16 +22,6 @@ __all__ = [
"OaiMetadataFormat",
"OaiSet",
"SourceClient",
"TalkOriginsBatchExport",
"TalkOriginsDuplicateCluster",
"TalkOriginsEnrichmentResult",
"TalkOriginsIngestReport",
"TalkOriginsReviewExport",
"TalkOriginsScraper",
"TalkOriginsSeedSet",
"TalkOriginsTopicPhraseSuggestion",
"TalkOriginsTopic",
"TalkOriginsValidationReport",
"extract_references",
"load_batch_jobs",
"merge_entries",

View File

@ -9,12 +9,12 @@ from pathlib import Path
from .batch import BatchBootstrapRunner, load_batch_jobs
from .bibtex import parse_bibtex, render_bibtex
from .bootstrap import Bootstrapper
from .examples.talkorigins import TalkOriginsScraper
from .expand import CrossrefExpander, OpenAlexExpander, TopicExpander
from .extract import extract_references
from .harvest import OaiPmhHarvester
from .resolve import MetadataResolver, merge_entries_with_conflicts
from .storage import BibliographyStore
from .talkorigins import TalkOriginsScraper
def build_parser() -> argparse.ArgumentParser:
@ -205,8 +205,9 @@ def build_parser() -> argparse.ArgumentParser:
batch_parser.add_argument("input", help="Path to batch JSON file")
talkorigins_parser = subparsers.add_parser(
"scrape-talkorigins",
help="Scrape TalkOrigins into per-topic seed BibTeX files and a bootstrap-batch JSON file",
"example-talkorigins-scrape",
aliases=["scrape-talkorigins"],
help="Example workflow: scrape TalkOrigins into per-topic seed BibTeX files and a bootstrap-batch JSON file",
)
talkorigins_parser.add_argument(
"output_dir",
@ -257,14 +258,16 @@ def build_parser() -> argparse.ArgumentParser:
talkorigins_parser.add_argument("--status", default="draft", help="Review status for generated seed jobs")
validate_talkorigins_parser = subparsers.add_parser(
"validate-talkorigins",
help="Validate a generated TalkOrigins manifest and report parse coverage and suspicious entries",
"example-talkorigins-validate",
aliases=["validate-talkorigins"],
help="Example workflow: validate a generated TalkOrigins manifest and report parse coverage and suspicious entries",
)
validate_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
suggest_talkorigins_parser = subparsers.add_parser(
"suggest-talkorigins-phrases",
help="Suggest stored topic expansion phrases from a TalkOrigins manifest",
"example-talkorigins-suggest-phrases",
aliases=["suggest-talkorigins-phrases"],
help="Example workflow: suggest stored topic expansion phrases from a TalkOrigins manifest",
)
suggest_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
suggest_talkorigins_parser.add_argument("--topic", help="Optional topic slug to restrict suggestions")
@ -298,9 +301,16 @@ def build_parser() -> argparse.ArgumentParser:
help="Optional expansion phrase override to apply with the review decision",
)
review_topic_phrases_parser = subparsers.add_parser(
"review-topic-phrases",
help="Apply topic phrase review decisions in bulk from JSON",
)
review_topic_phrases_parser.add_argument("input", help="Path to JSON file containing topic phrase review records")
duplicates_talkorigins_parser = subparsers.add_parser(
"duplicates-talkorigins",
help="Inspect duplicate clusters in a generated TalkOrigins manifest",
"example-talkorigins-duplicates",
aliases=["duplicates-talkorigins"],
help="Example workflow: inspect duplicate clusters in a generated TalkOrigins manifest",
)
duplicates_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
duplicates_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum clusters to show")
@ -324,8 +334,9 @@ def build_parser() -> argparse.ArgumentParser:
)
ingest_talkorigins_parser = subparsers.add_parser(
"ingest-talkorigins",
help="Ingest a TalkOrigins manifest into the database with duplicate consolidation and topic membership",
"example-talkorigins-ingest",
aliases=["ingest-talkorigins"],
help="Example workflow: ingest a TalkOrigins manifest into the database with duplicate consolidation and topic membership",
)
ingest_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
ingest_talkorigins_parser.add_argument("--status", default="draft", help="Review status for imported entries")
@ -336,8 +347,9 @@ def build_parser() -> argparse.ArgumentParser:
)
enrich_talkorigins_parser = subparsers.add_parser(
"enrich-talkorigins",
help="Attempt metadata enrichment for weak TalkOrigins canonical entries",
"example-talkorigins-enrich",
aliases=["enrich-talkorigins"],
help="Example workflow: attempt metadata enrichment for weak TalkOrigins canonical entries",
)
enrich_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
enrich_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum weak clusters to inspect")
@ -366,8 +378,9 @@ def build_parser() -> argparse.ArgumentParser:
)
review_talkorigins_parser = subparsers.add_parser(
"review-talkorigins",
help="Export weak TalkOrigins clusters plus dry-run enrichment outcomes for manual review",
"example-talkorigins-review",
aliases=["review-talkorigins"],
help="Example workflow: export weak TalkOrigins clusters plus dry-run enrichment outcomes for manual review",
)
review_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
review_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum weak clusters to export")
@ -382,8 +395,9 @@ def build_parser() -> argparse.ArgumentParser:
review_talkorigins_parser.add_argument("--output", help="Write review export JSON to a file instead of stdout")
apply_review_talkorigins_parser = subparsers.add_parser(
"apply-talkorigins-corrections",
help="Apply curated TalkOrigins review corrections to the consolidated database",
"example-talkorigins-apply-corrections",
aliases=["apply-talkorigins-corrections"],
help="Example workflow: apply curated TalkOrigins review corrections to the consolidated database",
)
apply_review_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
apply_review_talkorigins_parser.add_argument("corrections", help="Path to corrections JSON")
@ -401,6 +415,33 @@ def build_parser() -> argparse.ArgumentParser:
help="Restrict topics to one stored phrase review state",
)
topic_phrase_reviews_parser = subparsers.add_parser(
"topic-phrase-reviews",
help="List staged topic phrase suggestions and their review state",
)
topic_phrase_reviews_parser.add_argument("--limit", type=int, default=100, help="Maximum reviews to list")
topic_phrase_reviews_parser.add_argument(
"--phrase-review-status",
choices=["unreviewed", "pending", "accepted", "rejected"],
help="Restrict results to one stored phrase review state",
)
export_topic_phrase_reviews_parser = subparsers.add_parser(
"export-topic-phrase-reviews",
help="Export an editable JSON review template for staged topic phrase suggestions",
)
export_topic_phrase_reviews_parser.add_argument("--limit", type=int, default=100, help="Maximum reviews to export")
export_topic_phrase_reviews_parser.add_argument(
"--phrase-review-status",
choices=["unreviewed", "pending", "accepted", "rejected"],
default="pending",
help="Restrict exported reviews to one stored phrase review state",
)
export_topic_phrase_reviews_parser.add_argument(
"--output",
help="Write the review template JSON to a file instead of stdout",
)
topic_entries_parser = subparsers.add_parser(
"topic-entries",
help="List entries assigned to one topic",
@ -497,7 +538,7 @@ def main(argv: list[str] | None = None) -> int:
)
if args.command == "bootstrap-batch":
return _run_bootstrap_batch(store, Path(args.input))
if args.command == "scrape-talkorigins":
if args.command in {"example-talkorigins-scrape", "scrape-talkorigins"}:
return _run_scrape_talkorigins(
store,
args.base_url,
@ -512,9 +553,9 @@ def main(argv: list[str] | None = None) -> int:
args.topic_commit_limit,
args.status,
)
if args.command == "validate-talkorigins":
if args.command in {"example-talkorigins-validate", "validate-talkorigins"}:
return _run_validate_talkorigins(Path(args.manifest))
if args.command == "suggest-talkorigins-phrases":
if args.command in {"example-talkorigins-suggest-phrases", "suggest-talkorigins-phrases"}:
return _run_suggest_talkorigins_phrases(Path(args.manifest), args.topic, args.limit, args.output)
if args.command == "apply-topic-phrases":
return _run_apply_topic_phrases(store, Path(args.input))
@ -522,7 +563,9 @@ def main(argv: list[str] | None = None) -> int:
return _run_stage_topic_phrases(store, Path(args.input))
if args.command == "review-topic-phrase":
return _run_review_topic_phrase(store, args.topic_slug, args.status, args.notes, args.phrase)
if args.command == "duplicates-talkorigins":
if args.command == "review-topic-phrases":
return _run_review_topic_phrases(store, Path(args.input))
if args.command in {"example-talkorigins-duplicates", "duplicates-talkorigins"}:
return _run_duplicates_talkorigins(
Path(args.manifest),
args.limit,
@ -532,9 +575,9 @@ def main(argv: list[str] | None = None) -> int:
args.preview,
args.weak_only,
)
if args.command == "ingest-talkorigins":
if args.command in {"example-talkorigins-ingest", "ingest-talkorigins"}:
return _run_ingest_talkorigins(store, Path(args.manifest), args.status, not args.no_dedupe)
if args.command == "enrich-talkorigins":
if args.command in {"example-talkorigins-enrich", "enrich-talkorigins"}:
return _run_enrich_talkorigins(
store,
Path(args.manifest),
@ -546,7 +589,7 @@ def main(argv: list[str] | None = None) -> int:
args.status,
args.allow_unsafe_search_matches,
)
if args.command == "review-talkorigins":
if args.command in {"example-talkorigins-review", "review-talkorigins"}:
return _run_review_talkorigins(
store,
Path(args.manifest),
@ -556,7 +599,7 @@ def main(argv: list[str] | None = None) -> int:
args.topic,
args.output,
)
if args.command == "apply-talkorigins-corrections":
if args.command in {"example-talkorigins-apply-corrections", "apply-talkorigins-corrections"}:
return _run_apply_talkorigins_corrections(
store,
Path(args.manifest),
@ -565,6 +608,10 @@ def main(argv: list[str] | None = None) -> int:
)
if args.command == "topics":
return _run_topics(store, args.limit, args.phrase_review_status)
if args.command == "topic-phrase-reviews":
return _run_topic_phrase_reviews(store, args.limit, args.phrase_review_status)
if args.command == "export-topic-phrase-reviews":
return _run_export_topic_phrase_reviews(store, args.limit, args.phrase_review_status, args.output)
if args.command == "topic-entries":
return _run_topic_entries(store, args.topic_slug, args.limit)
if args.command == "export-topic":
@ -1056,6 +1103,51 @@ def _run_review_topic_phrase(
return 0
def _run_review_topic_phrases(store: BibliographyStore, input_path: Path) -> int:
payload = json.loads(input_path.read_text(encoding="utf-8"))
if isinstance(payload, dict):
items = payload.get("topics", payload.get("items", []))
else:
items = payload
if not isinstance(items, list):
print("Topic phrase review JSON must be a list or an object with a 'topics' or 'items' list", file=sys.stderr)
return 1
results: list[dict[str, object]] = []
exit_code = 0
for item in items:
if not isinstance(item, dict):
continue
slug = str(item.get("slug") or "")
status = str(item.get("status") or item.get("phrase_review_status") or "")
notes = item.get("review_notes")
phrase = item.get("phrase", item.get("expansion_phrase"))
if not slug or status not in {"accepted", "rejected"}:
continue
if notes is not None:
notes = str(notes)
if phrase is not None:
phrase = str(phrase)
reviewed = store.review_topic_phrase_suggestion(
slug,
review_status=status,
review_notes=notes,
applied_phrase=phrase,
)
if not reviewed:
exit_code = 1
results.append(
{
"slug": slug,
"phrase_review_status": status,
"expansion_phrase": phrase,
"reviewed": reviewed,
}
)
print(json.dumps(results, indent=2))
return exit_code
def _run_duplicates_talkorigins(
manifest_path: Path,
limit: int,
@ -1171,6 +1263,39 @@ def _run_topics(store: BibliographyStore, limit: int, phrase_review_status: str
return 0
def _run_topic_phrase_reviews(store: BibliographyStore, limit: int, phrase_review_status: str | None) -> int:
print(json.dumps(store.list_topic_phrase_reviews(limit=limit, phrase_review_status=phrase_review_status), indent=2))
return 0
def _run_export_topic_phrase_reviews(
store: BibliographyStore,
limit: int,
phrase_review_status: str | None,
output: str | None,
) -> int:
items = store.list_topic_phrase_reviews(limit=limit, phrase_review_status=phrase_review_status)
payload = [
{
"slug": item["slug"],
"topic": item["name"],
"current_expansion_phrase": item.get("expansion_phrase"),
"suggested_phrase": item.get("suggested_phrase"),
"current_status": item.get("phrase_review_status"),
"review_notes": item.get("phrase_review_notes"),
"status": "",
"phrase": item.get("suggested_phrase"),
}
for item in items
]
rendered = json.dumps(payload, indent=2)
if output:
Path(output).write_text(rendered + "\n", encoding="utf-8")
else:
print(rendered)
return 0
def _run_topic_entries(store: BibliographyStore, topic_slug: str, limit: int) -> int:
topic = store.get_topic(topic_slug)
if topic is None:

View File

@ -0,0 +1,29 @@
from .talkorigins import (
TalkOriginsBatchExport,
TalkOriginsCorrectionResult,
TalkOriginsDuplicateCluster,
TalkOriginsEnrichmentResult,
TalkOriginsIngestReport,
TalkOriginsReviewExport,
TalkOriginsScraper,
TalkOriginsSeedSet,
TalkOriginsTopic,
TalkOriginsTopicPhraseSuggestion,
TalkOriginsValidationReport,
normalize_topic_entries,
)
__all__ = [
"TalkOriginsBatchExport",
"TalkOriginsCorrectionResult",
"TalkOriginsDuplicateCluster",
"TalkOriginsEnrichmentResult",
"TalkOriginsIngestReport",
"TalkOriginsReviewExport",
"TalkOriginsScraper",
"TalkOriginsSeedSet",
"TalkOriginsTopic",
"TalkOriginsTopicPhraseSuggestion",
"TalkOriginsValidationReport",
"normalize_topic_entries",
]

View File

@ -0,0 +1,29 @@
from ..talkorigins import (
TalkOriginsBatchExport,
TalkOriginsCorrectionResult,
TalkOriginsDuplicateCluster,
TalkOriginsEnrichmentResult,
TalkOriginsIngestReport,
TalkOriginsReviewExport,
TalkOriginsScraper,
TalkOriginsSeedSet,
TalkOriginsTopic,
TalkOriginsTopicPhraseSuggestion,
TalkOriginsValidationReport,
normalize_topic_entries,
)
__all__ = [
"TalkOriginsBatchExport",
"TalkOriginsCorrectionResult",
"TalkOriginsDuplicateCluster",
"TalkOriginsEnrichmentResult",
"TalkOriginsIngestReport",
"TalkOriginsReviewExport",
"TalkOriginsScraper",
"TalkOriginsSeedSet",
"TalkOriginsTopic",
"TalkOriginsTopicPhraseSuggestion",
"TalkOriginsValidationReport",
"normalize_topic_entries",
]

View File

@ -603,6 +603,43 @@ class BibliographyStore:
).fetchone()
return dict(row) if row else None
def list_topic_phrase_reviews(
self,
limit: int = 100,
phrase_review_status: str | None = None,
) -> list[dict[str, object]]:
where = "WHERE t.suggested_phrase IS NOT NULL"
params: list[object] = []
if phrase_review_status is not None:
where += " AND t.phrase_review_status = ?"
params.append(phrase_review_status)
params.append(limit)
rows = self.connection.execute(
f"""
SELECT t.slug, t.name, t.expansion_phrase, t.suggested_phrase,
t.phrase_review_status, t.phrase_review_notes,
COUNT(et.entry_id) AS entry_count
FROM topics t
LEFT JOIN entry_topics et ON et.topic_id = t.id
{where}
GROUP BY t.id, t.slug, t.name, t.expansion_phrase, t.suggested_phrase,
t.phrase_review_status, t.phrase_review_notes
ORDER BY
CASE t.phrase_review_status
WHEN 'pending' THEN 0
WHEN 'unreviewed' THEN 1
WHEN 'rejected' THEN 2
WHEN 'accepted' THEN 3
ELSE 4
END,
t.name,
t.slug
LIMIT ?
""",
params,
).fetchall()
return [dict(row) for row in rows]
def set_topic_expansion_phrase(self, slug: str, expansion_phrase: str | None) -> bool:
row = self.connection.execute(
"""
@ -651,8 +688,10 @@ class BibliographyStore:
suggested_phrase = topic.get("suggested_phrase")
expansion_phrase = topic.get("expansion_phrase")
stored_suggested_phrase = suggested_phrase
if review_status == "accepted":
expansion_phrase = applied_phrase if applied_phrase is not None else suggested_phrase
stored_suggested_phrase = None
elif applied_phrase is not None:
expansion_phrase = applied_phrase
@ -660,13 +699,14 @@ class BibliographyStore:
"""
UPDATE topics
SET expansion_phrase = ?,
suggested_phrase = ?,
phrase_review_status = ?,
phrase_review_notes = ?,
updated_at = CURRENT_TIMESTAMP
WHERE slug = ?
RETURNING id
""",
(expansion_phrase, review_status, review_notes, slug),
(expansion_phrase, stored_suggested_phrase, review_status, review_notes, slug),
).fetchone()
self.connection.commit()
return row is not None

View File

@ -1,3 +1,10 @@
"""TalkOrigins example implementation.
This module backs the example-facing namespace at ``citegeist.examples.talkorigins``.
New code should prefer importing from the examples namespace rather than treating
TalkOrigins support as part of the core top-level package surface.
"""
from __future__ import annotations
from collections import Counter

View File

@ -7,6 +7,16 @@ from pathlib import Path
from unittest.mock import patch
from citegeist.cli import main
from citegeist.examples.talkorigins import (
TalkOriginsBatchExport,
TalkOriginsCorrectionResult,
TalkOriginsDuplicateCluster,
TalkOriginsEnrichmentResult,
TalkOriginsIngestReport,
TalkOriginsReviewExport,
TalkOriginsTopicPhraseSuggestion,
TalkOriginsValidationReport,
)
SAMPLE_BIB = """
@ -313,7 +323,7 @@ def test_cli_scrape_talkorigins_accepts_output_dir(tmp_path):
database = tmp_path / "library.sqlite3"
with patch("citegeist.cli.TalkOriginsScraper.scrape_to_directory") as mocked_scrape:
mocked_scrape.return_value = __import__("citegeist").TalkOriginsBatchExport(
mocked_scrape.return_value = TalkOriginsBatchExport(
base_url="https://www.talkorigins.org/origins/biblio/",
output_dir=str(tmp_path),
topic_count=1,
@ -326,7 +336,7 @@ def test_cli_scrape_talkorigins_accepts_output_dir(tmp_path):
[
"--db",
str(database),
"scrape-talkorigins",
"example-talkorigins-scrape",
str(tmp_path / "talkorigins-out"),
"--limit-topics",
"3",
@ -346,7 +356,7 @@ def test_cli_validate_talkorigins_accepts_manifest(tmp_path):
manifest = tmp_path / "talkorigins_manifest.json"
manifest.write_text("{}", encoding="utf-8")
with patch("citegeist.cli.TalkOriginsScraper.validate_export") as mocked_validate:
mocked_validate.return_value = __import__("citegeist").TalkOriginsValidationReport(
mocked_validate.return_value = TalkOriginsValidationReport(
manifest_path=str(manifest),
topic_count=1,
entry_count=2,
@ -360,7 +370,7 @@ def test_cli_validate_talkorigins_accepts_manifest(tmp_path):
duplicate_entry_count=0,
duplicate_examples=[],
)
exit_code = main(["validate-talkorigins", str(manifest)])
exit_code = main(["example-talkorigins-validate", str(manifest)])
assert exit_code == 0
@ -373,7 +383,7 @@ def test_cli_suggest_talkorigins_phrases_writes_output(tmp_path):
output = tmp_path / "phrases.json"
with patch("citegeist.cli.TalkOriginsScraper.suggest_topic_phrases") as mocked_suggest:
mocked_suggest.return_value = [
__import__("citegeist", fromlist=["TalkOriginsTopicPhraseSuggestion"]).TalkOriginsTopicPhraseSuggestion(
TalkOriginsTopicPhraseSuggestion(
slug="abiogenesis",
topic="Abiogenesis",
entry_count=2,
@ -385,7 +395,7 @@ def test_cli_suggest_talkorigins_phrases_writes_output(tmp_path):
]
exit_code = main(
[
"suggest-talkorigins-phrases",
"example-talkorigins-suggest-phrases",
str(manifest),
"--topic",
"abiogenesis",
@ -406,7 +416,7 @@ def test_cli_duplicates_talkorigins_accepts_manifest(tmp_path):
manifest.write_text("{}", encoding="utf-8")
with patch("citegeist.cli.TalkOriginsScraper.inspect_duplicate_clusters") as mocked_duplicates:
mocked_duplicates.return_value = [
__import__("citegeist.talkorigins", fromlist=["TalkOriginsDuplicateCluster"]).TalkOriginsDuplicateCluster(
TalkOriginsDuplicateCluster(
key="smith|1999|duplicate paper",
count=2,
items=[
@ -431,7 +441,7 @@ def test_cli_duplicates_talkorigins_accepts_manifest(tmp_path):
]
exit_code = main(
[
"duplicates-talkorigins",
"example-talkorigins-duplicates",
str(manifest),
"--topic",
"abiogenesis",
@ -452,7 +462,7 @@ def test_cli_ingest_talkorigins_accepts_manifest(tmp_path):
manifest = tmp_path / "talkorigins_manifest.json"
manifest.write_text("{}", encoding="utf-8")
with patch("citegeist.cli.TalkOriginsScraper.ingest_export") as mocked_ingest:
mocked_ingest.return_value = __import__("citegeist").TalkOriginsIngestReport(
mocked_ingest.return_value = TalkOriginsIngestReport(
manifest_path=str(manifest),
topic_count=1,
raw_entry_count=2,
@ -461,7 +471,7 @@ def test_cli_ingest_talkorigins_accepts_manifest(tmp_path):
duplicate_entry_count=2,
canonicalized_count=1,
)
exit_code = main(["--db", str(database), "ingest-talkorigins", str(manifest)])
exit_code = main(["--db", str(database), "example-talkorigins-ingest", str(manifest)])
assert exit_code == 0
@ -474,7 +484,7 @@ def test_cli_enrich_talkorigins_accepts_manifest(tmp_path):
manifest.write_text("{}", encoding="utf-8")
with patch("citegeist.cli.TalkOriginsScraper.enrich_weak_canonicals") as mocked_enrich:
mocked_enrich.return_value = [
__import__("citegeist.talkorigins", fromlist=["TalkOriginsEnrichmentResult"]).TalkOriginsEnrichmentResult(
TalkOriginsEnrichmentResult(
key="smith|1999|duplicate paper",
citation_key="dup1",
weak_reasons_before=["missing:doi"],
@ -490,7 +500,7 @@ def test_cli_enrich_talkorigins_accepts_manifest(tmp_path):
[
"--db",
str(database),
"enrich-talkorigins",
"example-talkorigins-enrich",
str(manifest),
"--limit",
"5",
@ -510,7 +520,7 @@ def test_cli_review_talkorigins_writes_output(tmp_path):
manifest.write_text("{}", encoding="utf-8")
output = tmp_path / "review.json"
with patch("citegeist.cli.TalkOriginsScraper.build_review_export") as mocked_review:
mocked_review.return_value = __import__("citegeist.talkorigins", fromlist=["TalkOriginsReviewExport"]).TalkOriginsReviewExport(
mocked_review.return_value = TalkOriginsReviewExport(
manifest_path=str(manifest),
item_count=1,
items=[{"key": "smith|1999|duplicate paper", "canonical": {}, "enrichment": {}}],
@ -519,7 +529,7 @@ def test_cli_review_talkorigins_writes_output(tmp_path):
[
"--db",
str(database),
"review-talkorigins",
"example-talkorigins-review",
str(manifest),
"--output",
str(output),
@ -540,7 +550,7 @@ def test_cli_apply_talkorigins_corrections_accepts_files(tmp_path):
corrections.write_text('{"corrections": []}', encoding="utf-8")
with patch("citegeist.cli.TalkOriginsScraper.apply_review_corrections") as mocked_apply:
mocked_apply.return_value = [
__import__("citegeist.talkorigins", fromlist=["TalkOriginsCorrectionResult"]).TalkOriginsCorrectionResult(
TalkOriginsCorrectionResult(
key="smith|1999|duplicate paper",
citation_key="dup1",
applied=True,
@ -551,7 +561,7 @@ def test_cli_apply_talkorigins_corrections_accepts_files(tmp_path):
[
"--db",
str(database),
"apply-talkorigins-corrections",
"example-talkorigins-apply-corrections",
str(manifest),
str(corrections),
]
@ -797,7 +807,7 @@ def test_cli_can_review_topic_phrase(tmp_path: Path):
)
assert result.returncode == 0
payload = json.loads(result.stdout)
assert payload["suggested_phrase"] == "graph networks biology"
assert payload["suggested_phrase"] is None
assert payload["expansion_phrase"] == "graph networks biology"
assert payload["phrase_review_status"] == "accepted"
assert payload["phrase_review_notes"] == "curated and approved"
@ -844,6 +854,172 @@ def test_cli_topics_can_filter_by_phrase_review_status(tmp_path: Path):
assert [topic["slug"] for topic in payload] == ["graph-methods"]
def test_cli_can_list_topic_phrase_reviews(tmp_path: Path):
bib_path = tmp_path / "input.bib"
bib_path.write_text(
"""
@article{seed2024,
author = {Seed, Alice},
title = {Seed Paper},
year = {2024}
}
""",
encoding="utf-8",
)
ingest = run_cli(tmp_path, "ingest", str(bib_path))
assert ingest.returncode == 0
from citegeist.storage import BibliographyStore
database = tmp_path / "library.sqlite3"
store = BibliographyStore(database)
try:
store.add_entry_topic(
"seed2024",
topic_slug="graph-methods",
topic_name="Graph Methods",
source_type="talkorigins",
source_url="https://example.org/topics/graph-methods",
source_label="topic-seed",
)
store.ensure_topic("abiogenesis", "Abiogenesis")
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
store.stage_topic_phrase_suggestion("abiogenesis", "abiogenesis life origin")
store.review_topic_phrase_suggestion("abiogenesis", "accepted")
finally:
store.close()
result = run_cli(tmp_path, "topic-phrase-reviews", "--phrase-review-status", "pending")
assert result.returncode == 0
payload = json.loads(result.stdout)
assert [review["slug"] for review in payload] == ["graph-methods"]
assert payload[0]["suggested_phrase"] == "graph networks biology"
assert payload[0]["phrase_review_status"] == "pending"
def test_cli_can_review_topic_phrases_in_bulk(tmp_path: Path):
bib_path = tmp_path / "input.bib"
bib_path.write_text(
"""
@article{seed2024,
author = {Seed, Alice},
title = {Seed Paper},
year = {2024}
}
""",
encoding="utf-8",
)
ingest = run_cli(tmp_path, "ingest", str(bib_path))
assert ingest.returncode == 0
from citegeist.storage import BibliographyStore
database = tmp_path / "library.sqlite3"
store = BibliographyStore(database)
try:
store.add_entry_topic(
"seed2024",
topic_slug="graph-methods",
topic_name="Graph Methods",
source_type="talkorigins",
source_url="https://example.org/topics/graph-methods",
source_label="topic-seed",
)
store.ensure_topic("abiogenesis", "Abiogenesis")
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
store.stage_topic_phrase_suggestion("abiogenesis", "abiogenesis life origin")
finally:
store.close()
review_path = tmp_path / "phrase-review.json"
review_path.write_text(
json.dumps(
[
{
"slug": "graph-methods",
"status": "accepted",
"review_notes": "good phrase",
},
{
"slug": "abiogenesis",
"status": "rejected",
"review_notes": "too sparse",
},
]
),
encoding="utf-8",
)
result = run_cli(tmp_path, "review-topic-phrases", str(review_path))
assert result.returncode == 0
payload = json.loads(result.stdout)
assert payload[0]["reviewed"] is True
assert payload[1]["reviewed"] is True
pending_result = run_cli(tmp_path, "topic-phrase-reviews", "--phrase-review-status", "pending")
assert pending_result.returncode == 0
assert json.loads(pending_result.stdout) == []
rejected_result = run_cli(tmp_path, "topic-phrase-reviews", "--phrase-review-status", "rejected")
assert rejected_result.returncode == 0
rejected_payload = json.loads(rejected_result.stdout)
assert [review["slug"] for review in rejected_payload] == ["abiogenesis"]
topics_result = run_cli(tmp_path, "topics", "--phrase-review-status", "accepted")
assert topics_result.returncode == 0
topics_payload = json.loads(topics_result.stdout)
assert [topic["slug"] for topic in topics_payload] == ["graph-methods"]
def test_cli_can_export_topic_phrase_review_template(tmp_path: Path):
bib_path = tmp_path / "input.bib"
bib_path.write_text(
"""
@article{seed2024,
author = {Seed, Alice},
title = {Seed Paper},
year = {2024}
}
""",
encoding="utf-8",
)
ingest = run_cli(tmp_path, "ingest", str(bib_path))
assert ingest.returncode == 0
from citegeist.storage import BibliographyStore
database = tmp_path / "library.sqlite3"
store = BibliographyStore(database)
try:
store.add_entry_topic(
"seed2024",
topic_slug="graph-methods",
topic_name="Graph Methods",
source_type="talkorigins",
source_url="https://example.org/topics/graph-methods",
source_label="topic-seed",
)
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
finally:
store.close()
output_path = tmp_path / "topic-phrase-review.json"
result = run_cli(
tmp_path,
"export-topic-phrase-reviews",
"--output",
str(output_path),
)
assert result.returncode == 0
payload = json.loads(output_path.read_text(encoding="utf-8"))
assert [item["slug"] for item in payload] == ["graph-methods"]
assert payload[0]["current_expansion_phrase"] is None
assert payload[0]["suggested_phrase"] == "graph networks biology"
assert payload[0]["current_status"] == "pending"
assert payload[0]["status"] == ""
assert payload[0]["phrase"] == "graph networks biology"
def test_cli_export_topic(tmp_path: Path):
bib_path = tmp_path / "input.bib"
bib_path.write_text(

View File

@ -307,7 +307,7 @@ def test_store_can_stage_and_review_topic_phrase_suggestion():
reviewed = store.get_topic("graph-methods")
assert reviewed is not None
assert reviewed["suggested_phrase"] == "graph networks biology"
assert reviewed["suggested_phrase"] is None
assert reviewed["expansion_phrase"] == "graph networks biology"
assert reviewed["phrase_review_status"] == "accepted"
assert reviewed["phrase_review_notes"] == "looks good"
@ -333,6 +333,52 @@ def test_store_can_filter_topics_by_phrase_review_status():
store.close()
def test_store_can_list_topic_phrase_reviews():
store = BibliographyStore()
try:
store.ensure_topic("graph-methods", "Graph Methods")
store.ensure_topic("abiogenesis", "Abiogenesis")
store.ensure_topic("plain-topic", "Plain Topic")
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
store.stage_topic_phrase_suggestion("abiogenesis", "abiogenesis life origin")
store.review_topic_phrase_suggestion("abiogenesis", "accepted")
reviews = store.list_topic_phrase_reviews()
pending_reviews = store.list_topic_phrase_reviews(phrase_review_status="pending")
assert [review["slug"] for review in reviews] == ["graph-methods"]
assert reviews[0]["suggested_phrase"] == "graph networks biology"
assert reviews[0]["phrase_review_status"] == "pending"
assert [review["slug"] for review in pending_reviews] == ["graph-methods"]
finally:
store.close()
def test_store_rejected_topic_phrase_stays_in_review_queue():
store = BibliographyStore()
try:
store.ensure_topic("graph-methods", "Graph Methods")
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
assert store.review_topic_phrase_suggestion(
"graph-methods",
"rejected",
review_notes="too broad",
) is True
topic = store.get_topic("graph-methods")
assert topic is not None
assert topic["suggested_phrase"] == "graph networks biology"
assert topic["expansion_phrase"] is None
assert topic["phrase_review_status"] == "rejected"
reviews = store.list_topic_phrase_reviews()
assert [review["slug"] for review in reviews] == ["graph-methods"]
assert reviews[0]["phrase_review_status"] == "rejected"
finally:
store.close()
def test_store_search_text_can_filter_by_topic():
store = BibliographyStore()
try:

View File

@ -5,8 +5,8 @@ from pathlib import Path
from citegeist.batch import load_batch_jobs
from citegeist.bibtex import BibEntry
from citegeist.examples.talkorigins import TalkOriginsScraper, normalize_topic_entries
from citegeist.storage import BibliographyStore
from citegeist.talkorigins import TalkOriginsScraper, normalize_topic_entries
INDEX_HTML = """