Compare commits
No commits in common. "c1a977b5e215f512b685fa31c16d1a59a10d51e3" and "b74582b72f09f36b63e459c26e3cc7ea3d0696c2" have entirely different histories.
c1a977b5e2
...
b74582b72f
75
README.md
75
README.md
|
|
@ -56,15 +56,11 @@ The initial repo includes:
|
|||
- OAI-PMH repository discovery via `Identify`, `ListSets`, and `ListMetadataFormats` to target harvests more precisely;
|
||||
- bibliography bootstrap workflows that can start from a seed `.bib`, a topic phrase, or both;
|
||||
- batch bootstrap orchestration from JSON job files containing seed BibTeX paths, topic phrases, or both;
|
||||
- a TalkOrigins scraper that fixes repeated-author plaintext references, emits per-topic seed BibTeX files, and writes a batch JSON specification;
|
||||
- normalized tables for entries, creators, identifiers, and citation relations;
|
||||
- full-text-search-ready indexing over title, abstract, and fulltext when SQLite FTS5 is available;
|
||||
- tests covering parsing, ingestion, relation storage, and search.
|
||||
|
||||
Example applications live alongside the core package rather than defining it. Current examples include:
|
||||
|
||||
- a topic-only bootstrap workflow for `artificial life` in [examples/artificial-life/README.md](./examples/artificial-life/README.md);
|
||||
- the TalkOrigins bibliography pipeline under [`citegeist.examples.talkorigins`](./src/citegeist/examples/talkorigins.py) with a usage guide in [examples/talkorigins/README.md](./examples/talkorigins/README.md).
|
||||
|
||||
The prioritized execution plan lives in [ROADMAP.md](./ROADMAP.md).
|
||||
|
||||
## Layout
|
||||
|
|
@ -73,7 +69,6 @@ The prioritized execution plan lives in [ROADMAP.md](./ROADMAP.md).
|
|||
citegeist/
|
||||
src/citegeist/
|
||||
bibtex.py
|
||||
examples/
|
||||
storage.py
|
||||
tests/
|
||||
test_storage.py
|
||||
|
|
@ -130,6 +125,7 @@ PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve-confli
|
|||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 apply-conflict smith2024graphs title
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --seed-bib seed.bib --topic "bayesian nonparametrics"
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --topic "bayesian nonparametrics" --preview --topic-commit-limit 5
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 scrape-talkorigins talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist extract references.txt --output draft.bib
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve smith2024graphs
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 topics
|
||||
|
|
@ -147,14 +143,44 @@ PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export --outpu
|
|||
|
||||
For live-source development, prefer fixture-backed or cache-backed source clients so resolver and expansion work can be exercised repeatedly without re-hitting upstream APIs on every run.
|
||||
|
||||
## Example Application
|
||||
For large legacy plaintext corpora such as the TalkOrigins bibliography, prefer a two-step workflow:
|
||||
|
||||
1. `scrape-talkorigins` to generate cleaned per-topic `seed_bib` files plus a `talkorigins_jobs.json` batch spec.
|
||||
2. `bootstrap-batch` on that JSON file when you want to ingest, resolve, and expand from the generated seeds.
|
||||
|
||||
The TalkOrigins scrape output now includes:
|
||||
|
||||
- `seeds/*.bib` per-topic seed BibTeX files for `bootstrap-batch`
|
||||
- `plaintext/*.txt` per-topic cleaned GSA-style plaintext with repeated authors expanded
|
||||
- `site/topics/*.html` reconstructed topic pages with hide/show BibTeX blocks
|
||||
- `talkorigins_full.txt` and `talkorigins_full.bib` aggregate downloads
|
||||
- `snapshots/*.json` cached topic payloads so reruns can resume without re-fetching already scraped topics
|
||||
|
||||
After a full scrape, run:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist validate-talkorigins talkorigins-out/talkorigins_manifest.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist duplicates-talkorigins talkorigins-out/talkorigins_manifest.json --limit 20
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist duplicates-talkorigins talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist suggest-talkorigins-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 review-topic-phrase abiogenesis accepted --notes "curated from local corpus"
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 apply-topic-phrases topic-phrases.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 enrich-talkorigins talkorigins-out/talkorigins_manifest.json --limit 20
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins-copy.sqlite3 enrich-talkorigins talkorigins-out/talkorigins_manifest.json --limit 5 --apply --allow-unsafe-search-matches
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 review-talkorigins talkorigins-out/talkorigins_manifest.json --output talkorigins-review.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 apply-talkorigins-corrections talkorigins-out/talkorigins_manifest.json talkorigins-corrections.json
|
||||
```
|
||||
|
||||
That report summarizes parse coverage and flags suspicious entry-type / venue combinations for manual cleanup.
|
||||
It also reports duplicate clusters across topic seed files so you can gauge how much deduplication pressure to expect before ingestion.
|
||||
Use `duplicates-talkorigins` when you want to inspect specific clusters, filter by text, restrict the audit to one topic slug, or preview only weak canonicalization outcomes before importing.
|
||||
|
||||
Use `suggest-talkorigins-phrases` to derive candidate stored expansion phrases from the existing TalkOrigins topic corpus itself. The output is deterministic JSON keyed by topic slug, with a suggested phrase plus the extracted keywords that drove it. This is a useful first pass before setting topic phrases in the database or editing generated batch jobs.
|
||||
|
||||
Use `stage-topic-phrases` to load those suggestions into the database as review items. Staging stores the candidate in `suggested_phrase` and marks the topic `pending` without changing the active `expansion_phrase`.
|
||||
Use `export-topic-phrase-reviews` to write an editable JSON template directly from the database for the currently staged suggestions. That gives you a round-trip path from DB review queue to file edits and back into `review-topic-phrases`.
|
||||
Use `review-topic-phrase` to accept or reject one staged suggestion in place. Accepting a suggestion copies it into `expansion_phrase` and clears it from the staged review queue; rejecting it preserves the staged suggestion together with its review state.
|
||||
Use `review-topic-phrases` when you want to apply many accept/reject decisions from one JSON file. Each item should carry `slug`, `status`, and optional `phrase` / `review_notes`.
|
||||
Use `review-topic-phrase` to accept or reject one staged suggestion in place. Accepting a suggestion copies it into `expansion_phrase`; rejecting it preserves the review state without changing the live phrase.
|
||||
Use `apply-topic-phrases` when you want a direct patch path instead of the staged review flow. It accepts either the raw suggestion list or an object with a `topics` list, and will apply `suggested_phrase` or `phrase` to matching topic slugs immediately.
|
||||
Use `topic-phrase-reviews --phrase-review-status pending` when you want a compact audit view of unresolved staged suggestions, including both the current live phrase and the pending replacement.
|
||||
Use `enrich-talkorigins` when you want to target those weak canonical entries for resolver-based metadata upgrades before retrying graph expansion on imported topic slices.
|
||||
Use `review-talkorigins` when you want one JSON review artifact that combines weak canonical clusters with dry-run enrichment outcomes for manual cleanup.
|
||||
Use `expand-topic` when you already have both a topic phrase and a curated topic seed set in the database: it expands outward from the topic’s existing entries, then only assigns discovered works back to that topic if they clear a topic-relevance threshold. Write-enabled assignment is stricter than preview ranking: a candidate must clear the score threshold and show a non-generic title anchor to the topic phrase, so broad methods papers do not get attached just because their abstracts or related terms overlap. On large noisy topics, prefer `--seed-key` to restrict the run to just the trusted seed entries you want to expand from, and use `--preview` first to inspect discovered candidates and relevance scores before writing anything.
|
||||
|
|
@ -163,24 +189,6 @@ Use `set-topic-phrase` to store a curated expansion phrase on the topic itself.
|
|||
Use `topics --phrase-review-status pending` when you want to audit only topics whose staged phrase suggestions still need review.
|
||||
`--allow-unsafe-search-matches` exists only for bounded experiments on copied databases when you explicitly want to relax trust to exercise downstream expansion behavior.
|
||||
|
||||
The TalkOrigins corpus pipeline remains in the repository as an example application rather than a core package surface. Use the example-scoped Python namespace:
|
||||
|
||||
```python
|
||||
from citegeist.examples.talkorigins import TalkOriginsScraper
|
||||
```
|
||||
|
||||
and the example-scoped CLI commands:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
|
||||
```
|
||||
|
||||
The older `scrape-talkorigins`-style command names remain available as compatibility aliases. The full example workflow and reconstruction notes live in [examples/talkorigins/README.md](./examples/talkorigins/README.md).
|
||||
|
||||
For a smaller example that starts from a topic phrase alone, see [examples/artificial-life/README.md](./examples/artificial-life/README.md).
|
||||
|
||||
Correction files are simple JSON:
|
||||
|
||||
```json
|
||||
|
|
@ -202,6 +210,15 @@ Correction files are simple JSON:
|
|||
|
||||
`fields` values overwrite the canonical entry for that duplicate-cluster key. Set a field to `null` to remove it.
|
||||
|
||||
To import the reconstructed corpus into SQLite while collapsing duplicate works across topics into canonical entries:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 ingest-talkorigins talkorigins-out/talkorigins_manifest.json
|
||||
```
|
||||
|
||||
That import preserves many-to-many topic membership through the `topics` and `entry_topics` tables.
|
||||
After import, use `topics`, `topic-entries`, `search --topic`, and `export-topic` to inspect or export topic slices from the consolidated database.
|
||||
|
||||
Live-source workflow:
|
||||
|
||||
```bash
|
||||
|
|
|
|||
|
|
@ -1,100 +0,0 @@
|
|||
# Artificial Life Topic-Seeding Example
|
||||
|
||||
This example shows the smallest useful `citegeist` workflow that starts from a topic phrase alone.
|
||||
|
||||
The seed phrase is:
|
||||
|
||||
```text
|
||||
artificial life
|
||||
```
|
||||
|
||||
## What It Demonstrates
|
||||
|
||||
- topic-only bootstrap without a seed `.bib`;
|
||||
- previewing ranked candidate seed entries before writing anything;
|
||||
- storing a curated topic slug, topic name, and expansion phrase in the database;
|
||||
- running later topic-aware expansion from that stored phrase.
|
||||
|
||||
## Preview First
|
||||
|
||||
Use a preview run to inspect the best candidate seed entries without changing the database:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 \
|
||||
bootstrap \
|
||||
--topic "artificial life" \
|
||||
--topic-slug artificial-life \
|
||||
--topic-name "Artificial life" \
|
||||
--store-topic-phrase "artificial life alife artificial organisms complex systems evolution simulation" \
|
||||
--topic-limit 10 \
|
||||
--topic-commit-limit 5 \
|
||||
--preview
|
||||
```
|
||||
|
||||
That returns ranked candidates gathered through the configured resolver/search stack.
|
||||
|
||||
## Commit The Topic Seeds
|
||||
|
||||
Once the preview looks reasonable, run the same bootstrap without `--preview`:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 \
|
||||
bootstrap \
|
||||
--topic "artificial life" \
|
||||
--topic-slug artificial-life \
|
||||
--topic-name "Artificial life" \
|
||||
--store-topic-phrase "artificial life alife artificial organisms complex systems evolution simulation" \
|
||||
--topic-limit 10 \
|
||||
--topic-commit-limit 5
|
||||
```
|
||||
|
||||
That does three things:
|
||||
|
||||
1. finds topic-relevant seed entries;
|
||||
2. stores them in the bibliography database;
|
||||
3. creates or updates the `artificial-life` topic row with the curated expansion phrase.
|
||||
|
||||
## Inspect The Result
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 topics
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 topic-entries artificial-life
|
||||
```
|
||||
|
||||
If you want to adjust the stored phrase later:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 \
|
||||
set-topic-phrase artificial-life "artificial life alife artificial organisms autonomous agents evolution simulation"
|
||||
```
|
||||
|
||||
## Optional Batch Form
|
||||
|
||||
The same topic-only seed can be expressed as a batch job:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"name": "artificial-life-topic-seed",
|
||||
"topic": "artificial life",
|
||||
"topic_slug": "artificial-life",
|
||||
"topic_name": "Artificial life",
|
||||
"topic_phrase": "artificial life alife artificial organisms complex systems evolution simulation",
|
||||
"topic_limit": 10,
|
||||
"topic_commit_limit": 5,
|
||||
"expand": false
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Run it with:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap-batch artificial-life.json
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- This example is intentionally generic and corpus-independent.
|
||||
- The exact candidate set depends on live source availability and resolver behavior.
|
||||
- Prefer preview mode before committing topic-only seeds, because topic phrases are noisier than curated seed `.bib` inputs.
|
||||
|
|
@ -1,52 +0,0 @@
|
|||
# TalkOrigins Example
|
||||
|
||||
This example shows how to use `citegeist` on a large legacy plaintext bibliography corpus.
|
||||
|
||||
It is intentionally positioned as an application of the core library, not as the main product surface.
|
||||
|
||||
## What It Demonstrates
|
||||
|
||||
- scraping a legacy bibliography index;
|
||||
- normalizing repeated-author plaintext references;
|
||||
- converting topic pages into per-topic seed BibTeX;
|
||||
- generating batch bootstrap specs for downstream ingest and expansion;
|
||||
- reconstructing cleaned plaintext and BibTeX topic pages for review;
|
||||
- validating parse quality, duplicate clusters, and weak canonical entries;
|
||||
- curating topic phrases and correction files before broader enrichment.
|
||||
|
||||
The example implementation lives under the Python namespace:
|
||||
|
||||
```python
|
||||
from citegeist.examples.talkorigins import TalkOriginsScraper
|
||||
```
|
||||
|
||||
The preferred CLI commands are example-scoped:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-suggest-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export-topic-phrase-reviews --output topic-phrase-review.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 review-topic-phrases topic-phrase-review.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-enrich talkorigins-out/talkorigins_manifest.json --limit 20
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-review talkorigins-out/talkorigins_manifest.json --output talkorigins-review.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-apply-corrections talkorigins-out/talkorigins_manifest.json talkorigins-corrections.json
|
||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-ingest talkorigins-out/talkorigins_manifest.json
|
||||
```
|
||||
|
||||
## Output Artifacts
|
||||
|
||||
The example scrape writes:
|
||||
|
||||
- `seeds/*.bib` per-topic seed BibTeX files;
|
||||
- `plaintext/*.txt` cleaned GSA-style plaintext with repeated authors expanded;
|
||||
- `site/topics/*.html` reconstructed topic pages with hide/show BibTeX blocks;
|
||||
- `talkorigins_full.txt` and `talkorigins_full.bib` aggregate downloads;
|
||||
- `snapshots/*.json` cached topic payloads so reruns can resume.
|
||||
|
||||
## Notes
|
||||
|
||||
- The example-specific CLI names have compatibility aliases matching the older `scrape-talkorigins` style commands.
|
||||
- Topic phrase staging, review, and export commands are generic `citegeist` functionality and are not specific to TalkOrigins.
|
||||
|
|
@ -7,6 +7,18 @@ from .harvest import OaiMetadataFormat, OaiPmhHarvester, OaiSet
|
|||
from .resolve import MetadataResolver, merge_entries, merge_entries_with_conflicts
|
||||
from .sources import SourceClient
|
||||
from .storage import BibliographyStore
|
||||
from .talkorigins import (
|
||||
TalkOriginsBatchExport,
|
||||
TalkOriginsDuplicateCluster,
|
||||
TalkOriginsEnrichmentResult,
|
||||
TalkOriginsIngestReport,
|
||||
TalkOriginsReviewExport,
|
||||
TalkOriginsScraper,
|
||||
TalkOriginsSeedSet,
|
||||
TalkOriginsTopicPhraseSuggestion,
|
||||
TalkOriginsTopic,
|
||||
TalkOriginsValidationReport,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"BibEntry",
|
||||
|
|
@ -22,6 +34,16 @@ __all__ = [
|
|||
"OaiMetadataFormat",
|
||||
"OaiSet",
|
||||
"SourceClient",
|
||||
"TalkOriginsBatchExport",
|
||||
"TalkOriginsDuplicateCluster",
|
||||
"TalkOriginsEnrichmentResult",
|
||||
"TalkOriginsIngestReport",
|
||||
"TalkOriginsReviewExport",
|
||||
"TalkOriginsScraper",
|
||||
"TalkOriginsSeedSet",
|
||||
"TalkOriginsTopicPhraseSuggestion",
|
||||
"TalkOriginsTopic",
|
||||
"TalkOriginsValidationReport",
|
||||
"extract_references",
|
||||
"load_batch_jobs",
|
||||
"merge_entries",
|
||||
|
|
|
|||
|
|
@ -9,12 +9,12 @@ from pathlib import Path
|
|||
from .batch import BatchBootstrapRunner, load_batch_jobs
|
||||
from .bibtex import parse_bibtex, render_bibtex
|
||||
from .bootstrap import Bootstrapper
|
||||
from .examples.talkorigins import TalkOriginsScraper
|
||||
from .expand import CrossrefExpander, OpenAlexExpander, TopicExpander
|
||||
from .extract import extract_references
|
||||
from .harvest import OaiPmhHarvester
|
||||
from .resolve import MetadataResolver, merge_entries_with_conflicts
|
||||
from .storage import BibliographyStore
|
||||
from .talkorigins import TalkOriginsScraper
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
|
|
@ -205,9 +205,8 @@ def build_parser() -> argparse.ArgumentParser:
|
|||
batch_parser.add_argument("input", help="Path to batch JSON file")
|
||||
|
||||
talkorigins_parser = subparsers.add_parser(
|
||||
"example-talkorigins-scrape",
|
||||
aliases=["scrape-talkorigins"],
|
||||
help="Example workflow: scrape TalkOrigins into per-topic seed BibTeX files and a bootstrap-batch JSON file",
|
||||
"scrape-talkorigins",
|
||||
help="Scrape TalkOrigins into per-topic seed BibTeX files and a bootstrap-batch JSON file",
|
||||
)
|
||||
talkorigins_parser.add_argument(
|
||||
"output_dir",
|
||||
|
|
@ -258,16 +257,14 @@ def build_parser() -> argparse.ArgumentParser:
|
|||
talkorigins_parser.add_argument("--status", default="draft", help="Review status for generated seed jobs")
|
||||
|
||||
validate_talkorigins_parser = subparsers.add_parser(
|
||||
"example-talkorigins-validate",
|
||||
aliases=["validate-talkorigins"],
|
||||
help="Example workflow: validate a generated TalkOrigins manifest and report parse coverage and suspicious entries",
|
||||
"validate-talkorigins",
|
||||
help="Validate a generated TalkOrigins manifest and report parse coverage and suspicious entries",
|
||||
)
|
||||
validate_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||
|
||||
suggest_talkorigins_parser = subparsers.add_parser(
|
||||
"example-talkorigins-suggest-phrases",
|
||||
aliases=["suggest-talkorigins-phrases"],
|
||||
help="Example workflow: suggest stored topic expansion phrases from a TalkOrigins manifest",
|
||||
"suggest-talkorigins-phrases",
|
||||
help="Suggest stored topic expansion phrases from a TalkOrigins manifest",
|
||||
)
|
||||
suggest_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||
suggest_talkorigins_parser.add_argument("--topic", help="Optional topic slug to restrict suggestions")
|
||||
|
|
@ -301,16 +298,9 @@ def build_parser() -> argparse.ArgumentParser:
|
|||
help="Optional expansion phrase override to apply with the review decision",
|
||||
)
|
||||
|
||||
review_topic_phrases_parser = subparsers.add_parser(
|
||||
"review-topic-phrases",
|
||||
help="Apply topic phrase review decisions in bulk from JSON",
|
||||
)
|
||||
review_topic_phrases_parser.add_argument("input", help="Path to JSON file containing topic phrase review records")
|
||||
|
||||
duplicates_talkorigins_parser = subparsers.add_parser(
|
||||
"example-talkorigins-duplicates",
|
||||
aliases=["duplicates-talkorigins"],
|
||||
help="Example workflow: inspect duplicate clusters in a generated TalkOrigins manifest",
|
||||
"duplicates-talkorigins",
|
||||
help="Inspect duplicate clusters in a generated TalkOrigins manifest",
|
||||
)
|
||||
duplicates_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||
duplicates_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum clusters to show")
|
||||
|
|
@ -334,9 +324,8 @@ def build_parser() -> argparse.ArgumentParser:
|
|||
)
|
||||
|
||||
ingest_talkorigins_parser = subparsers.add_parser(
|
||||
"example-talkorigins-ingest",
|
||||
aliases=["ingest-talkorigins"],
|
||||
help="Example workflow: ingest a TalkOrigins manifest into the database with duplicate consolidation and topic membership",
|
||||
"ingest-talkorigins",
|
||||
help="Ingest a TalkOrigins manifest into the database with duplicate consolidation and topic membership",
|
||||
)
|
||||
ingest_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||
ingest_talkorigins_parser.add_argument("--status", default="draft", help="Review status for imported entries")
|
||||
|
|
@ -347,9 +336,8 @@ def build_parser() -> argparse.ArgumentParser:
|
|||
)
|
||||
|
||||
enrich_talkorigins_parser = subparsers.add_parser(
|
||||
"example-talkorigins-enrich",
|
||||
aliases=["enrich-talkorigins"],
|
||||
help="Example workflow: attempt metadata enrichment for weak TalkOrigins canonical entries",
|
||||
"enrich-talkorigins",
|
||||
help="Attempt metadata enrichment for weak TalkOrigins canonical entries",
|
||||
)
|
||||
enrich_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||
enrich_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum weak clusters to inspect")
|
||||
|
|
@ -378,9 +366,8 @@ def build_parser() -> argparse.ArgumentParser:
|
|||
)
|
||||
|
||||
review_talkorigins_parser = subparsers.add_parser(
|
||||
"example-talkorigins-review",
|
||||
aliases=["review-talkorigins"],
|
||||
help="Example workflow: export weak TalkOrigins clusters plus dry-run enrichment outcomes for manual review",
|
||||
"review-talkorigins",
|
||||
help="Export weak TalkOrigins clusters plus dry-run enrichment outcomes for manual review",
|
||||
)
|
||||
review_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||
review_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum weak clusters to export")
|
||||
|
|
@ -395,9 +382,8 @@ def build_parser() -> argparse.ArgumentParser:
|
|||
review_talkorigins_parser.add_argument("--output", help="Write review export JSON to a file instead of stdout")
|
||||
|
||||
apply_review_talkorigins_parser = subparsers.add_parser(
|
||||
"example-talkorigins-apply-corrections",
|
||||
aliases=["apply-talkorigins-corrections"],
|
||||
help="Example workflow: apply curated TalkOrigins review corrections to the consolidated database",
|
||||
"apply-talkorigins-corrections",
|
||||
help="Apply curated TalkOrigins review corrections to the consolidated database",
|
||||
)
|
||||
apply_review_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||
apply_review_talkorigins_parser.add_argument("corrections", help="Path to corrections JSON")
|
||||
|
|
@ -415,33 +401,6 @@ def build_parser() -> argparse.ArgumentParser:
|
|||
help="Restrict topics to one stored phrase review state",
|
||||
)
|
||||
|
||||
topic_phrase_reviews_parser = subparsers.add_parser(
|
||||
"topic-phrase-reviews",
|
||||
help="List staged topic phrase suggestions and their review state",
|
||||
)
|
||||
topic_phrase_reviews_parser.add_argument("--limit", type=int, default=100, help="Maximum reviews to list")
|
||||
topic_phrase_reviews_parser.add_argument(
|
||||
"--phrase-review-status",
|
||||
choices=["unreviewed", "pending", "accepted", "rejected"],
|
||||
help="Restrict results to one stored phrase review state",
|
||||
)
|
||||
|
||||
export_topic_phrase_reviews_parser = subparsers.add_parser(
|
||||
"export-topic-phrase-reviews",
|
||||
help="Export an editable JSON review template for staged topic phrase suggestions",
|
||||
)
|
||||
export_topic_phrase_reviews_parser.add_argument("--limit", type=int, default=100, help="Maximum reviews to export")
|
||||
export_topic_phrase_reviews_parser.add_argument(
|
||||
"--phrase-review-status",
|
||||
choices=["unreviewed", "pending", "accepted", "rejected"],
|
||||
default="pending",
|
||||
help="Restrict exported reviews to one stored phrase review state",
|
||||
)
|
||||
export_topic_phrase_reviews_parser.add_argument(
|
||||
"--output",
|
||||
help="Write the review template JSON to a file instead of stdout",
|
||||
)
|
||||
|
||||
topic_entries_parser = subparsers.add_parser(
|
||||
"topic-entries",
|
||||
help="List entries assigned to one topic",
|
||||
|
|
@ -538,7 +497,7 @@ def main(argv: list[str] | None = None) -> int:
|
|||
)
|
||||
if args.command == "bootstrap-batch":
|
||||
return _run_bootstrap_batch(store, Path(args.input))
|
||||
if args.command in {"example-talkorigins-scrape", "scrape-talkorigins"}:
|
||||
if args.command == "scrape-talkorigins":
|
||||
return _run_scrape_talkorigins(
|
||||
store,
|
||||
args.base_url,
|
||||
|
|
@ -553,9 +512,9 @@ def main(argv: list[str] | None = None) -> int:
|
|||
args.topic_commit_limit,
|
||||
args.status,
|
||||
)
|
||||
if args.command in {"example-talkorigins-validate", "validate-talkorigins"}:
|
||||
if args.command == "validate-talkorigins":
|
||||
return _run_validate_talkorigins(Path(args.manifest))
|
||||
if args.command in {"example-talkorigins-suggest-phrases", "suggest-talkorigins-phrases"}:
|
||||
if args.command == "suggest-talkorigins-phrases":
|
||||
return _run_suggest_talkorigins_phrases(Path(args.manifest), args.topic, args.limit, args.output)
|
||||
if args.command == "apply-topic-phrases":
|
||||
return _run_apply_topic_phrases(store, Path(args.input))
|
||||
|
|
@ -563,9 +522,7 @@ def main(argv: list[str] | None = None) -> int:
|
|||
return _run_stage_topic_phrases(store, Path(args.input))
|
||||
if args.command == "review-topic-phrase":
|
||||
return _run_review_topic_phrase(store, args.topic_slug, args.status, args.notes, args.phrase)
|
||||
if args.command == "review-topic-phrases":
|
||||
return _run_review_topic_phrases(store, Path(args.input))
|
||||
if args.command in {"example-talkorigins-duplicates", "duplicates-talkorigins"}:
|
||||
if args.command == "duplicates-talkorigins":
|
||||
return _run_duplicates_talkorigins(
|
||||
Path(args.manifest),
|
||||
args.limit,
|
||||
|
|
@ -575,9 +532,9 @@ def main(argv: list[str] | None = None) -> int:
|
|||
args.preview,
|
||||
args.weak_only,
|
||||
)
|
||||
if args.command in {"example-talkorigins-ingest", "ingest-talkorigins"}:
|
||||
if args.command == "ingest-talkorigins":
|
||||
return _run_ingest_talkorigins(store, Path(args.manifest), args.status, not args.no_dedupe)
|
||||
if args.command in {"example-talkorigins-enrich", "enrich-talkorigins"}:
|
||||
if args.command == "enrich-talkorigins":
|
||||
return _run_enrich_talkorigins(
|
||||
store,
|
||||
Path(args.manifest),
|
||||
|
|
@ -589,7 +546,7 @@ def main(argv: list[str] | None = None) -> int:
|
|||
args.status,
|
||||
args.allow_unsafe_search_matches,
|
||||
)
|
||||
if args.command in {"example-talkorigins-review", "review-talkorigins"}:
|
||||
if args.command == "review-talkorigins":
|
||||
return _run_review_talkorigins(
|
||||
store,
|
||||
Path(args.manifest),
|
||||
|
|
@ -599,7 +556,7 @@ def main(argv: list[str] | None = None) -> int:
|
|||
args.topic,
|
||||
args.output,
|
||||
)
|
||||
if args.command in {"example-talkorigins-apply-corrections", "apply-talkorigins-corrections"}:
|
||||
if args.command == "apply-talkorigins-corrections":
|
||||
return _run_apply_talkorigins_corrections(
|
||||
store,
|
||||
Path(args.manifest),
|
||||
|
|
@ -608,10 +565,6 @@ def main(argv: list[str] | None = None) -> int:
|
|||
)
|
||||
if args.command == "topics":
|
||||
return _run_topics(store, args.limit, args.phrase_review_status)
|
||||
if args.command == "topic-phrase-reviews":
|
||||
return _run_topic_phrase_reviews(store, args.limit, args.phrase_review_status)
|
||||
if args.command == "export-topic-phrase-reviews":
|
||||
return _run_export_topic_phrase_reviews(store, args.limit, args.phrase_review_status, args.output)
|
||||
if args.command == "topic-entries":
|
||||
return _run_topic_entries(store, args.topic_slug, args.limit)
|
||||
if args.command == "export-topic":
|
||||
|
|
@ -1103,51 +1056,6 @@ def _run_review_topic_phrase(
|
|||
return 0
|
||||
|
||||
|
||||
def _run_review_topic_phrases(store: BibliographyStore, input_path: Path) -> int:
|
||||
payload = json.loads(input_path.read_text(encoding="utf-8"))
|
||||
if isinstance(payload, dict):
|
||||
items = payload.get("topics", payload.get("items", []))
|
||||
else:
|
||||
items = payload
|
||||
if not isinstance(items, list):
|
||||
print("Topic phrase review JSON must be a list or an object with a 'topics' or 'items' list", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
results: list[dict[str, object]] = []
|
||||
exit_code = 0
|
||||
for item in items:
|
||||
if not isinstance(item, dict):
|
||||
continue
|
||||
slug = str(item.get("slug") or "")
|
||||
status = str(item.get("status") or item.get("phrase_review_status") or "")
|
||||
notes = item.get("review_notes")
|
||||
phrase = item.get("phrase", item.get("expansion_phrase"))
|
||||
if not slug or status not in {"accepted", "rejected"}:
|
||||
continue
|
||||
if notes is not None:
|
||||
notes = str(notes)
|
||||
if phrase is not None:
|
||||
phrase = str(phrase)
|
||||
reviewed = store.review_topic_phrase_suggestion(
|
||||
slug,
|
||||
review_status=status,
|
||||
review_notes=notes,
|
||||
applied_phrase=phrase,
|
||||
)
|
||||
if not reviewed:
|
||||
exit_code = 1
|
||||
results.append(
|
||||
{
|
||||
"slug": slug,
|
||||
"phrase_review_status": status,
|
||||
"expansion_phrase": phrase,
|
||||
"reviewed": reviewed,
|
||||
}
|
||||
)
|
||||
print(json.dumps(results, indent=2))
|
||||
return exit_code
|
||||
|
||||
|
||||
def _run_duplicates_talkorigins(
|
||||
manifest_path: Path,
|
||||
limit: int,
|
||||
|
|
@ -1263,39 +1171,6 @@ def _run_topics(store: BibliographyStore, limit: int, phrase_review_status: str
|
|||
return 0
|
||||
|
||||
|
||||
def _run_topic_phrase_reviews(store: BibliographyStore, limit: int, phrase_review_status: str | None) -> int:
|
||||
print(json.dumps(store.list_topic_phrase_reviews(limit=limit, phrase_review_status=phrase_review_status), indent=2))
|
||||
return 0
|
||||
|
||||
|
||||
def _run_export_topic_phrase_reviews(
|
||||
store: BibliographyStore,
|
||||
limit: int,
|
||||
phrase_review_status: str | None,
|
||||
output: str | None,
|
||||
) -> int:
|
||||
items = store.list_topic_phrase_reviews(limit=limit, phrase_review_status=phrase_review_status)
|
||||
payload = [
|
||||
{
|
||||
"slug": item["slug"],
|
||||
"topic": item["name"],
|
||||
"current_expansion_phrase": item.get("expansion_phrase"),
|
||||
"suggested_phrase": item.get("suggested_phrase"),
|
||||
"current_status": item.get("phrase_review_status"),
|
||||
"review_notes": item.get("phrase_review_notes"),
|
||||
"status": "",
|
||||
"phrase": item.get("suggested_phrase"),
|
||||
}
|
||||
for item in items
|
||||
]
|
||||
rendered = json.dumps(payload, indent=2)
|
||||
if output:
|
||||
Path(output).write_text(rendered + "\n", encoding="utf-8")
|
||||
else:
|
||||
print(rendered)
|
||||
return 0
|
||||
|
||||
|
||||
def _run_topic_entries(store: BibliographyStore, topic_slug: str, limit: int) -> int:
|
||||
topic = store.get_topic(topic_slug)
|
||||
if topic is None:
|
||||
|
|
|
|||
|
|
@ -1,29 +0,0 @@
|
|||
from .talkorigins import (
|
||||
TalkOriginsBatchExport,
|
||||
TalkOriginsCorrectionResult,
|
||||
TalkOriginsDuplicateCluster,
|
||||
TalkOriginsEnrichmentResult,
|
||||
TalkOriginsIngestReport,
|
||||
TalkOriginsReviewExport,
|
||||
TalkOriginsScraper,
|
||||
TalkOriginsSeedSet,
|
||||
TalkOriginsTopic,
|
||||
TalkOriginsTopicPhraseSuggestion,
|
||||
TalkOriginsValidationReport,
|
||||
normalize_topic_entries,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"TalkOriginsBatchExport",
|
||||
"TalkOriginsCorrectionResult",
|
||||
"TalkOriginsDuplicateCluster",
|
||||
"TalkOriginsEnrichmentResult",
|
||||
"TalkOriginsIngestReport",
|
||||
"TalkOriginsReviewExport",
|
||||
"TalkOriginsScraper",
|
||||
"TalkOriginsSeedSet",
|
||||
"TalkOriginsTopic",
|
||||
"TalkOriginsTopicPhraseSuggestion",
|
||||
"TalkOriginsValidationReport",
|
||||
"normalize_topic_entries",
|
||||
]
|
||||
|
|
@ -1,29 +0,0 @@
|
|||
from ..talkorigins import (
|
||||
TalkOriginsBatchExport,
|
||||
TalkOriginsCorrectionResult,
|
||||
TalkOriginsDuplicateCluster,
|
||||
TalkOriginsEnrichmentResult,
|
||||
TalkOriginsIngestReport,
|
||||
TalkOriginsReviewExport,
|
||||
TalkOriginsScraper,
|
||||
TalkOriginsSeedSet,
|
||||
TalkOriginsTopic,
|
||||
TalkOriginsTopicPhraseSuggestion,
|
||||
TalkOriginsValidationReport,
|
||||
normalize_topic_entries,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"TalkOriginsBatchExport",
|
||||
"TalkOriginsCorrectionResult",
|
||||
"TalkOriginsDuplicateCluster",
|
||||
"TalkOriginsEnrichmentResult",
|
||||
"TalkOriginsIngestReport",
|
||||
"TalkOriginsReviewExport",
|
||||
"TalkOriginsScraper",
|
||||
"TalkOriginsSeedSet",
|
||||
"TalkOriginsTopic",
|
||||
"TalkOriginsTopicPhraseSuggestion",
|
||||
"TalkOriginsValidationReport",
|
||||
"normalize_topic_entries",
|
||||
]
|
||||
|
|
@ -603,43 +603,6 @@ class BibliographyStore:
|
|||
).fetchone()
|
||||
return dict(row) if row else None
|
||||
|
||||
def list_topic_phrase_reviews(
|
||||
self,
|
||||
limit: int = 100,
|
||||
phrase_review_status: str | None = None,
|
||||
) -> list[dict[str, object]]:
|
||||
where = "WHERE t.suggested_phrase IS NOT NULL"
|
||||
params: list[object] = []
|
||||
if phrase_review_status is not None:
|
||||
where += " AND t.phrase_review_status = ?"
|
||||
params.append(phrase_review_status)
|
||||
params.append(limit)
|
||||
rows = self.connection.execute(
|
||||
f"""
|
||||
SELECT t.slug, t.name, t.expansion_phrase, t.suggested_phrase,
|
||||
t.phrase_review_status, t.phrase_review_notes,
|
||||
COUNT(et.entry_id) AS entry_count
|
||||
FROM topics t
|
||||
LEFT JOIN entry_topics et ON et.topic_id = t.id
|
||||
{where}
|
||||
GROUP BY t.id, t.slug, t.name, t.expansion_phrase, t.suggested_phrase,
|
||||
t.phrase_review_status, t.phrase_review_notes
|
||||
ORDER BY
|
||||
CASE t.phrase_review_status
|
||||
WHEN 'pending' THEN 0
|
||||
WHEN 'unreviewed' THEN 1
|
||||
WHEN 'rejected' THEN 2
|
||||
WHEN 'accepted' THEN 3
|
||||
ELSE 4
|
||||
END,
|
||||
t.name,
|
||||
t.slug
|
||||
LIMIT ?
|
||||
""",
|
||||
params,
|
||||
).fetchall()
|
||||
return [dict(row) for row in rows]
|
||||
|
||||
def set_topic_expansion_phrase(self, slug: str, expansion_phrase: str | None) -> bool:
|
||||
row = self.connection.execute(
|
||||
"""
|
||||
|
|
@ -688,10 +651,8 @@ class BibliographyStore:
|
|||
|
||||
suggested_phrase = topic.get("suggested_phrase")
|
||||
expansion_phrase = topic.get("expansion_phrase")
|
||||
stored_suggested_phrase = suggested_phrase
|
||||
if review_status == "accepted":
|
||||
expansion_phrase = applied_phrase if applied_phrase is not None else suggested_phrase
|
||||
stored_suggested_phrase = None
|
||||
elif applied_phrase is not None:
|
||||
expansion_phrase = applied_phrase
|
||||
|
||||
|
|
@ -699,14 +660,13 @@ class BibliographyStore:
|
|||
"""
|
||||
UPDATE topics
|
||||
SET expansion_phrase = ?,
|
||||
suggested_phrase = ?,
|
||||
phrase_review_status = ?,
|
||||
phrase_review_notes = ?,
|
||||
updated_at = CURRENT_TIMESTAMP
|
||||
WHERE slug = ?
|
||||
RETURNING id
|
||||
""",
|
||||
(expansion_phrase, stored_suggested_phrase, review_status, review_notes, slug),
|
||||
(expansion_phrase, review_status, review_notes, slug),
|
||||
).fetchone()
|
||||
self.connection.commit()
|
||||
return row is not None
|
||||
|
|
|
|||
|
|
@ -1,10 +1,3 @@
|
|||
"""TalkOrigins example implementation.
|
||||
|
||||
This module backs the example-facing namespace at ``citegeist.examples.talkorigins``.
|
||||
New code should prefer importing from the examples namespace rather than treating
|
||||
TalkOrigins support as part of the core top-level package surface.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from collections import Counter
|
||||
|
|
|
|||
|
|
@ -7,16 +7,6 @@ from pathlib import Path
|
|||
from unittest.mock import patch
|
||||
|
||||
from citegeist.cli import main
|
||||
from citegeist.examples.talkorigins import (
|
||||
TalkOriginsBatchExport,
|
||||
TalkOriginsCorrectionResult,
|
||||
TalkOriginsDuplicateCluster,
|
||||
TalkOriginsEnrichmentResult,
|
||||
TalkOriginsIngestReport,
|
||||
TalkOriginsReviewExport,
|
||||
TalkOriginsTopicPhraseSuggestion,
|
||||
TalkOriginsValidationReport,
|
||||
)
|
||||
|
||||
|
||||
SAMPLE_BIB = """
|
||||
|
|
@ -323,7 +313,7 @@ def test_cli_scrape_talkorigins_accepts_output_dir(tmp_path):
|
|||
|
||||
database = tmp_path / "library.sqlite3"
|
||||
with patch("citegeist.cli.TalkOriginsScraper.scrape_to_directory") as mocked_scrape:
|
||||
mocked_scrape.return_value = TalkOriginsBatchExport(
|
||||
mocked_scrape.return_value = __import__("citegeist").TalkOriginsBatchExport(
|
||||
base_url="https://www.talkorigins.org/origins/biblio/",
|
||||
output_dir=str(tmp_path),
|
||||
topic_count=1,
|
||||
|
|
@ -336,7 +326,7 @@ def test_cli_scrape_talkorigins_accepts_output_dir(tmp_path):
|
|||
[
|
||||
"--db",
|
||||
str(database),
|
||||
"example-talkorigins-scrape",
|
||||
"scrape-talkorigins",
|
||||
str(tmp_path / "talkorigins-out"),
|
||||
"--limit-topics",
|
||||
"3",
|
||||
|
|
@ -356,7 +346,7 @@ def test_cli_validate_talkorigins_accepts_manifest(tmp_path):
|
|||
manifest = tmp_path / "talkorigins_manifest.json"
|
||||
manifest.write_text("{}", encoding="utf-8")
|
||||
with patch("citegeist.cli.TalkOriginsScraper.validate_export") as mocked_validate:
|
||||
mocked_validate.return_value = TalkOriginsValidationReport(
|
||||
mocked_validate.return_value = __import__("citegeist").TalkOriginsValidationReport(
|
||||
manifest_path=str(manifest),
|
||||
topic_count=1,
|
||||
entry_count=2,
|
||||
|
|
@ -370,7 +360,7 @@ def test_cli_validate_talkorigins_accepts_manifest(tmp_path):
|
|||
duplicate_entry_count=0,
|
||||
duplicate_examples=[],
|
||||
)
|
||||
exit_code = main(["example-talkorigins-validate", str(manifest)])
|
||||
exit_code = main(["validate-talkorigins", str(manifest)])
|
||||
|
||||
assert exit_code == 0
|
||||
|
||||
|
|
@ -383,7 +373,7 @@ def test_cli_suggest_talkorigins_phrases_writes_output(tmp_path):
|
|||
output = tmp_path / "phrases.json"
|
||||
with patch("citegeist.cli.TalkOriginsScraper.suggest_topic_phrases") as mocked_suggest:
|
||||
mocked_suggest.return_value = [
|
||||
TalkOriginsTopicPhraseSuggestion(
|
||||
__import__("citegeist", fromlist=["TalkOriginsTopicPhraseSuggestion"]).TalkOriginsTopicPhraseSuggestion(
|
||||
slug="abiogenesis",
|
||||
topic="Abiogenesis",
|
||||
entry_count=2,
|
||||
|
|
@ -395,7 +385,7 @@ def test_cli_suggest_talkorigins_phrases_writes_output(tmp_path):
|
|||
]
|
||||
exit_code = main(
|
||||
[
|
||||
"example-talkorigins-suggest-phrases",
|
||||
"suggest-talkorigins-phrases",
|
||||
str(manifest),
|
||||
"--topic",
|
||||
"abiogenesis",
|
||||
|
|
@ -416,7 +406,7 @@ def test_cli_duplicates_talkorigins_accepts_manifest(tmp_path):
|
|||
manifest.write_text("{}", encoding="utf-8")
|
||||
with patch("citegeist.cli.TalkOriginsScraper.inspect_duplicate_clusters") as mocked_duplicates:
|
||||
mocked_duplicates.return_value = [
|
||||
TalkOriginsDuplicateCluster(
|
||||
__import__("citegeist.talkorigins", fromlist=["TalkOriginsDuplicateCluster"]).TalkOriginsDuplicateCluster(
|
||||
key="smith|1999|duplicate paper",
|
||||
count=2,
|
||||
items=[
|
||||
|
|
@ -441,7 +431,7 @@ def test_cli_duplicates_talkorigins_accepts_manifest(tmp_path):
|
|||
]
|
||||
exit_code = main(
|
||||
[
|
||||
"example-talkorigins-duplicates",
|
||||
"duplicates-talkorigins",
|
||||
str(manifest),
|
||||
"--topic",
|
||||
"abiogenesis",
|
||||
|
|
@ -462,7 +452,7 @@ def test_cli_ingest_talkorigins_accepts_manifest(tmp_path):
|
|||
manifest = tmp_path / "talkorigins_manifest.json"
|
||||
manifest.write_text("{}", encoding="utf-8")
|
||||
with patch("citegeist.cli.TalkOriginsScraper.ingest_export") as mocked_ingest:
|
||||
mocked_ingest.return_value = TalkOriginsIngestReport(
|
||||
mocked_ingest.return_value = __import__("citegeist").TalkOriginsIngestReport(
|
||||
manifest_path=str(manifest),
|
||||
topic_count=1,
|
||||
raw_entry_count=2,
|
||||
|
|
@ -471,7 +461,7 @@ def test_cli_ingest_talkorigins_accepts_manifest(tmp_path):
|
|||
duplicate_entry_count=2,
|
||||
canonicalized_count=1,
|
||||
)
|
||||
exit_code = main(["--db", str(database), "example-talkorigins-ingest", str(manifest)])
|
||||
exit_code = main(["--db", str(database), "ingest-talkorigins", str(manifest)])
|
||||
|
||||
assert exit_code == 0
|
||||
|
||||
|
|
@ -484,7 +474,7 @@ def test_cli_enrich_talkorigins_accepts_manifest(tmp_path):
|
|||
manifest.write_text("{}", encoding="utf-8")
|
||||
with patch("citegeist.cli.TalkOriginsScraper.enrich_weak_canonicals") as mocked_enrich:
|
||||
mocked_enrich.return_value = [
|
||||
TalkOriginsEnrichmentResult(
|
||||
__import__("citegeist.talkorigins", fromlist=["TalkOriginsEnrichmentResult"]).TalkOriginsEnrichmentResult(
|
||||
key="smith|1999|duplicate paper",
|
||||
citation_key="dup1",
|
||||
weak_reasons_before=["missing:doi"],
|
||||
|
|
@ -500,7 +490,7 @@ def test_cli_enrich_talkorigins_accepts_manifest(tmp_path):
|
|||
[
|
||||
"--db",
|
||||
str(database),
|
||||
"example-talkorigins-enrich",
|
||||
"enrich-talkorigins",
|
||||
str(manifest),
|
||||
"--limit",
|
||||
"5",
|
||||
|
|
@ -520,7 +510,7 @@ def test_cli_review_talkorigins_writes_output(tmp_path):
|
|||
manifest.write_text("{}", encoding="utf-8")
|
||||
output = tmp_path / "review.json"
|
||||
with patch("citegeist.cli.TalkOriginsScraper.build_review_export") as mocked_review:
|
||||
mocked_review.return_value = TalkOriginsReviewExport(
|
||||
mocked_review.return_value = __import__("citegeist.talkorigins", fromlist=["TalkOriginsReviewExport"]).TalkOriginsReviewExport(
|
||||
manifest_path=str(manifest),
|
||||
item_count=1,
|
||||
items=[{"key": "smith|1999|duplicate paper", "canonical": {}, "enrichment": {}}],
|
||||
|
|
@ -529,7 +519,7 @@ def test_cli_review_talkorigins_writes_output(tmp_path):
|
|||
[
|
||||
"--db",
|
||||
str(database),
|
||||
"example-talkorigins-review",
|
||||
"review-talkorigins",
|
||||
str(manifest),
|
||||
"--output",
|
||||
str(output),
|
||||
|
|
@ -550,7 +540,7 @@ def test_cli_apply_talkorigins_corrections_accepts_files(tmp_path):
|
|||
corrections.write_text('{"corrections": []}', encoding="utf-8")
|
||||
with patch("citegeist.cli.TalkOriginsScraper.apply_review_corrections") as mocked_apply:
|
||||
mocked_apply.return_value = [
|
||||
TalkOriginsCorrectionResult(
|
||||
__import__("citegeist.talkorigins", fromlist=["TalkOriginsCorrectionResult"]).TalkOriginsCorrectionResult(
|
||||
key="smith|1999|duplicate paper",
|
||||
citation_key="dup1",
|
||||
applied=True,
|
||||
|
|
@ -561,7 +551,7 @@ def test_cli_apply_talkorigins_corrections_accepts_files(tmp_path):
|
|||
[
|
||||
"--db",
|
||||
str(database),
|
||||
"example-talkorigins-apply-corrections",
|
||||
"apply-talkorigins-corrections",
|
||||
str(manifest),
|
||||
str(corrections),
|
||||
]
|
||||
|
|
@ -807,7 +797,7 @@ def test_cli_can_review_topic_phrase(tmp_path: Path):
|
|||
)
|
||||
assert result.returncode == 0
|
||||
payload = json.loads(result.stdout)
|
||||
assert payload["suggested_phrase"] is None
|
||||
assert payload["suggested_phrase"] == "graph networks biology"
|
||||
assert payload["expansion_phrase"] == "graph networks biology"
|
||||
assert payload["phrase_review_status"] == "accepted"
|
||||
assert payload["phrase_review_notes"] == "curated and approved"
|
||||
|
|
@ -854,172 +844,6 @@ def test_cli_topics_can_filter_by_phrase_review_status(tmp_path: Path):
|
|||
assert [topic["slug"] for topic in payload] == ["graph-methods"]
|
||||
|
||||
|
||||
def test_cli_can_list_topic_phrase_reviews(tmp_path: Path):
|
||||
bib_path = tmp_path / "input.bib"
|
||||
bib_path.write_text(
|
||||
"""
|
||||
@article{seed2024,
|
||||
author = {Seed, Alice},
|
||||
title = {Seed Paper},
|
||||
year = {2024}
|
||||
}
|
||||
""",
|
||||
encoding="utf-8",
|
||||
)
|
||||
ingest = run_cli(tmp_path, "ingest", str(bib_path))
|
||||
assert ingest.returncode == 0
|
||||
|
||||
from citegeist.storage import BibliographyStore
|
||||
|
||||
database = tmp_path / "library.sqlite3"
|
||||
store = BibliographyStore(database)
|
||||
try:
|
||||
store.add_entry_topic(
|
||||
"seed2024",
|
||||
topic_slug="graph-methods",
|
||||
topic_name="Graph Methods",
|
||||
source_type="talkorigins",
|
||||
source_url="https://example.org/topics/graph-methods",
|
||||
source_label="topic-seed",
|
||||
)
|
||||
store.ensure_topic("abiogenesis", "Abiogenesis")
|
||||
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
|
||||
store.stage_topic_phrase_suggestion("abiogenesis", "abiogenesis life origin")
|
||||
store.review_topic_phrase_suggestion("abiogenesis", "accepted")
|
||||
finally:
|
||||
store.close()
|
||||
|
||||
result = run_cli(tmp_path, "topic-phrase-reviews", "--phrase-review-status", "pending")
|
||||
assert result.returncode == 0
|
||||
payload = json.loads(result.stdout)
|
||||
assert [review["slug"] for review in payload] == ["graph-methods"]
|
||||
assert payload[0]["suggested_phrase"] == "graph networks biology"
|
||||
assert payload[0]["phrase_review_status"] == "pending"
|
||||
|
||||
|
||||
def test_cli_can_review_topic_phrases_in_bulk(tmp_path: Path):
|
||||
bib_path = tmp_path / "input.bib"
|
||||
bib_path.write_text(
|
||||
"""
|
||||
@article{seed2024,
|
||||
author = {Seed, Alice},
|
||||
title = {Seed Paper},
|
||||
year = {2024}
|
||||
}
|
||||
""",
|
||||
encoding="utf-8",
|
||||
)
|
||||
ingest = run_cli(tmp_path, "ingest", str(bib_path))
|
||||
assert ingest.returncode == 0
|
||||
|
||||
from citegeist.storage import BibliographyStore
|
||||
|
||||
database = tmp_path / "library.sqlite3"
|
||||
store = BibliographyStore(database)
|
||||
try:
|
||||
store.add_entry_topic(
|
||||
"seed2024",
|
||||
topic_slug="graph-methods",
|
||||
topic_name="Graph Methods",
|
||||
source_type="talkorigins",
|
||||
source_url="https://example.org/topics/graph-methods",
|
||||
source_label="topic-seed",
|
||||
)
|
||||
store.ensure_topic("abiogenesis", "Abiogenesis")
|
||||
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
|
||||
store.stage_topic_phrase_suggestion("abiogenesis", "abiogenesis life origin")
|
||||
finally:
|
||||
store.close()
|
||||
|
||||
review_path = tmp_path / "phrase-review.json"
|
||||
review_path.write_text(
|
||||
json.dumps(
|
||||
[
|
||||
{
|
||||
"slug": "graph-methods",
|
||||
"status": "accepted",
|
||||
"review_notes": "good phrase",
|
||||
},
|
||||
{
|
||||
"slug": "abiogenesis",
|
||||
"status": "rejected",
|
||||
"review_notes": "too sparse",
|
||||
},
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = run_cli(tmp_path, "review-topic-phrases", str(review_path))
|
||||
assert result.returncode == 0
|
||||
payload = json.loads(result.stdout)
|
||||
assert payload[0]["reviewed"] is True
|
||||
assert payload[1]["reviewed"] is True
|
||||
|
||||
pending_result = run_cli(tmp_path, "topic-phrase-reviews", "--phrase-review-status", "pending")
|
||||
assert pending_result.returncode == 0
|
||||
assert json.loads(pending_result.stdout) == []
|
||||
|
||||
rejected_result = run_cli(tmp_path, "topic-phrase-reviews", "--phrase-review-status", "rejected")
|
||||
assert rejected_result.returncode == 0
|
||||
rejected_payload = json.loads(rejected_result.stdout)
|
||||
assert [review["slug"] for review in rejected_payload] == ["abiogenesis"]
|
||||
|
||||
topics_result = run_cli(tmp_path, "topics", "--phrase-review-status", "accepted")
|
||||
assert topics_result.returncode == 0
|
||||
topics_payload = json.loads(topics_result.stdout)
|
||||
assert [topic["slug"] for topic in topics_payload] == ["graph-methods"]
|
||||
|
||||
|
||||
def test_cli_can_export_topic_phrase_review_template(tmp_path: Path):
|
||||
bib_path = tmp_path / "input.bib"
|
||||
bib_path.write_text(
|
||||
"""
|
||||
@article{seed2024,
|
||||
author = {Seed, Alice},
|
||||
title = {Seed Paper},
|
||||
year = {2024}
|
||||
}
|
||||
""",
|
||||
encoding="utf-8",
|
||||
)
|
||||
ingest = run_cli(tmp_path, "ingest", str(bib_path))
|
||||
assert ingest.returncode == 0
|
||||
|
||||
from citegeist.storage import BibliographyStore
|
||||
|
||||
database = tmp_path / "library.sqlite3"
|
||||
store = BibliographyStore(database)
|
||||
try:
|
||||
store.add_entry_topic(
|
||||
"seed2024",
|
||||
topic_slug="graph-methods",
|
||||
topic_name="Graph Methods",
|
||||
source_type="talkorigins",
|
||||
source_url="https://example.org/topics/graph-methods",
|
||||
source_label="topic-seed",
|
||||
)
|
||||
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
|
||||
finally:
|
||||
store.close()
|
||||
|
||||
output_path = tmp_path / "topic-phrase-review.json"
|
||||
result = run_cli(
|
||||
tmp_path,
|
||||
"export-topic-phrase-reviews",
|
||||
"--output",
|
||||
str(output_path),
|
||||
)
|
||||
assert result.returncode == 0
|
||||
payload = json.loads(output_path.read_text(encoding="utf-8"))
|
||||
assert [item["slug"] for item in payload] == ["graph-methods"]
|
||||
assert payload[0]["current_expansion_phrase"] is None
|
||||
assert payload[0]["suggested_phrase"] == "graph networks biology"
|
||||
assert payload[0]["current_status"] == "pending"
|
||||
assert payload[0]["status"] == ""
|
||||
assert payload[0]["phrase"] == "graph networks biology"
|
||||
|
||||
|
||||
def test_cli_export_topic(tmp_path: Path):
|
||||
bib_path = tmp_path / "input.bib"
|
||||
bib_path.write_text(
|
||||
|
|
|
|||
|
|
@ -307,7 +307,7 @@ def test_store_can_stage_and_review_topic_phrase_suggestion():
|
|||
|
||||
reviewed = store.get_topic("graph-methods")
|
||||
assert reviewed is not None
|
||||
assert reviewed["suggested_phrase"] is None
|
||||
assert reviewed["suggested_phrase"] == "graph networks biology"
|
||||
assert reviewed["expansion_phrase"] == "graph networks biology"
|
||||
assert reviewed["phrase_review_status"] == "accepted"
|
||||
assert reviewed["phrase_review_notes"] == "looks good"
|
||||
|
|
@ -333,52 +333,6 @@ def test_store_can_filter_topics_by_phrase_review_status():
|
|||
store.close()
|
||||
|
||||
|
||||
def test_store_can_list_topic_phrase_reviews():
|
||||
store = BibliographyStore()
|
||||
try:
|
||||
store.ensure_topic("graph-methods", "Graph Methods")
|
||||
store.ensure_topic("abiogenesis", "Abiogenesis")
|
||||
store.ensure_topic("plain-topic", "Plain Topic")
|
||||
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
|
||||
store.stage_topic_phrase_suggestion("abiogenesis", "abiogenesis life origin")
|
||||
store.review_topic_phrase_suggestion("abiogenesis", "accepted")
|
||||
|
||||
reviews = store.list_topic_phrase_reviews()
|
||||
pending_reviews = store.list_topic_phrase_reviews(phrase_review_status="pending")
|
||||
|
||||
assert [review["slug"] for review in reviews] == ["graph-methods"]
|
||||
assert reviews[0]["suggested_phrase"] == "graph networks biology"
|
||||
assert reviews[0]["phrase_review_status"] == "pending"
|
||||
assert [review["slug"] for review in pending_reviews] == ["graph-methods"]
|
||||
finally:
|
||||
store.close()
|
||||
|
||||
|
||||
def test_store_rejected_topic_phrase_stays_in_review_queue():
|
||||
store = BibliographyStore()
|
||||
try:
|
||||
store.ensure_topic("graph-methods", "Graph Methods")
|
||||
store.stage_topic_phrase_suggestion("graph-methods", "graph networks biology")
|
||||
|
||||
assert store.review_topic_phrase_suggestion(
|
||||
"graph-methods",
|
||||
"rejected",
|
||||
review_notes="too broad",
|
||||
) is True
|
||||
|
||||
topic = store.get_topic("graph-methods")
|
||||
assert topic is not None
|
||||
assert topic["suggested_phrase"] == "graph networks biology"
|
||||
assert topic["expansion_phrase"] is None
|
||||
assert topic["phrase_review_status"] == "rejected"
|
||||
|
||||
reviews = store.list_topic_phrase_reviews()
|
||||
assert [review["slug"] for review in reviews] == ["graph-methods"]
|
||||
assert reviews[0]["phrase_review_status"] == "rejected"
|
||||
finally:
|
||||
store.close()
|
||||
|
||||
|
||||
def test_store_search_text_can_filter_by_topic():
|
||||
store = BibliographyStore()
|
||||
try:
|
||||
|
|
|
|||
|
|
@ -5,8 +5,8 @@ from pathlib import Path
|
|||
|
||||
from citegeist.batch import load_batch_jobs
|
||||
from citegeist.bibtex import BibEntry
|
||||
from citegeist.examples.talkorigins import TalkOriginsScraper, normalize_topic_entries
|
||||
from citegeist.storage import BibliographyStore
|
||||
from citegeist.talkorigins import TalkOriginsScraper, normalize_topic_entries
|
||||
|
||||
|
||||
INDEX_HTML = """
|
||||
|
|
|
|||
Loading…
Reference in New Issue