|
|
||
|---|---|---|
| .. | ||
| README.md | ||
README.md
TalkOrigins Example
This example shows how to use citegeist on a large legacy plaintext bibliography corpus.
It is intentionally positioned as an application of the core library, not as the main product surface.
What It Demonstrates
- scraping a legacy bibliography index;
- normalizing repeated-author plaintext references;
- converting topic pages into per-topic seed BibTeX;
- generating batch bootstrap specs for downstream ingest and expansion;
- reconstructing cleaned plaintext and BibTeX topic pages for review;
- validating parse quality, duplicate clusters, and weak canonical entries;
- curating topic phrases and correction files before broader enrichment.
The example implementation lives under the Python namespace:
from citegeist.examples.talkorigins import TalkOriginsScraper
The preferred CLI commands are example-scoped:
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-suggest-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export-topic-phrase-reviews --output topic-phrase-review.json
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 review-topic-phrases topic-phrase-review.json
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-enrich talkorigins-out/talkorigins_manifest.json --limit 20
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-review talkorigins-out/talkorigins_manifest.json --output talkorigins-review.json
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-apply-corrections talkorigins-out/talkorigins_manifest.json talkorigins-corrections.json
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-ingest talkorigins-out/talkorigins_manifest.json
Output Artifacts
The example scrape writes:
seeds/*.bibper-topic seed BibTeX files;plaintext/*.txtcleaned GSA-style plaintext with repeated authors expanded;site/topics/*.htmlreconstructed topic pages with hide/show BibTeX blocks;talkorigins_full.txtandtalkorigins_full.bibaggregate downloads;snapshots/*.jsoncached topic payloads so reruns can resume.
Notes
- The example-specific CLI names have compatibility aliases matching the older
scrape-talkoriginsstyle commands. - Topic phrase staging, review, and export commands are generic
citegeistfunctionality and are not specific to TalkOrigins.