# TalkOrigins Example This example shows how to use `citegeist` on a large legacy plaintext bibliography corpus. It is intentionally positioned as an application of the core library, not as the main product surface. ## What It Demonstrates - scraping a legacy bibliography index; - normalizing repeated-author plaintext references; - converting topic pages into per-topic seed BibTeX; - generating batch bootstrap specs for downstream ingest and expansion; - reconstructing cleaned plaintext and BibTeX topic pages for review; - validating parse quality, duplicate clusters, and weak canonical entries; - curating topic phrases and correction files before broader enrichment. The example implementation lives under the Python namespace: ```python from citegeist.examples.talkorigins import TalkOriginsScraper ``` The preferred CLI commands are example-scoped: ```bash PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20 PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-suggest-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export-topic-phrase-reviews --output topic-phrase-review.json PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 review-topic-phrases topic-phrase-review.json PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-enrich talkorigins-out/talkorigins_manifest.json --limit 20 PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-review talkorigins-out/talkorigins_manifest.json --output talkorigins-review.json PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-apply-corrections talkorigins-out/talkorigins_manifest.json talkorigins-corrections.json PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-ingest talkorigins-out/talkorigins_manifest.json ``` ## Output Artifacts The example scrape writes: - `seeds/*.bib` per-topic seed BibTeX files; - `plaintext/*.txt` cleaned GSA-style plaintext with repeated authors expanded; - `site/topics/*.html` reconstructed topic pages with hide/show BibTeX blocks; - `talkorigins_full.txt` and `talkorigins_full.bib` aggregate downloads; - `snapshots/*.json` cached topic payloads so reruns can resume. ## Notes - The example-specific CLI names have compatibility aliases matching the older `scrape-talkorigins` style commands. - Topic phrase staging, review, and export commands are generic `citegeist` functionality and are not specific to TalkOrigins.