Reframe TalkOrigins as an example workflow
This commit is contained in:
parent
dc53d16af5
commit
c76707e45e
67
README.md
67
README.md
|
|
@ -56,11 +56,12 @@ The initial repo includes:
|
||||||
- OAI-PMH repository discovery via `Identify`, `ListSets`, and `ListMetadataFormats` to target harvests more precisely;
|
- OAI-PMH repository discovery via `Identify`, `ListSets`, and `ListMetadataFormats` to target harvests more precisely;
|
||||||
- bibliography bootstrap workflows that can start from a seed `.bib`, a topic phrase, or both;
|
- bibliography bootstrap workflows that can start from a seed `.bib`, a topic phrase, or both;
|
||||||
- batch bootstrap orchestration from JSON job files containing seed BibTeX paths, topic phrases, or both;
|
- batch bootstrap orchestration from JSON job files containing seed BibTeX paths, topic phrases, or both;
|
||||||
- a TalkOrigins scraper that fixes repeated-author plaintext references, emits per-topic seed BibTeX files, and writes a batch JSON specification;
|
|
||||||
- normalized tables for entries, creators, identifiers, and citation relations;
|
- normalized tables for entries, creators, identifiers, and citation relations;
|
||||||
- full-text-search-ready indexing over title, abstract, and fulltext when SQLite FTS5 is available;
|
- full-text-search-ready indexing over title, abstract, and fulltext when SQLite FTS5 is available;
|
||||||
- tests covering parsing, ingestion, relation storage, and search.
|
- tests covering parsing, ingestion, relation storage, and search.
|
||||||
|
|
||||||
|
Example applications live alongside the core package rather than defining it. The current example corpus pipeline is the TalkOrigins bibliography workflow under [`citegeist.examples.talkorigins`](./src/citegeist/examples/talkorigins.py) with a usage guide in [examples/talkorigins/README.md](./examples/talkorigins/README.md).
|
||||||
|
|
||||||
The prioritized execution plan lives in [ROADMAP.md](./ROADMAP.md).
|
The prioritized execution plan lives in [ROADMAP.md](./ROADMAP.md).
|
||||||
|
|
||||||
## Layout
|
## Layout
|
||||||
|
|
@ -69,6 +70,7 @@ The prioritized execution plan lives in [ROADMAP.md](./ROADMAP.md).
|
||||||
citegeist/
|
citegeist/
|
||||||
src/citegeist/
|
src/citegeist/
|
||||||
bibtex.py
|
bibtex.py
|
||||||
|
examples/
|
||||||
storage.py
|
storage.py
|
||||||
tests/
|
tests/
|
||||||
test_storage.py
|
test_storage.py
|
||||||
|
|
@ -125,7 +127,6 @@ PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve-confli
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 apply-conflict smith2024graphs title
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 apply-conflict smith2024graphs title
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --seed-bib seed.bib --topic "bayesian nonparametrics"
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --seed-bib seed.bib --topic "bayesian nonparametrics"
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --topic "bayesian nonparametrics" --preview --topic-commit-limit 5
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 bootstrap --topic "bayesian nonparametrics" --preview --topic-commit-limit 5
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 scrape-talkorigins talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist extract references.txt --output draft.bib
|
PYTHONPATH=src .venv/bin/python -m citegeist extract references.txt --output draft.bib
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve smith2024graphs
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve smith2024graphs
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 topics
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 topics
|
||||||
|
|
@ -143,42 +144,7 @@ PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export --outpu
|
||||||
|
|
||||||
For live-source development, prefer fixture-backed or cache-backed source clients so resolver and expansion work can be exercised repeatedly without re-hitting upstream APIs on every run.
|
For live-source development, prefer fixture-backed or cache-backed source clients so resolver and expansion work can be exercised repeatedly without re-hitting upstream APIs on every run.
|
||||||
|
|
||||||
For large legacy plaintext corpora such as the TalkOrigins bibliography, prefer a two-step workflow:
|
## Example Application
|
||||||
|
|
||||||
1. `scrape-talkorigins` to generate cleaned per-topic `seed_bib` files plus a `talkorigins_jobs.json` batch spec.
|
|
||||||
2. `bootstrap-batch` on that JSON file when you want to ingest, resolve, and expand from the generated seeds.
|
|
||||||
|
|
||||||
The TalkOrigins scrape output now includes:
|
|
||||||
|
|
||||||
- `seeds/*.bib` per-topic seed BibTeX files for `bootstrap-batch`
|
|
||||||
- `plaintext/*.txt` per-topic cleaned GSA-style plaintext with repeated authors expanded
|
|
||||||
- `site/topics/*.html` reconstructed topic pages with hide/show BibTeX blocks
|
|
||||||
- `talkorigins_full.txt` and `talkorigins_full.bib` aggregate downloads
|
|
||||||
- `snapshots/*.json` cached topic payloads so reruns can resume without re-fetching already scraped topics
|
|
||||||
|
|
||||||
After a full scrape, run:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist validate-talkorigins talkorigins-out/talkorigins_manifest.json
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist duplicates-talkorigins talkorigins-out/talkorigins_manifest.json --limit 20
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist duplicates-talkorigins talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist suggest-talkorigins-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export-topic-phrase-reviews --output topic-phrase-review.json
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 review-topic-phrase abiogenesis accepted --notes "curated from local corpus"
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 review-topic-phrases topic-phrase-review.json
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 apply-topic-phrases topic-phrases.json
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 enrich-talkorigins talkorigins-out/talkorigins_manifest.json --limit 20
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins-copy.sqlite3 enrich-talkorigins talkorigins-out/talkorigins_manifest.json --limit 5 --apply --allow-unsafe-search-matches
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 review-talkorigins talkorigins-out/talkorigins_manifest.json --output talkorigins-review.json
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 apply-talkorigins-corrections talkorigins-out/talkorigins_manifest.json talkorigins-corrections.json
|
|
||||||
```
|
|
||||||
|
|
||||||
That report summarizes parse coverage and flags suspicious entry-type / venue combinations for manual cleanup.
|
|
||||||
It also reports duplicate clusters across topic seed files so you can gauge how much deduplication pressure to expect before ingestion.
|
|
||||||
Use `duplicates-talkorigins` when you want to inspect specific clusters, filter by text, restrict the audit to one topic slug, or preview only weak canonicalization outcomes before importing.
|
|
||||||
|
|
||||||
Use `suggest-talkorigins-phrases` to derive candidate stored expansion phrases from the existing TalkOrigins topic corpus itself. The output is deterministic JSON keyed by topic slug, with a suggested phrase plus the extracted keywords that drove it. This is a useful first pass before setting topic phrases in the database or editing generated batch jobs.
|
|
||||||
|
|
||||||
Use `stage-topic-phrases` to load those suggestions into the database as review items. Staging stores the candidate in `suggested_phrase` and marks the topic `pending` without changing the active `expansion_phrase`.
|
Use `stage-topic-phrases` to load those suggestions into the database as review items. Staging stores the candidate in `suggested_phrase` and marks the topic `pending` without changing the active `expansion_phrase`.
|
||||||
Use `export-topic-phrase-reviews` to write an editable JSON template directly from the database for the currently staged suggestions. That gives you a round-trip path from DB review queue to file edits and back into `review-topic-phrases`.
|
Use `export-topic-phrase-reviews` to write an editable JSON template directly from the database for the currently staged suggestions. That gives you a round-trip path from DB review queue to file edits and back into `review-topic-phrases`.
|
||||||
|
|
@ -194,6 +160,22 @@ Use `set-topic-phrase` to store a curated expansion phrase on the topic itself.
|
||||||
Use `topics --phrase-review-status pending` when you want to audit only topics whose staged phrase suggestions still need review.
|
Use `topics --phrase-review-status pending` when you want to audit only topics whose staged phrase suggestions still need review.
|
||||||
`--allow-unsafe-search-matches` exists only for bounded experiments on copied databases when you explicitly want to relax trust to exercise downstream expansion behavior.
|
`--allow-unsafe-search-matches` exists only for bounded experiments on copied databases when you explicitly want to relax trust to exercise downstream expansion behavior.
|
||||||
|
|
||||||
|
The TalkOrigins corpus pipeline remains in the repository as an example application rather than a core package surface. Use the example-scoped Python namespace:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from citegeist.examples.talkorigins import TalkOriginsScraper
|
||||||
|
```
|
||||||
|
|
||||||
|
and the example-scoped CLI commands:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
|
||||||
|
```
|
||||||
|
|
||||||
|
The older `scrape-talkorigins`-style command names remain available as compatibility aliases. The full example workflow and reconstruction notes live in [examples/talkorigins/README.md](./examples/talkorigins/README.md).
|
||||||
|
|
||||||
Correction files are simple JSON:
|
Correction files are simple JSON:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
|
|
@ -215,15 +197,6 @@ Correction files are simple JSON:
|
||||||
|
|
||||||
`fields` values overwrite the canonical entry for that duplicate-cluster key. Set a field to `null` to remove it.
|
`fields` values overwrite the canonical entry for that duplicate-cluster key. Set a field to `null` to remove it.
|
||||||
|
|
||||||
To import the reconstructed corpus into SQLite while collapsing duplicate works across topics into canonical entries:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 ingest-talkorigins talkorigins-out/talkorigins_manifest.json
|
|
||||||
```
|
|
||||||
|
|
||||||
That import preserves many-to-many topic membership through the `topics` and `entry_topics` tables.
|
|
||||||
After import, use `topics`, `topic-entries`, `search --topic`, and `export-topic` to inspect or export topic slices from the consolidated database.
|
|
||||||
|
|
||||||
Live-source workflow:
|
Live-source workflow:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,52 @@
|
||||||
|
# TalkOrigins Example
|
||||||
|
|
||||||
|
This example shows how to use `citegeist` on a large legacy plaintext bibliography corpus.
|
||||||
|
|
||||||
|
It is intentionally positioned as an application of the core library, not as the main product surface.
|
||||||
|
|
||||||
|
## What It Demonstrates
|
||||||
|
|
||||||
|
- scraping a legacy bibliography index;
|
||||||
|
- normalizing repeated-author plaintext references;
|
||||||
|
- converting topic pages into per-topic seed BibTeX;
|
||||||
|
- generating batch bootstrap specs for downstream ingest and expansion;
|
||||||
|
- reconstructing cleaned plaintext and BibTeX topic pages for review;
|
||||||
|
- validating parse quality, duplicate clusters, and weak canonical entries;
|
||||||
|
- curating topic phrases and correction files before broader enrichment.
|
||||||
|
|
||||||
|
The example implementation lives under the Python namespace:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from citegeist.examples.talkorigins import TalkOriginsScraper
|
||||||
|
```
|
||||||
|
|
||||||
|
The preferred CLI commands are example-scoped:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-scrape talkorigins-out --limit-topics 5 --limit-entries-per-topic 20
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-validate talkorigins-out/talkorigins_manifest.json
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-duplicates talkorigins-out/talkorigins_manifest.json --limit 20 --preview --weak-only
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist example-talkorigins-suggest-phrases talkorigins-out/talkorigins_manifest.json --output topic-phrases.json
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 stage-topic-phrases topic-phrases.json
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export-topic-phrase-reviews --output topic-phrase-review.json
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 review-topic-phrases topic-phrase-review.json
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-enrich talkorigins-out/talkorigins_manifest.json --limit 20
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-review talkorigins-out/talkorigins_manifest.json --output talkorigins-review.json
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-apply-corrections talkorigins-out/talkorigins_manifest.json talkorigins-corrections.json
|
||||||
|
PYTHONPATH=src .venv/bin/python -m citegeist --db talkorigins.sqlite3 example-talkorigins-ingest talkorigins-out/talkorigins_manifest.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Artifacts
|
||||||
|
|
||||||
|
The example scrape writes:
|
||||||
|
|
||||||
|
- `seeds/*.bib` per-topic seed BibTeX files;
|
||||||
|
- `plaintext/*.txt` cleaned GSA-style plaintext with repeated authors expanded;
|
||||||
|
- `site/topics/*.html` reconstructed topic pages with hide/show BibTeX blocks;
|
||||||
|
- `talkorigins_full.txt` and `talkorigins_full.bib` aggregate downloads;
|
||||||
|
- `snapshots/*.json` cached topic payloads so reruns can resume.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- The example-specific CLI names have compatibility aliases matching the older `scrape-talkorigins` style commands.
|
||||||
|
- Topic phrase staging, review, and export commands are generic `citegeist` functionality and are not specific to TalkOrigins.
|
||||||
|
|
@ -7,18 +7,6 @@ from .harvest import OaiMetadataFormat, OaiPmhHarvester, OaiSet
|
||||||
from .resolve import MetadataResolver, merge_entries, merge_entries_with_conflicts
|
from .resolve import MetadataResolver, merge_entries, merge_entries_with_conflicts
|
||||||
from .sources import SourceClient
|
from .sources import SourceClient
|
||||||
from .storage import BibliographyStore
|
from .storage import BibliographyStore
|
||||||
from .talkorigins import (
|
|
||||||
TalkOriginsBatchExport,
|
|
||||||
TalkOriginsDuplicateCluster,
|
|
||||||
TalkOriginsEnrichmentResult,
|
|
||||||
TalkOriginsIngestReport,
|
|
||||||
TalkOriginsReviewExport,
|
|
||||||
TalkOriginsScraper,
|
|
||||||
TalkOriginsSeedSet,
|
|
||||||
TalkOriginsTopicPhraseSuggestion,
|
|
||||||
TalkOriginsTopic,
|
|
||||||
TalkOriginsValidationReport,
|
|
||||||
)
|
|
||||||
|
|
||||||
__all__ = [
|
__all__ = [
|
||||||
"BibEntry",
|
"BibEntry",
|
||||||
|
|
@ -34,16 +22,6 @@ __all__ = [
|
||||||
"OaiMetadataFormat",
|
"OaiMetadataFormat",
|
||||||
"OaiSet",
|
"OaiSet",
|
||||||
"SourceClient",
|
"SourceClient",
|
||||||
"TalkOriginsBatchExport",
|
|
||||||
"TalkOriginsDuplicateCluster",
|
|
||||||
"TalkOriginsEnrichmentResult",
|
|
||||||
"TalkOriginsIngestReport",
|
|
||||||
"TalkOriginsReviewExport",
|
|
||||||
"TalkOriginsScraper",
|
|
||||||
"TalkOriginsSeedSet",
|
|
||||||
"TalkOriginsTopicPhraseSuggestion",
|
|
||||||
"TalkOriginsTopic",
|
|
||||||
"TalkOriginsValidationReport",
|
|
||||||
"extract_references",
|
"extract_references",
|
||||||
"load_batch_jobs",
|
"load_batch_jobs",
|
||||||
"merge_entries",
|
"merge_entries",
|
||||||
|
|
|
||||||
|
|
@ -9,12 +9,12 @@ from pathlib import Path
|
||||||
from .batch import BatchBootstrapRunner, load_batch_jobs
|
from .batch import BatchBootstrapRunner, load_batch_jobs
|
||||||
from .bibtex import parse_bibtex, render_bibtex
|
from .bibtex import parse_bibtex, render_bibtex
|
||||||
from .bootstrap import Bootstrapper
|
from .bootstrap import Bootstrapper
|
||||||
|
from .examples.talkorigins import TalkOriginsScraper
|
||||||
from .expand import CrossrefExpander, OpenAlexExpander, TopicExpander
|
from .expand import CrossrefExpander, OpenAlexExpander, TopicExpander
|
||||||
from .extract import extract_references
|
from .extract import extract_references
|
||||||
from .harvest import OaiPmhHarvester
|
from .harvest import OaiPmhHarvester
|
||||||
from .resolve import MetadataResolver, merge_entries_with_conflicts
|
from .resolve import MetadataResolver, merge_entries_with_conflicts
|
||||||
from .storage import BibliographyStore
|
from .storage import BibliographyStore
|
||||||
from .talkorigins import TalkOriginsScraper
|
|
||||||
|
|
||||||
|
|
||||||
def build_parser() -> argparse.ArgumentParser:
|
def build_parser() -> argparse.ArgumentParser:
|
||||||
|
|
@ -205,8 +205,9 @@ def build_parser() -> argparse.ArgumentParser:
|
||||||
batch_parser.add_argument("input", help="Path to batch JSON file")
|
batch_parser.add_argument("input", help="Path to batch JSON file")
|
||||||
|
|
||||||
talkorigins_parser = subparsers.add_parser(
|
talkorigins_parser = subparsers.add_parser(
|
||||||
"scrape-talkorigins",
|
"example-talkorigins-scrape",
|
||||||
help="Scrape TalkOrigins into per-topic seed BibTeX files and a bootstrap-batch JSON file",
|
aliases=["scrape-talkorigins"],
|
||||||
|
help="Example workflow: scrape TalkOrigins into per-topic seed BibTeX files and a bootstrap-batch JSON file",
|
||||||
)
|
)
|
||||||
talkorigins_parser.add_argument(
|
talkorigins_parser.add_argument(
|
||||||
"output_dir",
|
"output_dir",
|
||||||
|
|
@ -257,14 +258,16 @@ def build_parser() -> argparse.ArgumentParser:
|
||||||
talkorigins_parser.add_argument("--status", default="draft", help="Review status for generated seed jobs")
|
talkorigins_parser.add_argument("--status", default="draft", help="Review status for generated seed jobs")
|
||||||
|
|
||||||
validate_talkorigins_parser = subparsers.add_parser(
|
validate_talkorigins_parser = subparsers.add_parser(
|
||||||
"validate-talkorigins",
|
"example-talkorigins-validate",
|
||||||
help="Validate a generated TalkOrigins manifest and report parse coverage and suspicious entries",
|
aliases=["validate-talkorigins"],
|
||||||
|
help="Example workflow: validate a generated TalkOrigins manifest and report parse coverage and suspicious entries",
|
||||||
)
|
)
|
||||||
validate_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
validate_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||||
|
|
||||||
suggest_talkorigins_parser = subparsers.add_parser(
|
suggest_talkorigins_parser = subparsers.add_parser(
|
||||||
"suggest-talkorigins-phrases",
|
"example-talkorigins-suggest-phrases",
|
||||||
help="Suggest stored topic expansion phrases from a TalkOrigins manifest",
|
aliases=["suggest-talkorigins-phrases"],
|
||||||
|
help="Example workflow: suggest stored topic expansion phrases from a TalkOrigins manifest",
|
||||||
)
|
)
|
||||||
suggest_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
suggest_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||||
suggest_talkorigins_parser.add_argument("--topic", help="Optional topic slug to restrict suggestions")
|
suggest_talkorigins_parser.add_argument("--topic", help="Optional topic slug to restrict suggestions")
|
||||||
|
|
@ -305,8 +308,9 @@ def build_parser() -> argparse.ArgumentParser:
|
||||||
review_topic_phrases_parser.add_argument("input", help="Path to JSON file containing topic phrase review records")
|
review_topic_phrases_parser.add_argument("input", help="Path to JSON file containing topic phrase review records")
|
||||||
|
|
||||||
duplicates_talkorigins_parser = subparsers.add_parser(
|
duplicates_talkorigins_parser = subparsers.add_parser(
|
||||||
"duplicates-talkorigins",
|
"example-talkorigins-duplicates",
|
||||||
help="Inspect duplicate clusters in a generated TalkOrigins manifest",
|
aliases=["duplicates-talkorigins"],
|
||||||
|
help="Example workflow: inspect duplicate clusters in a generated TalkOrigins manifest",
|
||||||
)
|
)
|
||||||
duplicates_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
duplicates_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||||
duplicates_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum clusters to show")
|
duplicates_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum clusters to show")
|
||||||
|
|
@ -330,8 +334,9 @@ def build_parser() -> argparse.ArgumentParser:
|
||||||
)
|
)
|
||||||
|
|
||||||
ingest_talkorigins_parser = subparsers.add_parser(
|
ingest_talkorigins_parser = subparsers.add_parser(
|
||||||
"ingest-talkorigins",
|
"example-talkorigins-ingest",
|
||||||
help="Ingest a TalkOrigins manifest into the database with duplicate consolidation and topic membership",
|
aliases=["ingest-talkorigins"],
|
||||||
|
help="Example workflow: ingest a TalkOrigins manifest into the database with duplicate consolidation and topic membership",
|
||||||
)
|
)
|
||||||
ingest_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
ingest_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||||
ingest_talkorigins_parser.add_argument("--status", default="draft", help="Review status for imported entries")
|
ingest_talkorigins_parser.add_argument("--status", default="draft", help="Review status for imported entries")
|
||||||
|
|
@ -342,8 +347,9 @@ def build_parser() -> argparse.ArgumentParser:
|
||||||
)
|
)
|
||||||
|
|
||||||
enrich_talkorigins_parser = subparsers.add_parser(
|
enrich_talkorigins_parser = subparsers.add_parser(
|
||||||
"enrich-talkorigins",
|
"example-talkorigins-enrich",
|
||||||
help="Attempt metadata enrichment for weak TalkOrigins canonical entries",
|
aliases=["enrich-talkorigins"],
|
||||||
|
help="Example workflow: attempt metadata enrichment for weak TalkOrigins canonical entries",
|
||||||
)
|
)
|
||||||
enrich_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
enrich_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||||
enrich_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum weak clusters to inspect")
|
enrich_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum weak clusters to inspect")
|
||||||
|
|
@ -372,8 +378,9 @@ def build_parser() -> argparse.ArgumentParser:
|
||||||
)
|
)
|
||||||
|
|
||||||
review_talkorigins_parser = subparsers.add_parser(
|
review_talkorigins_parser = subparsers.add_parser(
|
||||||
"review-talkorigins",
|
"example-talkorigins-review",
|
||||||
help="Export weak TalkOrigins clusters plus dry-run enrichment outcomes for manual review",
|
aliases=["review-talkorigins"],
|
||||||
|
help="Example workflow: export weak TalkOrigins clusters plus dry-run enrichment outcomes for manual review",
|
||||||
)
|
)
|
||||||
review_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
review_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||||
review_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum weak clusters to export")
|
review_talkorigins_parser.add_argument("--limit", type=int, default=20, help="Maximum weak clusters to export")
|
||||||
|
|
@ -388,8 +395,9 @@ def build_parser() -> argparse.ArgumentParser:
|
||||||
review_talkorigins_parser.add_argument("--output", help="Write review export JSON to a file instead of stdout")
|
review_talkorigins_parser.add_argument("--output", help="Write review export JSON to a file instead of stdout")
|
||||||
|
|
||||||
apply_review_talkorigins_parser = subparsers.add_parser(
|
apply_review_talkorigins_parser = subparsers.add_parser(
|
||||||
"apply-talkorigins-corrections",
|
"example-talkorigins-apply-corrections",
|
||||||
help="Apply curated TalkOrigins review corrections to the consolidated database",
|
aliases=["apply-talkorigins-corrections"],
|
||||||
|
help="Example workflow: apply curated TalkOrigins review corrections to the consolidated database",
|
||||||
)
|
)
|
||||||
apply_review_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
apply_review_talkorigins_parser.add_argument("manifest", help="Path to talkorigins_manifest.json")
|
||||||
apply_review_talkorigins_parser.add_argument("corrections", help="Path to corrections JSON")
|
apply_review_talkorigins_parser.add_argument("corrections", help="Path to corrections JSON")
|
||||||
|
|
@ -530,7 +538,7 @@ def main(argv: list[str] | None = None) -> int:
|
||||||
)
|
)
|
||||||
if args.command == "bootstrap-batch":
|
if args.command == "bootstrap-batch":
|
||||||
return _run_bootstrap_batch(store, Path(args.input))
|
return _run_bootstrap_batch(store, Path(args.input))
|
||||||
if args.command == "scrape-talkorigins":
|
if args.command in {"example-talkorigins-scrape", "scrape-talkorigins"}:
|
||||||
return _run_scrape_talkorigins(
|
return _run_scrape_talkorigins(
|
||||||
store,
|
store,
|
||||||
args.base_url,
|
args.base_url,
|
||||||
|
|
@ -545,9 +553,9 @@ def main(argv: list[str] | None = None) -> int:
|
||||||
args.topic_commit_limit,
|
args.topic_commit_limit,
|
||||||
args.status,
|
args.status,
|
||||||
)
|
)
|
||||||
if args.command == "validate-talkorigins":
|
if args.command in {"example-talkorigins-validate", "validate-talkorigins"}:
|
||||||
return _run_validate_talkorigins(Path(args.manifest))
|
return _run_validate_talkorigins(Path(args.manifest))
|
||||||
if args.command == "suggest-talkorigins-phrases":
|
if args.command in {"example-talkorigins-suggest-phrases", "suggest-talkorigins-phrases"}:
|
||||||
return _run_suggest_talkorigins_phrases(Path(args.manifest), args.topic, args.limit, args.output)
|
return _run_suggest_talkorigins_phrases(Path(args.manifest), args.topic, args.limit, args.output)
|
||||||
if args.command == "apply-topic-phrases":
|
if args.command == "apply-topic-phrases":
|
||||||
return _run_apply_topic_phrases(store, Path(args.input))
|
return _run_apply_topic_phrases(store, Path(args.input))
|
||||||
|
|
@ -557,7 +565,7 @@ def main(argv: list[str] | None = None) -> int:
|
||||||
return _run_review_topic_phrase(store, args.topic_slug, args.status, args.notes, args.phrase)
|
return _run_review_topic_phrase(store, args.topic_slug, args.status, args.notes, args.phrase)
|
||||||
if args.command == "review-topic-phrases":
|
if args.command == "review-topic-phrases":
|
||||||
return _run_review_topic_phrases(store, Path(args.input))
|
return _run_review_topic_phrases(store, Path(args.input))
|
||||||
if args.command == "duplicates-talkorigins":
|
if args.command in {"example-talkorigins-duplicates", "duplicates-talkorigins"}:
|
||||||
return _run_duplicates_talkorigins(
|
return _run_duplicates_talkorigins(
|
||||||
Path(args.manifest),
|
Path(args.manifest),
|
||||||
args.limit,
|
args.limit,
|
||||||
|
|
@ -567,9 +575,9 @@ def main(argv: list[str] | None = None) -> int:
|
||||||
args.preview,
|
args.preview,
|
||||||
args.weak_only,
|
args.weak_only,
|
||||||
)
|
)
|
||||||
if args.command == "ingest-talkorigins":
|
if args.command in {"example-talkorigins-ingest", "ingest-talkorigins"}:
|
||||||
return _run_ingest_talkorigins(store, Path(args.manifest), args.status, not args.no_dedupe)
|
return _run_ingest_talkorigins(store, Path(args.manifest), args.status, not args.no_dedupe)
|
||||||
if args.command == "enrich-talkorigins":
|
if args.command in {"example-talkorigins-enrich", "enrich-talkorigins"}:
|
||||||
return _run_enrich_talkorigins(
|
return _run_enrich_talkorigins(
|
||||||
store,
|
store,
|
||||||
Path(args.manifest),
|
Path(args.manifest),
|
||||||
|
|
@ -581,7 +589,7 @@ def main(argv: list[str] | None = None) -> int:
|
||||||
args.status,
|
args.status,
|
||||||
args.allow_unsafe_search_matches,
|
args.allow_unsafe_search_matches,
|
||||||
)
|
)
|
||||||
if args.command == "review-talkorigins":
|
if args.command in {"example-talkorigins-review", "review-talkorigins"}:
|
||||||
return _run_review_talkorigins(
|
return _run_review_talkorigins(
|
||||||
store,
|
store,
|
||||||
Path(args.manifest),
|
Path(args.manifest),
|
||||||
|
|
@ -591,7 +599,7 @@ def main(argv: list[str] | None = None) -> int:
|
||||||
args.topic,
|
args.topic,
|
||||||
args.output,
|
args.output,
|
||||||
)
|
)
|
||||||
if args.command == "apply-talkorigins-corrections":
|
if args.command in {"example-talkorigins-apply-corrections", "apply-talkorigins-corrections"}:
|
||||||
return _run_apply_talkorigins_corrections(
|
return _run_apply_talkorigins_corrections(
|
||||||
store,
|
store,
|
||||||
Path(args.manifest),
|
Path(args.manifest),
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,29 @@
|
||||||
|
from .talkorigins import (
|
||||||
|
TalkOriginsBatchExport,
|
||||||
|
TalkOriginsCorrectionResult,
|
||||||
|
TalkOriginsDuplicateCluster,
|
||||||
|
TalkOriginsEnrichmentResult,
|
||||||
|
TalkOriginsIngestReport,
|
||||||
|
TalkOriginsReviewExport,
|
||||||
|
TalkOriginsScraper,
|
||||||
|
TalkOriginsSeedSet,
|
||||||
|
TalkOriginsTopic,
|
||||||
|
TalkOriginsTopicPhraseSuggestion,
|
||||||
|
TalkOriginsValidationReport,
|
||||||
|
normalize_topic_entries,
|
||||||
|
)
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"TalkOriginsBatchExport",
|
||||||
|
"TalkOriginsCorrectionResult",
|
||||||
|
"TalkOriginsDuplicateCluster",
|
||||||
|
"TalkOriginsEnrichmentResult",
|
||||||
|
"TalkOriginsIngestReport",
|
||||||
|
"TalkOriginsReviewExport",
|
||||||
|
"TalkOriginsScraper",
|
||||||
|
"TalkOriginsSeedSet",
|
||||||
|
"TalkOriginsTopic",
|
||||||
|
"TalkOriginsTopicPhraseSuggestion",
|
||||||
|
"TalkOriginsValidationReport",
|
||||||
|
"normalize_topic_entries",
|
||||||
|
]
|
||||||
|
|
@ -0,0 +1,29 @@
|
||||||
|
from ..talkorigins import (
|
||||||
|
TalkOriginsBatchExport,
|
||||||
|
TalkOriginsCorrectionResult,
|
||||||
|
TalkOriginsDuplicateCluster,
|
||||||
|
TalkOriginsEnrichmentResult,
|
||||||
|
TalkOriginsIngestReport,
|
||||||
|
TalkOriginsReviewExport,
|
||||||
|
TalkOriginsScraper,
|
||||||
|
TalkOriginsSeedSet,
|
||||||
|
TalkOriginsTopic,
|
||||||
|
TalkOriginsTopicPhraseSuggestion,
|
||||||
|
TalkOriginsValidationReport,
|
||||||
|
normalize_topic_entries,
|
||||||
|
)
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"TalkOriginsBatchExport",
|
||||||
|
"TalkOriginsCorrectionResult",
|
||||||
|
"TalkOriginsDuplicateCluster",
|
||||||
|
"TalkOriginsEnrichmentResult",
|
||||||
|
"TalkOriginsIngestReport",
|
||||||
|
"TalkOriginsReviewExport",
|
||||||
|
"TalkOriginsScraper",
|
||||||
|
"TalkOriginsSeedSet",
|
||||||
|
"TalkOriginsTopic",
|
||||||
|
"TalkOriginsTopicPhraseSuggestion",
|
||||||
|
"TalkOriginsValidationReport",
|
||||||
|
"normalize_topic_entries",
|
||||||
|
]
|
||||||
|
|
@ -1,3 +1,10 @@
|
||||||
|
"""TalkOrigins example implementation.
|
||||||
|
|
||||||
|
This module backs the example-facing namespace at ``citegeist.examples.talkorigins``.
|
||||||
|
New code should prefer importing from the examples namespace rather than treating
|
||||||
|
TalkOrigins support as part of the core top-level package surface.
|
||||||
|
"""
|
||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
|
|
|
||||||
|
|
@ -7,6 +7,16 @@ from pathlib import Path
|
||||||
from unittest.mock import patch
|
from unittest.mock import patch
|
||||||
|
|
||||||
from citegeist.cli import main
|
from citegeist.cli import main
|
||||||
|
from citegeist.examples.talkorigins import (
|
||||||
|
TalkOriginsBatchExport,
|
||||||
|
TalkOriginsCorrectionResult,
|
||||||
|
TalkOriginsDuplicateCluster,
|
||||||
|
TalkOriginsEnrichmentResult,
|
||||||
|
TalkOriginsIngestReport,
|
||||||
|
TalkOriginsReviewExport,
|
||||||
|
TalkOriginsTopicPhraseSuggestion,
|
||||||
|
TalkOriginsValidationReport,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
SAMPLE_BIB = """
|
SAMPLE_BIB = """
|
||||||
|
|
@ -313,7 +323,7 @@ def test_cli_scrape_talkorigins_accepts_output_dir(tmp_path):
|
||||||
|
|
||||||
database = tmp_path / "library.sqlite3"
|
database = tmp_path / "library.sqlite3"
|
||||||
with patch("citegeist.cli.TalkOriginsScraper.scrape_to_directory") as mocked_scrape:
|
with patch("citegeist.cli.TalkOriginsScraper.scrape_to_directory") as mocked_scrape:
|
||||||
mocked_scrape.return_value = __import__("citegeist").TalkOriginsBatchExport(
|
mocked_scrape.return_value = TalkOriginsBatchExport(
|
||||||
base_url="https://www.talkorigins.org/origins/biblio/",
|
base_url="https://www.talkorigins.org/origins/biblio/",
|
||||||
output_dir=str(tmp_path),
|
output_dir=str(tmp_path),
|
||||||
topic_count=1,
|
topic_count=1,
|
||||||
|
|
@ -326,7 +336,7 @@ def test_cli_scrape_talkorigins_accepts_output_dir(tmp_path):
|
||||||
[
|
[
|
||||||
"--db",
|
"--db",
|
||||||
str(database),
|
str(database),
|
||||||
"scrape-talkorigins",
|
"example-talkorigins-scrape",
|
||||||
str(tmp_path / "talkorigins-out"),
|
str(tmp_path / "talkorigins-out"),
|
||||||
"--limit-topics",
|
"--limit-topics",
|
||||||
"3",
|
"3",
|
||||||
|
|
@ -346,7 +356,7 @@ def test_cli_validate_talkorigins_accepts_manifest(tmp_path):
|
||||||
manifest = tmp_path / "talkorigins_manifest.json"
|
manifest = tmp_path / "talkorigins_manifest.json"
|
||||||
manifest.write_text("{}", encoding="utf-8")
|
manifest.write_text("{}", encoding="utf-8")
|
||||||
with patch("citegeist.cli.TalkOriginsScraper.validate_export") as mocked_validate:
|
with patch("citegeist.cli.TalkOriginsScraper.validate_export") as mocked_validate:
|
||||||
mocked_validate.return_value = __import__("citegeist").TalkOriginsValidationReport(
|
mocked_validate.return_value = TalkOriginsValidationReport(
|
||||||
manifest_path=str(manifest),
|
manifest_path=str(manifest),
|
||||||
topic_count=1,
|
topic_count=1,
|
||||||
entry_count=2,
|
entry_count=2,
|
||||||
|
|
@ -360,7 +370,7 @@ def test_cli_validate_talkorigins_accepts_manifest(tmp_path):
|
||||||
duplicate_entry_count=0,
|
duplicate_entry_count=0,
|
||||||
duplicate_examples=[],
|
duplicate_examples=[],
|
||||||
)
|
)
|
||||||
exit_code = main(["validate-talkorigins", str(manifest)])
|
exit_code = main(["example-talkorigins-validate", str(manifest)])
|
||||||
|
|
||||||
assert exit_code == 0
|
assert exit_code == 0
|
||||||
|
|
||||||
|
|
@ -373,7 +383,7 @@ def test_cli_suggest_talkorigins_phrases_writes_output(tmp_path):
|
||||||
output = tmp_path / "phrases.json"
|
output = tmp_path / "phrases.json"
|
||||||
with patch("citegeist.cli.TalkOriginsScraper.suggest_topic_phrases") as mocked_suggest:
|
with patch("citegeist.cli.TalkOriginsScraper.suggest_topic_phrases") as mocked_suggest:
|
||||||
mocked_suggest.return_value = [
|
mocked_suggest.return_value = [
|
||||||
__import__("citegeist", fromlist=["TalkOriginsTopicPhraseSuggestion"]).TalkOriginsTopicPhraseSuggestion(
|
TalkOriginsTopicPhraseSuggestion(
|
||||||
slug="abiogenesis",
|
slug="abiogenesis",
|
||||||
topic="Abiogenesis",
|
topic="Abiogenesis",
|
||||||
entry_count=2,
|
entry_count=2,
|
||||||
|
|
@ -385,7 +395,7 @@ def test_cli_suggest_talkorigins_phrases_writes_output(tmp_path):
|
||||||
]
|
]
|
||||||
exit_code = main(
|
exit_code = main(
|
||||||
[
|
[
|
||||||
"suggest-talkorigins-phrases",
|
"example-talkorigins-suggest-phrases",
|
||||||
str(manifest),
|
str(manifest),
|
||||||
"--topic",
|
"--topic",
|
||||||
"abiogenesis",
|
"abiogenesis",
|
||||||
|
|
@ -406,7 +416,7 @@ def test_cli_duplicates_talkorigins_accepts_manifest(tmp_path):
|
||||||
manifest.write_text("{}", encoding="utf-8")
|
manifest.write_text("{}", encoding="utf-8")
|
||||||
with patch("citegeist.cli.TalkOriginsScraper.inspect_duplicate_clusters") as mocked_duplicates:
|
with patch("citegeist.cli.TalkOriginsScraper.inspect_duplicate_clusters") as mocked_duplicates:
|
||||||
mocked_duplicates.return_value = [
|
mocked_duplicates.return_value = [
|
||||||
__import__("citegeist.talkorigins", fromlist=["TalkOriginsDuplicateCluster"]).TalkOriginsDuplicateCluster(
|
TalkOriginsDuplicateCluster(
|
||||||
key="smith|1999|duplicate paper",
|
key="smith|1999|duplicate paper",
|
||||||
count=2,
|
count=2,
|
||||||
items=[
|
items=[
|
||||||
|
|
@ -431,7 +441,7 @@ def test_cli_duplicates_talkorigins_accepts_manifest(tmp_path):
|
||||||
]
|
]
|
||||||
exit_code = main(
|
exit_code = main(
|
||||||
[
|
[
|
||||||
"duplicates-talkorigins",
|
"example-talkorigins-duplicates",
|
||||||
str(manifest),
|
str(manifest),
|
||||||
"--topic",
|
"--topic",
|
||||||
"abiogenesis",
|
"abiogenesis",
|
||||||
|
|
@ -452,7 +462,7 @@ def test_cli_ingest_talkorigins_accepts_manifest(tmp_path):
|
||||||
manifest = tmp_path / "talkorigins_manifest.json"
|
manifest = tmp_path / "talkorigins_manifest.json"
|
||||||
manifest.write_text("{}", encoding="utf-8")
|
manifest.write_text("{}", encoding="utf-8")
|
||||||
with patch("citegeist.cli.TalkOriginsScraper.ingest_export") as mocked_ingest:
|
with patch("citegeist.cli.TalkOriginsScraper.ingest_export") as mocked_ingest:
|
||||||
mocked_ingest.return_value = __import__("citegeist").TalkOriginsIngestReport(
|
mocked_ingest.return_value = TalkOriginsIngestReport(
|
||||||
manifest_path=str(manifest),
|
manifest_path=str(manifest),
|
||||||
topic_count=1,
|
topic_count=1,
|
||||||
raw_entry_count=2,
|
raw_entry_count=2,
|
||||||
|
|
@ -461,7 +471,7 @@ def test_cli_ingest_talkorigins_accepts_manifest(tmp_path):
|
||||||
duplicate_entry_count=2,
|
duplicate_entry_count=2,
|
||||||
canonicalized_count=1,
|
canonicalized_count=1,
|
||||||
)
|
)
|
||||||
exit_code = main(["--db", str(database), "ingest-talkorigins", str(manifest)])
|
exit_code = main(["--db", str(database), "example-talkorigins-ingest", str(manifest)])
|
||||||
|
|
||||||
assert exit_code == 0
|
assert exit_code == 0
|
||||||
|
|
||||||
|
|
@ -474,7 +484,7 @@ def test_cli_enrich_talkorigins_accepts_manifest(tmp_path):
|
||||||
manifest.write_text("{}", encoding="utf-8")
|
manifest.write_text("{}", encoding="utf-8")
|
||||||
with patch("citegeist.cli.TalkOriginsScraper.enrich_weak_canonicals") as mocked_enrich:
|
with patch("citegeist.cli.TalkOriginsScraper.enrich_weak_canonicals") as mocked_enrich:
|
||||||
mocked_enrich.return_value = [
|
mocked_enrich.return_value = [
|
||||||
__import__("citegeist.talkorigins", fromlist=["TalkOriginsEnrichmentResult"]).TalkOriginsEnrichmentResult(
|
TalkOriginsEnrichmentResult(
|
||||||
key="smith|1999|duplicate paper",
|
key="smith|1999|duplicate paper",
|
||||||
citation_key="dup1",
|
citation_key="dup1",
|
||||||
weak_reasons_before=["missing:doi"],
|
weak_reasons_before=["missing:doi"],
|
||||||
|
|
@ -490,7 +500,7 @@ def test_cli_enrich_talkorigins_accepts_manifest(tmp_path):
|
||||||
[
|
[
|
||||||
"--db",
|
"--db",
|
||||||
str(database),
|
str(database),
|
||||||
"enrich-talkorigins",
|
"example-talkorigins-enrich",
|
||||||
str(manifest),
|
str(manifest),
|
||||||
"--limit",
|
"--limit",
|
||||||
"5",
|
"5",
|
||||||
|
|
@ -510,7 +520,7 @@ def test_cli_review_talkorigins_writes_output(tmp_path):
|
||||||
manifest.write_text("{}", encoding="utf-8")
|
manifest.write_text("{}", encoding="utf-8")
|
||||||
output = tmp_path / "review.json"
|
output = tmp_path / "review.json"
|
||||||
with patch("citegeist.cli.TalkOriginsScraper.build_review_export") as mocked_review:
|
with patch("citegeist.cli.TalkOriginsScraper.build_review_export") as mocked_review:
|
||||||
mocked_review.return_value = __import__("citegeist.talkorigins", fromlist=["TalkOriginsReviewExport"]).TalkOriginsReviewExport(
|
mocked_review.return_value = TalkOriginsReviewExport(
|
||||||
manifest_path=str(manifest),
|
manifest_path=str(manifest),
|
||||||
item_count=1,
|
item_count=1,
|
||||||
items=[{"key": "smith|1999|duplicate paper", "canonical": {}, "enrichment": {}}],
|
items=[{"key": "smith|1999|duplicate paper", "canonical": {}, "enrichment": {}}],
|
||||||
|
|
@ -519,7 +529,7 @@ def test_cli_review_talkorigins_writes_output(tmp_path):
|
||||||
[
|
[
|
||||||
"--db",
|
"--db",
|
||||||
str(database),
|
str(database),
|
||||||
"review-talkorigins",
|
"example-talkorigins-review",
|
||||||
str(manifest),
|
str(manifest),
|
||||||
"--output",
|
"--output",
|
||||||
str(output),
|
str(output),
|
||||||
|
|
@ -540,7 +550,7 @@ def test_cli_apply_talkorigins_corrections_accepts_files(tmp_path):
|
||||||
corrections.write_text('{"corrections": []}', encoding="utf-8")
|
corrections.write_text('{"corrections": []}', encoding="utf-8")
|
||||||
with patch("citegeist.cli.TalkOriginsScraper.apply_review_corrections") as mocked_apply:
|
with patch("citegeist.cli.TalkOriginsScraper.apply_review_corrections") as mocked_apply:
|
||||||
mocked_apply.return_value = [
|
mocked_apply.return_value = [
|
||||||
__import__("citegeist.talkorigins", fromlist=["TalkOriginsCorrectionResult"]).TalkOriginsCorrectionResult(
|
TalkOriginsCorrectionResult(
|
||||||
key="smith|1999|duplicate paper",
|
key="smith|1999|duplicate paper",
|
||||||
citation_key="dup1",
|
citation_key="dup1",
|
||||||
applied=True,
|
applied=True,
|
||||||
|
|
@ -551,7 +561,7 @@ def test_cli_apply_talkorigins_corrections_accepts_files(tmp_path):
|
||||||
[
|
[
|
||||||
"--db",
|
"--db",
|
||||||
str(database),
|
str(database),
|
||||||
"apply-talkorigins-corrections",
|
"example-talkorigins-apply-corrections",
|
||||||
str(manifest),
|
str(manifest),
|
||||||
str(corrections),
|
str(corrections),
|
||||||
]
|
]
|
||||||
|
|
|
||||||
|
|
@ -5,8 +5,8 @@ from pathlib import Path
|
||||||
|
|
||||||
from citegeist.batch import load_batch_jobs
|
from citegeist.batch import load_batch_jobs
|
||||||
from citegeist.bibtex import BibEntry
|
from citegeist.bibtex import BibEntry
|
||||||
|
from citegeist.examples.talkorigins import TalkOriginsScraper, normalize_topic_entries
|
||||||
from citegeist.storage import BibliographyStore
|
from citegeist.storage import BibliographyStore
|
||||||
from citegeist.talkorigins import TalkOriginsScraper, normalize_topic_entries
|
|
||||||
|
|
||||||
|
|
||||||
INDEX_HTML = """
|
INDEX_HTML = """
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue