Add CLI, roadmap, and pybtex integration

This commit is contained in:
welsberr 2026-03-19 20:51:54 -04:00
parent 4f3ac4decb
commit ac405943fb
10 changed files with 532 additions and 149 deletions

1
.gitignore vendored
View File

@ -1,4 +1,5 @@
__pycache__/
.pytest_cache/
.venv/
*.pyc
library.sqlite3

View File

@ -43,12 +43,15 @@ But it is not the right long-term base:
The initial repo includes:
- a lightweight BibTeX parser for structured ingestion;
- `pybtex`-backed BibTeX parsing and export in a repo-local virtual environment;
- a SQLite-backed bibliography store;
- a small CLI for ingest, search, inspection, and export;
- normalized tables for entries, creators, identifiers, and citation relations;
- full-text-search-ready indexing over title, abstract, and fulltext when SQLite FTS5 is available;
- tests covering parsing, ingestion, relation storage, and search.
The prioritized execution plan lives in [ROADMAP.md](./ROADMAP.md).
## Layout
```text
@ -65,7 +68,10 @@ citegeist/
```bash
cd citegeist
PYTHONPATH=src python3 - <<'PY'
python3 -m virtualenv --always-copy .venv
.venv/bin/pip install -e .
.venv/bin/pip install pytest
PYTHONPATH=src .venv/bin/python - <<'PY'
from citegeist import BibliographyStore
bib = """
@ -91,17 +97,26 @@ print(store.get_relations("smith2024graphs"))
print(store.search_text("semantic"))
store.close()
PY
pytest -q
.venv/bin/python -m pytest -q
```
## Planned Work
Or use the CLI directly:
- parse references from raw prose, OCR, PDF text, and bibliography sections into draft BibTeX;
- add modern metadata resolvers for DOI, Crossref, DBLP, arXiv, and similar sources;
- track provenance and confidence for enriched fields;
- add graph expansion workflows over `cites` and `cited_by` edges;
- support acquisition pipelines for open-access theses, dissertations, preprints, and publisher metadata pages;
- add embeddings or pluggable semantic indexing beyond SQLite FTS.
```bash
cd citegeist
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 ingest references.bib
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 search "semantic search"
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 show smith2024graphs
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export --output reviewed.bib
```
## Near-Term Priorities
- provenance tracking and entry review states;
- plaintext reference extraction into draft BibTeX;
- metadata resolvers for DOI, Crossref, DBLP, and arXiv.
See [ROADMAP.md](./ROADMAP.md) for the prioritized phase plan and rationale.
## Naming

187
ROADMAP.md Normal file
View File

@ -0,0 +1,187 @@
# Roadmap
This roadmap prioritizes a usable local research workflow over breadth of integrations.
The first objective is not to support every metadata source. The first objective is to make one end-to-end path work reliably:
1. ingest draft references,
2. normalize and store them,
3. enrich them,
4. traverse citation links,
5. export reviewed BibTeX.
## Prioritization Principles
- prioritize steps that make the system usable by a single researcher on a local machine;
- prioritize deterministic infrastructure before network integrations;
- keep every stage inspectable and auditable;
- treat verification and provenance as core features, not cleanup work;
- defer heavy semantic infrastructure until the local corpus model is stable.
## Current Baseline
Completed:
- lightweight BibTeX parsing;
- SQLite storage for entries, creators, identifiers, and relations;
- local text search using SQLite FTS5 when available;
- tests for ingest, relation storage, and search.
## Phase 1: Core Ingestion And Export
Priority: P0
Goal:
Make `citegeist` useful as a local BibTeX workbench even before online enrichment is added.
Tasks:
- add BibTeX export from the normalized database back into stable, readable BibTeX;
- add a small CLI for `ingest`, `show`, `search`, and `export`;
- store field provenance metadata alongside imported and edited fields;
- add schema support for entry status such as `draft`, `enriched`, `reviewed`, and `exported`;
- add fixture-driven tests for round-tripping BibTeX through ingest and export.
Why this comes first:
- without export, the project is not yet useful in a LaTeX workflow;
- without a CLI, the package is a library demo rather than a tool;
- without provenance and state, later enrichment work becomes hard to audit.
Exit criteria:
- a user can ingest a `.bib` file, inspect entries, search locally, and export a reviewed `.bib`;
- round-trip tests show no unexpected field loss for supported entry types.
## Phase 2: Reference Extraction
Priority: P0
Goal:
Turn raw reference text into draft entries that can enter the main pipeline.
Tasks:
- add parsers for bibliography-section lines and plain-text reference lists;
- define a draft-entry schema for incomplete references with confidence markers;
- support ingestion of OCR- or PDF-derived plaintext bibliography sections;
- add normalization for author names, years, title casing, and page ranges;
- build gold-test fixtures from real, messy reference examples.
Why this is next:
- this addresses the projects first unique bottleneck: getting rough references into structured form;
- enrichment is much more effective once draft references are normalized.
Exit criteria:
- a user can pass a plaintext bibliography section and receive draft BibTeX entries with unresolved fields clearly marked;
- tests cover common article, book, chapter, and proceedings references.
## Phase 3: Metadata Enrichment
Priority: P1
Goal:
Resolve draft or partial entries against external scholarly sources and merge improved metadata safely.
Tasks:
- define a resolver interface with deterministic merge rules;
- implement first-party resolvers for DOI/Crossref, DBLP, and arXiv;
- add identifier-first resolution, then title/author/year fallback search;
- store merge provenance per field and resolution attempt logs;
- flag conflicts rather than silently overwriting disputed values.
Why this is P1 rather than the first phase:
- enrichment quality depends on the ingestion and provenance model being correct first;
- it is easier to test deterministic merge behavior once local workflows already exist.
Exit criteria:
- an incomplete entry can be enriched from at least one authoritative source;
- conflicting fields remain visible for review instead of being lost.
## Phase 4: Citation Graph Expansion
Priority: P1
Goal:
Use citation edges as a discovery engine rather than just metadata storage.
Tasks:
- support explicit `cites` and `cited_by` edge ingestion with source provenance;
- add graph expansion commands starting from one or more seed entries;
- track edge discovery source, timestamp, and confidence;
- add filters for depth, source type, year range, and reviewed status;
- expose unresolved nodes so the user can decide what to enrich next.
Why this matters:
- this is central to literature discovery rather than mere bibliography cleanup;
- it turns the database into a research navigation tool.
Exit criteria:
- starting from one or more seed entries, a user can expand outward through citation edges and persist newly discovered nodes;
- graph traversal results can be exported as BibTeX candidates for review.
## Phase 5: Search And Ranking
Priority: P2
Goal:
Improve discovery quality inside the local corpus.
Tasks:
- refine FTS ranking across title, abstract, keywords, and fulltext;
- add saved search queries and result filters;
- add optional embedding-backed semantic search behind a pluggable interface;
- support hybrid ranking that combines lexical matching, identifiers, and citation proximity;
- add benchmarking fixtures for retrieval quality on a few research topics.
Why this is later:
- FTS is already enough to support early workflows;
- embedding infrastructure is expensive and should wait until the corpus schema stabilizes.
Exit criteria:
- local search is useful on realistic corpora without requiring external services;
- semantic indexing is optional and does not displace the simpler local search path.
## Phase 6: Corpus Acquisition Pipelines
Priority: P2
Goal:
Broaden source acquisition without mixing that complexity into the core model.
Tasks:
- add source adapters for open-access theses and dissertation repositories;
- add support for harvesting publisher citation pages and preprint metadata pages;
- define per-source import provenance and rate-limit behavior;
- separate source-specific scraping logic from normalized entry storage;
- add regression fixtures for representative public sources.
Why this is later:
- acquisition breadth is useful, but only after the core ingest/enrich/review loop is solid;
- source adapters are brittle and should sit on top of a stable model.
Exit criteria:
- new public corpora can be imported through adapters without changing the storage core;
- imported entries retain their source provenance and can be reviewed like any other entry.
## Suggested Next Three Tasks
1. Add a CLI module with `ingest`, `search`, `show`, and `export`.
2. Implement BibTeX export from the normalized store.
3. Add provenance tables and entry review status fields.
These three tasks complete the first usable local workflow and should be treated as the immediate sprint.

View File

@ -7,6 +7,10 @@ name = "citegeist"
version = "0.1.0"
description = "BibTeX-native tooling for bibliography augmentation, citation graphs, and search"
requires-python = ">=3.10"
dependencies = ["pybtex==0.25.1"]
[project.scripts]
citegeist = "citegeist.cli:main"
[tool.pytest.ini_options]
pythonpath = ["src"]

View File

@ -0,0 +1,4 @@
from .cli import main
raise SystemExit(main())

View File

@ -1,6 +1,14 @@
from __future__ import annotations
from dataclasses import dataclass
from io import StringIO
try:
from pybtex.database import BibliographyData, Entry, Person, parse_string
from pybtex.database.output.bibtex import Writer
except ImportError: # pragma: no cover - exercised only outside the configured venv
BibliographyData = Entry = Person = Writer = None
parse_string = None
@dataclass(slots=True)
@ -11,148 +19,42 @@ class BibEntry:
def parse_bibtex(text: str) -> list[BibEntry]:
_require_pybtex()
bibliography = parse_string(text, bib_format="bibtex")
entries: list[BibEntry] = []
index = 0
size = len(text)
while index < size:
at = text.find("@", index)
if at == -1:
break
entry_type_start = at + 1
brace = text.find("{", entry_type_start)
if brace == -1:
raise ValueError("Malformed BibTeX: missing opening brace")
entry_type = text[entry_type_start:brace].strip().lower()
body, index = _read_balanced_block(text, brace)
citation_key, fields_blob = _split_key_and_fields(body)
for citation_key, entry in bibliography.entries.items():
fields = dict(entry.fields.items())
for role, persons in entry.persons.items():
fields[role] = " and ".join(str(person) for person in persons)
entries.append(
BibEntry(
entry_type=entry_type,
entry_type=entry.type,
citation_key=citation_key,
fields=_parse_fields(fields_blob),
fields=fields,
)
)
return entries
def _read_balanced_block(text: str, brace_index: int) -> tuple[str, int]:
depth = 0
in_quotes = False
escaped = False
def render_bibtex(entries: list[BibEntry]) -> str:
_require_pybtex()
bibliography_entries = {}
for entry in entries:
fields = {key: value for key, value in entry.fields.items() if key not in {"author", "editor"}}
persons = {}
for role in ("author", "editor"):
raw_names = entry.fields.get(role)
if raw_names:
persons[role] = [Person(name.strip()) for name in raw_names.split(" and ") if name.strip()]
bibliography_entries[entry.citation_key] = Entry(entry.entry_type, fields=fields, persons=persons)
for index in range(brace_index, len(text)):
char = text[index]
if in_quotes:
if escaped:
escaped = False
elif char == "\\":
escaped = True
elif char == '"':
in_quotes = False
continue
if char == '"':
in_quotes = True
elif char == "{":
depth += 1
elif char == "}":
depth -= 1
if depth == 0:
return text[brace_index + 1:index], index + 1
raise ValueError("Malformed BibTeX: unbalanced braces")
buffer = StringIO()
Writer().write_stream(BibliographyData(entries=bibliography_entries), buffer)
return buffer.getvalue().strip()
def _split_key_and_fields(body: str) -> tuple[str, str]:
depth = 0
in_quotes = False
escaped = False
for index, char in enumerate(body):
if in_quotes:
if escaped:
escaped = False
elif char == "\\":
escaped = True
elif char == '"':
in_quotes = False
continue
if char == '"':
in_quotes = True
elif char == "{":
depth += 1
elif char == "}":
depth -= 1
elif char == "," and depth == 0:
return body[:index].strip(), body[index + 1 :]
return body.strip(), ""
def _parse_fields(blob: str) -> dict[str, str]:
fields: dict[str, str] = {}
index = 0
size = len(blob)
while index < size:
while index < size and blob[index] in " \t\r\n,":
index += 1
if index >= size:
break
name_start = index
while index < size and (blob[index].isalnum() or blob[index] in "-_"):
index += 1
name = blob[name_start:index].strip().lower()
while index < size and blob[index].isspace():
index += 1
if index >= size or blob[index] != "=":
raise ValueError(f"Malformed BibTeX field near: {blob[name_start:]!r}")
index += 1
while index < size and blob[index].isspace():
index += 1
value, index = _parse_value(blob, index)
fields[name] = " ".join(value.split())
while index < size and blob[index] in " \t\r\n,":
index += 1
return fields
def _parse_value(blob: str, index: int) -> tuple[str, int]:
if index >= len(blob):
return "", index
if blob[index] == "{":
value, next_index = _read_balanced_block(blob, index)
return value.strip(), next_index
if blob[index] == '"':
index += 1
chars: list[str] = []
escaped = False
while index < len(blob):
char = blob[index]
if escaped:
chars.append(char)
escaped = False
elif char == "\\":
chars.append(char)
escaped = True
elif char == '"':
return "".join(chars).strip(), index + 1
else:
chars.append(char)
index += 1
raise ValueError("Malformed BibTeX: unterminated quoted string")
end = index
while end < len(blob) and blob[end] not in ",\r\n":
end += 1
return blob[index:end].strip(), end
def _require_pybtex() -> None:
if parse_string is None or Writer is None:
raise RuntimeError(
"pybtex is required. Use the repo-local virtual environment under .venv/ for citegeist commands."
)

91
src/citegeist/cli.py Normal file
View File

@ -0,0 +1,91 @@
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
from .storage import BibliographyStore
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(prog="citegeist")
parser.add_argument("--db", default="library.sqlite3", help="Path to the SQLite database")
subparsers = parser.add_subparsers(dest="command", required=True)
ingest_parser = subparsers.add_parser("ingest", help="Ingest BibTeX into the database")
ingest_parser.add_argument("input", help="BibTeX file to ingest")
search_parser = subparsers.add_parser("search", help="Search titles, abstracts, and fulltext")
search_parser.add_argument("query", help="Search query")
search_parser.add_argument("--limit", type=int, default=10, help="Maximum number of results")
show_parser = subparsers.add_parser("show", help="Show one entry or list entries")
show_parser.add_argument("citation_key", nargs="?", help="Citation key to show")
show_parser.add_argument("--limit", type=int, default=20, help="Maximum entries when listing")
export_parser = subparsers.add_parser("export", help="Export entries as BibTeX")
export_parser.add_argument("citation_keys", nargs="*", help="Optional citation keys to export")
export_parser.add_argument("--output", help="Write BibTeX to a file instead of stdout")
return parser
def main(argv: list[str] | None = None) -> int:
parser = build_parser()
args = parser.parse_args(argv)
store = BibliographyStore(args.db)
try:
if args.command == "ingest":
return _run_ingest(store, Path(args.input))
if args.command == "search":
return _run_search(store, args.query, args.limit)
if args.command == "show":
return _run_show(store, args.citation_key, args.limit)
if args.command == "export":
return _run_export(store, args.citation_keys, args.output)
finally:
store.close()
parser.error(f"Unknown command: {args.command}")
return 2
def _run_ingest(store: BibliographyStore, input_path: Path) -> int:
text = input_path.read_text(encoding="utf-8")
keys = store.ingest_bibtex(text)
for key in keys:
print(key)
return 0
def _run_search(store: BibliographyStore, query: str, limit: int) -> int:
for row in store.search_text(query, limit=limit):
score = row.get("score", 0.0)
print(f"{row['citation_key']}\t{row.get('year') or ''}\t{score:.3f}\t{row.get('title') or ''}")
return 0
def _run_show(store: BibliographyStore, citation_key: str | None, limit: int) -> int:
if citation_key:
entry = store.get_entry(citation_key)
if entry is None:
print(f"Entry not found: {citation_key}", file=sys.stderr)
return 1
print(json.dumps(entry, indent=2, sort_keys=True))
return 0
print(json.dumps(store.list_entries(limit=limit), indent=2))
return 0
def _run_export(store: BibliographyStore, citation_keys: list[str], output: str | None) -> int:
rendered = store.export_bibtex(citation_keys or None)
if output:
Path(output).write_text(rendered + ("\n" if rendered else ""), encoding="utf-8")
else:
if rendered:
print(rendered)
return 0

View File

@ -2,9 +2,10 @@ from __future__ import annotations
import json
import sqlite3
from collections import OrderedDict
from pathlib import Path
from .bibtex import BibEntry, parse_bibtex
from .bibtex import BibEntry, parse_bibtex, render_bibtex
IDENTIFIER_FIELDS = ("doi", "isbn", "issn", "pmid", "arxiv", "dblp", "oai", "url")
RELATION_FIELDS = {
@ -268,6 +269,41 @@ class BibliographyStore:
).fetchone()
return dict(row) if row else None
def list_entries(self, limit: int = 50) -> list[dict[str, object]]:
rows = self.connection.execute(
"""
SELECT citation_key, entry_type, title, year
FROM entries
ORDER BY COALESCE(year, ''), citation_key
LIMIT ?
""",
(limit,),
).fetchall()
return [dict(row) for row in rows]
def get_entry_bibtex(self, citation_key: str) -> str | None:
entry = self._load_bib_entry(citation_key)
if entry is None:
return None
return render_bibtex([entry])
def export_bibtex(self, citation_keys: list[str] | None = None) -> str:
if citation_keys is None:
rows = self.connection.execute(
"SELECT citation_key FROM entries ORDER BY COALESCE(year, ''), citation_key"
).fetchall()
citation_keys = [str(row["citation_key"]) for row in rows]
chunks: list[str] = []
entries: list[BibEntry] = []
for citation_key in citation_keys:
entry = self._load_bib_entry(citation_key)
if entry is not None:
entries.append(entry)
if not entries:
return ""
return render_bibtex(entries)
def _detect_fts5(self) -> bool:
try:
self.connection.execute("CREATE VIRTUAL TABLE temp.fts_probe USING fts5(content)")
@ -276,6 +312,76 @@ class BibliographyStore:
except sqlite3.OperationalError:
return False
def _load_bib_entry(self, citation_key: str) -> BibEntry | None:
row = self.connection.execute(
"""
SELECT citation_key, entry_type, title, year, journal, booktitle, publisher,
abstract, keywords, url, doi, isbn, extra_fields_json
FROM entries
WHERE citation_key = ?
""",
(citation_key,),
).fetchone()
if row is None:
return None
fields: OrderedDict[str, str] = OrderedDict()
for role in ("author", "editor"):
names = self._load_creator_names(citation_key, role)
if names:
fields[role] = " and ".join(names)
for field_name in (
"title",
"year",
"journal",
"booktitle",
"publisher",
"abstract",
"keywords",
"url",
"doi",
"isbn",
):
value = row[field_name]
if value:
fields[field_name] = str(value)
extra_fields = json.loads(row["extra_fields_json"])
for field_name in sorted(extra_fields):
value = extra_fields[field_name]
if value:
fields[field_name] = str(value)
for relation_type, field_name in (
("cites", "references"),
("cited_by", "cited_by"),
("crossref", "crossref"),
):
values = self.get_relations(citation_key, relation_type)
if values:
fields[field_name] = ", ".join(values)
return BibEntry(
entry_type=str(row["entry_type"]),
citation_key=str(row["citation_key"]),
fields=dict(fields),
)
def _load_creator_names(self, citation_key: str, role: str) -> list[str]:
rows = self.connection.execute(
"""
SELECT c.full_name
FROM entry_creators ec
JOIN entries e ON e.id = ec.entry_id
JOIN creators c ON c.id = ec.creator_id
WHERE e.citation_key = ? AND ec.role = ?
ORDER BY ec.ordinal
""",
(citation_key, role),
).fetchall()
return [str(row["full_name"]) for row in rows]
def _split_names(value: str) -> list[str]:
if not value:
@ -305,8 +411,4 @@ def _split_relation_values(value: str) -> list[str]:
def _entry_to_bibtex(entry: BibEntry) -> str:
lines = [f"@{entry.entry_type}{{{entry.citation_key},"]
for key, value in entry.fields.items():
lines.append(f" {key} = {{{value}}},")
lines.append("}")
return "\n".join(lines)
return render_bibtex([entry])

61
tests/test_cli.py Normal file
View File

@ -0,0 +1,61 @@
from __future__ import annotations
import json
import subprocess
import sys
from pathlib import Path
SAMPLE_BIB = """
@article{smith2024graphs,
author = {Smith, Jane and Doe, Alex},
title = {Graph-first bibliography augmentation},
year = {2024},
abstract = {We study citation graphs for literature discovery.},
references = {miller2023search}
}
@inproceedings{miller2023search,
author = {Miller, Sam},
title = {Semantic search for research corpora},
year = {2023},
abstract = {Dense retrieval improves recall for academic search.}
}
"""
def run_cli(tmp_path: Path, *args: str) -> subprocess.CompletedProcess[str]:
database = tmp_path / "library.sqlite3"
env = {"PYTHONPATH": "src"}
return subprocess.run(
[sys.executable, "-m", "citegeist", "--db", str(database), *args],
cwd=Path(__file__).resolve().parents[1],
env=env,
capture_output=True,
text=True,
check=False,
)
def test_cli_ingest_show_search_and_export(tmp_path: Path):
bib_path = tmp_path / "input.bib"
bib_path.write_text(SAMPLE_BIB, encoding="utf-8")
ingest = run_cli(tmp_path, "ingest", str(bib_path))
assert ingest.returncode == 0
assert "smith2024graphs" in ingest.stdout
show = run_cli(tmp_path, "show", "smith2024graphs")
assert show.returncode == 0
payload = json.loads(show.stdout)
assert payload["citation_key"] == "smith2024graphs"
search = run_cli(tmp_path, "search", "semantic")
assert search.returncode == 0
assert "miller2023search" in search.stdout
export_path = tmp_path / "exported.bib"
export_result = run_cli(tmp_path, "export", "--output", str(export_path))
assert export_result.returncode == 0
exported = export_path.read_text(encoding="utf-8")
assert "@article{smith2024graphs," in exported

View File

@ -51,3 +51,19 @@ def test_store_ingests_entries_relations_and_search_text():
]
finally:
store.close()
def test_store_exports_bibtex_from_normalized_rows():
store = BibliographyStore()
try:
store.ingest_bibtex(SAMPLE_BIB)
exported = store.export_bibtex()
parsed = {entry.citation_key: entry for entry in parse_bibtex(exported)}
assert "@article{smith2024graphs," in exported
assert "@inproceedings{miller2023search," in exported
assert parsed["smith2024graphs"].fields["author"] == "Smith, Jane and Doe, Alex"
assert parsed["smith2024graphs"].fields["references"] == "miller2023search"
finally:
store.close()