Add CLI, roadmap, and pybtex integration
This commit is contained in:
parent
4f3ac4decb
commit
ac405943fb
|
|
@ -1,4 +1,5 @@
|
||||||
__pycache__/
|
__pycache__/
|
||||||
.pytest_cache/
|
.pytest_cache/
|
||||||
|
.venv/
|
||||||
*.pyc
|
*.pyc
|
||||||
library.sqlite3
|
library.sqlite3
|
||||||
|
|
|
||||||
35
README.md
35
README.md
|
|
@ -43,12 +43,15 @@ But it is not the right long-term base:
|
||||||
|
|
||||||
The initial repo includes:
|
The initial repo includes:
|
||||||
|
|
||||||
- a lightweight BibTeX parser for structured ingestion;
|
- `pybtex`-backed BibTeX parsing and export in a repo-local virtual environment;
|
||||||
- a SQLite-backed bibliography store;
|
- a SQLite-backed bibliography store;
|
||||||
|
- a small CLI for ingest, search, inspection, and export;
|
||||||
- normalized tables for entries, creators, identifiers, and citation relations;
|
- normalized tables for entries, creators, identifiers, and citation relations;
|
||||||
- full-text-search-ready indexing over title, abstract, and fulltext when SQLite FTS5 is available;
|
- full-text-search-ready indexing over title, abstract, and fulltext when SQLite FTS5 is available;
|
||||||
- tests covering parsing, ingestion, relation storage, and search.
|
- tests covering parsing, ingestion, relation storage, and search.
|
||||||
|
|
||||||
|
The prioritized execution plan lives in [ROADMAP.md](./ROADMAP.md).
|
||||||
|
|
||||||
## Layout
|
## Layout
|
||||||
|
|
||||||
```text
|
```text
|
||||||
|
|
@ -65,7 +68,10 @@ citegeist/
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd citegeist
|
cd citegeist
|
||||||
PYTHONPATH=src python3 - <<'PY'
|
python3 -m virtualenv --always-copy .venv
|
||||||
|
.venv/bin/pip install -e .
|
||||||
|
.venv/bin/pip install pytest
|
||||||
|
PYTHONPATH=src .venv/bin/python - <<'PY'
|
||||||
from citegeist import BibliographyStore
|
from citegeist import BibliographyStore
|
||||||
|
|
||||||
bib = """
|
bib = """
|
||||||
|
|
@ -91,17 +97,26 @@ print(store.get_relations("smith2024graphs"))
|
||||||
print(store.search_text("semantic"))
|
print(store.search_text("semantic"))
|
||||||
store.close()
|
store.close()
|
||||||
PY
|
PY
|
||||||
pytest -q
|
.venv/bin/python -m pytest -q
|
||||||
```
|
```
|
||||||
|
|
||||||
## Planned Work
|
Or use the CLI directly:
|
||||||
|
|
||||||
- parse references from raw prose, OCR, PDF text, and bibliography sections into draft BibTeX;
|
```bash
|
||||||
- add modern metadata resolvers for DOI, Crossref, DBLP, arXiv, and similar sources;
|
cd citegeist
|
||||||
- track provenance and confidence for enriched fields;
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 ingest references.bib
|
||||||
- add graph expansion workflows over `cites` and `cited_by` edges;
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 search "semantic search"
|
||||||
- support acquisition pipelines for open-access theses, dissertations, preprints, and publisher metadata pages;
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 show smith2024graphs
|
||||||
- add embeddings or pluggable semantic indexing beyond SQLite FTS.
|
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export --output reviewed.bib
|
||||||
|
```
|
||||||
|
|
||||||
|
## Near-Term Priorities
|
||||||
|
|
||||||
|
- provenance tracking and entry review states;
|
||||||
|
- plaintext reference extraction into draft BibTeX;
|
||||||
|
- metadata resolvers for DOI, Crossref, DBLP, and arXiv.
|
||||||
|
|
||||||
|
See [ROADMAP.md](./ROADMAP.md) for the prioritized phase plan and rationale.
|
||||||
|
|
||||||
## Naming
|
## Naming
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,187 @@
|
||||||
|
# Roadmap
|
||||||
|
|
||||||
|
This roadmap prioritizes a usable local research workflow over breadth of integrations.
|
||||||
|
|
||||||
|
The first objective is not to support every metadata source. The first objective is to make one end-to-end path work reliably:
|
||||||
|
|
||||||
|
1. ingest draft references,
|
||||||
|
2. normalize and store them,
|
||||||
|
3. enrich them,
|
||||||
|
4. traverse citation links,
|
||||||
|
5. export reviewed BibTeX.
|
||||||
|
|
||||||
|
## Prioritization Principles
|
||||||
|
|
||||||
|
- prioritize steps that make the system usable by a single researcher on a local machine;
|
||||||
|
- prioritize deterministic infrastructure before network integrations;
|
||||||
|
- keep every stage inspectable and auditable;
|
||||||
|
- treat verification and provenance as core features, not cleanup work;
|
||||||
|
- defer heavy semantic infrastructure until the local corpus model is stable.
|
||||||
|
|
||||||
|
## Current Baseline
|
||||||
|
|
||||||
|
Completed:
|
||||||
|
|
||||||
|
- lightweight BibTeX parsing;
|
||||||
|
- SQLite storage for entries, creators, identifiers, and relations;
|
||||||
|
- local text search using SQLite FTS5 when available;
|
||||||
|
- tests for ingest, relation storage, and search.
|
||||||
|
|
||||||
|
## Phase 1: Core Ingestion And Export
|
||||||
|
|
||||||
|
Priority: P0
|
||||||
|
|
||||||
|
Goal:
|
||||||
|
Make `citegeist` useful as a local BibTeX workbench even before online enrichment is added.
|
||||||
|
|
||||||
|
Tasks:
|
||||||
|
|
||||||
|
- add BibTeX export from the normalized database back into stable, readable BibTeX;
|
||||||
|
- add a small CLI for `ingest`, `show`, `search`, and `export`;
|
||||||
|
- store field provenance metadata alongside imported and edited fields;
|
||||||
|
- add schema support for entry status such as `draft`, `enriched`, `reviewed`, and `exported`;
|
||||||
|
- add fixture-driven tests for round-tripping BibTeX through ingest and export.
|
||||||
|
|
||||||
|
Why this comes first:
|
||||||
|
|
||||||
|
- without export, the project is not yet useful in a LaTeX workflow;
|
||||||
|
- without a CLI, the package is a library demo rather than a tool;
|
||||||
|
- without provenance and state, later enrichment work becomes hard to audit.
|
||||||
|
|
||||||
|
Exit criteria:
|
||||||
|
|
||||||
|
- a user can ingest a `.bib` file, inspect entries, search locally, and export a reviewed `.bib`;
|
||||||
|
- round-trip tests show no unexpected field loss for supported entry types.
|
||||||
|
|
||||||
|
## Phase 2: Reference Extraction
|
||||||
|
|
||||||
|
Priority: P0
|
||||||
|
|
||||||
|
Goal:
|
||||||
|
Turn raw reference text into draft entries that can enter the main pipeline.
|
||||||
|
|
||||||
|
Tasks:
|
||||||
|
|
||||||
|
- add parsers for bibliography-section lines and plain-text reference lists;
|
||||||
|
- define a draft-entry schema for incomplete references with confidence markers;
|
||||||
|
- support ingestion of OCR- or PDF-derived plaintext bibliography sections;
|
||||||
|
- add normalization for author names, years, title casing, and page ranges;
|
||||||
|
- build gold-test fixtures from real, messy reference examples.
|
||||||
|
|
||||||
|
Why this is next:
|
||||||
|
|
||||||
|
- this addresses the project’s first unique bottleneck: getting rough references into structured form;
|
||||||
|
- enrichment is much more effective once draft references are normalized.
|
||||||
|
|
||||||
|
Exit criteria:
|
||||||
|
|
||||||
|
- a user can pass a plaintext bibliography section and receive draft BibTeX entries with unresolved fields clearly marked;
|
||||||
|
- tests cover common article, book, chapter, and proceedings references.
|
||||||
|
|
||||||
|
## Phase 3: Metadata Enrichment
|
||||||
|
|
||||||
|
Priority: P1
|
||||||
|
|
||||||
|
Goal:
|
||||||
|
Resolve draft or partial entries against external scholarly sources and merge improved metadata safely.
|
||||||
|
|
||||||
|
Tasks:
|
||||||
|
|
||||||
|
- define a resolver interface with deterministic merge rules;
|
||||||
|
- implement first-party resolvers for DOI/Crossref, DBLP, and arXiv;
|
||||||
|
- add identifier-first resolution, then title/author/year fallback search;
|
||||||
|
- store merge provenance per field and resolution attempt logs;
|
||||||
|
- flag conflicts rather than silently overwriting disputed values.
|
||||||
|
|
||||||
|
Why this is P1 rather than the first phase:
|
||||||
|
|
||||||
|
- enrichment quality depends on the ingestion and provenance model being correct first;
|
||||||
|
- it is easier to test deterministic merge behavior once local workflows already exist.
|
||||||
|
|
||||||
|
Exit criteria:
|
||||||
|
|
||||||
|
- an incomplete entry can be enriched from at least one authoritative source;
|
||||||
|
- conflicting fields remain visible for review instead of being lost.
|
||||||
|
|
||||||
|
## Phase 4: Citation Graph Expansion
|
||||||
|
|
||||||
|
Priority: P1
|
||||||
|
|
||||||
|
Goal:
|
||||||
|
Use citation edges as a discovery engine rather than just metadata storage.
|
||||||
|
|
||||||
|
Tasks:
|
||||||
|
|
||||||
|
- support explicit `cites` and `cited_by` edge ingestion with source provenance;
|
||||||
|
- add graph expansion commands starting from one or more seed entries;
|
||||||
|
- track edge discovery source, timestamp, and confidence;
|
||||||
|
- add filters for depth, source type, year range, and reviewed status;
|
||||||
|
- expose unresolved nodes so the user can decide what to enrich next.
|
||||||
|
|
||||||
|
Why this matters:
|
||||||
|
|
||||||
|
- this is central to literature discovery rather than mere bibliography cleanup;
|
||||||
|
- it turns the database into a research navigation tool.
|
||||||
|
|
||||||
|
Exit criteria:
|
||||||
|
|
||||||
|
- starting from one or more seed entries, a user can expand outward through citation edges and persist newly discovered nodes;
|
||||||
|
- graph traversal results can be exported as BibTeX candidates for review.
|
||||||
|
|
||||||
|
## Phase 5: Search And Ranking
|
||||||
|
|
||||||
|
Priority: P2
|
||||||
|
|
||||||
|
Goal:
|
||||||
|
Improve discovery quality inside the local corpus.
|
||||||
|
|
||||||
|
Tasks:
|
||||||
|
|
||||||
|
- refine FTS ranking across title, abstract, keywords, and fulltext;
|
||||||
|
- add saved search queries and result filters;
|
||||||
|
- add optional embedding-backed semantic search behind a pluggable interface;
|
||||||
|
- support hybrid ranking that combines lexical matching, identifiers, and citation proximity;
|
||||||
|
- add benchmarking fixtures for retrieval quality on a few research topics.
|
||||||
|
|
||||||
|
Why this is later:
|
||||||
|
|
||||||
|
- FTS is already enough to support early workflows;
|
||||||
|
- embedding infrastructure is expensive and should wait until the corpus schema stabilizes.
|
||||||
|
|
||||||
|
Exit criteria:
|
||||||
|
|
||||||
|
- local search is useful on realistic corpora without requiring external services;
|
||||||
|
- semantic indexing is optional and does not displace the simpler local search path.
|
||||||
|
|
||||||
|
## Phase 6: Corpus Acquisition Pipelines
|
||||||
|
|
||||||
|
Priority: P2
|
||||||
|
|
||||||
|
Goal:
|
||||||
|
Broaden source acquisition without mixing that complexity into the core model.
|
||||||
|
|
||||||
|
Tasks:
|
||||||
|
|
||||||
|
- add source adapters for open-access theses and dissertation repositories;
|
||||||
|
- add support for harvesting publisher citation pages and preprint metadata pages;
|
||||||
|
- define per-source import provenance and rate-limit behavior;
|
||||||
|
- separate source-specific scraping logic from normalized entry storage;
|
||||||
|
- add regression fixtures for representative public sources.
|
||||||
|
|
||||||
|
Why this is later:
|
||||||
|
|
||||||
|
- acquisition breadth is useful, but only after the core ingest/enrich/review loop is solid;
|
||||||
|
- source adapters are brittle and should sit on top of a stable model.
|
||||||
|
|
||||||
|
Exit criteria:
|
||||||
|
|
||||||
|
- new public corpora can be imported through adapters without changing the storage core;
|
||||||
|
- imported entries retain their source provenance and can be reviewed like any other entry.
|
||||||
|
|
||||||
|
## Suggested Next Three Tasks
|
||||||
|
|
||||||
|
1. Add a CLI module with `ingest`, `search`, `show`, and `export`.
|
||||||
|
2. Implement BibTeX export from the normalized store.
|
||||||
|
3. Add provenance tables and entry review status fields.
|
||||||
|
|
||||||
|
These three tasks complete the first usable local workflow and should be treated as the immediate sprint.
|
||||||
|
|
@ -7,6 +7,10 @@ name = "citegeist"
|
||||||
version = "0.1.0"
|
version = "0.1.0"
|
||||||
description = "BibTeX-native tooling for bibliography augmentation, citation graphs, and search"
|
description = "BibTeX-native tooling for bibliography augmentation, citation graphs, and search"
|
||||||
requires-python = ">=3.10"
|
requires-python = ">=3.10"
|
||||||
|
dependencies = ["pybtex==0.25.1"]
|
||||||
|
|
||||||
|
[project.scripts]
|
||||||
|
citegeist = "citegeist.cli:main"
|
||||||
|
|
||||||
[tool.pytest.ini_options]
|
[tool.pytest.ini_options]
|
||||||
pythonpath = ["src"]
|
pythonpath = ["src"]
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,4 @@
|
||||||
|
from .cli import main
|
||||||
|
|
||||||
|
|
||||||
|
raise SystemExit(main())
|
||||||
|
|
@ -1,6 +1,14 @@
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
|
from io import StringIO
|
||||||
|
|
||||||
|
try:
|
||||||
|
from pybtex.database import BibliographyData, Entry, Person, parse_string
|
||||||
|
from pybtex.database.output.bibtex import Writer
|
||||||
|
except ImportError: # pragma: no cover - exercised only outside the configured venv
|
||||||
|
BibliographyData = Entry = Person = Writer = None
|
||||||
|
parse_string = None
|
||||||
|
|
||||||
|
|
||||||
@dataclass(slots=True)
|
@dataclass(slots=True)
|
||||||
|
|
@ -11,148 +19,42 @@ class BibEntry:
|
||||||
|
|
||||||
|
|
||||||
def parse_bibtex(text: str) -> list[BibEntry]:
|
def parse_bibtex(text: str) -> list[BibEntry]:
|
||||||
|
_require_pybtex()
|
||||||
|
bibliography = parse_string(text, bib_format="bibtex")
|
||||||
entries: list[BibEntry] = []
|
entries: list[BibEntry] = []
|
||||||
index = 0
|
for citation_key, entry in bibliography.entries.items():
|
||||||
size = len(text)
|
fields = dict(entry.fields.items())
|
||||||
|
for role, persons in entry.persons.items():
|
||||||
while index < size:
|
fields[role] = " and ".join(str(person) for person in persons)
|
||||||
at = text.find("@", index)
|
|
||||||
if at == -1:
|
|
||||||
break
|
|
||||||
entry_type_start = at + 1
|
|
||||||
brace = text.find("{", entry_type_start)
|
|
||||||
if brace == -1:
|
|
||||||
raise ValueError("Malformed BibTeX: missing opening brace")
|
|
||||||
entry_type = text[entry_type_start:brace].strip().lower()
|
|
||||||
body, index = _read_balanced_block(text, brace)
|
|
||||||
citation_key, fields_blob = _split_key_and_fields(body)
|
|
||||||
entries.append(
|
entries.append(
|
||||||
BibEntry(
|
BibEntry(
|
||||||
entry_type=entry_type,
|
entry_type=entry.type,
|
||||||
citation_key=citation_key,
|
citation_key=citation_key,
|
||||||
fields=_parse_fields(fields_blob),
|
fields=fields,
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
return entries
|
return entries
|
||||||
|
|
||||||
|
|
||||||
def _read_balanced_block(text: str, brace_index: int) -> tuple[str, int]:
|
def render_bibtex(entries: list[BibEntry]) -> str:
|
||||||
depth = 0
|
_require_pybtex()
|
||||||
in_quotes = False
|
bibliography_entries = {}
|
||||||
escaped = False
|
for entry in entries:
|
||||||
|
fields = {key: value for key, value in entry.fields.items() if key not in {"author", "editor"}}
|
||||||
|
persons = {}
|
||||||
|
for role in ("author", "editor"):
|
||||||
|
raw_names = entry.fields.get(role)
|
||||||
|
if raw_names:
|
||||||
|
persons[role] = [Person(name.strip()) for name in raw_names.split(" and ") if name.strip()]
|
||||||
|
bibliography_entries[entry.citation_key] = Entry(entry.entry_type, fields=fields, persons=persons)
|
||||||
|
|
||||||
for index in range(brace_index, len(text)):
|
buffer = StringIO()
|
||||||
char = text[index]
|
Writer().write_stream(BibliographyData(entries=bibliography_entries), buffer)
|
||||||
if in_quotes:
|
return buffer.getvalue().strip()
|
||||||
if escaped:
|
|
||||||
escaped = False
|
|
||||||
elif char == "\\":
|
|
||||||
escaped = True
|
|
||||||
elif char == '"':
|
|
||||||
in_quotes = False
|
|
||||||
continue
|
|
||||||
|
|
||||||
if char == '"':
|
|
||||||
in_quotes = True
|
|
||||||
elif char == "{":
|
|
||||||
depth += 1
|
|
||||||
elif char == "}":
|
|
||||||
depth -= 1
|
|
||||||
if depth == 0:
|
|
||||||
return text[brace_index + 1:index], index + 1
|
|
||||||
|
|
||||||
raise ValueError("Malformed BibTeX: unbalanced braces")
|
|
||||||
|
|
||||||
|
|
||||||
def _split_key_and_fields(body: str) -> tuple[str, str]:
|
def _require_pybtex() -> None:
|
||||||
depth = 0
|
if parse_string is None or Writer is None:
|
||||||
in_quotes = False
|
raise RuntimeError(
|
||||||
escaped = False
|
"pybtex is required. Use the repo-local virtual environment under .venv/ for citegeist commands."
|
||||||
|
)
|
||||||
for index, char in enumerate(body):
|
|
||||||
if in_quotes:
|
|
||||||
if escaped:
|
|
||||||
escaped = False
|
|
||||||
elif char == "\\":
|
|
||||||
escaped = True
|
|
||||||
elif char == '"':
|
|
||||||
in_quotes = False
|
|
||||||
continue
|
|
||||||
|
|
||||||
if char == '"':
|
|
||||||
in_quotes = True
|
|
||||||
elif char == "{":
|
|
||||||
depth += 1
|
|
||||||
elif char == "}":
|
|
||||||
depth -= 1
|
|
||||||
elif char == "," and depth == 0:
|
|
||||||
return body[:index].strip(), body[index + 1 :]
|
|
||||||
|
|
||||||
return body.strip(), ""
|
|
||||||
|
|
||||||
|
|
||||||
def _parse_fields(blob: str) -> dict[str, str]:
|
|
||||||
fields: dict[str, str] = {}
|
|
||||||
index = 0
|
|
||||||
size = len(blob)
|
|
||||||
|
|
||||||
while index < size:
|
|
||||||
while index < size and blob[index] in " \t\r\n,":
|
|
||||||
index += 1
|
|
||||||
if index >= size:
|
|
||||||
break
|
|
||||||
|
|
||||||
name_start = index
|
|
||||||
while index < size and (blob[index].isalnum() or blob[index] in "-_"):
|
|
||||||
index += 1
|
|
||||||
name = blob[name_start:index].strip().lower()
|
|
||||||
|
|
||||||
while index < size and blob[index].isspace():
|
|
||||||
index += 1
|
|
||||||
if index >= size or blob[index] != "=":
|
|
||||||
raise ValueError(f"Malformed BibTeX field near: {blob[name_start:]!r}")
|
|
||||||
index += 1
|
|
||||||
|
|
||||||
while index < size and blob[index].isspace():
|
|
||||||
index += 1
|
|
||||||
value, index = _parse_value(blob, index)
|
|
||||||
fields[name] = " ".join(value.split())
|
|
||||||
|
|
||||||
while index < size and blob[index] in " \t\r\n,":
|
|
||||||
index += 1
|
|
||||||
|
|
||||||
return fields
|
|
||||||
|
|
||||||
|
|
||||||
def _parse_value(blob: str, index: int) -> tuple[str, int]:
|
|
||||||
if index >= len(blob):
|
|
||||||
return "", index
|
|
||||||
|
|
||||||
if blob[index] == "{":
|
|
||||||
value, next_index = _read_balanced_block(blob, index)
|
|
||||||
return value.strip(), next_index
|
|
||||||
|
|
||||||
if blob[index] == '"':
|
|
||||||
index += 1
|
|
||||||
chars: list[str] = []
|
|
||||||
escaped = False
|
|
||||||
while index < len(blob):
|
|
||||||
char = blob[index]
|
|
||||||
if escaped:
|
|
||||||
chars.append(char)
|
|
||||||
escaped = False
|
|
||||||
elif char == "\\":
|
|
||||||
chars.append(char)
|
|
||||||
escaped = True
|
|
||||||
elif char == '"':
|
|
||||||
return "".join(chars).strip(), index + 1
|
|
||||||
else:
|
|
||||||
chars.append(char)
|
|
||||||
index += 1
|
|
||||||
raise ValueError("Malformed BibTeX: unterminated quoted string")
|
|
||||||
|
|
||||||
end = index
|
|
||||||
while end < len(blob) and blob[end] not in ",\r\n":
|
|
||||||
end += 1
|
|
||||||
return blob[index:end].strip(), end
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,91 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from .storage import BibliographyStore
|
||||||
|
|
||||||
|
|
||||||
|
def build_parser() -> argparse.ArgumentParser:
|
||||||
|
parser = argparse.ArgumentParser(prog="citegeist")
|
||||||
|
parser.add_argument("--db", default="library.sqlite3", help="Path to the SQLite database")
|
||||||
|
|
||||||
|
subparsers = parser.add_subparsers(dest="command", required=True)
|
||||||
|
|
||||||
|
ingest_parser = subparsers.add_parser("ingest", help="Ingest BibTeX into the database")
|
||||||
|
ingest_parser.add_argument("input", help="BibTeX file to ingest")
|
||||||
|
|
||||||
|
search_parser = subparsers.add_parser("search", help="Search titles, abstracts, and fulltext")
|
||||||
|
search_parser.add_argument("query", help="Search query")
|
||||||
|
search_parser.add_argument("--limit", type=int, default=10, help="Maximum number of results")
|
||||||
|
|
||||||
|
show_parser = subparsers.add_parser("show", help="Show one entry or list entries")
|
||||||
|
show_parser.add_argument("citation_key", nargs="?", help="Citation key to show")
|
||||||
|
show_parser.add_argument("--limit", type=int, default=20, help="Maximum entries when listing")
|
||||||
|
|
||||||
|
export_parser = subparsers.add_parser("export", help="Export entries as BibTeX")
|
||||||
|
export_parser.add_argument("citation_keys", nargs="*", help="Optional citation keys to export")
|
||||||
|
export_parser.add_argument("--output", help="Write BibTeX to a file instead of stdout")
|
||||||
|
|
||||||
|
return parser
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
parser = build_parser()
|
||||||
|
args = parser.parse_args(argv)
|
||||||
|
|
||||||
|
store = BibliographyStore(args.db)
|
||||||
|
try:
|
||||||
|
if args.command == "ingest":
|
||||||
|
return _run_ingest(store, Path(args.input))
|
||||||
|
if args.command == "search":
|
||||||
|
return _run_search(store, args.query, args.limit)
|
||||||
|
if args.command == "show":
|
||||||
|
return _run_show(store, args.citation_key, args.limit)
|
||||||
|
if args.command == "export":
|
||||||
|
return _run_export(store, args.citation_keys, args.output)
|
||||||
|
finally:
|
||||||
|
store.close()
|
||||||
|
|
||||||
|
parser.error(f"Unknown command: {args.command}")
|
||||||
|
return 2
|
||||||
|
|
||||||
|
|
||||||
|
def _run_ingest(store: BibliographyStore, input_path: Path) -> int:
|
||||||
|
text = input_path.read_text(encoding="utf-8")
|
||||||
|
keys = store.ingest_bibtex(text)
|
||||||
|
for key in keys:
|
||||||
|
print(key)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _run_search(store: BibliographyStore, query: str, limit: int) -> int:
|
||||||
|
for row in store.search_text(query, limit=limit):
|
||||||
|
score = row.get("score", 0.0)
|
||||||
|
print(f"{row['citation_key']}\t{row.get('year') or ''}\t{score:.3f}\t{row.get('title') or ''}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _run_show(store: BibliographyStore, citation_key: str | None, limit: int) -> int:
|
||||||
|
if citation_key:
|
||||||
|
entry = store.get_entry(citation_key)
|
||||||
|
if entry is None:
|
||||||
|
print(f"Entry not found: {citation_key}", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
print(json.dumps(entry, indent=2, sort_keys=True))
|
||||||
|
return 0
|
||||||
|
|
||||||
|
print(json.dumps(store.list_entries(limit=limit), indent=2))
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _run_export(store: BibliographyStore, citation_keys: list[str], output: str | None) -> int:
|
||||||
|
rendered = store.export_bibtex(citation_keys or None)
|
||||||
|
if output:
|
||||||
|
Path(output).write_text(rendered + ("\n" if rendered else ""), encoding="utf-8")
|
||||||
|
else:
|
||||||
|
if rendered:
|
||||||
|
print(rendered)
|
||||||
|
return 0
|
||||||
|
|
@ -2,9 +2,10 @@ from __future__ import annotations
|
||||||
|
|
||||||
import json
|
import json
|
||||||
import sqlite3
|
import sqlite3
|
||||||
|
from collections import OrderedDict
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from .bibtex import BibEntry, parse_bibtex
|
from .bibtex import BibEntry, parse_bibtex, render_bibtex
|
||||||
|
|
||||||
IDENTIFIER_FIELDS = ("doi", "isbn", "issn", "pmid", "arxiv", "dblp", "oai", "url")
|
IDENTIFIER_FIELDS = ("doi", "isbn", "issn", "pmid", "arxiv", "dblp", "oai", "url")
|
||||||
RELATION_FIELDS = {
|
RELATION_FIELDS = {
|
||||||
|
|
@ -268,6 +269,41 @@ class BibliographyStore:
|
||||||
).fetchone()
|
).fetchone()
|
||||||
return dict(row) if row else None
|
return dict(row) if row else None
|
||||||
|
|
||||||
|
def list_entries(self, limit: int = 50) -> list[dict[str, object]]:
|
||||||
|
rows = self.connection.execute(
|
||||||
|
"""
|
||||||
|
SELECT citation_key, entry_type, title, year
|
||||||
|
FROM entries
|
||||||
|
ORDER BY COALESCE(year, ''), citation_key
|
||||||
|
LIMIT ?
|
||||||
|
""",
|
||||||
|
(limit,),
|
||||||
|
).fetchall()
|
||||||
|
return [dict(row) for row in rows]
|
||||||
|
|
||||||
|
def get_entry_bibtex(self, citation_key: str) -> str | None:
|
||||||
|
entry = self._load_bib_entry(citation_key)
|
||||||
|
if entry is None:
|
||||||
|
return None
|
||||||
|
return render_bibtex([entry])
|
||||||
|
|
||||||
|
def export_bibtex(self, citation_keys: list[str] | None = None) -> str:
|
||||||
|
if citation_keys is None:
|
||||||
|
rows = self.connection.execute(
|
||||||
|
"SELECT citation_key FROM entries ORDER BY COALESCE(year, ''), citation_key"
|
||||||
|
).fetchall()
|
||||||
|
citation_keys = [str(row["citation_key"]) for row in rows]
|
||||||
|
|
||||||
|
chunks: list[str] = []
|
||||||
|
entries: list[BibEntry] = []
|
||||||
|
for citation_key in citation_keys:
|
||||||
|
entry = self._load_bib_entry(citation_key)
|
||||||
|
if entry is not None:
|
||||||
|
entries.append(entry)
|
||||||
|
if not entries:
|
||||||
|
return ""
|
||||||
|
return render_bibtex(entries)
|
||||||
|
|
||||||
def _detect_fts5(self) -> bool:
|
def _detect_fts5(self) -> bool:
|
||||||
try:
|
try:
|
||||||
self.connection.execute("CREATE VIRTUAL TABLE temp.fts_probe USING fts5(content)")
|
self.connection.execute("CREATE VIRTUAL TABLE temp.fts_probe USING fts5(content)")
|
||||||
|
|
@ -276,6 +312,76 @@ class BibliographyStore:
|
||||||
except sqlite3.OperationalError:
|
except sqlite3.OperationalError:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
def _load_bib_entry(self, citation_key: str) -> BibEntry | None:
|
||||||
|
row = self.connection.execute(
|
||||||
|
"""
|
||||||
|
SELECT citation_key, entry_type, title, year, journal, booktitle, publisher,
|
||||||
|
abstract, keywords, url, doi, isbn, extra_fields_json
|
||||||
|
FROM entries
|
||||||
|
WHERE citation_key = ?
|
||||||
|
""",
|
||||||
|
(citation_key,),
|
||||||
|
).fetchone()
|
||||||
|
if row is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
fields: OrderedDict[str, str] = OrderedDict()
|
||||||
|
for role in ("author", "editor"):
|
||||||
|
names = self._load_creator_names(citation_key, role)
|
||||||
|
if names:
|
||||||
|
fields[role] = " and ".join(names)
|
||||||
|
|
||||||
|
for field_name in (
|
||||||
|
"title",
|
||||||
|
"year",
|
||||||
|
"journal",
|
||||||
|
"booktitle",
|
||||||
|
"publisher",
|
||||||
|
"abstract",
|
||||||
|
"keywords",
|
||||||
|
"url",
|
||||||
|
"doi",
|
||||||
|
"isbn",
|
||||||
|
):
|
||||||
|
value = row[field_name]
|
||||||
|
if value:
|
||||||
|
fields[field_name] = str(value)
|
||||||
|
|
||||||
|
extra_fields = json.loads(row["extra_fields_json"])
|
||||||
|
for field_name in sorted(extra_fields):
|
||||||
|
value = extra_fields[field_name]
|
||||||
|
if value:
|
||||||
|
fields[field_name] = str(value)
|
||||||
|
|
||||||
|
for relation_type, field_name in (
|
||||||
|
("cites", "references"),
|
||||||
|
("cited_by", "cited_by"),
|
||||||
|
("crossref", "crossref"),
|
||||||
|
):
|
||||||
|
values = self.get_relations(citation_key, relation_type)
|
||||||
|
if values:
|
||||||
|
fields[field_name] = ", ".join(values)
|
||||||
|
|
||||||
|
return BibEntry(
|
||||||
|
entry_type=str(row["entry_type"]),
|
||||||
|
citation_key=str(row["citation_key"]),
|
||||||
|
fields=dict(fields),
|
||||||
|
)
|
||||||
|
|
||||||
|
def _load_creator_names(self, citation_key: str, role: str) -> list[str]:
|
||||||
|
rows = self.connection.execute(
|
||||||
|
"""
|
||||||
|
SELECT c.full_name
|
||||||
|
FROM entry_creators ec
|
||||||
|
JOIN entries e ON e.id = ec.entry_id
|
||||||
|
JOIN creators c ON c.id = ec.creator_id
|
||||||
|
WHERE e.citation_key = ? AND ec.role = ?
|
||||||
|
ORDER BY ec.ordinal
|
||||||
|
""",
|
||||||
|
(citation_key, role),
|
||||||
|
).fetchall()
|
||||||
|
return [str(row["full_name"]) for row in rows]
|
||||||
|
|
||||||
|
|
||||||
def _split_names(value: str) -> list[str]:
|
def _split_names(value: str) -> list[str]:
|
||||||
if not value:
|
if not value:
|
||||||
|
|
@ -305,8 +411,4 @@ def _split_relation_values(value: str) -> list[str]:
|
||||||
|
|
||||||
|
|
||||||
def _entry_to_bibtex(entry: BibEntry) -> str:
|
def _entry_to_bibtex(entry: BibEntry) -> str:
|
||||||
lines = [f"@{entry.entry_type}{{{entry.citation_key},"]
|
return render_bibtex([entry])
|
||||||
for key, value in entry.fields.items():
|
|
||||||
lines.append(f" {key} = {{{value}}},")
|
|
||||||
lines.append("}")
|
|
||||||
return "\n".join(lines)
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,61 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
SAMPLE_BIB = """
|
||||||
|
@article{smith2024graphs,
|
||||||
|
author = {Smith, Jane and Doe, Alex},
|
||||||
|
title = {Graph-first bibliography augmentation},
|
||||||
|
year = {2024},
|
||||||
|
abstract = {We study citation graphs for literature discovery.},
|
||||||
|
references = {miller2023search}
|
||||||
|
}
|
||||||
|
|
||||||
|
@inproceedings{miller2023search,
|
||||||
|
author = {Miller, Sam},
|
||||||
|
title = {Semantic search for research corpora},
|
||||||
|
year = {2023},
|
||||||
|
abstract = {Dense retrieval improves recall for academic search.}
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def run_cli(tmp_path: Path, *args: str) -> subprocess.CompletedProcess[str]:
|
||||||
|
database = tmp_path / "library.sqlite3"
|
||||||
|
env = {"PYTHONPATH": "src"}
|
||||||
|
return subprocess.run(
|
||||||
|
[sys.executable, "-m", "citegeist", "--db", str(database), *args],
|
||||||
|
cwd=Path(__file__).resolve().parents[1],
|
||||||
|
env=env,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
check=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_cli_ingest_show_search_and_export(tmp_path: Path):
|
||||||
|
bib_path = tmp_path / "input.bib"
|
||||||
|
bib_path.write_text(SAMPLE_BIB, encoding="utf-8")
|
||||||
|
|
||||||
|
ingest = run_cli(tmp_path, "ingest", str(bib_path))
|
||||||
|
assert ingest.returncode == 0
|
||||||
|
assert "smith2024graphs" in ingest.stdout
|
||||||
|
|
||||||
|
show = run_cli(tmp_path, "show", "smith2024graphs")
|
||||||
|
assert show.returncode == 0
|
||||||
|
payload = json.loads(show.stdout)
|
||||||
|
assert payload["citation_key"] == "smith2024graphs"
|
||||||
|
|
||||||
|
search = run_cli(tmp_path, "search", "semantic")
|
||||||
|
assert search.returncode == 0
|
||||||
|
assert "miller2023search" in search.stdout
|
||||||
|
|
||||||
|
export_path = tmp_path / "exported.bib"
|
||||||
|
export_result = run_cli(tmp_path, "export", "--output", str(export_path))
|
||||||
|
assert export_result.returncode == 0
|
||||||
|
exported = export_path.read_text(encoding="utf-8")
|
||||||
|
assert "@article{smith2024graphs," in exported
|
||||||
|
|
@ -51,3 +51,19 @@ def test_store_ingests_entries_relations_and_search_text():
|
||||||
]
|
]
|
||||||
finally:
|
finally:
|
||||||
store.close()
|
store.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_store_exports_bibtex_from_normalized_rows():
|
||||||
|
store = BibliographyStore()
|
||||||
|
try:
|
||||||
|
store.ingest_bibtex(SAMPLE_BIB)
|
||||||
|
|
||||||
|
exported = store.export_bibtex()
|
||||||
|
parsed = {entry.citation_key: entry for entry in parse_bibtex(exported)}
|
||||||
|
|
||||||
|
assert "@article{smith2024graphs," in exported
|
||||||
|
assert "@inproceedings{miller2023search," in exported
|
||||||
|
assert parsed["smith2024graphs"].fields["author"] == "Smith, Jane and Doe, Alex"
|
||||||
|
assert parsed["smith2024graphs"].fields["references"] == "miller2023search"
|
||||||
|
finally:
|
||||||
|
store.close()
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue