|
|
||
|---|---|---|
| scripts | ||
| src/citegeist | ||
| tests | ||
| .gitignore | ||
| README.md | ||
| ROADMAP.md | ||
| pyproject.toml | ||
README.md
citegeist
citegeist is a research-oriented bibliography workbench for building, expanding, and auditing BibTeX libraries.
The aim is not just to store citations. The aim is to help with the harder problem: finding, improving, connecting, and checking the literature around a topic while keeping BibTeX as a first-class output format.
Repo Description
citegeist is a BibTeX-native research tool for citation extraction, metadata enrichment, citation-graph expansion, and semantic search over scholarly sources.
Scope
The project is intended to support a workflow like this:
- Start from rough references extracted from papers, notes, syllabi, or dissertations.
- Convert them into draft BibTeX entries.
- Enrich and correct those entries using external scholarly metadata sources.
- Persist entries, identifiers, abstracts, and citation edges in a local database.
- Traverse the citation graph outward to discover additional relevant works.
- Search the local corpus semantically using abstracts and extracted full text.
- Export verified results back into BibTeX for LaTeX use.
Why A New Codebase
This repository starts cleanly rather than extending the older bib/ toolkit directly.
The older toolkit is useful as prior art:
- it demonstrates identifier-driven metadata augmentation;
- it caches PDFs and extracted plaintext;
- it shows one workable model for bibliography growth.
But it is not the right long-term base:
- it is Python 2-era code;
- it is shell-script centric;
- it does not provide a normalized database for graph workflows;
- it is not structured as a reusable Python 3 library.
citegeist keeps the useful ideas and rebuilds the foundation around a cleaner Python 3 package boundary.
Current Status
The initial repo includes:
pybtex-backed BibTeX parsing and export in a repo-local virtual environment;- a SQLite-backed bibliography store;
- a small CLI for ingest, search, inspection, and export;
- review-state tracking on entries and per-field ingest provenance;
- plaintext reference extraction into draft BibTeX for numbered, APA-like, wrapped-line, and simple book-style references;
- identifier-first metadata resolution for DOI, OpenAlex, DBLP, and arXiv-backed entries, with OpenAlex title-search fallback;
- local citation-graph traversal over stored
cites,cited_by, andcrossrefedges; - Crossref- and OpenAlex-backed graph expansion that materializes draft related works and edge provenance;
- a dedicated source-client layer with fixture/cache support for live-source development;
- normalized tables for entries, creators, identifiers, and citation relations;
- full-text-search-ready indexing over title, abstract, and fulltext when SQLite FTS5 is available;
- tests covering parsing, ingestion, relation storage, and search.
The prioritized execution plan lives in ROADMAP.md.
Layout
citegeist/
src/citegeist/
bibtex.py
storage.py
tests/
test_storage.py
pyproject.toml
Quick Start
cd citegeist
python3 -m virtualenv --always-copy .venv
.venv/bin/pip install -e .
.venv/bin/pip install pytest
mkdir -p .cache/citegeist
PYTHONPATH=src .venv/bin/python - <<'PY'
from citegeist import BibliographyStore
bib = """
@article{smith2024graphs,
author = {Smith, Jane and Doe, Alex},
title = {Graph-first bibliography augmentation},
year = {2024},
abstract = {We study citation graphs for literature discovery.},
references = {miller2023search}
}
@inproceedings{miller2023search,
author = {Miller, Sam},
title = {Semantic search for research corpora},
year = {2023},
abstract = {Dense retrieval improves recall for academic search.}
}
"""
store = BibliographyStore("library.sqlite3")
store.ingest_bibtex(bib)
print(store.get_relations("smith2024graphs"))
print(store.search_text("semantic"))
store.close()
PY
.venv/bin/python -m pytest -q
Or use the CLI directly:
cd citegeist
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 ingest references.bib
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 search "semantic search"
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 show --provenance smith2024graphs
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 set-status smith2024graphs reviewed
PYTHONPATH=src .venv/bin/python -m citegeist extract references.txt --output draft.bib
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 resolve smith2024graphs
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 graph smith2024graphs --relation cites --depth 2 --missing-only
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 expand smith2024graphs --source crossref
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 expand smith2024graphs --source openalex --relation cited_by --limit 10
PYTHONPATH=src .venv/bin/python -m citegeist --db library.sqlite3 export --output reviewed.bib
For live-source development, prefer fixture-backed or cache-backed source clients so resolver and expansion work can be exercised repeatedly without re-hitting upstream APIs on every run.
Live-source workflow:
cd citegeist
export CITEGEIST_SOURCE_CACHE=.cache/citegeist
export CITEGEIST_LIVE_TESTS=1
PYTHONPATH=src .venv/bin/python -m pytest -m live -q
PYTHONPATH=src .venv/bin/python scripts/live_smoke.py
By default, live tests are skipped. They only run when CITEGEIST_LIVE_TESTS=1 is set.
Near-Term Priorities
- additional resolvers and expansion paths for non-DOI scholarly ecosystems.
See ROADMAP.md for the prioritized phase plan and rationale.
Naming
The name is intended to be short, distinct, and memorable:
citefor citation work;geistfor the organizing intelligence around the literature.