132 lines
3.9 KiB
Markdown
132 lines
3.9 KiB
Markdown
# Open Bibliographic Source Landscape
|
|
|
|
This document answers the question that prompted the recent planning work: which additional open bibliographic sources are worth incorporating into CiteGeist, given the sources it already uses?
|
|
|
|
## Current Baseline
|
|
|
|
CiteGeist already has useful source coverage for a local BibTeX-first workflow:
|
|
|
|
- `Crossref`: DOI lookup, title search, and reference-list expansion.
|
|
- `OpenAlex`: work lookup, title/discovery search, and citation-graph expansion.
|
|
- `PubMed`: authoritative biomedical metadata lookup by PMID and title search fallback.
|
|
- `Europe PMC`: biomedical metadata/fulltext complement to PubMed.
|
|
- `Semantic Scholar`: broad cross-domain metadata with strong biological and physical sciences coverage.
|
|
- `DataCite`: DOI-backed dataset/report/non-article metadata.
|
|
- `DBLP`: strong computer-science metadata.
|
|
- `arXiv`: preprint metadata.
|
|
- `OAI-PMH`: repository harvesting for theses, dissertations, and institutional collections.
|
|
|
|
That means the immediate gap is no longer “get any scholarly metadata at all.” The immediate gap is to add the next highest-value open sources without destabilizing the existing ingest, review, and export pipeline.
|
|
|
|
## Recommended Priorities
|
|
|
|
### OpenCitations
|
|
|
|
Why:
|
|
|
|
- It directly improves open citation-edge coverage.
|
|
- It fits CiteGeist's graph-discovery workflow better than another generic metadata source.
|
|
- It complements OpenAlex rather than replacing it.
|
|
|
|
Expected role:
|
|
|
|
- DOI-to-citations lookup
|
|
- DOI-to-references lookup
|
|
- provenance for citation edges
|
|
|
|
Status:
|
|
|
|
- now integrated as a DOI-based citation/reference source in the source layer and graph expansion flow
|
|
|
|
Main risk:
|
|
|
|
- coverage is narrower than OpenAlex, so merge rules need to treat it as an additional edge source rather than a primary metadata authority.
|
|
|
|
### Unpaywall
|
|
|
|
Why:
|
|
|
|
- It solves a different problem from Crossref/OpenAlex: full-text access and OA status.
|
|
- It improves the “can I get the paper?” part of the workflow without forcing a storage redesign.
|
|
|
|
Expected role:
|
|
|
|
- DOI-to-best-open-access-link lookup
|
|
- OA status enrichment
|
|
|
|
Status:
|
|
|
|
- now integrated as an OA-link enrichment source with a dedicated `enrich-oa` CLI flow
|
|
|
|
Main risk:
|
|
|
|
- It should remain an access-link enrichment layer, not become entangled with identity resolution logic.
|
|
|
|
### Europe PMC
|
|
|
|
Why:
|
|
|
|
- It is valuable for biomedical and life-sciences use cases.
|
|
- It complements PubMed with richer open-access and citation-related information.
|
|
|
|
Expected role:
|
|
|
|
- domain-specific metadata enrichment
|
|
- biomedical search
|
|
- OA/full-text linkage
|
|
|
|
Status:
|
|
|
|
- now integrated as a biomedical resolver/search complement to `PubMed`
|
|
|
|
Main risk:
|
|
|
|
- this should remain a domain-specific source, not be treated as a universal resolver.
|
|
|
|
### Semantic Scholar
|
|
|
|
Pros:
|
|
|
|
- good graph and relevance signals
|
|
- useful for discovery quality
|
|
|
|
Status:
|
|
|
|
- now integrated as a broad resolver/search complement with good biological and physical sciences coverage
|
|
|
|
Main risk:
|
|
|
|
- rate limits and product-policy changes still matter more here than for the more explicitly open bibliographic sources
|
|
|
|
## Evaluate But Do Not Make Core Yet
|
|
|
|
### OpenAIRE
|
|
|
|
Pros:
|
|
|
|
- strong repository and OA/project linkage
|
|
- good for European repository acquisition
|
|
|
|
Cons:
|
|
|
|
- better suited to corpus acquisition than first-line metadata resolution
|
|
|
|
Recommendation:
|
|
|
|
- treat as an acquisition adapter, not an immediate resolver target
|
|
|
|
## What Not To Prioritize Right Now
|
|
|
|
### Database Redesign
|
|
|
|
The repository already has a working SQLite storage model and FTS-backed local workflow. A second schema track should not lead the next phase of work unless a concrete source integration is blocked on it.
|
|
|
|
### Vector Search
|
|
|
|
Optional semantic ranking may become useful later, but it was not the motivating question and does not need to be a prerequisite for source incorporation.
|
|
|
|
## Suggested Execution Order
|
|
|
|
1. Keep the source abstraction aligned with sources already in use.
|
|
2. Revisit `OpenAIRE` after the current source additions settle.
|