12 KiB
AI Knowledge Graph Adoption Plan
This document translates the feature set of
robert-mcdermott/ai-knowledge-graph
into concrete implementation tickets for the current local repositories:
GroundRecallDidactopusdoclift
The goal is not to copy that repository's data model directly.
The useful import is:
- chunk-aware extraction
- entity standardization
- relation suggestion
- graph inspection and review affordances
The main thing to avoid is treating raw extracted SPO triples as canonical truth.
Design Rules
- Keep canonical storage typed and provenance-first.
- Treat extracted triples as candidate claims/relations, not promoted facts.
- Keep LLM extraction optional and reviewable.
- Keep
docliftdeterministic by default. - Put graph extraction in
GroundRecallfirst, then expose downstream affordances inDidactopus.
Repo Roles
GroundRecall
Primary fit for:
- candidate claim extraction
- concept alias normalization
- candidate relation inference
- graph diagnostics
- review queue generation
Key current modules:
- src/groundrecall/ingest.py
- src/groundrecall/models.py
- src/groundrecall/source_adapters
- src/groundrecall/groundrecall_source_adapters/doclift_bundle.py
- src/groundrecall/review_export.py
Didactopus
Primary fit for:
- graph workbench visualization
- concept merge/split suggestions
- graph-aware review overlays
- learner-facing graph inspection built on grounded artifacts
Key current modules:
- src/didactopus/knowledge_graph.py
- src/didactopus/graph_builder.py
- src/didactopus/graph_retrieval.py
- src/didactopus/learner_workbench.py
- src/didactopus/review_export.py
- src/didactopus/main.py
doclift
Primary fit for:
- deterministic chunk metadata
- optional extraction-friendly sidecars
- optional graph preview artifacts
Key current modules:
Phase 1: GroundRecall Candidate Graph Import
Ticket GR-1: Add chunk-aware candidate extraction layer
Outcome:
- ingest text artifacts into stable chunks
- extract candidate observations/claims/concepts/relations per chunk
- write reviewable import artifacts
Suggested implementation:
- add
src/groundrecall/candidate_graph.py - add
src/groundrecall/extraction_chunks.py
Responsibilities:
- split long text into bounded chunks with overlap
- assign stable
chunk_id - keep chunk-to-artifact provenance
- emit candidate records with
support_kind="derived_from_page"orsupport_kind="inferred"
CLI:
- extend
groundrecall importwith:--extract-graph--chunk-size--chunk-overlap--extractor none|heuristic|llm
Acceptance criteria:
- import still works without graph extraction
- import artifacts include chunk-backed candidate claims and relations when enabled
- all extracted candidates preserve artifact and chunk provenance
Ticket GR-2: Add deterministic entity/concept standardization
Outcome:
- alias clusters for near-duplicate concepts before review
Suggested implementation:
- add
src/groundrecall/entity_standardization.py
Responsibilities:
- normalize punctuation/case
- trim stopwords conservatively
- group obvious aliases
- emit alias-cluster review candidates when confidence is not high enough for direct merge
Data shape:
- enrich
ConceptRecord.aliases - optionally emit a new review payload section such as
alias_clusters
Acceptance criteria:
- obvious duplicates like minor punctuation/case variants collapse deterministically
- ambiguous clusters remain reviewable rather than auto-merged
Ticket GR-3: Add inferred relation candidates
Outcome:
- lexical and structural hints become review queue items
Suggested implementation:
- add
src/groundrecall/relation_inference.py
Inference types:
- lexical co-occurrence hints
- transitive prerequisite/support hints
- repeated same-source concept pair hints
Important restriction:
- inferred relations stay
draftortriaged - they are never silently promoted to canonical relations
Acceptance criteria:
- inferred relations appear in import artifacts with explicit provenance
- review queue distinguishes grounded vs inferred edges
Ticket GR-4: Add graph diagnostics and inspector output
Outcome:
- maintainers can inspect graph shape before promotion
Suggested implementation:
- add
src/groundrecall/graph_diagnostics.py - extend inspect.py
Diagnostics:
- disconnected components
- orphan concepts
- claims with no strong support
- bridge concepts
- dense noisy clusters
CLI:
groundrecall inspect ... --graphgroundrecall export ... --include-graph-diagnostics
Acceptance criteria:
- graph diagnostics appear in machine-readable JSON
- review operators can identify noisy imports quickly
Ticket GR-5: Add review export support for candidate graph artifacts
Outcome:
- current review flows can consume extracted graph candidates
Suggested implementation:
- extend review_export.py
- extend review app payloads under review_app
UI payload features:
- candidate relation cards
- alias-cluster cards
- chunk evidence preview
- inferred/grounded badges
Acceptance criteria:
- review bundle includes graph-candidate triage data
- no assistant-specific assumptions leak into canonical records
Phase 2: Didactopus Graph Review And Workbench Improvements
Ticket DT-1: Add review-oriented graph overlays
Outcome:
- graph visualizations expose quality problems, not just structure
Suggested implementation:
- extend knowledge_graph.py
- extend graph_retrieval.py
Overlay ideas:
- edge grounding status
- concept confidence/review status
- weakly grounded concept markers
- disconnected concept islands
Acceptance criteria:
- exported graph JSON can distinguish grounded, heuristic, and inferred links
- downstream visual layers can highlight fragile concepts
Ticket DT-2: Add concept consolidation suggestions
Outcome:
- reviewers get merge/split suggestions based on graph and text structure
Suggested implementation:
- extend graph_builder.py
- extend review_export.py
Input signals:
- title similarity
- shared source lessons
- overlapping prerequisite neighborhoods
- overlapping mastery signals
Acceptance criteria:
- review exports include merge suggestions
- suggested merges remain proposals, not automatic edits
Ticket DT-3: Add learner-workbench graph inspection modes
Outcome:
- learner and reviewer can inspect why concepts exist and how they connect
Suggested implementation:
- extend learner_workbench.py
- extend backend route api.py
Views:
- concept neighborhood
- source-fragment grounding trail
- alternate supporting lessons
- fragile or noisy concept warnings
Acceptance criteria:
- workbench can show source-grounded concept neighborhoods
- concept provenance is inspectable without raw JSON digging
Ticket DT-4: Add graph diagnostics to doclift-bundle pack generation
Outcome:
doclift -> Didactopusimports surface noisy graph structure early
Suggested implementation:
- extend doclift_bundle_demo.py
- extend main.py
doclift-bundle
Artifacts:
graph_diagnostics.jsonconcept_merge_suggestions.json
Acceptance criteria:
- importing a
docliftbundle produces diagnostics alongsideknowledge_graph.json - review workflow can consume those diagnostics
Phase 3: doclift Optional Extraction-Friendly Sidecars
Ticket DL-1: Emit stable chunk metadata
Outcome:
- downstream systems can import
docliftbundles without re-segmenting blindly
Suggested implementation:
- extend schemas.py
- extend convert.py
Artifacts:
document.chunks.json
Fields:
chunk_idline_startline_endsection_labelstext
Acceptance criteria:
- bundle remains valid without downstream AI extraction
- chunk metadata is deterministic across repeat runs
Ticket DL-2: Add optional graph-preview sidecars
Outcome:
- operators can inspect likely extracted structure at the bundle stage
Suggested implementation:
- add optional post-processing module such as
src/doclift/graph_preview.py
Artifacts:
document.entities.jsondocument.relations.json- optional
bundle_graph_preview.json
CLI:
- extend
doclift convert - extend
doclift convert-dir - flags:
--graph-preview--graph-preview-mode heuristic|llm
Important restriction:
- these are preview/debug artifacts only
- they are not the bundle's canonical semantics
Acceptance criteria:
- graph preview can be disabled entirely
- default conversion remains deterministic and lightweight
Ticket DL-3: Add HTML inspection output for graph previews
Outcome:
- maintainers can inspect extracted structure before import
Suggested implementation:
- add
doclift preview-graph /path/to/bundle
Acceptance criteria:
- preview HTML references chunk ids and source lines
- graph preview is visibly separate from conversion success reporting
Cross-Repo Integration Tickets
Ticket X-1: doclift -> GroundRecall candidate-graph import path
Outcome:
GroundRecallcan consumedocliftchunk metadata directly
Modules:
docliftemitsdocument.chunks.jsonGroundRecalldoclift_bundleadapter imports it
Acceptance criteria:
groundrecall import /path/to/doclift-bundle --extract-graph- uses
docliftchunk ids instead of re-splitting markdown where available
Ticket X-2: Shared graph diagnostics vocabulary
Outcome:
- the three repos use compatible terminology for quality signals
Suggested shared diagnostic keys:
orphan_conceptweak_groundinginferred_relationalias_clusterdisconnected_componentbridge_concepthigh_fanout_noisy_concept
Acceptance criteria:
- review and export layers can exchange diagnostics without brittle custom mapping
Recommended Build Order
GR-1GR-2GR-3GR-4X-1DT-1DT-2DL-1DL-2DT-4
Non-Goals
- replacing GroundRecall canonical models with freeform triples
- forcing LLM extraction into
docliftcore conversion - auto-promoting inferred relations
- making Didactopus depend on a graph preview layer to ingest ordinary packs
Immediate Next Step
If only one milestone is funded first, build:
GR-1GR-2X-1
That gives the highest leverage path:
docliftstays deterministicGroundRecallgains useful graph-candidate importDidactopuscan later consume cleaner grounded artifacts without architectural churn