GroundRecall/docs/ai-knowledge-graph-adoption...

12 KiB

AI Knowledge Graph Adoption Plan

This document translates the feature set of robert-mcdermott/ai-knowledge-graph into concrete implementation tickets for the current local repositories:

  • GroundRecall
  • Didactopus
  • doclift

The goal is not to copy that repository's data model directly.

The useful import is:

  • chunk-aware extraction
  • entity standardization
  • relation suggestion
  • graph inspection and review affordances

The main thing to avoid is treating raw extracted SPO triples as canonical truth.

Design Rules

  1. Keep canonical storage typed and provenance-first.
  2. Treat extracted triples as candidate claims/relations, not promoted facts.
  3. Keep LLM extraction optional and reviewable.
  4. Keep doclift deterministic by default.
  5. Put graph extraction in GroundRecall first, then expose downstream affordances in Didactopus.

Repo Roles

GroundRecall

Primary fit for:

  • candidate claim extraction
  • concept alias normalization
  • candidate relation inference
  • graph diagnostics
  • review queue generation

Key current modules:

Didactopus

Primary fit for:

  • graph workbench visualization
  • concept merge/split suggestions
  • graph-aware review overlays
  • learner-facing graph inspection built on grounded artifacts

Key current modules:

doclift

Primary fit for:

  • deterministic chunk metadata
  • optional extraction-friendly sidecars
  • optional graph preview artifacts

Key current modules:

Phase 1: GroundRecall Candidate Graph Import

Ticket GR-1: Add chunk-aware candidate extraction layer

Outcome:

  • ingest text artifacts into stable chunks
  • extract candidate observations/claims/concepts/relations per chunk
  • write reviewable import artifacts

Suggested implementation:

  • add src/groundrecall/candidate_graph.py
  • add src/groundrecall/extraction_chunks.py

Responsibilities:

  • split long text into bounded chunks with overlap
  • assign stable chunk_id
  • keep chunk-to-artifact provenance
  • emit candidate records with support_kind="derived_from_page" or support_kind="inferred"

CLI:

  • extend groundrecall import with:
    • --extract-graph
    • --chunk-size
    • --chunk-overlap
    • --extractor none|heuristic|llm

Acceptance criteria:

  • import still works without graph extraction
  • import artifacts include chunk-backed candidate claims and relations when enabled
  • all extracted candidates preserve artifact and chunk provenance

Ticket GR-2: Add deterministic entity/concept standardization

Outcome:

  • alias clusters for near-duplicate concepts before review

Suggested implementation:

  • add src/groundrecall/entity_standardization.py

Responsibilities:

  • normalize punctuation/case
  • trim stopwords conservatively
  • group obvious aliases
  • emit alias-cluster review candidates when confidence is not high enough for direct merge

Data shape:

  • enrich ConceptRecord.aliases
  • optionally emit a new review payload section such as alias_clusters

Acceptance criteria:

  • obvious duplicates like minor punctuation/case variants collapse deterministically
  • ambiguous clusters remain reviewable rather than auto-merged

Ticket GR-3: Add inferred relation candidates

Outcome:

  • lexical and structural hints become review queue items

Suggested implementation:

  • add src/groundrecall/relation_inference.py

Inference types:

  • lexical co-occurrence hints
  • transitive prerequisite/support hints
  • repeated same-source concept pair hints

Important restriction:

  • inferred relations stay draft or triaged
  • they are never silently promoted to canonical relations

Acceptance criteria:

  • inferred relations appear in import artifacts with explicit provenance
  • review queue distinguishes grounded vs inferred edges

Ticket GR-4: Add graph diagnostics and inspector output

Outcome:

  • maintainers can inspect graph shape before promotion

Suggested implementation:

  • add src/groundrecall/graph_diagnostics.py
  • extend inspect.py

Diagnostics:

  • disconnected components
  • orphan concepts
  • claims with no strong support
  • bridge concepts
  • dense noisy clusters

CLI:

  • groundrecall inspect ... --graph
  • groundrecall export ... --include-graph-diagnostics

Acceptance criteria:

  • graph diagnostics appear in machine-readable JSON
  • review operators can identify noisy imports quickly

Ticket GR-5: Add review export support for candidate graph artifacts

Outcome:

  • current review flows can consume extracted graph candidates

Suggested implementation:

UI payload features:

  • candidate relation cards
  • alias-cluster cards
  • chunk evidence preview
  • inferred/grounded badges

Acceptance criteria:

  • review bundle includes graph-candidate triage data
  • no assistant-specific assumptions leak into canonical records

Phase 2: Didactopus Graph Review And Workbench Improvements

Ticket DT-1: Add review-oriented graph overlays

Outcome:

  • graph visualizations expose quality problems, not just structure

Suggested implementation:

Overlay ideas:

  • edge grounding status
  • concept confidence/review status
  • weakly grounded concept markers
  • disconnected concept islands

Acceptance criteria:

  • exported graph JSON can distinguish grounded, heuristic, and inferred links
  • downstream visual layers can highlight fragile concepts

Ticket DT-2: Add concept consolidation suggestions

Outcome:

  • reviewers get merge/split suggestions based on graph and text structure

Suggested implementation:

Input signals:

  • title similarity
  • shared source lessons
  • overlapping prerequisite neighborhoods
  • overlapping mastery signals

Acceptance criteria:

  • review exports include merge suggestions
  • suggested merges remain proposals, not automatic edits

Ticket DT-3: Add learner-workbench graph inspection modes

Outcome:

  • learner and reviewer can inspect why concepts exist and how they connect

Suggested implementation:

Views:

  • concept neighborhood
  • source-fragment grounding trail
  • alternate supporting lessons
  • fragile or noisy concept warnings

Acceptance criteria:

  • workbench can show source-grounded concept neighborhoods
  • concept provenance is inspectable without raw JSON digging

Ticket DT-4: Add graph diagnostics to doclift-bundle pack generation

Outcome:

  • doclift -> Didactopus imports surface noisy graph structure early

Suggested implementation:

Artifacts:

  • graph_diagnostics.json
  • concept_merge_suggestions.json

Acceptance criteria:

  • importing a doclift bundle produces diagnostics alongside knowledge_graph.json
  • review workflow can consume those diagnostics

Phase 3: doclift Optional Extraction-Friendly Sidecars

Ticket DL-1: Emit stable chunk metadata

Outcome:

  • downstream systems can import doclift bundles without re-segmenting blindly

Suggested implementation:

Artifacts:

  • document.chunks.json

Fields:

  • chunk_id
  • line_start
  • line_end
  • section_labels
  • text

Acceptance criteria:

  • bundle remains valid without downstream AI extraction
  • chunk metadata is deterministic across repeat runs

Ticket DL-2: Add optional graph-preview sidecars

Outcome:

  • operators can inspect likely extracted structure at the bundle stage

Suggested implementation:

  • add optional post-processing module such as src/doclift/graph_preview.py

Artifacts:

  • document.entities.json
  • document.relations.json
  • optional bundle_graph_preview.json

CLI:

  • extend doclift convert
  • extend doclift convert-dir
  • flags:
    • --graph-preview
    • --graph-preview-mode heuristic|llm

Important restriction:

  • these are preview/debug artifacts only
  • they are not the bundle's canonical semantics

Acceptance criteria:

  • graph preview can be disabled entirely
  • default conversion remains deterministic and lightweight

Ticket DL-3: Add HTML inspection output for graph previews

Outcome:

  • maintainers can inspect extracted structure before import

Suggested implementation:

  • add doclift preview-graph /path/to/bundle

Acceptance criteria:

  • preview HTML references chunk ids and source lines
  • graph preview is visibly separate from conversion success reporting

Cross-Repo Integration Tickets

Ticket X-1: doclift -> GroundRecall candidate-graph import path

Outcome:

  • GroundRecall can consume doclift chunk metadata directly

Modules:

  • doclift emits document.chunks.json
  • GroundRecall doclift_bundle adapter imports it

Acceptance criteria:

  • groundrecall import /path/to/doclift-bundle --extract-graph
  • uses doclift chunk ids instead of re-splitting markdown where available

Ticket X-2: Shared graph diagnostics vocabulary

Outcome:

  • the three repos use compatible terminology for quality signals

Suggested shared diagnostic keys:

  • orphan_concept
  • weak_grounding
  • inferred_relation
  • alias_cluster
  • disconnected_component
  • bridge_concept
  • high_fanout_noisy_concept

Acceptance criteria:

  • review and export layers can exchange diagnostics without brittle custom mapping
  1. GR-1
  2. GR-2
  3. GR-3
  4. GR-4
  5. X-1
  6. DT-1
  7. DT-2
  8. DL-1
  9. DL-2
  10. DT-4

Non-Goals

  • replacing GroundRecall canonical models with freeform triples
  • forcing LLM extraction into doclift core conversion
  • auto-promoting inferred relations
  • making Didactopus depend on a graph preview layer to ingest ordinary packs

Immediate Next Step

If only one milestone is funded first, build:

  • GR-1
  • GR-2
  • X-1

That gives the highest leverage path:

  • doclift stays deterministic
  • GroundRecall gains useful graph-candidate import
  • Didactopus can later consume cleaner grounded artifacts without architectural churn