GroundRecall/docs/ai-knowledge-graph-adoption...

# AI Knowledge Graph Adoption Plan

This document translates the feature set of
[`robert-mcdermott/ai-knowledge-graph`](https://github.com/robert-mcdermott/ai-knowledge-graph)
into concrete implementation tickets for the current local repositories:

- `GroundRecall`
- `Didactopus`
- `doclift`

The goal is not to copy that repository's data model directly.

The useful import is:

- chunk-aware extraction
- entity standardization
- relation suggestion
- graph inspection and review affordances

The main thing to avoid is treating raw extracted SPO triples as canonical truth.

## Design Rules

1. Keep canonical storage typed and provenance-first.
2. Treat extracted triples as candidate claims/relations, not promoted facts.
3. Keep LLM extraction optional and reviewable.
4. Keep `doclift` deterministic by default.
5. Put graph extraction in `GroundRecall` first, then expose downstream affordances in `Didactopus`.

## Repo Roles

### GroundRecall

Primary fit for:

- candidate claim extraction
- concept alias normalization
- candidate relation inference
- graph diagnostics
- review queue generation

Key current modules:

- [src/groundrecall/ingest.py](/home/netuser/bin/GroundRecall/src/groundrecall/ingest.py)
- [src/groundrecall/models.py](/home/netuser/bin/GroundRecall/src/groundrecall/models.py)
- [src/groundrecall/source_adapters](/home/netuser/bin/GroundRecall/src/groundrecall/source_adapters)
- [src/groundrecall/groundrecall_source_adapters/doclift_bundle.py](/home/netuser/bin/GroundRecall/src/groundrecall/groundrecall_source_adapters/doclift_bundle.py)
- [src/groundrecall/review_export.py](/home/netuser/bin/GroundRecall/src/groundrecall/review_export.py)

### Didactopus

Primary fit for:

- graph workbench visualization
- concept merge/split suggestions
- graph-aware review overlays
- learner-facing graph inspection built on grounded artifacts

Key current modules:

- [src/didactopus/knowledge_graph.py](/home/netuser/bin/Didactopus/src/didactopus/knowledge_graph.py)
- [src/didactopus/graph_builder.py](/home/netuser/bin/Didactopus/src/didactopus/graph_builder.py)
- [src/didactopus/graph_retrieval.py](/home/netuser/bin/Didactopus/src/didactopus/graph_retrieval.py)
- [src/didactopus/learner_workbench.py](/home/netuser/bin/Didactopus/src/didactopus/learner_workbench.py)
- [src/didactopus/review_export.py](/home/netuser/bin/Didactopus/src/didactopus/review_export.py)
- [src/didactopus/main.py](/home/netuser/bin/Didactopus/src/didactopus/main.py)

### doclift

Primary fit for:

- deterministic chunk metadata
- optional extraction-friendly sidecars
- optional graph preview artifacts

Key current modules:

- [src/doclift/convert.py](/home/netuser/bin/doclift/src/doclift/convert.py)
- [src/doclift/schemas.py](/home/netuser/bin/doclift/src/doclift/schemas.py)
- [src/doclift/cli.py](/home/netuser/bin/doclift/src/doclift/cli.py)

## Phase 1: GroundRecall Candidate Graph Import

### Ticket GR-1: Add chunk-aware candidate extraction layer

Outcome:

- ingest text artifacts into stable chunks
- extract candidate observations/claims/concepts/relations per chunk
- write reviewable import artifacts

Suggested implementation:

- add `src/groundrecall/candidate_graph.py`
- add `src/groundrecall/extraction_chunks.py`

Responsibilities:

- split long text into bounded chunks with overlap
- assign stable `chunk_id`
- keep chunk-to-artifact provenance
- emit candidate records with `support_kind="derived_from_page"` or `support_kind="inferred"`

CLI:

- extend `groundrecall import` with:
  - `--extract-graph`
  - `--chunk-size`
  - `--chunk-overlap`
  - `--extractor none|heuristic|llm`

Acceptance criteria:

- import still works without graph extraction
- import artifacts include chunk-backed candidate claims and relations when enabled
- all extracted candidates preserve artifact and chunk provenance

### Ticket GR-2: Add deterministic entity/concept standardization

Outcome:

- alias clusters for near-duplicate concepts before review

Suggested implementation:

- add `src/groundrecall/entity_standardization.py`

Responsibilities:

- normalize punctuation/case
- trim stopwords conservatively
- group obvious aliases
- emit alias-cluster review candidates when confidence is not high enough for direct merge

Data shape:

- enrich `ConceptRecord.aliases`
- optionally emit a new review payload section such as `alias_clusters`

Acceptance criteria:

- obvious duplicates like minor punctuation/case variants collapse deterministically
- ambiguous clusters remain reviewable rather than auto-merged

### Ticket GR-3: Add inferred relation candidates

Outcome:

- lexical and structural hints become review queue items

Suggested implementation:

- add `src/groundrecall/relation_inference.py`

Inference types:

- lexical co-occurrence hints
- transitive prerequisite/support hints
- repeated same-source concept pair hints

Important restriction:

- inferred relations stay `draft` or `triaged`
- they are never silently promoted to canonical relations

Acceptance criteria:

- inferred relations appear in import artifacts with explicit provenance
- review queue distinguishes grounded vs inferred edges

### Ticket GR-4: Add graph diagnostics and inspector output

Outcome:

- maintainers can inspect graph shape before promotion

Suggested implementation:

- add `src/groundrecall/graph_diagnostics.py`
- extend [inspect.py](/home/netuser/bin/GroundRecall/src/groundrecall/inspect.py)

Diagnostics:

- disconnected components
- orphan concepts
- claims with no strong support
- bridge concepts
- dense noisy clusters

CLI:

- `groundrecall inspect ... --graph`
- `groundrecall export ... --include-graph-diagnostics`

Acceptance criteria:

- graph diagnostics appear in machine-readable JSON
- review operators can identify noisy imports quickly

### Ticket GR-5: Add review export support for candidate graph artifacts

Outcome:

- current review flows can consume extracted graph candidates

Suggested implementation:

- extend [review_export.py](/home/netuser/bin/GroundRecall/src/groundrecall/review_export.py)
- extend review app payloads under [review_app](/home/netuser/bin/GroundRecall/src/groundrecall/review_app)

UI payload features:

- candidate relation cards
- alias-cluster cards
- chunk evidence preview
- inferred/grounded badges

Acceptance criteria:

- review bundle includes graph-candidate triage data
- no assistant-specific assumptions leak into canonical records

## Phase 2: Didactopus Graph Review And Workbench Improvements

### Ticket DT-1: Add review-oriented graph overlays

Outcome:

- graph visualizations expose quality problems, not just structure

Suggested implementation:

- extend [knowledge_graph.py](/home/netuser/bin/Didactopus/src/didactopus/knowledge_graph.py)
- extend [graph_retrieval.py](/home/netuser/bin/Didactopus/src/didactopus/graph_retrieval.py)

Overlay ideas:

- edge grounding status
- concept confidence/review status
- weakly grounded concept markers
- disconnected concept islands

Acceptance criteria:

- exported graph JSON can distinguish grounded, heuristic, and inferred links
- downstream visual layers can highlight fragile concepts

### Ticket DT-2: Add concept consolidation suggestions

Outcome:

- reviewers get merge/split suggestions based on graph and text structure

Suggested implementation:

- extend [graph_builder.py](/home/netuser/bin/Didactopus/src/didactopus/graph_builder.py)
- extend [review_export.py](/home/netuser/bin/Didactopus/src/didactopus/review_export.py)

Input signals:

- title similarity
- shared source lessons
- overlapping prerequisite neighborhoods
- overlapping mastery signals

Acceptance criteria:

- review exports include merge suggestions
- suggested merges remain proposals, not automatic edits

### Ticket DT-3: Add learner-workbench graph inspection modes

Outcome:

- learner and reviewer can inspect why concepts exist and how they connect

Suggested implementation:

- extend [learner_workbench.py](/home/netuser/bin/Didactopus/src/didactopus/learner_workbench.py)
- extend backend route [api.py](/home/netuser/bin/Didactopus/src/didactopus/api.py)

Views:

- concept neighborhood
- source-fragment grounding trail
- alternate supporting lessons
- fragile or noisy concept warnings

Acceptance criteria:

- workbench can show source-grounded concept neighborhoods
- concept provenance is inspectable without raw JSON digging

### Ticket DT-4: Add graph diagnostics to `doclift-bundle` pack generation

Outcome:

- `doclift -> Didactopus` imports surface noisy graph structure early

Suggested implementation:

- extend [doclift_bundle_demo.py](/home/netuser/bin/Didactopus/src/didactopus/doclift_bundle_demo.py)
- extend [main.py](/home/netuser/bin/Didactopus/src/didactopus/main.py) `doclift-bundle`

Artifacts:

- `graph_diagnostics.json`
- `concept_merge_suggestions.json`

Acceptance criteria:

- importing a `doclift` bundle produces diagnostics alongside `knowledge_graph.json`
- review workflow can consume those diagnostics

## Phase 3: doclift Optional Extraction-Friendly Sidecars

### Ticket DL-1: Emit stable chunk metadata

Outcome:

- downstream systems can import `doclift` bundles without re-segmenting blindly

Suggested implementation:

- extend [schemas.py](/home/netuser/bin/doclift/src/doclift/schemas.py)
- extend [convert.py](/home/netuser/bin/doclift/src/doclift/convert.py)

Artifacts:

- `document.chunks.json`

Fields:

- `chunk_id`
- `line_start`
- `line_end`
- `section_labels`
- `text`

Acceptance criteria:

- bundle remains valid without downstream AI extraction
- chunk metadata is deterministic across repeat runs

### Ticket DL-2: Add optional graph-preview sidecars

Outcome:

- operators can inspect likely extracted structure at the bundle stage

Suggested implementation:

- add optional post-processing module such as `src/doclift/graph_preview.py`

Artifacts:

- `document.entities.json`
- `document.relations.json`
- optional `bundle_graph_preview.json`

CLI:

- extend `doclift convert`
- extend `doclift convert-dir`
- flags:
  - `--graph-preview`
  - `--graph-preview-mode heuristic|llm`

Important restriction:

- these are preview/debug artifacts only
- they are not the bundle's canonical semantics

Acceptance criteria:

- graph preview can be disabled entirely
- default conversion remains deterministic and lightweight

### Ticket DL-3: Add HTML inspection output for graph previews

Outcome:

- maintainers can inspect extracted structure before import

Suggested implementation:

- add `doclift preview-graph /path/to/bundle`

Acceptance criteria:

- preview HTML references chunk ids and source lines
- graph preview is visibly separate from conversion success reporting

## Cross-Repo Integration Tickets

### Ticket X-1: `doclift -> GroundRecall` candidate-graph import path

Outcome:

- `GroundRecall` can consume `doclift` chunk metadata directly

Modules:

- `doclift` emits `document.chunks.json`
- `GroundRecall` `doclift_bundle` adapter imports it

Acceptance criteria:

- `groundrecall import /path/to/doclift-bundle --extract-graph`
- uses `doclift` chunk ids instead of re-splitting markdown where available

### Ticket X-2: Shared graph diagnostics vocabulary

Outcome:

- the three repos use compatible terminology for quality signals

Suggested shared diagnostic keys:

- `orphan_concept`
- `weak_grounding`
- `inferred_relation`
- `alias_cluster`
- `disconnected_component`
- `bridge_concept`
- `high_fanout_noisy_concept`

Acceptance criteria:

- review and export layers can exchange diagnostics without brittle custom mapping

## Recommended Build Order

1. `GR-1`
2. `GR-2`
3. `GR-3`
4. `GR-4`
5. `X-1`
6. `DT-1`
7. `DT-2`
8. `DL-1`
9. `DL-2`
10. `DT-4`

## Non-Goals

- replacing GroundRecall canonical models with freeform triples
- forcing LLM extraction into `doclift` core conversion
- auto-promoting inferred relations
- making Didactopus depend on a graph preview layer to ingest ordinary packs

## Immediate Next Step

If only one milestone is funded first, build:

- `GR-1`
- `GR-2`
- `X-1`

That gives the highest leverage path:

- `doclift` stays deterministic
- `GroundRecall` gains useful graph-candidate import
- `Didactopus` can later consume cleaner grounded artifacts without architectural churn