465 lines
12 KiB
Markdown
465 lines
12 KiB
Markdown
# AI Knowledge Graph Adoption Plan
|
|
|
|
This document translates the feature set of
|
|
[`robert-mcdermott/ai-knowledge-graph`](https://github.com/robert-mcdermott/ai-knowledge-graph)
|
|
into concrete implementation tickets for the current local repositories:
|
|
|
|
- `GroundRecall`
|
|
- `Didactopus`
|
|
- `doclift`
|
|
|
|
The goal is not to copy that repository's data model directly.
|
|
|
|
The useful import is:
|
|
|
|
- chunk-aware extraction
|
|
- entity standardization
|
|
- relation suggestion
|
|
- graph inspection and review affordances
|
|
|
|
The main thing to avoid is treating raw extracted SPO triples as canonical truth.
|
|
|
|
## Design Rules
|
|
|
|
1. Keep canonical storage typed and provenance-first.
|
|
2. Treat extracted triples as candidate claims/relations, not promoted facts.
|
|
3. Keep LLM extraction optional and reviewable.
|
|
4. Keep `doclift` deterministic by default.
|
|
5. Put graph extraction in `GroundRecall` first, then expose downstream affordances in `Didactopus`.
|
|
|
|
## Repo Roles
|
|
|
|
### GroundRecall
|
|
|
|
Primary fit for:
|
|
|
|
- candidate claim extraction
|
|
- concept alias normalization
|
|
- candidate relation inference
|
|
- graph diagnostics
|
|
- review queue generation
|
|
|
|
Key current modules:
|
|
|
|
- [src/groundrecall/ingest.py](/home/netuser/bin/GroundRecall/src/groundrecall/ingest.py)
|
|
- [src/groundrecall/models.py](/home/netuser/bin/GroundRecall/src/groundrecall/models.py)
|
|
- [src/groundrecall/source_adapters](/home/netuser/bin/GroundRecall/src/groundrecall/source_adapters)
|
|
- [src/groundrecall/groundrecall_source_adapters/doclift_bundle.py](/home/netuser/bin/GroundRecall/src/groundrecall/groundrecall_source_adapters/doclift_bundle.py)
|
|
- [src/groundrecall/review_export.py](/home/netuser/bin/GroundRecall/src/groundrecall/review_export.py)
|
|
|
|
### Didactopus
|
|
|
|
Primary fit for:
|
|
|
|
- graph workbench visualization
|
|
- concept merge/split suggestions
|
|
- graph-aware review overlays
|
|
- learner-facing graph inspection built on grounded artifacts
|
|
|
|
Key current modules:
|
|
|
|
- [src/didactopus/knowledge_graph.py](/home/netuser/bin/Didactopus/src/didactopus/knowledge_graph.py)
|
|
- [src/didactopus/graph_builder.py](/home/netuser/bin/Didactopus/src/didactopus/graph_builder.py)
|
|
- [src/didactopus/graph_retrieval.py](/home/netuser/bin/Didactopus/src/didactopus/graph_retrieval.py)
|
|
- [src/didactopus/learner_workbench.py](/home/netuser/bin/Didactopus/src/didactopus/learner_workbench.py)
|
|
- [src/didactopus/review_export.py](/home/netuser/bin/Didactopus/src/didactopus/review_export.py)
|
|
- [src/didactopus/main.py](/home/netuser/bin/Didactopus/src/didactopus/main.py)
|
|
|
|
### doclift
|
|
|
|
Primary fit for:
|
|
|
|
- deterministic chunk metadata
|
|
- optional extraction-friendly sidecars
|
|
- optional graph preview artifacts
|
|
|
|
Key current modules:
|
|
|
|
- [src/doclift/convert.py](/home/netuser/bin/doclift/src/doclift/convert.py)
|
|
- [src/doclift/schemas.py](/home/netuser/bin/doclift/src/doclift/schemas.py)
|
|
- [src/doclift/cli.py](/home/netuser/bin/doclift/src/doclift/cli.py)
|
|
|
|
## Phase 1: GroundRecall Candidate Graph Import
|
|
|
|
### Ticket GR-1: Add chunk-aware candidate extraction layer
|
|
|
|
Outcome:
|
|
|
|
- ingest text artifacts into stable chunks
|
|
- extract candidate observations/claims/concepts/relations per chunk
|
|
- write reviewable import artifacts
|
|
|
|
Suggested implementation:
|
|
|
|
- add `src/groundrecall/candidate_graph.py`
|
|
- add `src/groundrecall/extraction_chunks.py`
|
|
|
|
Responsibilities:
|
|
|
|
- split long text into bounded chunks with overlap
|
|
- assign stable `chunk_id`
|
|
- keep chunk-to-artifact provenance
|
|
- emit candidate records with `support_kind="derived_from_page"` or `support_kind="inferred"`
|
|
|
|
CLI:
|
|
|
|
- extend `groundrecall import` with:
|
|
- `--extract-graph`
|
|
- `--chunk-size`
|
|
- `--chunk-overlap`
|
|
- `--extractor none|heuristic|llm`
|
|
|
|
Acceptance criteria:
|
|
|
|
- import still works without graph extraction
|
|
- import artifacts include chunk-backed candidate claims and relations when enabled
|
|
- all extracted candidates preserve artifact and chunk provenance
|
|
|
|
### Ticket GR-2: Add deterministic entity/concept standardization
|
|
|
|
Outcome:
|
|
|
|
- alias clusters for near-duplicate concepts before review
|
|
|
|
Suggested implementation:
|
|
|
|
- add `src/groundrecall/entity_standardization.py`
|
|
|
|
Responsibilities:
|
|
|
|
- normalize punctuation/case
|
|
- trim stopwords conservatively
|
|
- group obvious aliases
|
|
- emit alias-cluster review candidates when confidence is not high enough for direct merge
|
|
|
|
Data shape:
|
|
|
|
- enrich `ConceptRecord.aliases`
|
|
- optionally emit a new review payload section such as `alias_clusters`
|
|
|
|
Acceptance criteria:
|
|
|
|
- obvious duplicates like minor punctuation/case variants collapse deterministically
|
|
- ambiguous clusters remain reviewable rather than auto-merged
|
|
|
|
### Ticket GR-3: Add inferred relation candidates
|
|
|
|
Outcome:
|
|
|
|
- lexical and structural hints become review queue items
|
|
|
|
Suggested implementation:
|
|
|
|
- add `src/groundrecall/relation_inference.py`
|
|
|
|
Inference types:
|
|
|
|
- lexical co-occurrence hints
|
|
- transitive prerequisite/support hints
|
|
- repeated same-source concept pair hints
|
|
|
|
Important restriction:
|
|
|
|
- inferred relations stay `draft` or `triaged`
|
|
- they are never silently promoted to canonical relations
|
|
|
|
Acceptance criteria:
|
|
|
|
- inferred relations appear in import artifacts with explicit provenance
|
|
- review queue distinguishes grounded vs inferred edges
|
|
|
|
### Ticket GR-4: Add graph diagnostics and inspector output
|
|
|
|
Outcome:
|
|
|
|
- maintainers can inspect graph shape before promotion
|
|
|
|
Suggested implementation:
|
|
|
|
- add `src/groundrecall/graph_diagnostics.py`
|
|
- extend [inspect.py](/home/netuser/bin/GroundRecall/src/groundrecall/inspect.py)
|
|
|
|
Diagnostics:
|
|
|
|
- disconnected components
|
|
- orphan concepts
|
|
- claims with no strong support
|
|
- bridge concepts
|
|
- dense noisy clusters
|
|
|
|
CLI:
|
|
|
|
- `groundrecall inspect ... --graph`
|
|
- `groundrecall export ... --include-graph-diagnostics`
|
|
|
|
Acceptance criteria:
|
|
|
|
- graph diagnostics appear in machine-readable JSON
|
|
- review operators can identify noisy imports quickly
|
|
|
|
### Ticket GR-5: Add review export support for candidate graph artifacts
|
|
|
|
Outcome:
|
|
|
|
- current review flows can consume extracted graph candidates
|
|
|
|
Suggested implementation:
|
|
|
|
- extend [review_export.py](/home/netuser/bin/GroundRecall/src/groundrecall/review_export.py)
|
|
- extend review app payloads under [review_app](/home/netuser/bin/GroundRecall/src/groundrecall/review_app)
|
|
|
|
UI payload features:
|
|
|
|
- candidate relation cards
|
|
- alias-cluster cards
|
|
- chunk evidence preview
|
|
- inferred/grounded badges
|
|
|
|
Acceptance criteria:
|
|
|
|
- review bundle includes graph-candidate triage data
|
|
- no assistant-specific assumptions leak into canonical records
|
|
|
|
## Phase 2: Didactopus Graph Review And Workbench Improvements
|
|
|
|
### Ticket DT-1: Add review-oriented graph overlays
|
|
|
|
Outcome:
|
|
|
|
- graph visualizations expose quality problems, not just structure
|
|
|
|
Suggested implementation:
|
|
|
|
- extend [knowledge_graph.py](/home/netuser/bin/Didactopus/src/didactopus/knowledge_graph.py)
|
|
- extend [graph_retrieval.py](/home/netuser/bin/Didactopus/src/didactopus/graph_retrieval.py)
|
|
|
|
Overlay ideas:
|
|
|
|
- edge grounding status
|
|
- concept confidence/review status
|
|
- weakly grounded concept markers
|
|
- disconnected concept islands
|
|
|
|
Acceptance criteria:
|
|
|
|
- exported graph JSON can distinguish grounded, heuristic, and inferred links
|
|
- downstream visual layers can highlight fragile concepts
|
|
|
|
### Ticket DT-2: Add concept consolidation suggestions
|
|
|
|
Outcome:
|
|
|
|
- reviewers get merge/split suggestions based on graph and text structure
|
|
|
|
Suggested implementation:
|
|
|
|
- extend [graph_builder.py](/home/netuser/bin/Didactopus/src/didactopus/graph_builder.py)
|
|
- extend [review_export.py](/home/netuser/bin/Didactopus/src/didactopus/review_export.py)
|
|
|
|
Input signals:
|
|
|
|
- title similarity
|
|
- shared source lessons
|
|
- overlapping prerequisite neighborhoods
|
|
- overlapping mastery signals
|
|
|
|
Acceptance criteria:
|
|
|
|
- review exports include merge suggestions
|
|
- suggested merges remain proposals, not automatic edits
|
|
|
|
### Ticket DT-3: Add learner-workbench graph inspection modes
|
|
|
|
Outcome:
|
|
|
|
- learner and reviewer can inspect why concepts exist and how they connect
|
|
|
|
Suggested implementation:
|
|
|
|
- extend [learner_workbench.py](/home/netuser/bin/Didactopus/src/didactopus/learner_workbench.py)
|
|
- extend backend route [api.py](/home/netuser/bin/Didactopus/src/didactopus/api.py)
|
|
|
|
Views:
|
|
|
|
- concept neighborhood
|
|
- source-fragment grounding trail
|
|
- alternate supporting lessons
|
|
- fragile or noisy concept warnings
|
|
|
|
Acceptance criteria:
|
|
|
|
- workbench can show source-grounded concept neighborhoods
|
|
- concept provenance is inspectable without raw JSON digging
|
|
|
|
### Ticket DT-4: Add graph diagnostics to `doclift-bundle` pack generation
|
|
|
|
Outcome:
|
|
|
|
- `doclift -> Didactopus` imports surface noisy graph structure early
|
|
|
|
Suggested implementation:
|
|
|
|
- extend [doclift_bundle_demo.py](/home/netuser/bin/Didactopus/src/didactopus/doclift_bundle_demo.py)
|
|
- extend [main.py](/home/netuser/bin/Didactopus/src/didactopus/main.py) `doclift-bundle`
|
|
|
|
Artifacts:
|
|
|
|
- `graph_diagnostics.json`
|
|
- `concept_merge_suggestions.json`
|
|
|
|
Acceptance criteria:
|
|
|
|
- importing a `doclift` bundle produces diagnostics alongside `knowledge_graph.json`
|
|
- review workflow can consume those diagnostics
|
|
|
|
## Phase 3: doclift Optional Extraction-Friendly Sidecars
|
|
|
|
### Ticket DL-1: Emit stable chunk metadata
|
|
|
|
Outcome:
|
|
|
|
- downstream systems can import `doclift` bundles without re-segmenting blindly
|
|
|
|
Suggested implementation:
|
|
|
|
- extend [schemas.py](/home/netuser/bin/doclift/src/doclift/schemas.py)
|
|
- extend [convert.py](/home/netuser/bin/doclift/src/doclift/convert.py)
|
|
|
|
Artifacts:
|
|
|
|
- `document.chunks.json`
|
|
|
|
Fields:
|
|
|
|
- `chunk_id`
|
|
- `line_start`
|
|
- `line_end`
|
|
- `section_labels`
|
|
- `text`
|
|
|
|
Acceptance criteria:
|
|
|
|
- bundle remains valid without downstream AI extraction
|
|
- chunk metadata is deterministic across repeat runs
|
|
|
|
### Ticket DL-2: Add optional graph-preview sidecars
|
|
|
|
Outcome:
|
|
|
|
- operators can inspect likely extracted structure at the bundle stage
|
|
|
|
Suggested implementation:
|
|
|
|
- add optional post-processing module such as `src/doclift/graph_preview.py`
|
|
|
|
Artifacts:
|
|
|
|
- `document.entities.json`
|
|
- `document.relations.json`
|
|
- optional `bundle_graph_preview.json`
|
|
|
|
CLI:
|
|
|
|
- extend `doclift convert`
|
|
- extend `doclift convert-dir`
|
|
- flags:
|
|
- `--graph-preview`
|
|
- `--graph-preview-mode heuristic|llm`
|
|
|
|
Important restriction:
|
|
|
|
- these are preview/debug artifacts only
|
|
- they are not the bundle's canonical semantics
|
|
|
|
Acceptance criteria:
|
|
|
|
- graph preview can be disabled entirely
|
|
- default conversion remains deterministic and lightweight
|
|
|
|
### Ticket DL-3: Add HTML inspection output for graph previews
|
|
|
|
Outcome:
|
|
|
|
- maintainers can inspect extracted structure before import
|
|
|
|
Suggested implementation:
|
|
|
|
- add `doclift preview-graph /path/to/bundle`
|
|
|
|
Acceptance criteria:
|
|
|
|
- preview HTML references chunk ids and source lines
|
|
- graph preview is visibly separate from conversion success reporting
|
|
|
|
## Cross-Repo Integration Tickets
|
|
|
|
### Ticket X-1: `doclift -> GroundRecall` candidate-graph import path
|
|
|
|
Outcome:
|
|
|
|
- `GroundRecall` can consume `doclift` chunk metadata directly
|
|
|
|
Modules:
|
|
|
|
- `doclift` emits `document.chunks.json`
|
|
- `GroundRecall` `doclift_bundle` adapter imports it
|
|
|
|
Acceptance criteria:
|
|
|
|
- `groundrecall import /path/to/doclift-bundle --extract-graph`
|
|
- uses `doclift` chunk ids instead of re-splitting markdown where available
|
|
|
|
### Ticket X-2: Shared graph diagnostics vocabulary
|
|
|
|
Outcome:
|
|
|
|
- the three repos use compatible terminology for quality signals
|
|
|
|
Suggested shared diagnostic keys:
|
|
|
|
- `orphan_concept`
|
|
- `weak_grounding`
|
|
- `inferred_relation`
|
|
- `alias_cluster`
|
|
- `disconnected_component`
|
|
- `bridge_concept`
|
|
- `high_fanout_noisy_concept`
|
|
|
|
Acceptance criteria:
|
|
|
|
- review and export layers can exchange diagnostics without brittle custom mapping
|
|
|
|
## Recommended Build Order
|
|
|
|
1. `GR-1`
|
|
2. `GR-2`
|
|
3. `GR-3`
|
|
4. `GR-4`
|
|
5. `X-1`
|
|
6. `DT-1`
|
|
7. `DT-2`
|
|
8. `DL-1`
|
|
9. `DL-2`
|
|
10. `DT-4`
|
|
|
|
## Non-Goals
|
|
|
|
- replacing GroundRecall canonical models with freeform triples
|
|
- forcing LLM extraction into `doclift` core conversion
|
|
- auto-promoting inferred relations
|
|
- making Didactopus depend on a graph preview layer to ingest ordinary packs
|
|
|
|
## Immediate Next Step
|
|
|
|
If only one milestone is funded first, build:
|
|
|
|
- `GR-1`
|
|
- `GR-2`
|
|
- `X-1`
|
|
|
|
That gives the highest leverage path:
|
|
|
|
- `doclift` stays deterministic
|
|
- `GroundRecall` gains useful graph-candidate import
|
|
- `Didactopus` can later consume cleaner grounded artifacts without architectural churn
|