106 lines
2.7 KiB
Markdown
106 lines
2.7 KiB
Markdown
# GroundRecall Ingestion Refactor Plan
|
|
|
|
GroundRecall should treat `llmwiki` as one upstream source shape, not as the
|
|
defining architecture for grounded knowledge import.
|
|
|
|
Didactopus already has broader ambitions around ingestion of weakly structured
|
|
materials such as:
|
|
|
|
- markdown notes
|
|
- transcripts
|
|
- HTML/text course materials
|
|
- generated draft packs
|
|
- review sessions
|
|
- learner artifacts
|
|
|
|
The GroundRecall import pipeline should therefore be generalized around a shared
|
|
normalization and promotion substrate with pluggable source adapters.
|
|
|
|
## Design rule
|
|
|
|
Source-specific logic should live at the ingestion edge.
|
|
|
|
These stages should be generic:
|
|
|
|
- segmentation
|
|
- extraction
|
|
- normalization
|
|
- lint
|
|
- review queue generation
|
|
- review bridge
|
|
- promotion
|
|
- canonical store
|
|
- query
|
|
- canonical export
|
|
|
|
## Recommended module split
|
|
|
|
Recommended package layout:
|
|
|
|
- `didactopus.groundrecall_ingest`
|
|
- `didactopus.groundrecall_source_adapters.base`
|
|
- `didactopus.groundrecall_source_adapters.llmwiki`
|
|
- `didactopus.groundrecall_source_adapters.markdown_notes`
|
|
- `didactopus.groundrecall_source_adapters.transcript`
|
|
- `didactopus.groundrecall_source_adapters.didactopus_pack`
|
|
- `didactopus.groundrecall_source_adapters.didactopus_review`
|
|
|
|
## Shared intermediate envelope
|
|
|
|
Adapters should emit shared discovery records rather than jumping straight into
|
|
canonical GroundRecall objects.
|
|
|
|
Recommended intermediate types:
|
|
|
|
- `DiscoveredImportSource`
|
|
- `SegmentCandidate`
|
|
- `ImportProfile`
|
|
|
|
This keeps adapter-specific parsing separate from the shared import pipeline.
|
|
|
|
## Output intent
|
|
|
|
Not every imported source should be treated the same way.
|
|
|
|
Adapters should declare an output intent:
|
|
|
|
- `grounded_knowledge`
|
|
- `curriculum`
|
|
- `both`
|
|
|
|
Examples:
|
|
|
|
- `llmwiki` usually targets `grounded_knowledge`
|
|
- loose transcripts may target `grounded_knowledge`
|
|
- syllabus/course folders often target `curriculum`
|
|
- Didactopus packs or review sessions may target `both`
|
|
|
|
## First refactor milestones
|
|
|
|
### Milestone 1
|
|
|
|
- introduce adapter registry and adapter protocol
|
|
- move current `llmwiki` discovery/classification behind an adapter
|
|
- preserve the current import CLI behavior
|
|
|
|
### Milestone 2
|
|
|
|
- add a `markdown_notes` adapter
|
|
- add a `transcript` adapter
|
|
- add import profiles that tune extraction strictness
|
|
|
|
### Milestone 3
|
|
|
|
- add a `didactopus_pack` adapter for pack and review artifacts
|
|
- allow current Didactopus outputs to feed into GroundRecall directly
|
|
|
|
## Why this matters
|
|
|
|
This avoids building two parallel ingestion stacks inside Didactopus:
|
|
|
|
- one for packs and educational structures
|
|
- another for grounded knowledge capture
|
|
|
|
Instead, the system gets one generic ingestion substrate with multiple source
|
|
adapters and multiple downstream promotion/export paths.
|