GroundRecall/docs/legacy/groundrecall-ingestion-refa...

106 lines
2.7 KiB
Markdown

# GroundRecall Ingestion Refactor Plan
GroundRecall should treat `llmwiki` as one upstream source shape, not as the
defining architecture for grounded knowledge import.
Didactopus already has broader ambitions around ingestion of weakly structured
materials such as:
- markdown notes
- transcripts
- HTML/text course materials
- generated draft packs
- review sessions
- learner artifacts
The GroundRecall import pipeline should therefore be generalized around a shared
normalization and promotion substrate with pluggable source adapters.
## Design rule
Source-specific logic should live at the ingestion edge.
These stages should be generic:
- segmentation
- extraction
- normalization
- lint
- review queue generation
- review bridge
- promotion
- canonical store
- query
- canonical export
## Recommended module split
Recommended package layout:
- `didactopus.groundrecall_ingest`
- `didactopus.groundrecall_source_adapters.base`
- `didactopus.groundrecall_source_adapters.llmwiki`
- `didactopus.groundrecall_source_adapters.markdown_notes`
- `didactopus.groundrecall_source_adapters.transcript`
- `didactopus.groundrecall_source_adapters.didactopus_pack`
- `didactopus.groundrecall_source_adapters.didactopus_review`
## Shared intermediate envelope
Adapters should emit shared discovery records rather than jumping straight into
canonical GroundRecall objects.
Recommended intermediate types:
- `DiscoveredImportSource`
- `SegmentCandidate`
- `ImportProfile`
This keeps adapter-specific parsing separate from the shared import pipeline.
## Output intent
Not every imported source should be treated the same way.
Adapters should declare an output intent:
- `grounded_knowledge`
- `curriculum`
- `both`
Examples:
- `llmwiki` usually targets `grounded_knowledge`
- loose transcripts may target `grounded_knowledge`
- syllabus/course folders often target `curriculum`
- Didactopus packs or review sessions may target `both`
## First refactor milestones
### Milestone 1
- introduce adapter registry and adapter protocol
- move current `llmwiki` discovery/classification behind an adapter
- preserve the current import CLI behavior
### Milestone 2
- add a `markdown_notes` adapter
- add a `transcript` adapter
- add import profiles that tune extraction strictness
### Milestone 3
- add a `didactopus_pack` adapter for pack and review artifacts
- allow current Didactopus outputs to feed into GroundRecall directly
## Why this matters
This avoids building two parallel ingestion stacks inside Didactopus:
- one for packs and educational structures
- another for grounded knowledge capture
Instead, the system gets one generic ingestion substrate with multiple source
adapters and multiple downstream promotion/export paths.