2.7 KiB
GroundRecall Ingestion Refactor Plan
GroundRecall should treat llmwiki as one upstream source shape, not as the
defining architecture for grounded knowledge import.
Didactopus already has broader ambitions around ingestion of weakly structured materials such as:
- markdown notes
- transcripts
- HTML/text course materials
- generated draft packs
- review sessions
- learner artifacts
The GroundRecall import pipeline should therefore be generalized around a shared normalization and promotion substrate with pluggable source adapters.
Design rule
Source-specific logic should live at the ingestion edge.
These stages should be generic:
- segmentation
- extraction
- normalization
- lint
- review queue generation
- review bridge
- promotion
- canonical store
- query
- canonical export
Recommended module split
Recommended package layout:
didactopus.groundrecall_ingestdidactopus.groundrecall_source_adapters.basedidactopus.groundrecall_source_adapters.llmwikididactopus.groundrecall_source_adapters.markdown_notesdidactopus.groundrecall_source_adapters.transcriptdidactopus.groundrecall_source_adapters.didactopus_packdidactopus.groundrecall_source_adapters.didactopus_review
Shared intermediate envelope
Adapters should emit shared discovery records rather than jumping straight into canonical GroundRecall objects.
Recommended intermediate types:
DiscoveredImportSourceSegmentCandidateImportProfile
This keeps adapter-specific parsing separate from the shared import pipeline.
Output intent
Not every imported source should be treated the same way.
Adapters should declare an output intent:
grounded_knowledgecurriculumboth
Examples:
llmwikiusually targetsgrounded_knowledge- loose transcripts may target
grounded_knowledge - syllabus/course folders often target
curriculum - Didactopus packs or review sessions may target
both
First refactor milestones
Milestone 1
- introduce adapter registry and adapter protocol
- move current
llmwikidiscovery/classification behind an adapter - preserve the current import CLI behavior
Milestone 2
- add a
markdown_notesadapter - add a
transcriptadapter - add import profiles that tune extraction strictness
Milestone 3
- add a
didactopus_packadapter for pack and review artifacts - allow current Didactopus outputs to feed into GroundRecall directly
Why this matters
This avoids building two parallel ingestion stacks inside Didactopus:
- one for packs and educational structures
- another for grounded knowledge capture
Instead, the system gets one generic ingestion substrate with multiple source adapters and multiple downstream promotion/export paths.