12 KiB
GroundRecall llmwiki Import Specification
This document defines the first-pass import path for users who already have some
form of llmwiki-style repository and want to migrate it into the broader
GroundRecall substrate while staying compatible with Didactopus review and
promotion flows.
Goal
The import path should let an existing llmwiki corpus become:
- searchable without immediate manual cleanup
- reviewable rather than blindly trusted
- grounded in explicit provenance
- promotable into durable structured knowledge objects
- exportable back into compiled wiki pages, assistant adapter bundles, and queryable graph artifacts
The key rule is:
Imported wiki pages are derived artifacts, not automatic source truth.
Import philosophy
Users coming from llmwiki often have a mixture of:
- raw notes
- compiled markdown pages
- local source files
- generated summaries
- ad hoc link graphs
- session transcripts
- speculative or weakly-supported synthesis
GroundRecall should preserve that work without pretending all of it is already promoted knowledge.
The import pipeline therefore has two responsibilities:
- Preserve the original material with minimal loss.
- Reify explicit structured objects that can later be reviewed and promoted.
Scope of the first implementation
The first implementation should support common llmwiki layouts such as:
raw/wiki/schema.*logs/sources/- top-level markdown pages
The importer should not require a canonical upstream schema. It should operate from directory conventions plus simple heuristics.
Import modes
1. archive
Purpose:
- preserve an existing
llmwikitree as read-only imported artifacts - index it for search and later review
Behavior:
- no claim promotion
- minimal extraction
- all compiled pages remain
draft
Use when:
- the user wants backward compatibility first
- the corpus quality is unknown
2. quick
Purpose:
- bootstrap usable structured objects fast
Behavior:
- import pages and raw sources
- extract candidate claims and concepts heuristically
- attach lightweight provenance
- queue uncertain items for review
Use when:
- the user wants early utility and accepts heuristic noise
3. grounded
Purpose:
- perform a migration suitable for long-lived shared knowledge
Behavior:
- require provenance for promoted claims
- mark unsupported statements explicitly
- produce review records and lint findings
- populate promotion queues rather than auto-promoting
Use when:
- the imported corpus will be shared across machines or agents
Pipeline stages
1. Capture
The importer records the source repository as an import artifact.
Required metadata:
import_idimport_modesource_rootimported_atmachine_idagent_idsource_repo_kind=llmwiki
Outputs:
- import manifest
- artifact records for all discovered files
2. Segment
Imported content is split into stable units.
Primary segment types:
source_documentsource_fragmentcompiled_pagesection_summarycandidate_claimcandidate_conceptcandidate_relationsession_observation
Segmentation should preserve:
- original path
- section heading
- line or byte offsets when possible
- page title
- frontmatter fields
3. Classify
Each segment gets a semantic role.
Recommended roles:
sourcederivationclaimsummaryquestiontodospeculationobsoletetranscript
This prevents unsupported prose from being confused with grounded knowledge.
4. Ground
Each imported segment gets provenance and support metadata.
Required grounding fields:
origin_artifact_idorigin_pathorigin_sectionsource_urlwhen knownretrieval_datewhen knownmachine_idsession_idwhen knownsupport_kindgrounding_status
Suggested values:
support_kind:direct_source,derived_from_page,derived_from_session,inferred,unknowngrounding_status:grounded,partially_grounded,ungrounded
5. Normalize
The importer emits explicit GroundRecall objects.
Minimum object set:
SourceFragmentArtifactObservationClaimConceptRelation
6. Lint
The importer produces machine-readable findings before promotion.
Required lint checks:
- claim has no supporting fragment
- multiple claims appear text-identical
- concept is orphaned
- relation points to missing concept
- page summary has no cited support
- imported item marked
obsoletestill linked as current - same claim imported with conflicting confidence or polarity
7. Promote
Imported objects enter existing Didactopus review/promotion lanes rather than becoming trusted immediately.
Recommended states:
drafttriagedreviewedpromotedsupersededarchived
8. Export
Promoted objects can then be rendered back out as:
- compiled wiki pages
- graph snapshots
- assistant adapter bundles
- review reports
- query bundles for assistant-facing use
Object contracts
ImportedArtifact
{
"artifact_id": "ia_001",
"import_id": "imp_2026_04_16_a",
"artifact_kind": "compiled_page",
"path": "wiki/channel-capacity.md",
"title": "Channel Capacity",
"sha256": "abc123",
"created_at": "2026-04-16T14:00:00Z",
"metadata": {
"frontmatter": {},
"headings": ["Definition", "Examples"]
},
"current_status": "draft"
}
ImportedObservation
{
"observation_id": "obs_001",
"import_id": "imp_2026_04_16_a",
"artifact_id": "ia_001",
"role": "summary",
"text": "Capacity bounds reliable communication over a noisy channel.",
"origin_path": "wiki/channel-capacity.md",
"origin_section": "Definition",
"line_start": 12,
"line_end": 14,
"grounding_status": "partially_grounded",
"support_kind": "derived_from_page",
"confidence_hint": 0.63,
"current_status": "draft"
}
ImportedClaim
{
"claim_id": "clm_001",
"import_id": "imp_2026_04_16_a",
"claim_text": "Channel capacity is the maximum reliable communication rate for a channel model.",
"claim_kind": "definition",
"source_observation_ids": ["obs_001"],
"supporting_fragment_ids": ["frag_014"],
"concept_ids": ["concept::channel-capacity"],
"confidence_hint": 0.74,
"grounding_status": "grounded",
"current_status": "triaged"
}
ImportedConcept
{
"concept_id": "concept::channel-capacity",
"import_id": "imp_2026_04_16_a",
"title": "Channel Capacity",
"aliases": [],
"description": "Imported concept from llmwiki corpus.",
"source_artifact_ids": ["ia_001"],
"current_status": "triaged"
}
ImportedRelation
{
"relation_id": "rel_001",
"import_id": "imp_2026_04_16_a",
"source_id": "concept::shannon-entropy",
"target_id": "concept::channel-capacity",
"relation_type": "supports_understanding_of",
"evidence_ids": ["obs_015"],
"current_status": "draft"
}
Mapping from llmwiki into GroundRecall
Recommended first-pass mapping:
raw/*->SourceorArtifact(kind=raw_note)wiki/*.md->Artifact(kind=compiled_page)- frontmatter -> artifact metadata
- headings -> section boundaries
- linked page names -> candidate
ConceptandRelation - bullet or sentence extraction -> candidate
ObservationandClaim - chat or session logs ->
Observation(kind=session_note) - schema files -> import metadata only unless a future adapter exists
Confidence and trust policy
Imported confidence must remain clearly separate from reviewed confidence.
Recommended fields:
confidence_hintreview_confidencegrounding_statusreview_verdict
Policy:
confidence_hintcomes from heuristic import scoringreview_confidenceexists only after review- promotion requires at least
partially_grounded - fully ungrounded claims can be stored, but only as
draftorarchived
Provenance policy
The importer should follow the existing Didactopus provenance direction:
- preserve source identity
- preserve retrieval date when available
- preserve adaptation status
- keep both human-readable and machine-readable provenance
When only a compiled wiki page exists and the original source is missing:
- the compiled page becomes the immediate origin artifact
- all extracted claims must be marked
derived_from_page - such claims should not auto-promote in
groundedmode
Review and promotion integration
Imported Claim and Concept objects should feed into the same general review
machinery already used for pack-oriented promotion:
- create candidate records
- attach lint findings
- route to a triage lane
- collect review verdicts
- emit promotion records
Suggested triage lanes:
knowledge_capturepack_improvementskill_exportsource_cleanupconflict_resolution
Module layout
First-pass module layout:
didactopus.groundrecall_importEntry points and top-level orchestration.didactopus.groundrecall_discoveryFindsllmwiki-style files and classifies paths.didactopus.groundrecall_segmenterSplits pages and logs into stable observations and candidate claims.didactopus.groundrecall_normalizerEmits normalized import objects.didactopus.groundrecall_lintImport-time lint checks.didactopus.groundrecall_review_bridgeConverts imported objects into review candidates and promotion records.didactopus.groundrecall_exportRenders promoted objects back to wiki, graph, and skill artifacts.
CLI shape
Suggested CLI:
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode archive
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode quick
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode grounded
python -m didactopus.groundrecall.cli lint imports/<import-id>
python -m didactopus.groundrecall.cli promote imports/<import-id> /path/to/store
python -m didactopus.groundrecall.cli export /path/to/store exports/groundrecall --concept channel-capacity
Compatibility wrappers still exist during migration:
python -m didactopus.groundrecall_import /path/to/llmwiki --mode grounded
python -m didactopus.groundrecall_lint imports/<import-id>
python -m didactopus.groundrecall_export /path/to/store exports/groundrecall --concept channel-capacity
Filesystem layout
Suggested repository-local layout:
imports/<import-id>/manifest.jsonimports/<import-id>/artifacts.jsonlimports/<import-id>/observations.jsonlimports/<import-id>/claims.jsonlimports/<import-id>/concepts.jsonlimports/<import-id>/relations.jsonlimports/<import-id>/lint_findings.jsonimports/<import-id>/review_queue.json
This keeps imported state auditable and easy to sync across machines.
Multi-machine sync implication
For distributed assistant use, imported state should be append-oriented and rebuildable.
Recommended sync primitives:
- import manifests
- normalized jsonl object streams
- review records
- promotion records
Non-authoritative derived artifacts:
- rendered wiki pages
- local indexes
- embeddings
- cache files
This allows multiple machines to contribute import events without making the compiled page tree the merge primitive.
First implementation milestones
Milestone 1
- discover
raw/andwiki/ - import artifacts
- segment markdown by headings
- emit observations and candidate claims
- write import manifest and jsonl outputs
Milestone 2
- add grounding metadata
- add lint checks
- add triage lanes and review queue output
Milestone 3
- map promoted claims into assistant-neutral exports plus assistant adapter bundles
- render compiled wiki views from promoted objects
- support multi-machine import manifests and merge-safe event storage
Non-goals for the first pass
- perfect semantic claim extraction
- automatic trust assignment
- full upstream
llmwikischema compatibility - lossless import of every custom plugin or script
- embeddings-first retrieval
The first pass should be conservative, inspectable, and easy to improve.