497 lines
12 KiB
Markdown
497 lines
12 KiB
Markdown
# GroundRecall `llmwiki` Import Specification
|
|
|
|
This document defines the first-pass import path for users who already have some
|
|
form of `llmwiki`-style repository and want to migrate it into the broader
|
|
GroundRecall substrate while staying compatible with Didactopus review and
|
|
promotion flows.
|
|
|
|
## Goal
|
|
|
|
The import path should let an existing `llmwiki` corpus become:
|
|
|
|
- searchable without immediate manual cleanup
|
|
- reviewable rather than blindly trusted
|
|
- grounded in explicit provenance
|
|
- promotable into durable structured knowledge objects
|
|
- exportable back into compiled wiki pages, assistant adapter bundles, and
|
|
queryable graph artifacts
|
|
|
|
The key rule is:
|
|
|
|
Imported wiki pages are **derived artifacts**, not automatic source truth.
|
|
|
|
## Import philosophy
|
|
|
|
Users coming from `llmwiki` often have a mixture of:
|
|
|
|
- raw notes
|
|
- compiled markdown pages
|
|
- local source files
|
|
- generated summaries
|
|
- ad hoc link graphs
|
|
- session transcripts
|
|
- speculative or weakly-supported synthesis
|
|
|
|
GroundRecall should preserve that work without pretending all of it is already
|
|
promoted knowledge.
|
|
|
|
The import pipeline therefore has two responsibilities:
|
|
|
|
1. Preserve the original material with minimal loss.
|
|
2. Reify explicit structured objects that can later be reviewed and promoted.
|
|
|
|
## Scope of the first implementation
|
|
|
|
The first implementation should support common `llmwiki` layouts such as:
|
|
|
|
- `raw/`
|
|
- `wiki/`
|
|
- `schema.*`
|
|
- `logs/`
|
|
- `sources/`
|
|
- top-level markdown pages
|
|
|
|
The importer should not require a canonical upstream schema. It should operate
|
|
from directory conventions plus simple heuristics.
|
|
|
|
## Import modes
|
|
|
|
### 1. `archive`
|
|
|
|
Purpose:
|
|
- preserve an existing `llmwiki` tree as read-only imported artifacts
|
|
- index it for search and later review
|
|
|
|
Behavior:
|
|
- no claim promotion
|
|
- minimal extraction
|
|
- all compiled pages remain `draft`
|
|
|
|
Use when:
|
|
- the user wants backward compatibility first
|
|
- the corpus quality is unknown
|
|
|
|
### 2. `quick`
|
|
|
|
Purpose:
|
|
- bootstrap usable structured objects fast
|
|
|
|
Behavior:
|
|
- import pages and raw sources
|
|
- extract candidate claims and concepts heuristically
|
|
- attach lightweight provenance
|
|
- queue uncertain items for review
|
|
|
|
Use when:
|
|
- the user wants early utility and accepts heuristic noise
|
|
|
|
### 3. `grounded`
|
|
|
|
Purpose:
|
|
- perform a migration suitable for long-lived shared knowledge
|
|
|
|
Behavior:
|
|
- require provenance for promoted claims
|
|
- mark unsupported statements explicitly
|
|
- produce review records and lint findings
|
|
- populate promotion queues rather than auto-promoting
|
|
|
|
Use when:
|
|
- the imported corpus will be shared across machines or agents
|
|
|
|
## Pipeline stages
|
|
|
|
### 1. Capture
|
|
|
|
The importer records the source repository as an import artifact.
|
|
|
|
Required metadata:
|
|
|
|
- `import_id`
|
|
- `import_mode`
|
|
- `source_root`
|
|
- `imported_at`
|
|
- `machine_id`
|
|
- `agent_id`
|
|
- `source_repo_kind=llmwiki`
|
|
|
|
Outputs:
|
|
|
|
- import manifest
|
|
- artifact records for all discovered files
|
|
|
|
### 2. Segment
|
|
|
|
Imported content is split into stable units.
|
|
|
|
Primary segment types:
|
|
|
|
- `source_document`
|
|
- `source_fragment`
|
|
- `compiled_page`
|
|
- `section_summary`
|
|
- `candidate_claim`
|
|
- `candidate_concept`
|
|
- `candidate_relation`
|
|
- `session_observation`
|
|
|
|
Segmentation should preserve:
|
|
|
|
- original path
|
|
- section heading
|
|
- line or byte offsets when possible
|
|
- page title
|
|
- frontmatter fields
|
|
|
|
### 3. Classify
|
|
|
|
Each segment gets a semantic role.
|
|
|
|
Recommended roles:
|
|
|
|
- `source`
|
|
- `derivation`
|
|
- `claim`
|
|
- `summary`
|
|
- `question`
|
|
- `todo`
|
|
- `speculation`
|
|
- `obsolete`
|
|
- `transcript`
|
|
|
|
This prevents unsupported prose from being confused with grounded knowledge.
|
|
|
|
### 4. Ground
|
|
|
|
Each imported segment gets provenance and support metadata.
|
|
|
|
Required grounding fields:
|
|
|
|
- `origin_artifact_id`
|
|
- `origin_path`
|
|
- `origin_section`
|
|
- `source_url` when known
|
|
- `retrieval_date` when known
|
|
- `machine_id`
|
|
- `session_id` when known
|
|
- `support_kind`
|
|
- `grounding_status`
|
|
|
|
Suggested values:
|
|
|
|
- `support_kind`: `direct_source`, `derived_from_page`, `derived_from_session`,
|
|
`inferred`, `unknown`
|
|
- `grounding_status`: `grounded`, `partially_grounded`, `ungrounded`
|
|
|
|
### 5. Normalize
|
|
|
|
The importer emits explicit GroundRecall objects.
|
|
|
|
Minimum object set:
|
|
|
|
- `Source`
|
|
- `Fragment`
|
|
- `Artifact`
|
|
- `Observation`
|
|
- `Claim`
|
|
- `Concept`
|
|
- `Relation`
|
|
|
|
### 6. Lint
|
|
|
|
The importer produces machine-readable findings before promotion.
|
|
|
|
Required lint checks:
|
|
|
|
- claim has no supporting fragment
|
|
- multiple claims appear text-identical
|
|
- concept is orphaned
|
|
- relation points to missing concept
|
|
- page summary has no cited support
|
|
- imported item marked `obsolete` still linked as current
|
|
- same claim imported with conflicting confidence or polarity
|
|
|
|
### 7. Promote
|
|
|
|
Imported objects enter existing Didactopus review/promotion lanes rather than
|
|
becoming trusted immediately.
|
|
|
|
Recommended states:
|
|
|
|
- `draft`
|
|
- `triaged`
|
|
- `reviewed`
|
|
- `promoted`
|
|
- `superseded`
|
|
- `archived`
|
|
|
|
### 8. Export
|
|
|
|
Promoted objects can then be rendered back out as:
|
|
|
|
- compiled wiki pages
|
|
- graph snapshots
|
|
- assistant adapter bundles
|
|
- review reports
|
|
- query bundles for assistant-facing use
|
|
|
|
## Object contracts
|
|
|
|
### `ImportedArtifact`
|
|
|
|
```json
|
|
{
|
|
"artifact_id": "ia_001",
|
|
"import_id": "imp_2026_04_16_a",
|
|
"artifact_kind": "compiled_page",
|
|
"path": "wiki/channel-capacity.md",
|
|
"title": "Channel Capacity",
|
|
"sha256": "abc123",
|
|
"created_at": "2026-04-16T14:00:00Z",
|
|
"metadata": {
|
|
"frontmatter": {},
|
|
"headings": ["Definition", "Examples"]
|
|
},
|
|
"current_status": "draft"
|
|
}
|
|
```
|
|
|
|
### `ImportedObservation`
|
|
|
|
```json
|
|
{
|
|
"observation_id": "obs_001",
|
|
"import_id": "imp_2026_04_16_a",
|
|
"artifact_id": "ia_001",
|
|
"role": "summary",
|
|
"text": "Capacity bounds reliable communication over a noisy channel.",
|
|
"origin_path": "wiki/channel-capacity.md",
|
|
"origin_section": "Definition",
|
|
"line_start": 12,
|
|
"line_end": 14,
|
|
"grounding_status": "partially_grounded",
|
|
"support_kind": "derived_from_page",
|
|
"confidence_hint": 0.63,
|
|
"current_status": "draft"
|
|
}
|
|
```
|
|
|
|
### `ImportedClaim`
|
|
|
|
```json
|
|
{
|
|
"claim_id": "clm_001",
|
|
"import_id": "imp_2026_04_16_a",
|
|
"claim_text": "Channel capacity is the maximum reliable communication rate for a channel model.",
|
|
"claim_kind": "definition",
|
|
"source_observation_ids": ["obs_001"],
|
|
"supporting_fragment_ids": ["frag_014"],
|
|
"concept_ids": ["concept::channel-capacity"],
|
|
"confidence_hint": 0.74,
|
|
"grounding_status": "grounded",
|
|
"current_status": "triaged"
|
|
}
|
|
```
|
|
|
|
### `ImportedConcept`
|
|
|
|
```json
|
|
{
|
|
"concept_id": "concept::channel-capacity",
|
|
"import_id": "imp_2026_04_16_a",
|
|
"title": "Channel Capacity",
|
|
"aliases": [],
|
|
"description": "Imported concept from llmwiki corpus.",
|
|
"source_artifact_ids": ["ia_001"],
|
|
"current_status": "triaged"
|
|
}
|
|
```
|
|
|
|
### `ImportedRelation`
|
|
|
|
```json
|
|
{
|
|
"relation_id": "rel_001",
|
|
"import_id": "imp_2026_04_16_a",
|
|
"source_id": "concept::shannon-entropy",
|
|
"target_id": "concept::channel-capacity",
|
|
"relation_type": "supports_understanding_of",
|
|
"evidence_ids": ["obs_015"],
|
|
"current_status": "draft"
|
|
}
|
|
```
|
|
|
|
## Mapping from `llmwiki` into GroundRecall
|
|
|
|
Recommended first-pass mapping:
|
|
|
|
- `raw/*` -> `Source` or `Artifact(kind=raw_note)`
|
|
- `wiki/*.md` -> `Artifact(kind=compiled_page)`
|
|
- frontmatter -> artifact metadata
|
|
- headings -> section boundaries
|
|
- linked page names -> candidate `Concept` and `Relation`
|
|
- bullet or sentence extraction -> candidate `Observation` and `Claim`
|
|
- chat or session logs -> `Observation(kind=session_note)`
|
|
- schema files -> import metadata only unless a future adapter exists
|
|
|
|
## Confidence and trust policy
|
|
|
|
Imported confidence must remain clearly separate from reviewed confidence.
|
|
|
|
Recommended fields:
|
|
|
|
- `confidence_hint`
|
|
- `review_confidence`
|
|
- `grounding_status`
|
|
- `review_verdict`
|
|
|
|
Policy:
|
|
|
|
- `confidence_hint` comes from heuristic import scoring
|
|
- `review_confidence` exists only after review
|
|
- promotion requires at least `partially_grounded`
|
|
- fully ungrounded claims can be stored, but only as `draft` or `archived`
|
|
|
|
## Provenance policy
|
|
|
|
The importer should follow the existing Didactopus provenance direction:
|
|
|
|
- preserve source identity
|
|
- preserve retrieval date when available
|
|
- preserve adaptation status
|
|
- keep both human-readable and machine-readable provenance
|
|
|
|
When only a compiled wiki page exists and the original source is missing:
|
|
|
|
- the compiled page becomes the immediate origin artifact
|
|
- all extracted claims must be marked `derived_from_page`
|
|
- such claims should not auto-promote in `grounded` mode
|
|
|
|
## Review and promotion integration
|
|
|
|
Imported `Claim` and `Concept` objects should feed into the same general review
|
|
machinery already used for pack-oriented promotion:
|
|
|
|
- create candidate records
|
|
- attach lint findings
|
|
- route to a triage lane
|
|
- collect review verdicts
|
|
- emit promotion records
|
|
|
|
Suggested triage lanes:
|
|
|
|
- `knowledge_capture`
|
|
- `pack_improvement`
|
|
- `skill_export`
|
|
- `source_cleanup`
|
|
- `conflict_resolution`
|
|
|
|
## Module layout
|
|
|
|
First-pass module layout:
|
|
|
|
- `didactopus.groundrecall_import`
|
|
Entry points and top-level orchestration.
|
|
- `didactopus.groundrecall_discovery`
|
|
Finds `llmwiki`-style files and classifies paths.
|
|
- `didactopus.groundrecall_segmenter`
|
|
Splits pages and logs into stable observations and candidate claims.
|
|
- `didactopus.groundrecall_normalizer`
|
|
Emits normalized import objects.
|
|
- `didactopus.groundrecall_lint`
|
|
Import-time lint checks.
|
|
- `didactopus.groundrecall_review_bridge`
|
|
Converts imported objects into review candidates and promotion records.
|
|
- `didactopus.groundrecall_export`
|
|
Renders promoted objects back to wiki, graph, and skill artifacts.
|
|
|
|
## CLI shape
|
|
|
|
Suggested CLI:
|
|
|
|
```bash
|
|
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode archive
|
|
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode quick
|
|
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode grounded
|
|
python -m didactopus.groundrecall.cli lint imports/<import-id>
|
|
python -m didactopus.groundrecall.cli promote imports/<import-id> /path/to/store
|
|
python -m didactopus.groundrecall.cli export /path/to/store exports/groundrecall --concept channel-capacity
|
|
```
|
|
|
|
Compatibility wrappers still exist during migration:
|
|
|
|
```bash
|
|
python -m didactopus.groundrecall_import /path/to/llmwiki --mode grounded
|
|
python -m didactopus.groundrecall_lint imports/<import-id>
|
|
python -m didactopus.groundrecall_export /path/to/store exports/groundrecall --concept channel-capacity
|
|
```
|
|
|
|
## Filesystem layout
|
|
|
|
Suggested repository-local layout:
|
|
|
|
- `imports/<import-id>/manifest.json`
|
|
- `imports/<import-id>/artifacts.jsonl`
|
|
- `imports/<import-id>/observations.jsonl`
|
|
- `imports/<import-id>/claims.jsonl`
|
|
- `imports/<import-id>/concepts.jsonl`
|
|
- `imports/<import-id>/relations.jsonl`
|
|
- `imports/<import-id>/lint_findings.json`
|
|
- `imports/<import-id>/review_queue.json`
|
|
|
|
This keeps imported state auditable and easy to sync across machines.
|
|
|
|
## Multi-machine sync implication
|
|
|
|
For distributed assistant use, imported state should be append-oriented and
|
|
rebuildable.
|
|
|
|
Recommended sync primitives:
|
|
|
|
- import manifests
|
|
- normalized jsonl object streams
|
|
- review records
|
|
- promotion records
|
|
|
|
Non-authoritative derived artifacts:
|
|
|
|
- rendered wiki pages
|
|
- local indexes
|
|
- embeddings
|
|
- cache files
|
|
|
|
This allows multiple machines to contribute import events without making the
|
|
compiled page tree the merge primitive.
|
|
|
|
## First implementation milestones
|
|
|
|
### Milestone 1
|
|
|
|
- discover `raw/` and `wiki/`
|
|
- import artifacts
|
|
- segment markdown by headings
|
|
- emit observations and candidate claims
|
|
- write import manifest and jsonl outputs
|
|
|
|
### Milestone 2
|
|
|
|
- add grounding metadata
|
|
- add lint checks
|
|
- add triage lanes and review queue output
|
|
|
|
### Milestone 3
|
|
|
|
- map promoted claims into assistant-neutral exports plus assistant adapter bundles
|
|
- render compiled wiki views from promoted objects
|
|
- support multi-machine import manifests and merge-safe event storage
|
|
|
|
## Non-goals for the first pass
|
|
|
|
- perfect semantic claim extraction
|
|
- automatic trust assignment
|
|
- full upstream `llmwiki` schema compatibility
|
|
- lossless import of every custom plugin or script
|
|
- embeddings-first retrieval
|
|
|
|
The first pass should be conservative, inspectable, and easy to improve.
|