GroundRecall/docs/legacy/groundrecall-llmwiki-import.md

# GroundRecall `llmwiki` Import Specification

This document defines the first-pass import path for users who already have some
form of `llmwiki`-style repository and want to migrate it into the broader
GroundRecall substrate while staying compatible with Didactopus review and
promotion flows.

## Goal

The import path should let an existing `llmwiki` corpus become:

- searchable without immediate manual cleanup
- reviewable rather than blindly trusted
- grounded in explicit provenance
- promotable into durable structured knowledge objects
- exportable back into compiled wiki pages, assistant adapter bundles, and
  queryable graph artifacts

The key rule is:

Imported wiki pages are **derived artifacts**, not automatic source truth.

## Import philosophy

Users coming from `llmwiki` often have a mixture of:

- raw notes
- compiled markdown pages
- local source files
- generated summaries
- ad hoc link graphs
- session transcripts
- speculative or weakly-supported synthesis

GroundRecall should preserve that work without pretending all of it is already
promoted knowledge.

The import pipeline therefore has two responsibilities:

1. Preserve the original material with minimal loss.
2. Reify explicit structured objects that can later be reviewed and promoted.

## Scope of the first implementation

The first implementation should support common `llmwiki` layouts such as:

- `raw/`
- `wiki/`
- `schema.*`
- `logs/`
- `sources/`
- top-level markdown pages

The importer should not require a canonical upstream schema. It should operate
from directory conventions plus simple heuristics.

## Import modes

### 1. `archive`

Purpose:
- preserve an existing `llmwiki` tree as read-only imported artifacts
- index it for search and later review

Behavior:
- no claim promotion
- minimal extraction
- all compiled pages remain `draft`

Use when:
- the user wants backward compatibility first
- the corpus quality is unknown

### 2. `quick`

Purpose:
- bootstrap usable structured objects fast

Behavior:
- import pages and raw sources
- extract candidate claims and concepts heuristically
- attach lightweight provenance
- queue uncertain items for review

Use when:
- the user wants early utility and accepts heuristic noise

### 3. `grounded`

Purpose:
- perform a migration suitable for long-lived shared knowledge

Behavior:
- require provenance for promoted claims
- mark unsupported statements explicitly
- produce review records and lint findings
- populate promotion queues rather than auto-promoting

Use when:
- the imported corpus will be shared across machines or agents

## Pipeline stages

### 1. Capture

The importer records the source repository as an import artifact.

Required metadata:

- `import_id`
- `import_mode`
- `source_root`
- `imported_at`
- `machine_id`
- `agent_id`
- `source_repo_kind=llmwiki`

Outputs:

- import manifest
- artifact records for all discovered files

### 2. Segment

Imported content is split into stable units.

Primary segment types:

- `source_document`
- `source_fragment`
- `compiled_page`
- `section_summary`
- `candidate_claim`
- `candidate_concept`
- `candidate_relation`
- `session_observation`

Segmentation should preserve:

- original path
- section heading
- line or byte offsets when possible
- page title
- frontmatter fields

### 3. Classify

Each segment gets a semantic role.

Recommended roles:

- `source`
- `derivation`
- `claim`
- `summary`
- `question`
- `todo`
- `speculation`
- `obsolete`
- `transcript`

This prevents unsupported prose from being confused with grounded knowledge.

### 4. Ground

Each imported segment gets provenance and support metadata.

Required grounding fields:

- `origin_artifact_id`
- `origin_path`
- `origin_section`
- `source_url` when known
- `retrieval_date` when known
- `machine_id`
- `session_id` when known
- `support_kind`
- `grounding_status`

Suggested values:

- `support_kind`: `direct_source`, `derived_from_page`, `derived_from_session`,
  `inferred`, `unknown`
- `grounding_status`: `grounded`, `partially_grounded`, `ungrounded`

### 5. Normalize

The importer emits explicit GroundRecall objects.

Minimum object set:

- `Source`
- `Fragment`
- `Artifact`
- `Observation`
- `Claim`
- `Concept`
- `Relation`

### 6. Lint

The importer produces machine-readable findings before promotion.

Required lint checks:

- claim has no supporting fragment
- multiple claims appear text-identical
- concept is orphaned
- relation points to missing concept
- page summary has no cited support
- imported item marked `obsolete` still linked as current
- same claim imported with conflicting confidence or polarity

### 7. Promote

Imported objects enter existing Didactopus review/promotion lanes rather than
becoming trusted immediately.

Recommended states:

- `draft`
- `triaged`
- `reviewed`
- `promoted`
- `superseded`
- `archived`

### 8. Export

Promoted objects can then be rendered back out as:

- compiled wiki pages
- graph snapshots
- assistant adapter bundles
- review reports
- query bundles for assistant-facing use

## Object contracts

### `ImportedArtifact`

```json
{
  "artifact_id": "ia_001",
  "import_id": "imp_2026_04_16_a",
  "artifact_kind": "compiled_page",
  "path": "wiki/channel-capacity.md",
  "title": "Channel Capacity",
  "sha256": "abc123",
  "created_at": "2026-04-16T14:00:00Z",
  "metadata": {
    "frontmatter": {},
    "headings": ["Definition", "Examples"]
  },
  "current_status": "draft"
}
```

### `ImportedObservation`

```json
{
  "observation_id": "obs_001",
  "import_id": "imp_2026_04_16_a",
  "artifact_id": "ia_001",
  "role": "summary",
  "text": "Capacity bounds reliable communication over a noisy channel.",
  "origin_path": "wiki/channel-capacity.md",
  "origin_section": "Definition",
  "line_start": 12,
  "line_end": 14,
  "grounding_status": "partially_grounded",
  "support_kind": "derived_from_page",
  "confidence_hint": 0.63,
  "current_status": "draft"
}
```

### `ImportedClaim`

```json
{
  "claim_id": "clm_001",
  "import_id": "imp_2026_04_16_a",
  "claim_text": "Channel capacity is the maximum reliable communication rate for a channel model.",
  "claim_kind": "definition",
  "source_observation_ids": ["obs_001"],
  "supporting_fragment_ids": ["frag_014"],
  "concept_ids": ["concept::channel-capacity"],
  "confidence_hint": 0.74,
  "grounding_status": "grounded",
  "current_status": "triaged"
}
```

### `ImportedConcept`

```json
{
  "concept_id": "concept::channel-capacity",
  "import_id": "imp_2026_04_16_a",
  "title": "Channel Capacity",
  "aliases": [],
  "description": "Imported concept from llmwiki corpus.",
  "source_artifact_ids": ["ia_001"],
  "current_status": "triaged"
}
```

### `ImportedRelation`

```json
{
  "relation_id": "rel_001",
  "import_id": "imp_2026_04_16_a",
  "source_id": "concept::shannon-entropy",
  "target_id": "concept::channel-capacity",
  "relation_type": "supports_understanding_of",
  "evidence_ids": ["obs_015"],
  "current_status": "draft"
}
```

## Mapping from `llmwiki` into GroundRecall

Recommended first-pass mapping:

- `raw/*` -> `Source` or `Artifact(kind=raw_note)`
- `wiki/*.md` -> `Artifact(kind=compiled_page)`
- frontmatter -> artifact metadata
- headings -> section boundaries
- linked page names -> candidate `Concept` and `Relation`
- bullet or sentence extraction -> candidate `Observation` and `Claim`
- chat or session logs -> `Observation(kind=session_note)`
- schema files -> import metadata only unless a future adapter exists

## Confidence and trust policy

Imported confidence must remain clearly separate from reviewed confidence.

Recommended fields:

- `confidence_hint`
- `review_confidence`
- `grounding_status`
- `review_verdict`

Policy:

- `confidence_hint` comes from heuristic import scoring
- `review_confidence` exists only after review
- promotion requires at least `partially_grounded`
- fully ungrounded claims can be stored, but only as `draft` or `archived`

## Provenance policy

The importer should follow the existing Didactopus provenance direction:

- preserve source identity
- preserve retrieval date when available
- preserve adaptation status
- keep both human-readable and machine-readable provenance

When only a compiled wiki page exists and the original source is missing:

- the compiled page becomes the immediate origin artifact
- all extracted claims must be marked `derived_from_page`
- such claims should not auto-promote in `grounded` mode

## Review and promotion integration

Imported `Claim` and `Concept` objects should feed into the same general review
machinery already used for pack-oriented promotion:

- create candidate records
- attach lint findings
- route to a triage lane
- collect review verdicts
- emit promotion records

Suggested triage lanes:

- `knowledge_capture`
- `pack_improvement`
- `skill_export`
- `source_cleanup`
- `conflict_resolution`

## Module layout

First-pass module layout:

- `didactopus.groundrecall_import`
  Entry points and top-level orchestration.
- `didactopus.groundrecall_discovery`
  Finds `llmwiki`-style files and classifies paths.
- `didactopus.groundrecall_segmenter`
  Splits pages and logs into stable observations and candidate claims.
- `didactopus.groundrecall_normalizer`
  Emits normalized import objects.
- `didactopus.groundrecall_lint`
  Import-time lint checks.
- `didactopus.groundrecall_review_bridge`
  Converts imported objects into review candidates and promotion records.
- `didactopus.groundrecall_export`
  Renders promoted objects back to wiki, graph, and skill artifacts.

## CLI shape

Suggested CLI:

```bash
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode archive
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode quick
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode grounded
python -m didactopus.groundrecall.cli lint imports/<import-id>
python -m didactopus.groundrecall.cli promote imports/<import-id> /path/to/store
python -m didactopus.groundrecall.cli export /path/to/store exports/groundrecall --concept channel-capacity
```

Compatibility wrappers still exist during migration:

```bash
python -m didactopus.groundrecall_import /path/to/llmwiki --mode grounded
python -m didactopus.groundrecall_lint imports/<import-id>
python -m didactopus.groundrecall_export /path/to/store exports/groundrecall --concept channel-capacity
```

## Filesystem layout

Suggested repository-local layout:

- `imports/<import-id>/manifest.json`
- `imports/<import-id>/artifacts.jsonl`
- `imports/<import-id>/observations.jsonl`
- `imports/<import-id>/claims.jsonl`
- `imports/<import-id>/concepts.jsonl`
- `imports/<import-id>/relations.jsonl`
- `imports/<import-id>/lint_findings.json`
- `imports/<import-id>/review_queue.json`

This keeps imported state auditable and easy to sync across machines.

## Multi-machine sync implication

For distributed assistant use, imported state should be append-oriented and
rebuildable.

Recommended sync primitives:

- import manifests
- normalized jsonl object streams
- review records
- promotion records

Non-authoritative derived artifacts:

- rendered wiki pages
- local indexes
- embeddings
- cache files

This allows multiple machines to contribute import events without making the
compiled page tree the merge primitive.

## First implementation milestones

### Milestone 1

- discover `raw/` and `wiki/`
- import artifacts
- segment markdown by headings
- emit observations and candidate claims
- write import manifest and jsonl outputs

### Milestone 2

- add grounding metadata
- add lint checks
- add triage lanes and review queue output

### Milestone 3

- map promoted claims into assistant-neutral exports plus assistant adapter bundles
- render compiled wiki views from promoted objects
- support multi-machine import manifests and merge-safe event storage

## Non-goals for the first pass

- perfect semantic claim extraction
- automatic trust assignment
- full upstream `llmwiki` schema compatibility
- lossless import of every custom plugin or script
- embeddings-first retrieval

The first pass should be conservative, inspectable, and easy to improve.