Add chunk-backed GroundRecall import artifacts

2026-04-27 10:29:58 -04:00 · 2026-04-27 10:29:58 -04:00 · a5efe0cccb
parent 1668a2b3a8
commit a5efe0cccb
10 changed files with 733 additions and 10 deletions
--- a/docs/ai-knowledge-graph-adoption-plan.md
+++ b/docs/ai-knowledge-graph-adoption-plan.md
@ -0,0 +1,464 @@
+# AI Knowledge Graph Adoption Plan
+
+This document translates the feature set of
+[`robert-mcdermott/ai-knowledge-graph`](https://github.com/robert-mcdermott/ai-knowledge-graph)
+into concrete implementation tickets for the current local repositories:
+
+- `GroundRecall`
+- `Didactopus`
+- `doclift`
+
+The goal is not to copy that repository's data model directly.
+
+The useful import is:
+
+- chunk-aware extraction
+- entity standardization
+- relation suggestion
+- graph inspection and review affordances
+
+The main thing to avoid is treating raw extracted SPO triples as canonical truth.
+
+## Design Rules
+
+1. Keep canonical storage typed and provenance-first.
+2. Treat extracted triples as candidate claims/relations, not promoted facts.
+3. Keep LLM extraction optional and reviewable.
+4. Keep `doclift` deterministic by default.
+5. Put graph extraction in `GroundRecall` first, then expose downstream affordances in `Didactopus`.
+
+## Repo Roles
+
+### GroundRecall
+
+Primary fit for:
+
+- candidate claim extraction
+- concept alias normalization
+- candidate relation inference
+- graph diagnostics
+- review queue generation
+
+Key current modules:
+
+- [src/groundrecall/ingest.py](/home/netuser/bin/GroundRecall/src/groundrecall/ingest.py)
+- [src/groundrecall/models.py](/home/netuser/bin/GroundRecall/src/groundrecall/models.py)
+- [src/groundrecall/source_adapters](/home/netuser/bin/GroundRecall/src/groundrecall/source_adapters)
+- [src/groundrecall/groundrecall_source_adapters/doclift_bundle.py](/home/netuser/bin/GroundRecall/src/groundrecall/groundrecall_source_adapters/doclift_bundle.py)
+- [src/groundrecall/review_export.py](/home/netuser/bin/GroundRecall/src/groundrecall/review_export.py)
+
+### Didactopus
+
+Primary fit for:
+
+- graph workbench visualization
+- concept merge/split suggestions
+- graph-aware review overlays
+- learner-facing graph inspection built on grounded artifacts
+
+Key current modules:
+
+- [src/didactopus/knowledge_graph.py](/home/netuser/bin/Didactopus/src/didactopus/knowledge_graph.py)
+- [src/didactopus/graph_builder.py](/home/netuser/bin/Didactopus/src/didactopus/graph_builder.py)
+- [src/didactopus/graph_retrieval.py](/home/netuser/bin/Didactopus/src/didactopus/graph_retrieval.py)
+- [src/didactopus/learner_workbench.py](/home/netuser/bin/Didactopus/src/didactopus/learner_workbench.py)
+- [src/didactopus/review_export.py](/home/netuser/bin/Didactopus/src/didactopus/review_export.py)
+- [src/didactopus/main.py](/home/netuser/bin/Didactopus/src/didactopus/main.py)
+
+### doclift
+
+Primary fit for:
+
+- deterministic chunk metadata
+- optional extraction-friendly sidecars
+- optional graph preview artifacts
+
+Key current modules:
+
+- [src/doclift/convert.py](/home/netuser/bin/doclift/src/doclift/convert.py)
+- [src/doclift/schemas.py](/home/netuser/bin/doclift/src/doclift/schemas.py)
+- [src/doclift/cli.py](/home/netuser/bin/doclift/src/doclift/cli.py)
+
+## Phase 1: GroundRecall Candidate Graph Import
+
+### Ticket GR-1: Add chunk-aware candidate extraction layer
+
+Outcome:
+
+- ingest text artifacts into stable chunks
+- extract candidate observations/claims/concepts/relations per chunk
+- write reviewable import artifacts
+
+Suggested implementation:
+
+- add `src/groundrecall/candidate_graph.py`
+- add `src/groundrecall/extraction_chunks.py`
+
+Responsibilities:
+
+- split long text into bounded chunks with overlap
+- assign stable `chunk_id`
+- keep chunk-to-artifact provenance
+- emit candidate records with `support_kind="derived_from_page"` or `support_kind="inferred"`
+
+CLI:
+
+- extend `groundrecall import` with:
+  - `--extract-graph`
+  - `--chunk-size`
+  - `--chunk-overlap`
+  - `--extractor none|heuristic|llm`
+
+Acceptance criteria:
+
+- import still works without graph extraction
+- import artifacts include chunk-backed candidate claims and relations when enabled
+- all extracted candidates preserve artifact and chunk provenance
+
+### Ticket GR-2: Add deterministic entity/concept standardization
+
+Outcome:
+
+- alias clusters for near-duplicate concepts before review
+
+Suggested implementation:
+
+- add `src/groundrecall/entity_standardization.py`
+
+Responsibilities:
+
+- normalize punctuation/case
+- trim stopwords conservatively
+- group obvious aliases
+- emit alias-cluster review candidates when confidence is not high enough for direct merge
+
+Data shape:
+
+- enrich `ConceptRecord.aliases`
+- optionally emit a new review payload section such as `alias_clusters`
+
+Acceptance criteria:
+
+- obvious duplicates like minor punctuation/case variants collapse deterministically
+- ambiguous clusters remain reviewable rather than auto-merged
+
+### Ticket GR-3: Add inferred relation candidates
+
+Outcome:
+
+- lexical and structural hints become review queue items
+
+Suggested implementation:
+
+- add `src/groundrecall/relation_inference.py`
+
+Inference types:
+
+- lexical co-occurrence hints
+- transitive prerequisite/support hints
+- repeated same-source concept pair hints
+
+Important restriction:
+
+- inferred relations stay `draft` or `triaged`
+- they are never silently promoted to canonical relations
+
+Acceptance criteria:
+
+- inferred relations appear in import artifacts with explicit provenance
+- review queue distinguishes grounded vs inferred edges
+
+### Ticket GR-4: Add graph diagnostics and inspector output
+
+Outcome:
+
+- maintainers can inspect graph shape before promotion
+
+Suggested implementation:
+
+- add `src/groundrecall/graph_diagnostics.py`
+- extend [inspect.py](/home/netuser/bin/GroundRecall/src/groundrecall/inspect.py)
+
+Diagnostics:
+
+- disconnected components
+- orphan concepts
+- claims with no strong support
+- bridge concepts
+- dense noisy clusters
+
+CLI:
+
+- `groundrecall inspect ... --graph`
+- `groundrecall export ... --include-graph-diagnostics`
+
+Acceptance criteria:
+
+- graph diagnostics appear in machine-readable JSON
+- review operators can identify noisy imports quickly
+
+### Ticket GR-5: Add review export support for candidate graph artifacts
+
+Outcome:
+
+- current review flows can consume extracted graph candidates
+
+Suggested implementation:
+
+- extend [review_export.py](/home/netuser/bin/GroundRecall/src/groundrecall/review_export.py)
+- extend review app payloads under [review_app](/home/netuser/bin/GroundRecall/src/groundrecall/review_app)
+
+UI payload features:
+
+- candidate relation cards
+- alias-cluster cards
+- chunk evidence preview
+- inferred/grounded badges
+
+Acceptance criteria:
+
+- review bundle includes graph-candidate triage data
+- no assistant-specific assumptions leak into canonical records
+
+## Phase 2: Didactopus Graph Review And Workbench Improvements
+
+### Ticket DT-1: Add review-oriented graph overlays
+
+Outcome:
+
+- graph visualizations expose quality problems, not just structure
+
+Suggested implementation:
+
+- extend [knowledge_graph.py](/home/netuser/bin/Didactopus/src/didactopus/knowledge_graph.py)
+- extend [graph_retrieval.py](/home/netuser/bin/Didactopus/src/didactopus/graph_retrieval.py)
+
+Overlay ideas:
+
+- edge grounding status
+- concept confidence/review status
+- weakly grounded concept markers
+- disconnected concept islands
+
+Acceptance criteria:
+
+- exported graph JSON can distinguish grounded, heuristic, and inferred links
+- downstream visual layers can highlight fragile concepts
+
+### Ticket DT-2: Add concept consolidation suggestions
+
+Outcome:
+
+- reviewers get merge/split suggestions based on graph and text structure
+
+Suggested implementation:
+
+- extend [graph_builder.py](/home/netuser/bin/Didactopus/src/didactopus/graph_builder.py)
+- extend [review_export.py](/home/netuser/bin/Didactopus/src/didactopus/review_export.py)
+
+Input signals:
+
+- title similarity
+- shared source lessons
+- overlapping prerequisite neighborhoods
+- overlapping mastery signals
+
+Acceptance criteria:
+
+- review exports include merge suggestions
+- suggested merges remain proposals, not automatic edits
+
+### Ticket DT-3: Add learner-workbench graph inspection modes
+
+Outcome:
+
+- learner and reviewer can inspect why concepts exist and how they connect
+
+Suggested implementation:
+
+- extend [learner_workbench.py](/home/netuser/bin/Didactopus/src/didactopus/learner_workbench.py)
+- extend backend route [api.py](/home/netuser/bin/Didactopus/src/didactopus/api.py)
+
+Views:
+
+- concept neighborhood
+- source-fragment grounding trail
+- alternate supporting lessons
+- fragile or noisy concept warnings
+
+Acceptance criteria:
+
+- workbench can show source-grounded concept neighborhoods
+- concept provenance is inspectable without raw JSON digging
+
+### Ticket DT-4: Add graph diagnostics to `doclift-bundle` pack generation
+
+Outcome:
+
+- `doclift -> Didactopus` imports surface noisy graph structure early
+
+Suggested implementation:
+
+- extend [doclift_bundle_demo.py](/home/netuser/bin/Didactopus/src/didactopus/doclift_bundle_demo.py)
+- extend [main.py](/home/netuser/bin/Didactopus/src/didactopus/main.py) `doclift-bundle`
+
+Artifacts:
+
+- `graph_diagnostics.json`
+- `concept_merge_suggestions.json`
+
+Acceptance criteria:
+
+- importing a `doclift` bundle produces diagnostics alongside `knowledge_graph.json`
+- review workflow can consume those diagnostics
+
+## Phase 3: doclift Optional Extraction-Friendly Sidecars
+
+### Ticket DL-1: Emit stable chunk metadata
+
+Outcome:
+
+- downstream systems can import `doclift` bundles without re-segmenting blindly
+
+Suggested implementation:
+
+- extend [schemas.py](/home/netuser/bin/doclift/src/doclift/schemas.py)
+- extend [convert.py](/home/netuser/bin/doclift/src/doclift/convert.py)
+
+Artifacts:
+
+- `document.chunks.json`
+
+Fields:
+
+- `chunk_id`
+- `line_start`
+- `line_end`
+- `section_labels`
+- `text`
+
+Acceptance criteria:
+
+- bundle remains valid without downstream AI extraction
+- chunk metadata is deterministic across repeat runs
+
+### Ticket DL-2: Add optional graph-preview sidecars
+
+Outcome:
+
+- operators can inspect likely extracted structure at the bundle stage
+
+Suggested implementation:
+
+- add optional post-processing module such as `src/doclift/graph_preview.py`
+
+Artifacts:
+
+- `document.entities.json`
+- `document.relations.json`
+- optional `bundle_graph_preview.json`
+
+CLI:
+
+- extend `doclift convert`
+- extend `doclift convert-dir`
+- flags:
+  - `--graph-preview`
+  - `--graph-preview-mode heuristic|llm`
+
+Important restriction:
+
+- these are preview/debug artifacts only
+- they are not the bundle's canonical semantics
+
+Acceptance criteria:
+
+- graph preview can be disabled entirely
+- default conversion remains deterministic and lightweight
+
+### Ticket DL-3: Add HTML inspection output for graph previews
+
+Outcome:
+
+- maintainers can inspect extracted structure before import
+
+Suggested implementation:
+
+- add `doclift preview-graph /path/to/bundle`
+
+Acceptance criteria:
+
+- preview HTML references chunk ids and source lines
+- graph preview is visibly separate from conversion success reporting
+
+## Cross-Repo Integration Tickets
+
+### Ticket X-1: `doclift -> GroundRecall` candidate-graph import path
+
+Outcome:
+
+- `GroundRecall` can consume `doclift` chunk metadata directly
+
+Modules:
+
+- `doclift` emits `document.chunks.json`
+- `GroundRecall` `doclift_bundle` adapter imports it
+
+Acceptance criteria:
+
+- `groundrecall import /path/to/doclift-bundle --extract-graph`
+- uses `doclift` chunk ids instead of re-splitting markdown where available
+
+### Ticket X-2: Shared graph diagnostics vocabulary
+
+Outcome:
+
+- the three repos use compatible terminology for quality signals
+
+Suggested shared diagnostic keys:
+
+- `orphan_concept`
+- `weak_grounding`
+- `inferred_relation`
+- `alias_cluster`
+- `disconnected_component`
+- `bridge_concept`
+- `high_fanout_noisy_concept`
+
+Acceptance criteria:
+
+- review and export layers can exchange diagnostics without brittle custom mapping
+
+## Recommended Build Order
+
+1. `GR-1`
+2. `GR-2`
+3. `GR-3`
+4. `GR-4`
+5. `X-1`
+6. `DT-1`
+7. `DT-2`
+8. `DL-1`
+9. `DL-2`
+10. `DT-4`
+
+## Non-Goals
+
+- replacing GroundRecall canonical models with freeform triples
+- forcing LLM extraction into `doclift` core conversion
+- auto-promoting inferred relations
+- making Didactopus depend on a graph preview layer to ingest ordinary packs
+
+## Immediate Next Step
+
+If only one milestone is funded first, build:
+
+- `GR-1`
+- `GR-2`
+- `X-1`
+
+That gives the highest leverage path:
+
+- `doclift` stays deterministic
+- `GroundRecall` gains useful graph-candidate import
+- `Didactopus` can later consume cleaner grounded artifacts without architectural churn
--- a/src/groundrecall/groundrecall_normalizer.py
+++ b/src/groundrecall/groundrecall_normalizer.py
@ -71,12 +71,35 @@ def build_observation_record(
    }


+def build_fragment_record(
+    context: ImportContext,
+    artifact_record: dict[str, Any],
+    observation: SegmentedObservation,
+    index: int,
+) -> dict[str, Any]:
+    return {
+        "fragment_id": f"frag_{artifact_record['artifact_id']}_{index}",
+        "import_id": context.import_id,
+        "source_id": artifact_record["artifact_id"],
+        "text": observation.text,
+        "section": observation.section,
+        "line_start": observation.line_start,
+        "line_end": observation.line_end,
+        "metadata": {
+            "artifact_path": observation.artifact_relative_path,
+            "role": observation.role,
+        },
+        "current_status": "draft",
+    }
+
+
 def build_claim_record(
    context: ImportContext,
    observation_record: dict[str, Any],
    observation: SegmentedObservation,
    concept_ids: list[str],
    index: int,
+    fragment_ids: list[str] | None = None,
 ) -> dict[str, Any]:
    return {
        "claim_id": _claim_id_for_observation(observation_record, observation, index),
@ -84,7 +107,7 @@ def build_claim_record(
        "claim_text": observation_record["text"],
        "claim_kind": "statement" if observation_record["role"] == "claim" else "summary",
        "source_observation_ids": [observation_record["observation_id"]],
-        "supporting_fragment_ids": [],
+        "supporting_fragment_ids": list(fragment_ids or []),
        "concept_ids": [f"concept::{concept_id}" for concept_id in concept_ids],
        "contradicts_claim_ids": [f"clm_{_sanitize_claim_key(value)}" for value in observation.contradict_keys],
        "supersedes_claim_ids": [f"clm_{_sanitize_claim_key(value)}" for value in observation.supersede_keys],
@ -134,3 +157,50 @@ def build_relation_records(context: ImportContext, artifact_record: dict[str, An

 def manifest_record(context: ImportContext) -> dict[str, Any]:
    return asdict(context) | {"source_repo_kind": "llmwiki"}
+
+
+def standardize_concept_rows(
+    concept_rows: list[dict[str, Any]],
+    claim_rows: list[dict[str, Any]],
+    relation_rows: list[dict[str, Any]],
+) -> tuple[list[dict[str, Any]], list[dict[str, Any]], list[dict[str, Any]]]:
+    alias_map: dict[str, str] = {}
+    normalized_index: dict[str, dict[str, Any]] = {}
+    standardized_rows: list[dict[str, Any]] = []
+
+    for row in concept_rows:
+        normalized_title = _normalize_concept_title(str(row.get("title", "")))
+        if not normalized_title:
+            standardized_rows.append(row)
+            continue
+
+        canonical = normalized_index.get(normalized_title)
+        if canonical is None:
+            normalized_index[normalized_title] = row
+            standardized_rows.append(row)
+            continue
+
+        canonical["source_artifact_ids"] = sorted(
+            set(canonical.get("source_artifact_ids", [])) | set(row.get("source_artifact_ids", []))
+        )
+        aliases = set(canonical.get("aliases", []))
+        aliases.add(str(row.get("title", "")))
+        aliases.update(str(alias) for alias in row.get("aliases", []))
+        aliases.discard(str(canonical.get("title", "")))
+        canonical["aliases"] = sorted(alias for alias in aliases if alias)
+        alias_map[str(row["concept_id"])] = str(canonical["concept_id"])
+
+    if alias_map:
+        for row in claim_rows:
+            row["concept_ids"] = [alias_map.get(concept_id, concept_id) for concept_id in row.get("concept_ids", [])]
+        for row in relation_rows:
+            row["source_id"] = alias_map.get(str(row.get("source_id", "")), str(row.get("source_id", "")))
+            row["target_id"] = alias_map.get(str(row.get("target_id", "")), str(row.get("target_id", "")))
+
+    return standardized_rows, claim_rows, relation_rows
+
+
+def _normalize_concept_title(value: str) -> str:
+    normalized = "".join(ch.lower() if ch.isalnum() else " " for ch in value)
+    tokens = [token for token in normalized.split() if token not in {"a", "an", "the"}]
+    return " ".join(tokens)
--- a/src/groundrecall/groundrecall_source_adapters/base.py
+++ b/src/groundrecall/groundrecall_source_adapters/base.py
@ -28,6 +28,7 @@ class DiscoveredImportSource:
@dataclass
 class StructuredImportRows:
    artifact_rows: list[dict]
+    fragment_rows: list[dict]
    observation_rows: list[dict]
    claim_rows: list[dict]
    concept_rows: list[dict]
@ -46,7 +47,7 @@ class GroundRecallSourceAdapter(Protocol):
    def import_intent(self) -> ImportIntent:
        ...

-    def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
+    def build_rows(self, context, sources: list[DiscoveredImportSource], root: Path | None = None) -> StructuredImportRows | None:
        ...


--- a/src/groundrecall/groundrecall_source_adapters/didactopus_pack.py
+++ b/src/groundrecall/groundrecall_source_adapters/didactopus_pack.py
@ -38,7 +38,7 @@ class DidactopusPackSourceAdapter:
    def import_intent(self) -> str:
        return "both"

-    def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
+    def build_rows(self, context, sources: list[DiscoveredImportSource], root: Path | None = None) -> StructuredImportRows | None:
        by_name = {Path(item.relative_path).name: item for item in sources}
        concepts_src = by_name.get("concepts.yaml")
        if concepts_src is None:
@ -224,6 +224,7 @@ class DidactopusPackSourceAdapter:

        return StructuredImportRows(
            artifact_rows=artifact_rows,
+            fragment_rows=[],
            observation_rows=observation_rows,
            claim_rows=claim_rows,
            concept_rows=concept_rows,
--- a/src/groundrecall/groundrecall_source_adapters/doclift_bundle.py
+++ b/src/groundrecall/groundrecall_source_adapters/doclift_bundle.py
@ -22,6 +22,23 @@ class DocliftBundleSourceAdapter:
        base = Path(root)
        return (base / "manifest.json").exists() and (base / "documents").exists()

+    def _load_chunks(self, base: Path, document: dict) -> list[dict]:
+        explicit_path = document.get("chunks_path")
+        if explicit_path:
+            chunk_path = self._resolve_bundle_path(base, explicit_path)
+        else:
+            output_dir = self._resolve_bundle_path(base, document.get("output_dir"))
+            chunk_path = output_dir / "document.chunks.json"
+        if not chunk_path.exists():
+            return []
+        payload = json.loads(chunk_path.read_text(encoding="utf-8"))
+        if isinstance(payload, dict):
+            chunks = payload.get("chunks", [])
+            return [chunk for chunk in chunks if isinstance(chunk, dict)]
+        if isinstance(payload, list):
+            return [chunk for chunk in payload if isinstance(chunk, dict)]
+        return []
+
    def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
        base = Path(root)
        rows: list[DiscoveredImportSource] = []
@ -41,8 +58,8 @@ class DocliftBundleSourceAdapter:
    def import_intent(self) -> str:
        return "both"

-    def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
-        base = Path(context.source_root)
+    def build_rows(self, context, sources: list[DiscoveredImportSource], root: Path | None = None) -> StructuredImportRows | None:
+        base = Path(root) if root is not None else Path(context.source_root)
        if not self.detect(base) and sources:
            for candidate in [sources[0].path.parent, *sources[0].path.parents]:
                if self.detect(candidate):
@ -54,6 +71,7 @@ class DocliftBundleSourceAdapter:
        manifest = json.loads(manifest_path.read_text(encoding="utf-8"))

        artifact_rows: list[dict] = []
+        fragment_rows: list[dict] = []
        observation_rows: list[dict] = []
        claim_rows: list[dict] = []
        concept_rows: list[dict] = []
@ -142,6 +160,71 @@ class DocliftBundleSourceAdapter:
                    "current_status": "triaged",
                }
            )
+            for chunk_index, chunk in enumerate(self._load_chunks(base, document), start=1):
+                chunk_text = str(chunk.get("text") or "").strip()
+                if not chunk_text:
+                    continue
+                chunk_role = str(chunk.get("role") or "summary")
+                chunk_section = str(chunk.get("section") or title)
+                line_start = int(chunk.get("line_start") or 0)
+                line_end = int(chunk.get("line_end") or line_start)
+                fragment_id = f"frag_doclift_{index}_{chunk_index}"
+                observation_id = f"obs_doclift_{index}_{chunk_index}"
+                fragment_rows.append(
+                    {
+                        "fragment_id": fragment_id,
+                        "import_id": context.import_id,
+                        "source_id": artifact_id,
+                        "text": chunk_text,
+                        "section": chunk_section,
+                        "line_start": line_start,
+                        "line_end": line_end,
+                        "metadata": {
+                            "chunk_id": chunk.get("chunk_id", f"{document.get('document_id', index)}-{chunk_index}"),
+                            "source_kind": "doclift_chunk",
+                        },
+                        "current_status": "draft",
+                    }
+                )
+                observation_rows.append(
+                    {
+                        "observation_id": observation_id,
+                        "import_id": context.import_id,
+                        "artifact_id": artifact_id,
+                        "role": chunk_role,
+                        "text": chunk_text,
+                        "origin_path": relative_markdown,
+                        "origin_section": chunk_section,
+                        "line_start": line_start,
+                        "line_end": line_end,
+                        "source_url": source_path,
+                        "metadata": {
+                            "source_path_kind": source_path_kind,
+                            "chunk_id": chunk.get("chunk_id", f"{document.get('document_id', index)}-{chunk_index}"),
+                        },
+                        "grounding_status": "grounded",
+                        "support_kind": "direct_source",
+                        "confidence_hint": float(chunk.get("confidence_hint") or 0.75),
+                        "current_status": "draft",
+                    }
+                )
+                if chunk_role in {"claim", "summary"}:
+                    claim_rows.append(
+                        {
+                            "claim_id": f"clm_doclift_{index}_{chunk_index}",
+                            "import_id": context.import_id,
+                            "claim_text": chunk_text,
+                            "claim_kind": "statement" if chunk_role == "claim" else "summary",
+                            "source_observation_ids": [observation_id],
+                            "supporting_fragment_ids": [fragment_id],
+                            "concept_ids": [concept_id],
+                            "contradicts_claim_ids": [],
+                            "supersedes_claim_ids": [],
+                            "confidence_hint": float(chunk.get("confidence_hint") or 0.75),
+                            "grounding_status": "grounded",
+                            "current_status": "triaged",
+                        }
+                    )
            if previous_concept_id is not None:
                relation_rows.append(
                    {
@ -158,6 +241,7 @@ class DocliftBundleSourceAdapter:

        return StructuredImportRows(
            artifact_rows=artifact_rows,
+            fragment_rows=fragment_rows,
            observation_rows=observation_rows,
            claim_rows=claim_rows,
            concept_rows=concept_rows,
--- a/src/groundrecall/ingest.py
+++ b/src/groundrecall/ingest.py
@ -1,6 +1,7 @@
 from __future__ import annotations

 import argparse
+import inspect
 import json
 import shutil
 import socket
@ -18,9 +19,11 @@ from .groundrecall_normalizer import (
    build_artifact_record,
    build_claim_record,
    build_concept_records,
+    build_fragment_record,
    build_observation_record,
    build_relation_records,
    manifest_record,
+    standardize_concept_rows,
 )
 from .groundrecall_review_bridge import export_review_bundle_from_import
 from .groundrecall_review_queue import build_review_queue
@ -36,6 +39,7 @@ VALID_MODES = {"archive", "quick", "grounded"}
 class ImportResult:
    manifest: dict[str, Any]
    artifacts: list[dict[str, Any]]
+    fragments: list[dict[str, Any]]
    observations: list[dict[str, Any]]
    claims: list[dict[str, Any]]
    concepts: list[dict[str, Any]]
@ -56,9 +60,10 @@ def _default_import_id(source_root: Path) -> str:
 def _portable_source_root_ref(source_path: Path, output_root: Path) -> tuple[str, str]:
    anchor = output_root.resolve().parent
    if source_path.is_relative_to(anchor):
-        relative = source_path.relative_to(anchor).as_posix()
-        if relative != ".":
-            return relative, "output_root_parent_relative"
+        relative = source_path.relative_to(anchor)
+        if relative == Path("."):
+            return source_path.name, "source_label"
+        return relative.as_posix(), "output_root_parent_relative"
    return source_path.name, "source_label"


@ -147,13 +152,19 @@ def run_groundrecall_import(
    )

    artifact_rows: list[dict[str, Any]] = []
+    fragment_rows: list[dict[str, Any]] = []
    observation_rows: list[dict[str, Any]] = []
    claim_rows: list[dict[str, Any]] = []
    concept_rows: list[dict[str, Any]] = []
    relation_rows: list[dict[str, Any]] = []
+    build_rows_params = inspect.signature(adapter.build_rows).parameters
+    if "root" in build_rows_params:
+        structured_rows = adapter.build_rows(context, discovered, root=source_path)
+    else:
        structured_rows = adapter.build_rows(context, discovered)
    if structured_rows is not None:
        artifact_rows.extend(structured_rows.artifact_rows)
+        fragment_rows.extend(structured_rows.fragment_rows)
        observation_rows.extend(structured_rows.observation_rows)
        claim_rows.extend(structured_rows.claim_rows)
        concept_rows.extend(structured_rows.concept_rows)
@ -170,14 +181,27 @@ def run_groundrecall_import(
            relation_rows.extend(build_relation_records(context, artifact_row, page.concepts, page.links))

            for index, observation in enumerate(page.observations, start=1):
+                fragment_row = build_fragment_record(context, artifact_row, observation, index)
+                fragment_rows.append(fragment_row)
                observation_row = build_observation_record(context, artifact_row, observation, index)
                observation_rows.append(observation_row)
                if mode == "archive":
                    continue
                if observation.role not in {"claim", "summary"}:
                    continue
-                claim_rows.append(build_claim_record(context, observation_row, observation, page.concepts[:3], index))
+                claim_rows.append(
+                    build_claim_record(
+                        context,
+                        observation_row,
+                        observation,
+                        page.concepts[:3],
+                        index,
+                        fragment_ids=[fragment_row["fragment_id"]],
+                    )
+                )

+    fragment_rows = _dedupe_by_key(fragment_rows, "fragment_id")
+    concept_rows, claim_rows, relation_rows = standardize_concept_rows(concept_rows, claim_rows, relation_rows)
    concept_rows = _dedupe_by_key(concept_rows, "concept_id")
    relation_rows = _dedupe_by_key(relation_rows, "relation_id")
    artifact_rows = _dedupe_by_key(artifact_rows, "artifact_id")
@ -189,6 +213,7 @@ def run_groundrecall_import(
        "import_intent": adapter.import_intent(),
        "source_root_kind": source_root_kind,
        "artifact_count": len(artifact_rows),
+        "fragment_count": len(fragment_rows),
        "observation_count": len(observation_rows),
        "claim_count": len(claim_rows),
        "concept_count": len(concept_rows),
@ -197,6 +222,7 @@ def run_groundrecall_import(

    _write_json(output_dir / "manifest.json", manifest)
    _write_jsonl(output_dir / "artifacts.jsonl", artifact_rows)
+    _write_jsonl(output_dir / "fragments.jsonl", fragment_rows)
    _write_jsonl(output_dir / "observations.jsonl", observation_rows)
    _write_jsonl(output_dir / "claims.jsonl", claim_rows)
    _write_jsonl(output_dir / "concepts.jsonl", concept_rows)
@ -210,6 +236,7 @@ def run_groundrecall_import(
    return ImportResult(
        manifest=manifest,
        artifacts=artifact_rows,
+        fragments=fragment_rows,
        observations=observation_rows,
        claims=claim_rows,
        concepts=concept_rows,
--- a/src/groundrecall/lint.py
+++ b/src/groundrecall/lint.py
@ -24,6 +24,7 @@ def lint_import_directory(import_dir: str | Path) -> dict[str, Any]:
    base = Path(import_dir)
    manifest = _read_json(base / "manifest.json")
    artifacts = _read_jsonl(base / "artifacts.jsonl")
+    fragments = _read_jsonl(base / "fragments.jsonl")
    observations = _read_jsonl(base / "observations.jsonl")
    claims = _read_jsonl(base / "claims.jsonl")
    concepts = _read_jsonl(base / "concepts.jsonl")
@ -166,6 +167,7 @@ def lint_import_directory(import_dir: str | Path) -> dict[str, Any]:

    summary = {
        "artifact_count": len(artifacts),
+        "fragment_count": len(fragments),
        "observation_count": len(observations),
        "claim_count": len(claims),
        "concept_count": len(concepts),
--- a/tests/fixtures/doclift_bundle_minimal/documents/lecture-1/document.chunks.json
+++ b/tests/fixtures/doclift_bundle_minimal/documents/lecture-1/document.chunks.json
@ -0,0 +1,20 @@
+{
+  "chunks": [
+    {
+      "chunk_id": "lecture-1-c1",
+      "role": "summary",
+      "section": "Module A",
+      "line_start": 1,
+      "line_end": 4,
+      "text": "Lecture 1 introduces Module A and frames the example lesson."
+    },
+    {
+      "chunk_id": "lecture-1-c2",
+      "role": "claim",
+      "section": "Lesson A",
+      "line_start": 5,
+      "line_end": 7,
+      "text": "Objective: Explain lesson A."
+    }
+  ]
+}
--- a/tests/test_groundrecall_import.py
+++ b/tests/test_groundrecall_import.py
@ -3,6 +3,7 @@ from __future__ import annotations
 import json
 from pathlib import Path

+from groundrecall.groundrecall_normalizer import standardize_concept_rows
 from groundrecall.ingest import run_groundrecall_import
 from groundrecall.lint import lint_import_directory

@ -46,8 +47,13 @@ def test_groundrecall_import_emits_normalized_artifacts(tmp_path: Path) -> None:
    artifacts = _read_jsonl(result.out_dir / "artifacts.jsonl")
    assert {item["artifact_kind"] for item in artifacts} == {"compiled_page", "raw_note", "session_log"}

+    fragments = _read_jsonl(result.out_dir / "fragments.jsonl")
+    assert len(fragments) >= 3
+    assert all(item["source_id"].startswith("ia_") for item in fragments)
+
    claims = _read_jsonl(result.out_dir / "claims.jsonl")
    assert any("Reliable rate upper bound" in item["claim_text"] for item in claims)
+    assert any(item["supporting_fragment_ids"] for item in claims)

    concepts = _read_jsonl(result.out_dir / "concepts.jsonl")
    concept_ids = {item["concept_id"] for item in concepts}
@ -78,6 +84,49 @@ def test_groundrecall_import_emits_normalized_artifacts(tmp_path: Path) -> None:
    assert "citation_reviews" in review_data


+def test_concept_standardization_merges_duplicate_titles_into_aliases() -> None:
+    concept_rows = [
+        {
+            "concept_id": "concept::signal-processing",
+            "title": "Signal Processing",
+            "aliases": [],
+            "description": "",
+            "source_artifact_ids": ["ia_one"],
+            "current_status": "triaged",
+        },
+        {
+            "concept_id": "concept::signal-processing-variant",
+            "title": "The Signal Processing",
+            "aliases": ["DSP"],
+            "description": "",
+            "source_artifact_ids": ["ia_two"],
+            "current_status": "triaged",
+        },
+    ]
+    claim_rows = [
+        {
+            "claim_id": "clm_1",
+            "concept_ids": ["concept::signal-processing-variant"],
+        }
+    ]
+    relation_rows = [
+        {
+            "relation_id": "rel_1",
+            "source_id": "concept::signal-processing-variant",
+            "target_id": "concept::signal-processing",
+        }
+    ]
+
+    concepts, claims, relations = standardize_concept_rows(concept_rows, claim_rows, relation_rows)
+
+    assert len(concepts) == 1
+    assert concepts[0]["concept_id"] == "concept::signal-processing"
+    assert concepts[0]["aliases"] == ["DSP", "The Signal Processing"]
+    assert concepts[0]["source_artifact_ids"] == ["ia_one", "ia_two"]
+    assert claims[0]["concept_ids"] == ["concept::signal-processing"]
+    assert relations[0]["source_id"] == "concept::signal-processing"
+
+
 def test_groundrecall_import_parses_explicit_claim_relations(tmp_path: Path) -> None:
    root = tmp_path / "llmwiki"
    (root / "wiki").mkdir(parents=True)
--- a/tests/test_groundrecall_source_adapters.py
+++ b/tests/test_groundrecall_source_adapters.py
@ -216,8 +216,13 @@ def test_doclift_bundle_import_generates_structured_concepts(tmp_path: Path) ->
    assert result.manifest["import_intent"] == "both"
    assert result.manifest["source_root"] == "doclift_bundle_minimal"
    assert result.manifest["source_root_kind"] == "source_label"
+    assert result.manifest["fragment_count"] == 2
    concept_ids = {item["concept_id"] for item in result.concepts}
    assert "concept::lecture-1" in concept_ids
    claim_ids = {item["claim_id"] for item in result.claims}
    assert "clm_doclift_1" in claim_ids
+    assert "clm_doclift_1_1" in claim_ids
    assert result.observations[0]["source_url"] == "legacy/lecture-1.doc"
+    assert len(result.fragments) == 2
+    assert result.fragments[0]["metadata"]["source_kind"] == "doclift_chunk"
+    assert result.claims[1]["supporting_fragment_ids"] == ["frag_doclift_1_1"]