Initial commit

This commit is contained in:
welsberr 2026-04-23 07:06:50 -04:00
parent bf5c1a0e4a
commit e819f17607
79 changed files with 8094 additions and 0 deletions

20
docs/README.md Normal file
View File

@ -0,0 +1,20 @@
# Docs
The top-level documentation in this repository is intended to describe `GroundRecall` as a standalone project.
Primary docs:
- [quickstart.md](quickstart.md)
- [architecture.md](architecture.md)
- [llmwiki-import.md](llmwiki-import.md)
- [sync-roadmap.md](sync-roadmap.md)
Legacy extraction notes:
- [legacy/groundrecall-assistant-architecture.md](legacy/groundrecall-assistant-architecture.md)
- [legacy/groundrecall-ingestion-refactor.md](legacy/groundrecall-ingestion-refactor.md)
- [legacy/groundrecall-llmwiki-import.md](legacy/groundrecall-llmwiki-import.md)
- [legacy/groundrecall-migration-plan.md](legacy/groundrecall-migration-plan.md)
- [legacy/groundrecall-repo-bootstrap.md](legacy/groundrecall-repo-bootstrap.md)
Those legacy documents were carried over from the earlier `Didactopus`-embedded phase. They remain useful as design history, but they are not the preferred starting point for current standalone `GroundRecall` work.

93
docs/architecture.md Normal file
View File

@ -0,0 +1,93 @@
# Architecture
`GroundRecall` is the grounded knowledge substrate in a larger stack:
- `GroundRecall`: canonical knowledge ingestion, promotion, query, export, and future sync
- `Didactopus`: learner-facing workflows and educational tooling
- `GenieHive`: model and routing layer where runtime assistant/service resolution is needed
## Core Design
The system is built around one canonical flow:
1. ingest weakly structured sources
2. normalize them into stable knowledge objects
3. lint and queue them for review
4. promote reviewed objects into a canonical store
5. query and export promoted state
## Core Objects
The canonical store is built from these object families:
- `Source`
- `Fragment`
- `Artifact`
- `Observation`
- `Claim`
- `Concept`
- `Relation`
- `ReviewCandidate`
- `PromotionRecord`
- `GroundRecallSnapshot`
These objects are assistant-neutral. Assistant-specific formatting belongs at the adapter layer.
## Package Surface
The main standalone package surface is:
- `groundrecall.ingest`
- `groundrecall.lint`
- `groundrecall.models`
- `groundrecall.store`
- `groundrecall.promotion`
- `groundrecall.query`
- `groundrecall.export`
- `groundrecall.assistant_export`
- `groundrecall.inspect`
- `groundrecall.source_adapters.*`
- `groundrecall.assistants.*`
There are also compatibility-style helper modules prefixed with `groundrecall_` inside the package. Those exist because the standalone repo was extracted from an earlier monorepo layout.
## Source Adapters
Adapters handle source-shape-specific discovery and mapping while the downstream pipeline stays generic.
Current adapter families include:
- `llmwiki`
- `markdown_notes`
- `transcript`
- `didactopus_pack`
## Assistant Boundary
Assistant integration is intentionally outside the core store and query semantics.
The rule is:
- core `GroundRecall` owns truth, provenance, lifecycle, and retrieval semantics
- assistant adapters own presentation, bundle shaping, and tool-specific exports
Current adapters include:
- `codex`
- `claude_code`
## Alpha Boundary
The current alpha is strong enough for:
- local import and promotion
- canonical query and export
- assistant-neutral bundles
- assistant-targeted bundle generation
It is not yet complete for:
- multi-node sync and merge
- re-import/update semantics
- richer review adjudication
- large-scale distributed corpus integration

View File

@ -0,0 +1,174 @@
# GroundRecall Assistant Integration Architecture
This document defines how GroundRecall should support Codex, Claude Code, and
future assistant environments without treating any single assistant as the
authoritative integration target.
## Design rule
GroundRecall core must be assistant-agnostic.
Assistant-specific formats are derived views over promoted GroundRecall objects,
not the canonical representation of knowledge.
## Why this boundary matters
If assistant-specific prompt packaging leaks into the core model too early,
GroundRecall becomes:
- harder to evolve
- harder to validate
- harder to sync across machines
- harder to support across multiple assistant environments
The stable boundary should instead be:
- canonical grounded knowledge objects in core
- assistant adapters at the edge
## Core vs adapter split
### Core GroundRecall responsibilities
These should remain assistant-neutral:
- schemas for `Source`, `Fragment`, `Artifact`, `Observation`, `Claim`,
`Concept`, `Relation`, `ReviewCandidate`, and `PromotionRecord`
- provenance and confidence modeling
- contradiction and supersession handling
- linting and review queue generation
- review and promotion workflows
- persistent storage for promoted objects
- query and retrieval semantics
- sync and multi-machine consolidation
- canonical export formats
### Assistant adapter responsibilities
These should be adapter-specific:
- prompt/context packaging
- assistant-specific bundle layout
- memory-file rendering
- skill-file rendering
- assistant capability declarations
- token-budget shaping and truncation policy
- tool-specific metadata
## Canonical export contract
GroundRecall should export assistant-neutral artifacts first.
Recommended canonical exports:
- `groundrecall_snapshot.json`
- `claims.jsonl`
- `concepts.jsonl`
- `relations.jsonl`
- `provenance_manifest.json`
- `query_bundle.json`
Assistant adapters then derive secondary outputs from those canonical exports.
## Assistant adapter interface
GroundRecall should expose a small adapter protocol.
Example shape:
```python
class AssistantAdapter(Protocol):
name: str
def export_bundle(self, snapshot: dict, out_dir: Path) -> list[Path]:
...
def build_context(self, query_result: dict) -> dict:
...
def supported_capabilities(self) -> dict[str, bool]:
...
```
This is a strategy/plugin boundary. A small registry or factory is acceptable,
but the important architectural decision is the separation of concerns, not the
factory itself.
## Recommended package layout
Recommended modules:
- `didactopus.groundrecall.models`
- `didactopus.groundrecall.store`
- `didactopus.groundrecall.promotion`
- `didactopus.groundrecall.query`
- `didactopus.groundrecall.export`
- `didactopus.groundrecall.assistants.base`
- `didactopus.groundrecall.assistants.codex`
- `didactopus.groundrecall.assistants.claude_code`
## Export layering
Recommended filesystem layout:
- `exports/canonical/`
- `exports/assistants/codex/`
- `exports/assistants/claude-code/`
Canonical exports remain the durable interchange format.
Assistant exports remain reproducible derived artifacts.
## Query layering
The query layer should return assistant-neutral structures such as:
- relevant claims
- supporting fragments
- provenance
- contradictions
- supersessions
- confidence and recency
- suggested next actions
Adapters may then convert this payload into:
- Codex skill/context bundles
- Claude Code project memory/context bundles
- future assistant context packages
## Stability policy
GroundRecall should adopt these rules early:
1. No assistant-specific fields in canonical `Claim` or `Concept` objects.
2. No assistant-specific persistence formats as authoritative storage.
3. No review or promotion decisions based on assistant-specific packaging.
4. Assistant adapters may be added or removed without changing canonical objects.
## Migration implication
Current and future GroundRecall work should replace language like:
- "Codex-facing export"
- "Codex skill bundle"
with:
- "assistant adapter bundle"
- "assistant-facing export"
- "assistant-specific derived bundle"
Codex can still be one adapter and may remain the first implemented adapter, but
it should not define the system boundary.
## Immediate implementation impact
The next GroundRecall milestones should be interpreted as:
1. build assistant-neutral canonical models and storage
2. build review and promotion over canonical objects
3. build canonical query and export layers
4. add assistant adapters as thin renderers over those canonical outputs
This is the lowest-risk path for long-term stability.

View File

@ -0,0 +1,105 @@
# GroundRecall Ingestion Refactor Plan
GroundRecall should treat `llmwiki` as one upstream source shape, not as the
defining architecture for grounded knowledge import.
Didactopus already has broader ambitions around ingestion of weakly structured
materials such as:
- markdown notes
- transcripts
- HTML/text course materials
- generated draft packs
- review sessions
- learner artifacts
The GroundRecall import pipeline should therefore be generalized around a shared
normalization and promotion substrate with pluggable source adapters.
## Design rule
Source-specific logic should live at the ingestion edge.
These stages should be generic:
- segmentation
- extraction
- normalization
- lint
- review queue generation
- review bridge
- promotion
- canonical store
- query
- canonical export
## Recommended module split
Recommended package layout:
- `didactopus.groundrecall_ingest`
- `didactopus.groundrecall_source_adapters.base`
- `didactopus.groundrecall_source_adapters.llmwiki`
- `didactopus.groundrecall_source_adapters.markdown_notes`
- `didactopus.groundrecall_source_adapters.transcript`
- `didactopus.groundrecall_source_adapters.didactopus_pack`
- `didactopus.groundrecall_source_adapters.didactopus_review`
## Shared intermediate envelope
Adapters should emit shared discovery records rather than jumping straight into
canonical GroundRecall objects.
Recommended intermediate types:
- `DiscoveredImportSource`
- `SegmentCandidate`
- `ImportProfile`
This keeps adapter-specific parsing separate from the shared import pipeline.
## Output intent
Not every imported source should be treated the same way.
Adapters should declare an output intent:
- `grounded_knowledge`
- `curriculum`
- `both`
Examples:
- `llmwiki` usually targets `grounded_knowledge`
- loose transcripts may target `grounded_knowledge`
- syllabus/course folders often target `curriculum`
- Didactopus packs or review sessions may target `both`
## First refactor milestones
### Milestone 1
- introduce adapter registry and adapter protocol
- move current `llmwiki` discovery/classification behind an adapter
- preserve the current import CLI behavior
### Milestone 2
- add a `markdown_notes` adapter
- add a `transcript` adapter
- add import profiles that tune extraction strictness
### Milestone 3
- add a `didactopus_pack` adapter for pack and review artifacts
- allow current Didactopus outputs to feed into GroundRecall directly
## Why this matters
This avoids building two parallel ingestion stacks inside Didactopus:
- one for packs and educational structures
- another for grounded knowledge capture
Instead, the system gets one generic ingestion substrate with multiple source
adapters and multiple downstream promotion/export paths.

View File

@ -0,0 +1,496 @@
# GroundRecall `llmwiki` Import Specification
This document defines the first-pass import path for users who already have some
form of `llmwiki`-style repository and want to migrate it into the broader
GroundRecall substrate while staying compatible with Didactopus review and
promotion flows.
## Goal
The import path should let an existing `llmwiki` corpus become:
- searchable without immediate manual cleanup
- reviewable rather than blindly trusted
- grounded in explicit provenance
- promotable into durable structured knowledge objects
- exportable back into compiled wiki pages, assistant adapter bundles, and
queryable graph artifacts
The key rule is:
Imported wiki pages are **derived artifacts**, not automatic source truth.
## Import philosophy
Users coming from `llmwiki` often have a mixture of:
- raw notes
- compiled markdown pages
- local source files
- generated summaries
- ad hoc link graphs
- session transcripts
- speculative or weakly-supported synthesis
GroundRecall should preserve that work without pretending all of it is already
promoted knowledge.
The import pipeline therefore has two responsibilities:
1. Preserve the original material with minimal loss.
2. Reify explicit structured objects that can later be reviewed and promoted.
## Scope of the first implementation
The first implementation should support common `llmwiki` layouts such as:
- `raw/`
- `wiki/`
- `schema.*`
- `logs/`
- `sources/`
- top-level markdown pages
The importer should not require a canonical upstream schema. It should operate
from directory conventions plus simple heuristics.
## Import modes
### 1. `archive`
Purpose:
- preserve an existing `llmwiki` tree as read-only imported artifacts
- index it for search and later review
Behavior:
- no claim promotion
- minimal extraction
- all compiled pages remain `draft`
Use when:
- the user wants backward compatibility first
- the corpus quality is unknown
### 2. `quick`
Purpose:
- bootstrap usable structured objects fast
Behavior:
- import pages and raw sources
- extract candidate claims and concepts heuristically
- attach lightweight provenance
- queue uncertain items for review
Use when:
- the user wants early utility and accepts heuristic noise
### 3. `grounded`
Purpose:
- perform a migration suitable for long-lived shared knowledge
Behavior:
- require provenance for promoted claims
- mark unsupported statements explicitly
- produce review records and lint findings
- populate promotion queues rather than auto-promoting
Use when:
- the imported corpus will be shared across machines or agents
## Pipeline stages
### 1. Capture
The importer records the source repository as an import artifact.
Required metadata:
- `import_id`
- `import_mode`
- `source_root`
- `imported_at`
- `machine_id`
- `agent_id`
- `source_repo_kind=llmwiki`
Outputs:
- import manifest
- artifact records for all discovered files
### 2. Segment
Imported content is split into stable units.
Primary segment types:
- `source_document`
- `source_fragment`
- `compiled_page`
- `section_summary`
- `candidate_claim`
- `candidate_concept`
- `candidate_relation`
- `session_observation`
Segmentation should preserve:
- original path
- section heading
- line or byte offsets when possible
- page title
- frontmatter fields
### 3. Classify
Each segment gets a semantic role.
Recommended roles:
- `source`
- `derivation`
- `claim`
- `summary`
- `question`
- `todo`
- `speculation`
- `obsolete`
- `transcript`
This prevents unsupported prose from being confused with grounded knowledge.
### 4. Ground
Each imported segment gets provenance and support metadata.
Required grounding fields:
- `origin_artifact_id`
- `origin_path`
- `origin_section`
- `source_url` when known
- `retrieval_date` when known
- `machine_id`
- `session_id` when known
- `support_kind`
- `grounding_status`
Suggested values:
- `support_kind`: `direct_source`, `derived_from_page`, `derived_from_session`,
`inferred`, `unknown`
- `grounding_status`: `grounded`, `partially_grounded`, `ungrounded`
### 5. Normalize
The importer emits explicit GroundRecall objects.
Minimum object set:
- `Source`
- `Fragment`
- `Artifact`
- `Observation`
- `Claim`
- `Concept`
- `Relation`
### 6. Lint
The importer produces machine-readable findings before promotion.
Required lint checks:
- claim has no supporting fragment
- multiple claims appear text-identical
- concept is orphaned
- relation points to missing concept
- page summary has no cited support
- imported item marked `obsolete` still linked as current
- same claim imported with conflicting confidence or polarity
### 7. Promote
Imported objects enter existing Didactopus review/promotion lanes rather than
becoming trusted immediately.
Recommended states:
- `draft`
- `triaged`
- `reviewed`
- `promoted`
- `superseded`
- `archived`
### 8. Export
Promoted objects can then be rendered back out as:
- compiled wiki pages
- graph snapshots
- assistant adapter bundles
- review reports
- query bundles for assistant-facing use
## Object contracts
### `ImportedArtifact`
```json
{
"artifact_id": "ia_001",
"import_id": "imp_2026_04_16_a",
"artifact_kind": "compiled_page",
"path": "wiki/channel-capacity.md",
"title": "Channel Capacity",
"sha256": "abc123",
"created_at": "2026-04-16T14:00:00Z",
"metadata": {
"frontmatter": {},
"headings": ["Definition", "Examples"]
},
"current_status": "draft"
}
```
### `ImportedObservation`
```json
{
"observation_id": "obs_001",
"import_id": "imp_2026_04_16_a",
"artifact_id": "ia_001",
"role": "summary",
"text": "Capacity bounds reliable communication over a noisy channel.",
"origin_path": "wiki/channel-capacity.md",
"origin_section": "Definition",
"line_start": 12,
"line_end": 14,
"grounding_status": "partially_grounded",
"support_kind": "derived_from_page",
"confidence_hint": 0.63,
"current_status": "draft"
}
```
### `ImportedClaim`
```json
{
"claim_id": "clm_001",
"import_id": "imp_2026_04_16_a",
"claim_text": "Channel capacity is the maximum reliable communication rate for a channel model.",
"claim_kind": "definition",
"source_observation_ids": ["obs_001"],
"supporting_fragment_ids": ["frag_014"],
"concept_ids": ["concept::channel-capacity"],
"confidence_hint": 0.74,
"grounding_status": "grounded",
"current_status": "triaged"
}
```
### `ImportedConcept`
```json
{
"concept_id": "concept::channel-capacity",
"import_id": "imp_2026_04_16_a",
"title": "Channel Capacity",
"aliases": [],
"description": "Imported concept from llmwiki corpus.",
"source_artifact_ids": ["ia_001"],
"current_status": "triaged"
}
```
### `ImportedRelation`
```json
{
"relation_id": "rel_001",
"import_id": "imp_2026_04_16_a",
"source_id": "concept::shannon-entropy",
"target_id": "concept::channel-capacity",
"relation_type": "supports_understanding_of",
"evidence_ids": ["obs_015"],
"current_status": "draft"
}
```
## Mapping from `llmwiki` into GroundRecall
Recommended first-pass mapping:
- `raw/*` -> `Source` or `Artifact(kind=raw_note)`
- `wiki/*.md` -> `Artifact(kind=compiled_page)`
- frontmatter -> artifact metadata
- headings -> section boundaries
- linked page names -> candidate `Concept` and `Relation`
- bullet or sentence extraction -> candidate `Observation` and `Claim`
- chat or session logs -> `Observation(kind=session_note)`
- schema files -> import metadata only unless a future adapter exists
## Confidence and trust policy
Imported confidence must remain clearly separate from reviewed confidence.
Recommended fields:
- `confidence_hint`
- `review_confidence`
- `grounding_status`
- `review_verdict`
Policy:
- `confidence_hint` comes from heuristic import scoring
- `review_confidence` exists only after review
- promotion requires at least `partially_grounded`
- fully ungrounded claims can be stored, but only as `draft` or `archived`
## Provenance policy
The importer should follow the existing Didactopus provenance direction:
- preserve source identity
- preserve retrieval date when available
- preserve adaptation status
- keep both human-readable and machine-readable provenance
When only a compiled wiki page exists and the original source is missing:
- the compiled page becomes the immediate origin artifact
- all extracted claims must be marked `derived_from_page`
- such claims should not auto-promote in `grounded` mode
## Review and promotion integration
Imported `Claim` and `Concept` objects should feed into the same general review
machinery already used for pack-oriented promotion:
- create candidate records
- attach lint findings
- route to a triage lane
- collect review verdicts
- emit promotion records
Suggested triage lanes:
- `knowledge_capture`
- `pack_improvement`
- `skill_export`
- `source_cleanup`
- `conflict_resolution`
## Module layout
First-pass module layout:
- `didactopus.groundrecall_import`
Entry points and top-level orchestration.
- `didactopus.groundrecall_discovery`
Finds `llmwiki`-style files and classifies paths.
- `didactopus.groundrecall_segmenter`
Splits pages and logs into stable observations and candidate claims.
- `didactopus.groundrecall_normalizer`
Emits normalized import objects.
- `didactopus.groundrecall_lint`
Import-time lint checks.
- `didactopus.groundrecall_review_bridge`
Converts imported objects into review candidates and promotion records.
- `didactopus.groundrecall_export`
Renders promoted objects back to wiki, graph, and skill artifacts.
## CLI shape
Suggested CLI:
```bash
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode archive
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode quick
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode grounded
python -m didactopus.groundrecall.cli lint imports/<import-id>
python -m didactopus.groundrecall.cli promote imports/<import-id> /path/to/store
python -m didactopus.groundrecall.cli export /path/to/store exports/groundrecall --concept channel-capacity
```
Compatibility wrappers still exist during migration:
```bash
python -m didactopus.groundrecall_import /path/to/llmwiki --mode grounded
python -m didactopus.groundrecall_lint imports/<import-id>
python -m didactopus.groundrecall_export /path/to/store exports/groundrecall --concept channel-capacity
```
## Filesystem layout
Suggested repository-local layout:
- `imports/<import-id>/manifest.json`
- `imports/<import-id>/artifacts.jsonl`
- `imports/<import-id>/observations.jsonl`
- `imports/<import-id>/claims.jsonl`
- `imports/<import-id>/concepts.jsonl`
- `imports/<import-id>/relations.jsonl`
- `imports/<import-id>/lint_findings.json`
- `imports/<import-id>/review_queue.json`
This keeps imported state auditable and easy to sync across machines.
## Multi-machine sync implication
For distributed assistant use, imported state should be append-oriented and
rebuildable.
Recommended sync primitives:
- import manifests
- normalized jsonl object streams
- review records
- promotion records
Non-authoritative derived artifacts:
- rendered wiki pages
- local indexes
- embeddings
- cache files
This allows multiple machines to contribute import events without making the
compiled page tree the merge primitive.
## First implementation milestones
### Milestone 1
- discover `raw/` and `wiki/`
- import artifacts
- segment markdown by headings
- emit observations and candidate claims
- write import manifest and jsonl outputs
### Milestone 2
- add grounding metadata
- add lint checks
- add triage lanes and review queue output
### Milestone 3
- map promoted claims into assistant-neutral exports plus assistant adapter bundles
- render compiled wiki views from promoted objects
- support multi-machine import manifests and merge-safe event storage
## Non-goals for the first pass
- perfect semantic claim extraction
- automatic trust assignment
- full upstream `llmwiki` schema compatibility
- lossless import of every custom plugin or script
- embeddings-first retrieval
The first pass should be conservative, inspectable, and easy to improve.

View File

@ -0,0 +1,281 @@
# GroundRecall Migration Plan
This document turns the boundary decisions in [deployment-modes.md](deployment-modes.md) into an implementation plan.
The goal is not an immediate repo split. The goal is to let `GroundRecall` become independently deployable and operable without destabilizing ongoing `Didactopus` learner work.
## Current State
Today, GroundRecall exists as a set of modules under `src/didactopus/`:
- `groundrecall_import`
- `groundrecall_source_adapters/*`
- `groundrecall_lint`
- `groundrecall_review_queue`
- `groundrecall_review_bridge`
- `groundrecall_models`
- `groundrecall_store`
- `groundrecall_promotion`
- `groundrecall_query`
- `groundrecall_export`
- `groundrecall_assistant_export`
- `groundrecall_assistants/*`
This is acceptable as an implementation phase, but it creates two risks:
1. generic knowledge-substrate functionality may continue to accrete under `didactopus.main`
2. feature work may silently assume the presence of learner-facing Didactopus components
## Migration Goal
Target state:
- `Didactopus` remains the learner-facing application
- `GroundRecall` becomes the standalone grounded knowledge substrate
- `GenieHive` remains the model and routing control plane
The package, CLI, and deployment boundaries should eventually reflect that.
## Target Ownership
### GroundRecall should own
- source ingestion and normalization
- claim/concept/relation/artifact/provenance schemas
- canonical store and snapshots
- lint and review queue generation
- promotion and merge semantics
- assistant-neutral query and export
- assistant adapter export
- sync, merge, and team/shared knowledge operations
### Didactopus should own
- learner session flows
- mentor/practice/evaluator/project-advisor workflows
- pack and curriculum-specific review UX
- mastery-ledger and learner evidence experiences
- educational packaging over grounded knowledge
### Shared boundary helpers should stay narrow
- provider policy that depends on GenieHive route resolution but serves learner workflows
- review bridges where GroundRecall needs to feed an existing Didactopus review process during the transition
## Packaging Direction
### Phase 0: Present layout, stricter discipline
Keep the code in `src/didactopus/`, but use naming and imports that preserve the eventual split.
Rules:
- new generic knowledge features go into `groundrecall_*` modules
- new learner-facing features go into `didactopus` learner modules
- do not add generic knowledge operations to `didactopus.main`
- treat review bridges as bridges, not permanent core ownership
### Phase 1: Explicit namespace inside the repo
Preferred direction:
- move GroundRecall modules under `src/didactopus/groundrecall/`
Target structure:
- `src/didactopus/groundrecall/ingest.py`
- `src/didactopus/groundrecall/source_adapters/`
- `src/didactopus/groundrecall/models.py`
- `src/didactopus/groundrecall/store.py`
- `src/didactopus/groundrecall/promotion.py`
- `src/didactopus/groundrecall/query.py`
- `src/didactopus/groundrecall/export.py`
- `src/didactopus/groundrecall/assistants/`
- `src/didactopus/groundrecall/sync.py`
- `src/didactopus/groundrecall/merge.py`
- `src/didactopus/groundrecall/cli.py`
Benefits:
- cleaner conceptual grouping
- easier extraction later
- clearer import discipline
Compatibility path:
- keep thin wrapper modules at old import paths during transition
- deprecate wrappers only after tests and docs have moved
### Phase 2: Dual CLI identity
Before any repo split, expose GroundRecall as a first-class CLI namespace.
Desired commands:
- `python -m didactopus.groundrecall.cli import ...`
- `python -m didactopus.groundrecall.cli lint ...`
- `python -m didactopus.groundrecall.cli promote ...`
- `python -m didactopus.groundrecall.cli query ...`
- `python -m didactopus.groundrecall.cli export ...`
- `python -m didactopus.groundrecall.cli inspect ...`
At that point, `didactopus.main` should only surface:
- learner-facing commands
- review-workflow commands with educational intent
- possibly a pointer to GroundRecall commands, but not ownership of them
### Phase 3: Optional package extraction
Only after sync/merge and standalone use are mature:
- move GroundRecall to its own package or repo if that becomes operationally useful
- keep Didactopus consuming it as a dependency
This step is optional. A clean package boundary inside one repo may be sufficient for a long time.
## CLI Migration Plan
### Keep under `didactopus.main`
- `review`
- future learner-facing workbench commands
### Move toward GroundRecall CLI
- import
- lint
- review queue
- promotion
- canonical query
- canonical export
- assistant export
- sync and merge
### Transitional exception
`provider-inspect` can remain on the Didactopus umbrella CLI for now because:
- it is already useful operationally
- it supports learner-node deployments
- it is not a GroundRecall-specific operation
Longer term, it may also belong on a separate operator surface depending on whether Didactopus becomes the standard local application shell.
## Module Mapping
### Move first
Current -> target
- `didactopus.groundrecall_import` -> `didactopus.groundrecall.ingest`
- `didactopus.groundrecall_source_adapters.*` -> `didactopus.groundrecall.source_adapters.*`
- `didactopus.groundrecall_models` -> `didactopus.groundrecall.models`
- `didactopus.groundrecall_store` -> `didactopus.groundrecall.store`
- `didactopus.groundrecall_promotion` -> `didactopus.groundrecall.promotion`
- `didactopus.groundrecall_query` -> `didactopus.groundrecall.query`
- `didactopus.groundrecall_export` -> `didactopus.groundrecall.export`
- `didactopus.groundrecall_assistants.*` -> `didactopus.groundrecall.assistants.*`
### Keep as transitional bridges
- `didactopus.groundrecall_review_bridge`
- source adapters that ingest Didactopus-native artifacts
These are legitimate but should be documented as cross-boundary adapters rather than intrinsic ownership proof.
### Stay in Didactopus
- `learner_session`
- `learner_session_demo`
- `mentor`
- `practice`
- `project_advisor`
- educational review UX modules
- pack and graph-planning modules
## Service Boundary Direction
### GroundRecall service candidates
Once needed, a GroundRecall service should focus on:
- canonical knowledge query
- import status and queue inspection
- promotion status
- sync/merge status
- assistant-neutral bundle retrieval
### Didactopus service candidates
- learner session orchestration
- learner progress and evaluation
- pack/workbench interactions
### GenieHive service candidates
- model and service inspection
- route resolution
- cluster health
## Milestones
### Milestone 1: Namespace discipline
Done when:
- new generic knowledge work lands only in GroundRecall-oriented modules
- `didactopus.main` stops growing generic knowledge commands
- docs consistently describe GroundRecall as a substrate, not a learner feature
### Milestone 2: Internal package reorganization
Done when:
- GroundRecall modules live under an explicit package path
- old flat import paths are wrappers only
- tests target the new package paths
### Milestone 3: First-class GroundRecall CLI
Done when:
- import/lint/promote/query/export/inspect are available under one GroundRecall CLI surface
- operator docs no longer require `Didactopus` framing for generic knowledge tasks
### Milestone 4: Sync and merge maturity
Done when:
- append-only event ingestion exists
- promoted-state merge semantics exist
- team/shared knowledge workflows are practical without learner workflows
### Milestone 5: Extraction decision
Done when:
- the project can make an informed choice between:
- one repo, multiple packages
- separate GroundRecall package/repo
## Immediate Next Work
Recommended next implementation steps:
1. Introduce `didactopus.groundrecall` as an internal package namespace.
2. Add a single GroundRecall umbrella CLI module.
3. Keep thin wrapper modules for compatibility.
4. Start moving docs and tests to the new namespace.
5. Begin implementing sync/merge primitives under GroundRecall rather than under Didactopus learner flows.
## Decision Rule For New Work
Before adding a new command, module, or service, ask:
1. Would this still be needed if there were no learner session?
2. Would a team using only shared knowledge still need it?
3. Is the canonical artifact knowledge state or educational interaction?
4. Would it still matter if Didactopus UI vanished?
If yes, default toward GroundRecall.

View File

@ -0,0 +1,286 @@
# GroundRecall Repo Bootstrap Checklist
This document turns the broader [groundrecall-migration-plan.md](groundrecall-migration-plan.md) into a practical checklist for creating a standalone `GroundRecall` repository.
The goal here is narrower than full feature completion. The goal is to get to a standalone repository that can be installed, run locally, and used for real `llmwiki++`-style work without requiring `Didactopus` as the primary shell.
## Bootstrap Goal
Minimum viable standalone `GroundRecall` repo:
- installable as its own Python package
- exposes a first-class `groundrecall` CLI
- imports and normalizes knowledge sources
- promotes reviewed knowledge into a canonical store
- supports query and export over promoted state
- supports assistant-neutral exports plus adapter exports
- remains consumable by `Didactopus` as a dependency or sibling package
This is enough for a local standalone alpha. It is not yet the full distributed team and corpus-scale vision.
## What Already Exists
The current `Didactopus` codebase already contains most of the implementation spine:
- `didactopus.groundrecall.ingest`
- `didactopus.groundrecall.source_adapters.*`
- `didactopus.groundrecall.models`
- `didactopus.groundrecall.store`
- `didactopus.groundrecall.promotion`
- `didactopus.groundrecall.query`
- `didactopus.groundrecall.export`
- `didactopus.groundrecall.assistant_export`
- `didactopus.groundrecall.assistants.*`
- `didactopus.groundrecall.inspect`
- `didactopus.groundrecall.cli`
This means the repo bootstrap is primarily a packaging and boundary exercise, not a greenfield implementation.
## Target Repo Shape
Suggested standalone layout:
```text
groundrecall/
pyproject.toml
README.md
LICENSE
src/
groundrecall/
__init__.py
cli.py
ingest.py
inspect.py
lint.py
models.py
store.py
promotion.py
query.py
export.py
assistant_export.py
review_queue.py
review_bridge.py
source_adapters/
assistants/
tests/
docs/
quickstart.md
llmwiki-import.md
deployment-modes.md
assistant-architecture.md
sync-roadmap.md
```
Notes:
- `review_bridge.py` may remain optional if the standalone repo only needs generic review artifacts.
- `review_queue.py` belongs in `GroundRecall`; it is not a learner-only concern.
- `review_bridge.py` is the most likely file to stay transitional if it depends too directly on Didactopus review objects.
## Move / Keep / Bridge
### Move into standalone `GroundRecall`
Move first:
- `didactopus.groundrecall.ingest`
- `didactopus.groundrecall.inspect`
- `didactopus.groundrecall.lint`
- `didactopus.groundrecall.models`
- `didactopus.groundrecall.store`
- `didactopus.groundrecall.promotion`
- `didactopus.groundrecall.query`
- `didactopus.groundrecall.export`
- `didactopus.groundrecall.assistant_export`
- `didactopus.groundrecall.review_queue`
- `didactopus.groundrecall.source_adapters.*`
- `didactopus.groundrecall.assistants.*`
- `didactopus.groundrecall.cli`
### Keep in `Didactopus`
These should not move:
- learner session and mentor/practice flows
- educational pack authoring and pack-specific UX
- mastery/evidence learner experiences
- provider demos that exist to support Didactopus learner workflows
### Keep as temporary bridges
These may need a staged treatment:
- `groundrecall_review_bridge`
- `didactopus_pack` source adapter
Those are useful during transition, but they are cross-boundary integrations, not proof that `GroundRecall` must remain inside `Didactopus`.
## Bootstrap Checklist
### 1. Create the new repo skeleton
Required:
- create a new repo root
- add `pyproject.toml`
- add `src/groundrecall/`
- add `tests/`
- add `docs/`
- add a minimal `README.md`
- add `LICENSE`
Definition of done:
- `pip install -e .` works
- `python -m groundrecall.cli --help` works
### 2. Move the package code
Required:
- copy the current `didactopus.groundrecall.*` package into `src/groundrecall/`
- update relative imports as needed
- remove `didactopus`-prefixed assumptions in docstrings and parser help text
Definition of done:
- module imports succeed under `groundrecall.*`
- no package file requires `didactopus` imports except explicit transition bridges
### 3. Extract the tests
Required:
- move GroundRecall-focused tests into the new repo
- keep Didactopus integration tests in Didactopus
- add an end-to-end CLI smoke test that runs:
- `import`
- `promote`
- `query`
- `export`
- `inspect`
Definition of done:
- the new repo has its own passing test suite
- Didactopus retains only integration tests that prove interoperability
### 4. Harden the standalone CLI
Required commands:
- `groundrecall import`
- `groundrecall lint`
- `groundrecall promote`
- `groundrecall query`
- `groundrecall export`
- `groundrecall inspect`
Recommended additions:
- `groundrecall assistant-export`
- `groundrecall review-queue`
Definition of done:
- the CLI help text is standalone and does not refer users back to `Didactopus`
### 5. Publish a repo-local data layout
Pick and document a stable layout such as:
```text
.groundrecall/
imports/
store/
exports/
events/
```
Required:
- make these paths configurable
- define sane defaults
- remove assumptions that the caller already knows the Didactopus workspace layout
Definition of done:
- a new user can run GroundRecall in an empty directory and get predictable local state
### 6. Document the standalone workflows
At minimum:
- quickstart
- migrate from `llmwiki`
- query and export patterns
- assistant adapter exports
- relationship to `Didactopus`
- relationship to `GenieHive`
Definition of done:
- the README can orient a new user without requiring Didactopus-specific context
### 7. Leave compatibility shims in `Didactopus`
Required:
- keep thin wrappers at `didactopus.groundrecall_*` or `didactopus.groundrecall.*` integration paths as needed
- make `Didactopus` import the extracted package where possible
- clearly mark the wrappers as compatibility paths
Definition of done:
- existing Didactopus workflows do not break during the split
## Alpha Completion Criteria
The standalone repo is alpha-ready when:
- `llmwiki` import works
- `markdown_notes` import works
- at least one Didactopus-native adapter still works as an integration adapter
- canonical store creation and snapshot export work
- query works over promoted objects
- assistant-neutral export works
- at least two assistant adapters export usable bundles
This is the right threshold for “functional GroundRecall repo.”
## Still Missing After Alpha
A standalone alpha is not yet the full target system. These remain post-bootstrap priorities:
- re-import and update semantics
- append-only event logs for multi-node merge
- shared/private scope support
- merge and sync conflict handling
- stronger claim extraction
- richer claim-level review and adjudication
- corpus-scale distributed coordination
Those features should be built in `GroundRecall`, but they do not need to block repo extraction.
## Recommended Execution Order
Use this order:
1. create the repo and package skeleton
2. copy the current `groundrecall` package and make imports pass
3. move tests and get the standalone suite green
4. finalize CLI and README
5. switch Didactopus integration points to consume the extracted package
6. only then continue with sync/merge and corpus-scale features
This keeps the boundary clean without stalling feature progress.
## First PR-Sized Steps
If this were executed as concrete work, the first three small changes should be:
1. create the new repo with package skeleton and copy `src/didactopus/groundrecall/`
2. move the existing namespace-focused tests and make them pass under `groundrecall.*`
3. add a standalone README quickstart and one end-to-end CLI smoke test
After that, the repo is real enough to iterate in place rather than continuing to plan around it.

85
docs/llmwiki-import.md Normal file
View File

@ -0,0 +1,85 @@
# llmwiki Import
`GroundRecall` treats `llmwiki` as one important source shape, not as the defining architecture.
An imported `llmwiki` tree is treated as:
- raw source material
- prior synthesized artifacts
- candidate claims and concepts
- provenance that needs to be normalized and reviewed
Compiled wiki pages are useful artifacts, but they are not automatically promoted as canonical truth.
## Import Modes
### `archive`
- preserve source material with minimal interpretation
- index and normalize without assuming promotion readiness
- useful for long-tail historical corpora
### `quick`
- fast bootstrap mode
- extracts candidate concepts, claims, and relations heuristically
- useful when getting an old corpus into GroundRecall quickly matters more than perfect grounding
### `grounded`
- stricter mode
- expects better provenance and cleaner support signals
- better fit for shared or promoted knowledge
## Import Flow
The normalized import flow is:
1. capture source files
2. discover and classify artifacts
3. segment content into observations
4. normalize claims, concepts, and relations
5. lint the import
6. emit a review queue and review bundle
7. promote reviewed artifacts into the canonical store
## Commands
```bash
groundrecall import /path/to/llmwiki --mode archive
groundrecall import /path/to/llmwiki --mode quick
groundrecall import /path/to/llmwiki --mode grounded
groundrecall lint imports/<import-id>
groundrecall promote imports/<import-id> store/
groundrecall export store/ exports/groundrecall --concept channel-capacity
```
## Current Heuristics
Todays importer already supports:
- `raw/` and `wiki/` discovery
- markdown and log segmentation
- claim extraction with inline contradiction and supersession markers
- review queue generation
- review bundle export
Areas still planned:
- stronger re-import/update semantics
- more robust transcript and semi-structured document handling
- stronger large-corpus extraction and consolidation
## Recommended Promotion Rule
Treat imported wiki pages as derived artifacts.
That means:
- preserve them
- mine them for claims and concepts
- review what matters
- promote canonical claims and concepts into the store
This is the main difference between `GroundRecall` and a plain markdown wiki.

97
docs/quickstart.md Normal file
View File

@ -0,0 +1,97 @@
# Quickstart
`GroundRecall` is a local-first grounded knowledge substrate for `llmwiki++`-style workflows.
This quickstart assumes a fresh checkout of the standalone repository.
## Install
```bash
pip install -e .
groundrecall --help
```
You can also use the module entry point:
```bash
PYTHONPATH=src python -m groundrecall --help
```
## Import A Knowledge Source
Fast import from an `llmwiki`-style tree:
```bash
groundrecall import /path/to/llmwiki --mode quick
```
More conservative import with stronger grounding expectations:
```bash
groundrecall import /path/to/llmwiki --mode grounded
```
The importer writes normalized artifacts under `imports/<import-id>/`.
## Review And Promote
Inspect the import outputs:
```bash
groundrecall lint imports/<import-id>
```
Promote the imported review artifacts into a canonical store:
```bash
groundrecall promote imports/<import-id> store/
```
## Query The Canonical Store
Query a concept:
```bash
groundrecall query store/ channel-capacity
```
Inspect the overall store:
```bash
groundrecall inspect store/
```
## Export
Export assistant-neutral artifacts:
```bash
groundrecall export store/ exports/groundrecall --concept channel-capacity
```
Export assistant-targeted bundles:
```bash
groundrecall assistant-export store/ codex exports/codex --concept channel-capacity
groundrecall assistant-export store/ claude_code exports/claude --concept channel-capacity
```
## Default Working Layout
A simple local layout is:
```text
.groundrecall/
imports/
store/
exports/
events/
```
The current alpha does not require this exact layout, but it is a sensible starting point.
## Next Reading
- [architecture.md](architecture.md)
- [llmwiki-import.md](llmwiki-import.md)
- [sync-roadmap.md](sync-roadmap.md)

73
docs/sync-roadmap.md Normal file
View File

@ -0,0 +1,73 @@
# Sync Roadmap
The current standalone alpha is local-first. Sync and merge are planned next-stage features.
## Goal
Support these use cases cleanly:
- one user across multiple machines
- teams with shared and individual knowledge
- parallel corpus transformation and consolidation
## Planned Model
The intended model is:
- append-only event capture at the edge
- canonical promoted store as the durable reviewed state
- generated exports and assistant bundles as derived artifacts
This avoids treating compiled wiki pages or generated bundles as merge primitives.
## Likely Local Layout
```text
.groundrecall/
events/
imports/
store/
exports/
```
## Planned Phases
### Phase 1: Re-import And Update Semantics
- import the same source tree repeatedly without duplicating everything
- support import lineage and supersession
- track object continuity across imports
### Phase 2: Event Log Capture
- record machine-local observations and import events
- distinguish machine-local state from promoted shared state
- preserve provenance and timestamps explicitly
### Phase 3: Merge And Consolidation
- merge append-only events from multiple machines
- consolidate draft claims and review candidates
- preserve contradiction and supersession history
### Phase 4: Shared And Private Scopes
- private notes and private candidate knowledge
- shared promoted knowledge
- controlled promotion from private to shared
### Phase 5: Team And Corpus Workflows
- parallel ingestion over large corpora
- coordinated claim review and adjudication
- export of consolidated assistant-neutral snapshots
## Non-Goals For The Current Alpha
The current repo does not yet provide:
- real-time networked sync
- conflict-free replicated data types
- hosted review services
The next useful milestone is a practical local event-log and re-import model, not a full distributed platform in one step.

31
pyproject.toml Normal file
View File

@ -0,0 +1,31 @@
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "groundrecall"
version = "0.1.0a0"
description = "Grounded knowledge substrate for llmwiki++ style workflows."
readme = "README.md"
requires-python = ">=3.10"
license = { text = "MIT" }
authors = [
{ name = "GroundRecall contributors" }
]
dependencies = [
"pydantic>=2,<3",
"PyYAML>=6,<7",
]
[project.scripts]
groundrecall = "groundrecall.cli:main"
[tool.setuptools]
package-dir = { "" = "src" }
[tool.setuptools.packages.find]
where = ["src"]
[tool.pytest.ini_options]
pythonpath = ["src"]
testpaths = ["tests"]

View File

@ -0,0 +1,34 @@
from __future__ import annotations
from .inspect import inspect_store, summarize_store
from .ingest import ImportResult, build_parser as build_import_parser, main as import_main, run_groundrecall_import
from .models import * # noqa: F403
from .promotion import build_parser as build_promotion_parser, main as promotion_main, promote_import_to_store
from .query import (
build_parser as build_query_parser,
build_query_bundle_for_concept,
main as query_main,
query_concept,
query_provenance,
search_claims,
)
from .store import GroundRecallStore
__all__ = [
"GroundRecallStore",
"ImportResult",
"run_groundrecall_import",
"build_import_parser",
"import_main",
"promote_import_to_store",
"build_promotion_parser",
"promotion_main",
"query_concept",
"query_provenance",
"search_claims",
"build_query_bundle_for_concept",
"build_query_parser",
"query_main",
"summarize_store",
"inspect_store",
]

View File

@ -0,0 +1,5 @@
from .cli import main
if __name__ == "__main__":
main()

View File

@ -0,0 +1,81 @@
from typing import Any
from pydantic import BaseModel, Field
class DependencySpec(BaseModel):
name: str
min_version: str = "0.0.0"
max_version: str = "9999.9999.9999"
class MasteryProfileSpec(BaseModel):
template: str | None = None
required_dimensions: list[str] = Field(default_factory=list)
dimension_threshold_overrides: dict[str, float] = Field(default_factory=dict)
class CrossPackLinkSpec(BaseModel):
source_concept: str
target_concept: str
relation: str
class ProfileTemplateSpec(BaseModel):
required_dimensions: list[str] = Field(default_factory=list)
dimension_threshold_overrides: dict[str, float] = Field(default_factory=dict)
class PackManifest(BaseModel):
name: str
display_name: str
version: str
schema_version: str
didactopus_min_version: str
didactopus_max_version: str
description: str = ""
author: str = ""
license: str = "unspecified"
dependencies: list[DependencySpec] = Field(default_factory=list)
overrides: list[str] = Field(default_factory=list)
profile_templates: dict[str, ProfileTemplateSpec] = Field(default_factory=dict)
cross_pack_links: list[CrossPackLinkSpec] = Field(default_factory=list)
class ConceptEntry(BaseModel):
id: str
title: str
description: str = ""
prerequisites: list[str] = Field(default_factory=list)
mastery_signals: list[str] = Field(default_factory=list)
mastery_profile: MasteryProfileSpec = Field(default_factory=MasteryProfileSpec)
class ConceptsFile(BaseModel):
concepts: list[ConceptEntry]
class RoadmapStageEntry(BaseModel):
id: str
title: str
concepts: list[str] = Field(default_factory=list)
checkpoint: list[str] = Field(default_factory=list)
class RoadmapFile(BaseModel):
stages: list[RoadmapStageEntry]
class ProjectEntry(BaseModel):
id: str
title: str
difficulty: str = ""
prerequisites: list[str] = Field(default_factory=list)
deliverables: list[str] = Field(default_factory=list)
class ProjectsFile(BaseModel):
projects: list[ProjectEntry]
class RubricsFile(BaseModel):
rubrics: list[dict[str, Any]]

View File

@ -0,0 +1,59 @@
from __future__ import annotations
import argparse
import json
from pathlib import Path
from typing import Any
from .assistants.base import get_assistant_adapter
from .query import build_query_bundle_for_concept
from .store import GroundRecallStore
def export_assistant_bundle(
store_dir: str | Path,
assistant: str,
out_dir: str | Path,
concept_refs: list[str] | None = None,
) -> dict[str, Any]:
store = GroundRecallStore(store_dir)
snapshot = store.build_snapshot(
snapshot_id="assistant-export",
created_at="",
metadata={"export_kind": "assistant_adapter", "assistant": assistant},
).model_dump()
query_bundles = []
for concept_ref in concept_refs or []:
payload = build_query_bundle_for_concept(store_dir, concept_ref)
if payload is not None:
query_bundles.append(payload)
adapter = get_assistant_adapter(assistant)
paths = adapter.export_bundle(snapshot, query_bundles, out_dir)
manifest = {
"assistant": assistant,
"output_paths": [str(path) for path in paths],
"query_bundle_count": len(query_bundles),
}
Path(out_dir).mkdir(parents=True, exist_ok=True)
(Path(out_dir) / "assistant_export_manifest.json").write_text(json.dumps(manifest, indent=2), encoding="utf-8")
return manifest
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Export assistant-specific GroundRecall bundles from canonical store data.")
parser.add_argument("store_dir")
parser.add_argument("assistant")
parser.add_argument("out_dir")
parser.add_argument("--concept", action="append", default=[])
return parser
def main() -> None:
args = build_parser().parse_args()
payload = export_assistant_bundle(
store_dir=args.store_dir,
assistant=args.assistant,
out_dir=args.out_dir,
concept_refs=list(args.concept or []),
)
print(json.dumps(payload, indent=2))

View File

@ -0,0 +1,2 @@
from __future__ import annotations

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from ..groundrecall_assistants.base import * # noqa: F403

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from ..groundrecall_assistants.claude_code import * # noqa: F403

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from ..groundrecall_assistants.codex import * # noqa: F403

View File

@ -0,0 +1,239 @@
from __future__ import annotations
import json
from pathlib import Path
import re
from typing import Any
def _load_citegeist_symbols() -> dict[str, Any] | None:
import sys
citegeist_src = Path("/home/netuser/bin/CiteGeist/src")
if citegeist_src.exists():
sys.path.insert(0, str(citegeist_src))
try:
from citegeist.app_api import LiteratureExplorerApi # type: ignore
from citegeist.bibtex import BibEntry, parse_bibtex, render_bibtex # type: ignore
from citegeist.storage import BibliographyStore # type: ignore
except Exception:
return None
return {
"LiteratureExplorerApi": LiteratureExplorerApi,
"BibEntry": BibEntry,
"parse_bibtex": parse_bibtex,
"render_bibtex": render_bibtex,
"BibliographyStore": BibliographyStore,
}
def discover_bib_files(source_root: str | Path) -> list[Path]:
root = Path(source_root)
if not root.exists():
return []
candidates = [
path
for path in root.rglob("*.bib")
if path.is_file() and not path.name.endswith("-bak.bib") and not path.name.startswith(".")
]
def rank(path: Path) -> tuple[int, int, str]:
rel = path.relative_to(root)
name = path.name
if rel == Path("refs.bib"):
return (0, len(rel.parts), str(rel))
if rel == Path("biblio.bib"):
return (1, len(rel.parts), str(rel))
if name == "refs.bib":
return (2, len(rel.parts), str(rel))
if name == "biblio.bib":
return (3, len(rel.parts), str(rel))
return (4, len(rel.parts), str(rel))
return sorted(candidates, key=rank)
def load_bibliography_index(source_root: str | Path) -> dict[str, dict[str, Any]]:
symbols = _load_citegeist_symbols()
root = Path(source_root)
index: dict[str, dict[str, Any]] = {}
for bib_path in discover_bib_files(root):
try:
entries = _parse_bib_entries(bib_path.read_text(encoding="utf-8"), symbols=symbols)
except Exception:
continue
for entry in entries:
raw_bibtex = _render_entry_bibtex(entry, symbols=symbols)
payload = {
"citation_key": entry.citation_key,
"entry_type": entry.entry_type,
"fields": dict(entry.fields),
"source_bib_path": str(bib_path.relative_to(root)),
"raw_bibtex": raw_bibtex,
"duplicate_source_bib_paths": [],
}
existing = index.get(entry.citation_key)
if existing is None:
index[entry.citation_key] = payload
else:
existing.setdefault("duplicate_source_bib_paths", []).append(str(bib_path.relative_to(root)))
return index
def materialize_citegeist_store(import_dir: str | Path, source_root: str | Path) -> dict[str, Any]:
symbols = _load_citegeist_symbols()
if symbols is None:
return {"available": False}
BibliographyStore = symbols["BibliographyStore"]
LiteratureExplorerApi = symbols["LiteratureExplorerApi"]
import_root = Path(import_dir)
db_path = import_root / "citegeist.sqlite3"
if db_path.exists():
db_path.unlink()
store = BibliographyStore(db_path)
ingested_files: list[str] = []
for bib_path in discover_bib_files(source_root):
try:
text = bib_path.read_text(encoding="utf-8")
entries = _parse_bib_entries(text, symbols=symbols)
for entry in entries:
store.upsert_entry(
entry,
raw_bibtex=_render_entry_bibtex(entry, symbols=symbols),
source_type="bibtex",
source_label=str(bib_path.relative_to(Path(source_root))),
review_status="draft",
)
store.connection.commit()
ingested_files.append(str(bib_path.relative_to(Path(source_root))))
except Exception:
continue
api = LiteratureExplorerApi(store)
return {
"available": True,
"db_path": str(db_path),
"ingested_files": ingested_files,
"api": api,
"store": store,
}
def bibliography_summary_payload(source_root: str | Path) -> dict[str, Any]:
index = load_bibliography_index(source_root)
source_files = discover_bib_files(source_root)
return {
"enabled": bool(index),
"entry_count": len(index),
"source_files": [str(path.relative_to(Path(source_root))) for path in source_files],
}
def serialize_bib_entry(entry: dict[str, Any] | None) -> dict[str, Any] | None:
if entry is None:
return None
return {
"citation_key": entry.get("citation_key", ""),
"entry_type": entry.get("entry_type", ""),
"fields": dict(entry.get("fields", {})),
"source_bib_path": entry.get("source_bib_path", ""),
"raw_bibtex": entry.get("raw_bibtex", ""),
"duplicate_source_bib_paths": list(entry.get("duplicate_source_bib_paths", [])),
}
def serialize_citegeist_entry_payload(payload: dict[str, Any] | None) -> dict[str, Any] | None:
if payload is None:
return None
result = dict(payload)
if "raw_bibtex" in result and isinstance(result["raw_bibtex"], str):
return result
return json.loads(json.dumps(result))
def _parse_bib_entries(text: str, *, symbols: dict[str, Any] | None) -> list[Any]:
if symbols is not None:
try:
return symbols["parse_bibtex"](text)
except Exception:
pass
return _fallback_parse_bibtex(text, symbols=symbols)
def _render_entry_bibtex(entry: Any, *, symbols: dict[str, Any] | None) -> str:
if symbols is not None:
try:
return symbols["render_bibtex"]([entry])
except Exception:
pass
fields = []
for key, value in entry.fields.items():
fields.append(f" {key} = {{{value}}}")
body = ",\n".join(fields)
return f"@{entry.entry_type}{{{entry.citation_key},\n{body}\n}}"
def _fallback_parse_bibtex(text: str, *, symbols: dict[str, Any] | None) -> list[Any]:
BibEntry = symbols["BibEntry"] if symbols is not None else None
entries: list[Any] = []
pattern = re.compile(r"@(?P<entry_type>[A-Za-z]+)\s*\{\s*(?P<citation_key>[^,\s]+)\s*,", re.MULTILINE)
matches = list(pattern.finditer(text))
for index, match in enumerate(matches):
start = match.end()
end = matches[index + 1].start() if index + 1 < len(matches) else len(text)
body = text[start:end]
fields = _fallback_parse_fields(body)
if BibEntry is not None:
entries.append(BibEntry(entry_type=match.group("entry_type").lower(), citation_key=match.group("citation_key").strip(), fields=fields))
else:
entries.append(type("BibEntryFallback", (), {"entry_type": match.group("entry_type").lower(), "citation_key": match.group("citation_key").strip(), "fields": fields})())
return entries
def _fallback_parse_fields(body: str) -> dict[str, str]:
fields: dict[str, str] = {}
index = 0
length = len(body)
while index < length:
while index < length and body[index] in " \t\r\n,":
index += 1
if index >= length or body[index] == "}":
break
key_start = index
while index < length and re.match(r"[A-Za-z0-9_:-]", body[index]):
index += 1
key = body[key_start:index].strip().lower()
while index < length and body[index] in " \t\r\n=":
index += 1
value = ""
if index < length and body[index] == "{":
depth = 1
index += 1
value_start = index
while index < length and depth > 0:
if body[index] == "{":
depth += 1
elif body[index] == "}":
depth -= 1
if depth == 0:
break
index += 1
value = body[value_start:index].strip()
index += 1
elif index < length and body[index] == '"':
index += 1
value_start = index
while index < length and body[index] != '"':
if body[index] == "\\":
index += 1
index += 1
value = body[value_start:index].strip()
index += 1
else:
value_start = index
while index < length and body[index] not in ",\n":
index += 1
value = body[value_start:index].strip()
if key:
fields[key] = value.rstrip(",")
return fields

40
src/groundrecall/cli.py Normal file
View File

@ -0,0 +1,40 @@
from __future__ import annotations
import argparse
import sys
from . import assistant_export, export, ingest, inspect, lint, promotion, query, review_server
COMMANDS = {
"import": ingest.main,
"lint": lint.main,
"promote": promotion.main,
"query": query.main,
"export": export.main,
"assistant-export": assistant_export.main,
"inspect": inspect.main,
"review-server": review_server.main,
}
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="GroundRecall command-line tools")
parser.add_argument("command", nargs="?", choices=sorted(COMMANDS))
return parser
def main() -> None:
argv = sys.argv[1:]
parser = build_parser()
args, remainder = parser.parse_known_args(argv)
if not args.command:
parser.print_help()
return
handler = COMMANDS[args.command]
original_argv = sys.argv
try:
sys.argv = [f"groundrecall.cli {args.command}", *remainder]
handler()
finally:
sys.argv = original_argv

136
src/groundrecall/export.py Normal file
View File

@ -0,0 +1,136 @@
from __future__ import annotations
import argparse
import json
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
from .query import build_query_bundle_for_concept
from .store import GroundRecallStore
def _now() -> str:
return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
def _write_json(path: Path, payload: dict[str, Any]) -> None:
path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
def _write_jsonl(path: Path, rows: list[dict[str, Any]]) -> None:
text = "\n".join(json.dumps(row, sort_keys=True) for row in rows)
if text:
text += "\n"
path.write_text(text, encoding="utf-8")
def export_canonical_snapshot(
store_dir: str | Path,
out_dir: str | Path,
snapshot_id: str | None = None,
metadata: dict[str, Any] | None = None,
) -> dict[str, str]:
store = GroundRecallStore(store_dir)
target = Path(out_dir)
target.mkdir(parents=True, exist_ok=True)
actual_snapshot_id = snapshot_id or f"snapshot-export-{datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')}"
snapshot = store.build_snapshot(
snapshot_id=actual_snapshot_id,
created_at=_now(),
metadata={"export_kind": "canonical", **(metadata or {})},
)
store.save_snapshot(snapshot)
snapshot_path = target / "groundrecall_snapshot.json"
_write_json(snapshot_path, snapshot.model_dump())
_write_jsonl(target / "claims.jsonl", [item.model_dump() for item in snapshot.claims])
_write_jsonl(target / "concepts.jsonl", [item.model_dump() for item in snapshot.concepts])
_write_jsonl(target / "relations.jsonl", [item.model_dump() for item in snapshot.relations])
provenance_manifest = {
"snapshot_id": snapshot.snapshot_id,
"created_at": snapshot.created_at,
"source_count": len(snapshot.sources),
"artifact_count": len(snapshot.artifacts),
"observation_count": len(snapshot.observations),
}
_write_json(target / "provenance_manifest.json", provenance_manifest)
manifest = {
"export_kind": "canonical",
"snapshot_id": snapshot.snapshot_id,
"files": [
"groundrecall_snapshot.json",
"claims.jsonl",
"concepts.jsonl",
"relations.jsonl",
"provenance_manifest.json",
],
}
_write_json(target / "export_manifest.json", manifest)
return {
"snapshot_json": str(snapshot_path),
"claims_jsonl": str(target / "claims.jsonl"),
"concepts_jsonl": str(target / "concepts.jsonl"),
"relations_jsonl": str(target / "relations.jsonl"),
"provenance_manifest_json": str(target / "provenance_manifest.json"),
"export_manifest_json": str(target / "export_manifest.json"),
}
def export_query_bundle(
store_dir: str | Path,
concept_ref: str,
out_path: str | Path,
) -> dict[str, Any]:
payload = build_query_bundle_for_concept(store_dir, concept_ref)
if payload is None:
raise KeyError(f"Unknown concept reference: {concept_ref}")
path = Path(out_path)
path.parent.mkdir(parents=True, exist_ok=True)
_write_json(path, payload)
return payload
def export_canonical_bundle(
store_dir: str | Path,
out_dir: str | Path,
concept_refs: list[str] | None = None,
snapshot_id: str | None = None,
) -> dict[str, Any]:
target = Path(out_dir)
target.mkdir(parents=True, exist_ok=True)
outputs = export_canonical_snapshot(store_dir, target, snapshot_id=snapshot_id)
query_bundle_paths: list[str] = []
for concept_ref in concept_refs or []:
safe_name = concept_ref.lower().replace(" ", "-").replace("::", "-")
bundle_path = target / f"query_bundle__{safe_name}.json"
export_query_bundle(store_dir, concept_ref, bundle_path)
query_bundle_paths.append(str(bundle_path))
manifest = json.loads((target / "export_manifest.json").read_text(encoding="utf-8"))
manifest["query_bundles"] = query_bundle_paths
_write_json(target / "export_manifest.json", manifest)
return {
"canonical_outputs": outputs,
"query_bundles": query_bundle_paths,
}
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Export canonical GroundRecall artifacts.")
parser.add_argument("store_dir")
parser.add_argument("out_dir")
parser.add_argument("--snapshot-id", default=None)
parser.add_argument("--concept", action="append", default=[])
return parser
def main() -> None:
args = build_parser().parse_args()
payload = export_canonical_bundle(
store_dir=args.store_dir,
out_dir=args.out_dir,
concept_refs=list(args.concept or []),
snapshot_id=args.snapshot_id,
)
print(json.dumps(payload, indent=2))

View File

@ -0,0 +1,12 @@
from __future__ import annotations
"""Legacy flat GroundRecall assistant export module.
Compatibility path retained during the internal namespace migration.
Prefer imports under ``didactopus.groundrecall.assistant_export`` or CLI usage
via ``didactopus.groundrecall.cli`` for new code.
"""
from .groundrecall.assistant_export import build_parser, export_assistant_bundle, main
__all__ = ["export_assistant_bundle", "build_parser", "main"]

View File

@ -0,0 +1,9 @@
"""Legacy flat GroundRecall assistants package.
Compatibility path retained during the internal namespace migration.
Prefer imports under ``didactopus.groundrecall.assistants`` for new code.
"""
from .base import get_assistant_adapter, list_assistant_adapters
__all__ = ["get_assistant_adapter", "list_assistant_adapters"]

View File

@ -0,0 +1,43 @@
from __future__ import annotations
"""Legacy flat GroundRecall assistant adapter base module.
Compatibility path retained during the internal namespace migration.
Prefer imports under ``didactopus.groundrecall.assistants.base`` for new code.
"""
from dataclasses import dataclass
from pathlib import Path
from typing import Protocol
class AssistantAdapter(Protocol):
name: str
def export_bundle(self, snapshot: dict, query_bundles: list[dict], out_dir: str | Path) -> list[Path]:
...
def build_context(self, query_result: dict) -> dict:
...
def supported_capabilities(self) -> dict[str, bool]:
...
_REGISTRY: dict[str, AssistantAdapter] = {}
def register_assistant_adapter(adapter: AssistantAdapter) -> AssistantAdapter:
_REGISTRY[adapter.name] = adapter
return adapter
def get_assistant_adapter(name: str) -> AssistantAdapter:
try:
return _REGISTRY[name]
except KeyError as exc:
raise KeyError(f"Unknown assistant adapter: {name}") from exc
def list_assistant_adapters() -> list[str]:
return sorted(_REGISTRY)

View File

@ -0,0 +1,69 @@
from __future__ import annotations
import json
from pathlib import Path
from .base import register_assistant_adapter
class ClaudeCodeAdapter:
name = "claude_code"
def export_bundle(self, snapshot: dict, query_bundles: list[dict], out_dir: str | Path) -> list[Path]:
target = Path(out_dir)
target.mkdir(parents=True, exist_ok=True)
paths: list[Path] = []
memory_md = "\n".join(
[
"# GroundRecall Memory",
"",
f"- Snapshot: `{snapshot.get('snapshot_id', '')}`",
f"- Concepts: {len(snapshot.get('concepts', []))}",
f"- Claims: {len(snapshot.get('claims', []))}",
"",
"Prefer the canonical GroundRecall snapshot and query bundles over free-form recollection.",
"",
"## Query Bundles",
]
+ [f"- `{bundle.get('concept', {}).get('concept_id', 'unknown')}`" for bundle in query_bundles]
)
memory_path = target / "CLAUDE.md"
memory_path.write_text(memory_md, encoding="utf-8")
paths.append(memory_path)
bundle_path = target / "claude_code_bundle.json"
bundle_path.write_text(
json.dumps(
{
"assistant": "claude_code",
"snapshot_id": snapshot.get("snapshot_id", ""),
"query_bundle_count": len(query_bundles),
"query_bundles": query_bundles,
},
indent=2,
),
encoding="utf-8",
)
paths.append(bundle_path)
return paths
def build_context(self, query_result: dict) -> dict:
return {
"assistant": "claude_code",
"memory_kind": "groundrecall_query_bundle",
"concept": query_result.get("concept", {}),
"claims": query_result.get("relevant_claims", []),
"support": query_result.get("supporting_observations", []),
"next_actions": query_result.get("suggested_next_actions", []),
}
def supported_capabilities(self) -> dict[str, bool]:
return {
"skill_markdown": False,
"json_bundle": True,
"project_memory": True,
}
register_assistant_adapter(ClaudeCodeAdapter())

View File

@ -0,0 +1,78 @@
from __future__ import annotations
import json
from pathlib import Path
from .base import register_assistant_adapter
class CodexAdapter:
name = "codex"
def export_bundle(self, snapshot: dict, query_bundles: list[dict], out_dir: str | Path) -> list[Path]:
target = Path(out_dir)
target.mkdir(parents=True, exist_ok=True)
paths: list[Path] = []
skill_payload = {
"name": f"groundrecall-{snapshot.get('snapshot_id', 'snapshot')}",
"description": "GroundRecall assistant adapter bundle for Codex.",
"snapshot_id": snapshot.get("snapshot_id", ""),
"concept_count": len(snapshot.get("concepts", [])),
"claim_count": len(snapshot.get("claims", [])),
}
skill_md = "\n".join(
[
"---",
f"name: {skill_payload['name']}",
f"description: {skill_payload['description']}",
"---",
"",
"# GroundRecall Codex Bundle",
"",
f"- Snapshot: `{skill_payload['snapshot_id']}`",
f"- Concepts: {skill_payload['concept_count']}",
f"- Claims: {skill_payload['claim_count']}",
"",
"Use the accompanying canonical JSON and query bundles as the primary source of grounded context.",
]
)
skill_path = target / "SKILL.md"
skill_path.write_text(skill_md, encoding="utf-8")
paths.append(skill_path)
bundle_path = target / "codex_bundle.json"
bundle_path.write_text(
json.dumps(
{
"assistant": "codex",
"snapshot_id": snapshot.get("snapshot_id", ""),
"query_bundle_count": len(query_bundles),
"query_bundles": query_bundles,
},
indent=2,
),
encoding="utf-8",
)
paths.append(bundle_path)
return paths
def build_context(self, query_result: dict) -> dict:
return {
"assistant": "codex",
"context_kind": "groundrecall_query_bundle",
"concept": query_result.get("concept", {}),
"relevant_claims": query_result.get("relevant_claims", []),
"supporting_observations": query_result.get("supporting_observations", []),
"suggested_next_actions": query_result.get("suggested_next_actions", []),
}
def supported_capabilities(self) -> dict[str, bool]:
return {
"skill_markdown": True,
"json_bundle": True,
"project_memory": False,
}
register_assistant_adapter(CodexAdapter())

View File

@ -0,0 +1,54 @@
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
TEXT_EXTENSIONS = {
".md",
".markdown",
".txt",
".tex",
".json",
".yaml",
".yml",
".csv",
".log",
}
@dataclass
class DiscoveredArtifact:
path: Path
relative_path: str
artifact_kind: str
is_text: bool
def classify_artifact(root: Path, path: Path) -> DiscoveredArtifact:
rel = path.relative_to(root).as_posix()
top = rel.split("/", 1)[0]
suffix = path.suffix.lower()
is_text = suffix in TEXT_EXTENSIONS or path.name in {"README", "LICENSE"}
artifact_kind = "generic_artifact"
if top == "wiki":
artifact_kind = "compiled_page"
elif top in {"raw", "sources"}:
artifact_kind = "raw_note"
elif top == "logs":
artifact_kind = "session_log"
elif path.name.startswith("schema."):
artifact_kind = "schema_file"
elif suffix in {".md", ".markdown"}:
artifact_kind = "markdown_note"
return DiscoveredArtifact(path=path, relative_path=rel, artifact_kind=artifact_kind, is_text=is_text)
def discover_llmwiki_artifacts(root: str | Path) -> list[DiscoveredArtifact]:
base = Path(root)
artifacts: list[DiscoveredArtifact] = []
for path in sorted(p for p in base.rglob("*") if p.is_file()):
if any(part in {".git", "__pycache__", ".pytest_cache"} for part in path.parts):
continue
artifacts.append(classify_artifact(base, path))
return artifacts

View File

@ -0,0 +1,24 @@
from __future__ import annotations
"""Legacy flat GroundRecall export module.
Compatibility path retained during the internal namespace migration.
Prefer imports under ``didactopus.groundrecall.export`` or CLI usage via
``didactopus.groundrecall.cli`` for new code.
"""
from .groundrecall.export import (
build_parser,
export_canonical_bundle,
export_canonical_snapshot,
export_query_bundle,
main,
)
__all__ = [
"export_canonical_snapshot",
"export_query_bundle",
"export_canonical_bundle",
"build_parser",
"main",
]

View File

@ -0,0 +1,9 @@
from __future__ import annotations
"""Legacy extracted GroundRecall import module.
Compatibility path retained while the standalone repo converges on the
top-level ``groundrecall.ingest`` module as the primary implementation.
"""
from .ingest import ImportResult, build_parser, main, run_groundrecall_import

View File

@ -0,0 +1,9 @@
from __future__ import annotations
"""Legacy extracted GroundRecall lint module.
Compatibility path retained while the standalone repo converges on the
top-level ``groundrecall.lint`` module as the primary implementation.
"""
from .lint import build_parser, lint_import_directory, main

View File

@ -0,0 +1,9 @@
from __future__ import annotations
"""Legacy extracted GroundRecall models module.
Compatibility path retained while the standalone repo converges on the
top-level ``groundrecall.models`` module as the primary implementation.
"""
from .models import * # noqa: F403

View File

@ -0,0 +1,136 @@
from __future__ import annotations
from dataclasses import asdict, dataclass
from hashlib import sha256
from pathlib import Path
from typing import Any
from .groundrecall_discovery import DiscoveredArtifact
from .groundrecall_segmenter import SegmentedPage, SegmentedObservation
@dataclass
class ImportContext:
import_id: str
import_mode: str
machine_id: str
agent_id: str
source_root: str
imported_at: str
def _sanitize_claim_key(value: str) -> str:
text = "".join(ch.lower() if ch.isalnum() else "-" for ch in value).strip("-")
return text or "claim"
def _claim_id_for_observation(observation_record: dict[str, Any], observation: SegmentedObservation, index: int) -> str:
if observation.explicit_claim_key:
return f"clm_{_sanitize_claim_key(observation.explicit_claim_key)}"
return f"clm_{observation_record['observation_id']}_{index}"
def build_artifact_record(context: ImportContext, artifact: DiscoveredArtifact, page: SegmentedPage | None) -> dict[str, Any]:
record = {
"artifact_id": f"ia_{sha256(artifact.relative_path.encode('utf-8')).hexdigest()[:12]}",
"import_id": context.import_id,
"artifact_kind": artifact.artifact_kind,
"path": artifact.relative_path,
"title": page.title if page else Path(artifact.relative_path).stem,
"sha256": sha256(artifact.path.read_bytes()).hexdigest(),
"created_at": context.imported_at,
"metadata": {
"frontmatter": page.frontmatter if page else {},
"headings": page.headings if page else [],
},
"current_status": "draft",
}
return record
def build_observation_record(
context: ImportContext,
artifact_record: dict[str, Any],
observation: SegmentedObservation,
index: int,
) -> dict[str, Any]:
return {
"observation_id": f"obs_{artifact_record['artifact_id']}_{index}",
"import_id": context.import_id,
"artifact_id": artifact_record["artifact_id"],
"role": observation.role,
"text": observation.text,
"origin_path": observation.artifact_relative_path,
"origin_section": observation.section,
"line_start": observation.line_start,
"line_end": observation.line_end,
"grounding_status": observation.grounding_status,
"support_kind": observation.support_kind,
"confidence_hint": observation.confidence_hint,
"current_status": "draft",
}
def build_claim_record(
context: ImportContext,
observation_record: dict[str, Any],
observation: SegmentedObservation,
concept_ids: list[str],
index: int,
) -> dict[str, Any]:
return {
"claim_id": _claim_id_for_observation(observation_record, observation, index),
"import_id": context.import_id,
"claim_text": observation_record["text"],
"claim_kind": "statement" if observation_record["role"] == "claim" else "summary",
"source_observation_ids": [observation_record["observation_id"]],
"supporting_fragment_ids": [],
"concept_ids": [f"concept::{concept_id}" for concept_id in concept_ids],
"contradicts_claim_ids": [f"clm_{_sanitize_claim_key(value)}" for value in observation.contradict_keys],
"supersedes_claim_ids": [f"clm_{_sanitize_claim_key(value)}" for value in observation.supersede_keys],
"confidence_hint": observation_record["confidence_hint"],
"grounding_status": observation_record["grounding_status"],
"current_status": "triaged" if observation_record["grounding_status"] != "ungrounded" else "draft",
}
def build_concept_records(context: ImportContext, artifact_record: dict[str, Any], concept_ids: list[str]) -> list[dict[str, Any]]:
records = []
for concept_id in concept_ids:
records.append(
{
"concept_id": f"concept::{concept_id}",
"import_id": context.import_id,
"title": concept_id.replace("-", " ").title(),
"aliases": [],
"description": "Imported concept from llmwiki corpus.",
"source_artifact_ids": [artifact_record["artifact_id"]],
"current_status": "triaged",
}
)
return records
def build_relation_records(context: ImportContext, artifact_record: dict[str, Any], concept_ids: list[str], links: list[str]) -> list[dict[str, Any]]:
if not concept_ids:
return []
primary = f"concept::{concept_ids[0]}"
records = []
for idx, link in enumerate(links, start=1):
target = f"concept::{link.lower().replace(' ', '-')}"
records.append(
{
"relation_id": f"rel_{artifact_record['artifact_id']}_{idx}",
"import_id": context.import_id,
"source_id": primary,
"target_id": target,
"relation_type": "references",
"evidence_ids": [],
"current_status": "draft",
}
)
return records
def manifest_record(context: ImportContext) -> dict[str, Any]:
return asdict(context) | {"source_repo_kind": "llmwiki"}

View File

@ -0,0 +1,9 @@
from __future__ import annotations
"""Legacy extracted GroundRecall promotion module.
Compatibility path retained while the standalone repo converges on the
top-level ``groundrecall.promotion`` module as the primary implementation.
"""
from .promotion import build_parser, main, promote_import_to_store

View File

@ -0,0 +1,26 @@
from __future__ import annotations
"""Legacy flat GroundRecall query module.
Compatibility path retained during the internal namespace migration.
Prefer imports under ``didactopus.groundrecall.query`` or CLI usage via
``didactopus.groundrecall.cli`` for new code.
"""
from .groundrecall.query import (
build_parser,
build_query_bundle_for_concept,
main,
query_concept,
query_provenance,
search_claims,
)
__all__ = [
"query_concept",
"search_claims",
"query_provenance",
"build_query_bundle_for_concept",
"build_parser",
"main",
]

View File

@ -0,0 +1,138 @@
from __future__ import annotations
import argparse
import json
from collections import defaultdict
from pathlib import Path
from typing import Any
from .review_export import build_citation_review_entries_from_import, export_review_state_json, export_review_ui_data
from .review_schema import ConceptReviewEntry, DraftPackData, ReviewSession
def _read_json(path: Path) -> dict[str, Any]:
return json.loads(path.read_text(encoding="utf-8"))
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
if not path.exists():
return []
text = path.read_text(encoding="utf-8").strip()
if not text:
return []
return [json.loads(line) for line in text.splitlines()]
def _claim_summary(claims: list[dict[str, Any]]) -> list[str]:
lines: list[str] = []
for claim in claims[:3]:
grounding = claim.get("grounding_status", "unknown")
lines.append(f"Claim: {claim.get('claim_text', '')} [{grounding}]")
if len(claims) > 3:
lines.append(f"{len(claims) - 3} additional claims omitted from notes summary.")
return lines
def build_review_session_from_import(import_dir: str | Path, reviewer: str = "GroundRecall Import") -> ReviewSession:
base = Path(import_dir)
manifest = _read_json(base / "manifest.json")
lint_payload = _read_json(base / "lint_findings.json")
claims = _read_jsonl(base / "claims.jsonl")
concepts = _read_jsonl(base / "concepts.jsonl")
claims_by_concept: defaultdict[str, list[dict[str, Any]]] = defaultdict(list)
for claim in claims:
for concept_id in claim.get("concept_ids", []):
claims_by_concept[concept_id].append(claim)
findings_by_target: defaultdict[str, list[dict[str, Any]]] = defaultdict(list)
concept_findings: defaultdict[str, list[dict[str, Any]]] = defaultdict(list)
for finding in lint_payload.get("findings", []):
findings_by_target[finding["target_id"]].append(finding)
for claim in claims:
for concept_id in claim.get("concept_ids", []):
concept_findings[concept_id].extend(findings_by_target.get(claim["claim_id"], []))
for concept in concepts:
concept_findings[concept["concept_id"]].extend(findings_by_target.get(concept["concept_id"], []))
entries: list[ConceptReviewEntry] = []
for concept in concepts:
concept_id = concept["concept_id"]
related_claims = claims_by_concept.get(concept_id, [])
related_findings = concept_findings.get(concept_id, [])
has_errors = any(item["severity"] == "error" for item in related_findings)
all_grounded = bool(related_claims) and all(item.get("grounding_status") == "grounded" for item in related_claims)
status = "needs_review"
if not has_errors and all_grounded:
status = "provisional"
notes = _claim_summary(related_claims)
notes.extend(item["message"] for item in related_findings[:5])
entries.append(
ConceptReviewEntry(
concept_id=concept_id.replace("concept::", "", 1),
title=concept.get("title", concept_id),
description=concept.get("description", ""),
prerequisites=[],
mastery_signals=[],
status=status,
notes=notes,
)
)
conflicts = [item["message"] for item in lint_payload.get("findings", []) if item["severity"] == "error"]
review_flags = [item["message"] for item in lint_payload.get("findings", []) if item["severity"] == "warning"]
pack = {
"name": f"groundrecall-import-{manifest['import_id']}",
"display_name": f"GroundRecall Import {manifest['import_id']}",
"version": "0.1.0-draft",
"source_import_id": manifest["import_id"],
"source_root": manifest.get("source_root", ""),
}
attribution = {
"source_repo_kind": manifest.get("source_repo_kind", "llmwiki"),
"source_root": manifest.get("source_root", ""),
"imported_at": manifest.get("imported_at", ""),
"machine_id": manifest.get("machine_id", ""),
"rights_note": "Imported llmwiki-style corpus requires review before promotion.",
}
return ReviewSession(
reviewer=reviewer,
draft_pack=DraftPackData(
pack=pack,
concepts=entries,
conflicts=conflicts,
review_flags=review_flags,
attribution=attribution,
),
citation_reviews=build_citation_review_entries_from_import(base),
)
def export_review_bundle_from_import(import_dir: str | Path, out_dir: str | Path | None = None, reviewer: str = "GroundRecall Import") -> dict[str, str]:
base = Path(import_dir)
target = Path(out_dir) if out_dir is not None else base
target.mkdir(parents=True, exist_ok=True)
session = build_review_session_from_import(base, reviewer=reviewer)
review_state_path = target / "review_session.json"
export_review_state_json(session, review_state_path)
export_review_ui_data(session, target, import_dir=base)
return {
"review_session_json": str(review_state_path),
"review_data_json": str(target / "review_data.json"),
}
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Build Didactopus review artifacts from a GroundRecall import.")
parser.add_argument("import_dir")
parser.add_argument("--out-dir", default=None)
parser.add_argument("--reviewer", default="GroundRecall Import")
return parser
def main() -> None:
args = build_parser().parse_args()
outputs = export_review_bundle_from_import(args.import_dir, out_dir=args.out_dir, reviewer=args.reviewer)
print(json.dumps(outputs, indent=2))

View File

@ -0,0 +1,114 @@
from __future__ import annotations
import argparse
import json
from collections import defaultdict
from pathlib import Path
from typing import Any
def _read_json(path: Path) -> dict[str, Any]:
return json.loads(path.read_text(encoding="utf-8"))
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
if not path.exists():
return []
text = path.read_text(encoding="utf-8").strip()
if not text:
return []
return [json.loads(line) for line in text.splitlines()]
def _triage_lane(item: dict[str, Any], finding_codes: set[str]) -> str:
if {"claim_ungrounded", "ungrounded_summary"} & finding_codes:
return "source_cleanup"
if {"relation_missing_source", "relation_missing_target", "orphan_concept"} & finding_codes:
return "conflict_resolution"
return "knowledge_capture"
def _priority(item: dict[str, Any], finding_codes: set[str]) -> int:
priority = 50
if item.get("grounding_status") == "grounded":
priority -= 10
if item.get("current_status") == "triaged":
priority -= 5
if any(code.startswith("claim_") or code.startswith("relation_") for code in finding_codes):
priority += 20
priority -= min(len(finding_codes) * 2, 10)
return max(priority, 1)
def build_review_queue(import_dir: str | Path) -> dict[str, Any]:
base = Path(import_dir)
manifest = _read_json(base / "manifest.json")
lint_payload = _read_json(base / "lint_findings.json")
claims = _read_jsonl(base / "claims.jsonl")
concepts = _read_jsonl(base / "concepts.jsonl")
findings_by_target: defaultdict[str, list[dict[str, Any]]] = defaultdict(list)
for finding in lint_payload.get("findings", []):
findings_by_target[finding["target_id"]].append(finding)
queue: list[dict[str, Any]] = []
for claim in claims:
related = findings_by_target.get(claim["claim_id"], [])
finding_codes = {item["code"] for item in related}
queue.append(
{
"queue_id": f"rq_{claim['claim_id']}",
"candidate_type": "claim",
"candidate_id": claim["claim_id"],
"title": claim["claim_text"][:100],
"triage_lane": _triage_lane(claim, finding_codes),
"priority": _priority(claim, finding_codes),
"grounding_status": claim.get("grounding_status"),
"status": "needs_review",
"finding_codes": sorted(finding_codes),
"concept_ids": list(claim.get("concept_ids", [])),
}
)
for concept in concepts:
related = findings_by_target.get(concept["concept_id"], [])
finding_codes = {item["code"] for item in related}
if not finding_codes:
continue
queue.append(
{
"queue_id": f"rq_{concept['concept_id'].replace('::', '_')}",
"candidate_type": "concept",
"candidate_id": concept["concept_id"],
"title": concept["title"],
"triage_lane": _triage_lane(concept, finding_codes),
"priority": _priority(concept, finding_codes),
"grounding_status": concept.get("grounding_status", "triaged"),
"status": "needs_review",
"finding_codes": sorted(finding_codes),
"concept_ids": [concept["concept_id"]],
}
)
queue.sort(key=lambda item: (item["priority"], item["candidate_type"], item["candidate_id"]))
return {
"import_id": manifest["import_id"],
"queue_length": len(queue),
"items": queue,
}
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Build a GroundRecall review queue from import artifacts.")
parser.add_argument("import_dir")
parser.add_argument("--out", default=None)
return parser
def main() -> None:
args = build_parser().parse_args()
payload = build_review_queue(args.import_dir)
out_path = Path(args.out) if args.out else Path(args.import_dir) / "review_queue.json"
out_path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
print(f"Wrote {out_path}")

View File

@ -0,0 +1,180 @@
from __future__ import annotations
from dataclasses import dataclass, field
from pathlib import Path
import re
from .groundrecall_discovery import DiscoveredArtifact
HEADING_RE = re.compile(r"^(#{1,6})\s+(.*)$")
FRONTMATTER_DELIM = "---"
ANNOTATION_RE = re.compile(r"\[(claim_id|contradicts|supersedes):([^\]]+)\]", re.IGNORECASE)
TABLE_SEPARATOR_RE = re.compile(r"^\|(?:\s*:?-{3,}:?\s*\|)+\s*$")
LATEX_STRUCTURAL_RE = re.compile(r"^\\(begin|end|centering|caption|label|tikzset|node|draw|path|matrix|includegraphics)\b")
LATEX_MATH_ONLY_RE = re.compile(r"^[\\{}[\]()$&_^%.,;:=+\-*/|<>~0-9A-Za-z ]+$")
@dataclass
class SegmentedObservation:
artifact_relative_path: str
role: str
text: str
section: str
line_start: int
line_end: int
grounding_status: str
support_kind: str
confidence_hint: float
explicit_claim_key: str = ""
contradict_keys: list[str] = field(default_factory=list)
supersede_keys: list[str] = field(default_factory=list)
@dataclass
class SegmentedPage:
title: str
headings: list[str] = field(default_factory=list)
frontmatter: dict[str, str] = field(default_factory=dict)
observations: list[SegmentedObservation] = field(default_factory=list)
concepts: list[str] = field(default_factory=list)
links: list[str] = field(default_factory=list)
def _parse_frontmatter(lines: list[str]) -> tuple[dict[str, str], int]:
if not lines or lines[0].strip() != FRONTMATTER_DELIM:
return {}, 0
data: dict[str, str] = {}
idx = 1
while idx < len(lines):
stripped = lines[idx].strip()
if stripped == FRONTMATTER_DELIM:
return data, idx + 1
if ":" in stripped:
key, value = stripped.split(":", 1)
data[key.strip()] = value.strip()
idx += 1
return data, 0
def _extract_links(text: str) -> list[str]:
return re.findall(r"\[\[([^\]]+)\]\]", text)
def _to_concept_id(text: str) -> str:
text = re.sub(r"[^a-zA-Z0-9]+", "-", text.strip().lower()).strip("-")
return text or "untitled"
def _parse_annotations(text: str) -> tuple[str, str, list[str], list[str]]:
claim_key = ""
contradict_keys: list[str] = []
supersede_keys: list[str] = []
for kind, raw_value in ANNOTATION_RE.findall(text):
values = [value.strip() for value in raw_value.split(",") if value.strip()]
kind_lower = kind.lower()
if kind_lower == "claim_id" and values:
claim_key = values[0]
elif kind_lower == "contradicts":
contradict_keys.extend(values)
elif kind_lower == "supersedes":
supersede_keys.extend(values)
cleaned = ANNOTATION_RE.sub("", text)
cleaned = re.sub(r"\s{2,}", " ", cleaned).strip()
return cleaned, claim_key, contradict_keys, supersede_keys
def _should_skip_line(text: str) -> bool:
stripped = text.strip()
if not stripped:
return True
if stripped.startswith("!["):
return True
if stripped in {"---", "```", "};", "{", "}", "</div>", "<div>", ":::"}:
return True
if stripped.startswith(":::"):
return True
if stripped.startswith("|") and stripped.endswith("|"):
return True
if TABLE_SEPARATOR_RE.match(stripped):
return True
if LATEX_STRUCTURAL_RE.match(stripped):
return True
if stripped.startswith("%"):
return True
if stripped.startswith("\\") and LATEX_MATH_ONLY_RE.match(stripped):
return True
return False
def segment_markdown_artifact(artifact: DiscoveredArtifact, text: str | None = None) -> SegmentedPage:
text = artifact.path.read_text(encoding="utf-8") if text is None else text
lines = text.splitlines()
frontmatter, start_idx = _parse_frontmatter(lines)
current_section = frontmatter.get("title", Path(artifact.relative_path).stem.replace("-", " ").title())
title = current_section
headings: list[str] = []
observations: list[SegmentedObservation] = []
concepts: list[str] = []
links: list[str] = []
for idx in range(start_idx, len(lines)):
raw_line = lines[idx]
stripped = raw_line.strip()
if _should_skip_line(stripped):
continue
heading_match = HEADING_RE.match(raw_line)
if heading_match:
current_section = heading_match.group(2).strip()
headings.append(current_section)
if not title and heading_match.group(1) == "#":
title = current_section
concepts.append(_to_concept_id(current_section))
continue
role = "summary"
obs_text = stripped
if stripped.startswith(("- ", "* ")):
role = "claim"
obs_text = stripped[2:].strip()
elif stripped.lower().startswith(("todo:", "question:", "q:")):
role = "question"
elif stripped.lower().startswith(("speculation:", "hypothesis:")):
role = "speculation"
elif artifact.artifact_kind == "session_log":
role = "transcript"
obs_text, claim_key, contradict_keys, supersede_keys = _parse_annotations(obs_text)
links.extend(_extract_links(obs_text))
if role in {"summary", "claim"}:
concepts.extend(_to_concept_id(link) for link in _extract_links(obs_text))
observations.append(
SegmentedObservation(
artifact_relative_path=artifact.relative_path,
role=role,
text=obs_text,
section=current_section,
line_start=idx + 1,
line_end=idx + 1,
grounding_status="partially_grounded" if artifact.artifact_kind == "compiled_page" else "grounded",
support_kind="derived_from_page" if artifact.artifact_kind == "compiled_page" else "direct_source",
confidence_hint=0.55 if role == "speculation" else 0.7 if role == "claim" else 0.6,
explicit_claim_key=claim_key,
contradict_keys=contradict_keys,
supersede_keys=supersede_keys,
)
)
if not headings and title:
headings.append(title)
if not concepts and title:
concepts.append(_to_concept_id(title))
return SegmentedPage(
title=title,
headings=headings,
frontmatter=frontmatter,
observations=observations,
concepts=sorted({c for c in concepts if c}),
links=sorted({link for link in links if link}),
)

View File

@ -0,0 +1,15 @@
"""Legacy flat GroundRecall source adapter package.
Compatibility path retained during the internal namespace migration.
Prefer imports under ``didactopus.groundrecall.source_adapters`` for new code.
"""
from .base import get_source_adapter, list_source_adapters
from . import llmwiki # noqa: F401
from . import polypaper # noqa: F401
from . import doclift_bundle # noqa: F401
from . import markdown_notes # noqa: F401
from . import transcript # noqa: F401
from . import didactopus_pack # noqa: F401
__all__ = ["get_source_adapter", "list_source_adapters"]

View File

@ -0,0 +1,76 @@
from __future__ import annotations
"""Legacy flat GroundRecall source adapter base module.
Compatibility path retained during the internal namespace migration.
Prefer imports under ``didactopus.groundrecall.source_adapters.base`` for new
code.
"""
from dataclasses import dataclass
from pathlib import Path
from typing import Literal, Protocol
ImportIntent = Literal["grounded_knowledge", "curriculum", "both"]
@dataclass
class DiscoveredImportSource:
path: Path
relative_path: str
source_kind: str
artifact_kind: str
is_text: bool
metadata: dict
@dataclass
class StructuredImportRows:
artifact_rows: list[dict]
observation_rows: list[dict]
claim_rows: list[dict]
concept_rows: list[dict]
relation_rows: list[dict]
class GroundRecallSourceAdapter(Protocol):
name: str
def detect(self, root: str | Path) -> bool:
...
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
...
def import_intent(self) -> ImportIntent:
...
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
...
_REGISTRY: dict[str, GroundRecallSourceAdapter] = {}
def register_source_adapter(adapter: GroundRecallSourceAdapter) -> GroundRecallSourceAdapter:
_REGISTRY[adapter.name] = adapter
return adapter
def get_source_adapter(name: str) -> GroundRecallSourceAdapter:
try:
return _REGISTRY[name]
except KeyError as exc:
raise KeyError(f"Unknown GroundRecall source adapter: {name}") from exc
def list_source_adapters() -> list[str]:
return sorted(_REGISTRY)
def detect_source_adapter(root: str | Path) -> GroundRecallSourceAdapter:
for adapter in _REGISTRY.values():
if adapter.detect(root):
return adapter
raise ValueError(f"No GroundRecall source adapter detected for {root}")

View File

@ -0,0 +1,234 @@
from __future__ import annotations
from hashlib import sha256
import yaml
from pathlib import Path
from ..artifact_schemas import ConceptsFile, RoadmapFile
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
class DidactopusPackSourceAdapter:
name = "didactopus_pack"
def detect(self, root: str | Path) -> bool:
base = Path(root)
required = {"pack.yaml", "concepts.yaml"}
return required.issubset({path.name for path in base.iterdir() if path.exists()})
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
base = Path(root)
rows: list[DiscoveredImportSource] = []
for filename in ["pack.yaml", "concepts.yaml", "roadmap.yaml", "projects.yaml", "rubrics.yaml", "review_ledger.json"]:
path = base / filename
if not path.exists():
continue
rows.append(
DiscoveredImportSource(
path=path,
relative_path=path.relative_to(base).as_posix(),
source_kind="didactopus_pack",
artifact_kind="didactopus_pack_artifact",
is_text=True,
metadata={},
)
)
return rows
def import_intent(self) -> str:
return "both"
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
by_name = {Path(item.relative_path).name: item for item in sources}
concepts_src = by_name.get("concepts.yaml")
if concepts_src is None:
return None
pack_src = by_name.get("pack.yaml")
pack_payload = {}
if pack_src is not None:
pack_payload = yaml.safe_load(pack_src.path.read_text(encoding="utf-8")) or {}
concepts_payload = ConceptsFile.model_validate(
yaml.safe_load(concepts_src.path.read_text(encoding="utf-8")) or {"concepts": []}
)
roadmap_payload = None
roadmap_src = by_name.get("roadmap.yaml")
if roadmap_src is not None:
roadmap_payload = RoadmapFile.model_validate(
yaml.safe_load(roadmap_src.path.read_text(encoding="utf-8")) or {"stages": []}
)
artifact_rows: list[dict] = []
observation_rows: list[dict] = []
claim_rows: list[dict] = []
concept_rows: list[dict] = []
relation_rows: list[dict] = []
for source in sources:
artifact_rows.append(
{
"artifact_id": f"ia_{sha256(source.relative_path.encode('utf-8')).hexdigest()[:12]}",
"import_id": context.import_id,
"artifact_kind": source.artifact_kind,
"path": source.relative_path,
"title": source.path.stem,
"sha256": sha256(source.path.read_bytes()).hexdigest(),
"created_at": context.imported_at,
"metadata": {"source_kind": source.source_kind},
"current_status": "draft",
}
)
pack_name = pack_payload.get("name", Path(context.source_root).name)
concepts_artifact_id = next(
(row["artifact_id"] for row in artifact_rows if row["path"] == concepts_src.relative_path),
"",
)
for index, concept in enumerate(concepts_payload.concepts, start=1):
concept_key = f"concept::{concept.id}"
concept_rows.append(
{
"concept_id": concept_key,
"import_id": context.import_id,
"title": concept.title,
"aliases": [],
"description": concept.description or f"Imported concept from Didactopus pack {pack_name}.",
"source_artifact_ids": [concepts_artifact_id] if concepts_artifact_id else [],
"current_status": "triaged",
}
)
observation_id = f"obs_pack_{concept.id}_{index}"
observation_rows.append(
{
"observation_id": observation_id,
"import_id": context.import_id,
"artifact_id": concepts_artifact_id,
"role": "summary",
"text": concept.description or concept.title,
"origin_path": concepts_src.relative_path,
"origin_section": concept.title,
"line_start": 0,
"line_end": 0,
"grounding_status": "grounded",
"support_kind": "direct_source",
"confidence_hint": 0.85,
"current_status": "draft",
}
)
claim_rows.append(
{
"claim_id": f"clm_pack_{concept.id}",
"import_id": context.import_id,
"claim_text": concept.description or f"{concept.title} is a concept in pack {pack_name}.",
"claim_kind": "summary",
"source_observation_ids": [observation_id],
"supporting_fragment_ids": [],
"concept_ids": [concept_key],
"contradicts_claim_ids": [],
"supersedes_claim_ids": [],
"confidence_hint": 0.85,
"grounding_status": "grounded",
"current_status": "triaged",
}
)
for prereq in concept.prerequisites:
relation_rows.append(
{
"relation_id": f"rel_prereq_{concept.id}_{prereq}",
"import_id": context.import_id,
"source_id": f"concept::{prereq}",
"target_id": concept_key,
"relation_type": "prerequisite",
"evidence_ids": [f"clm_pack_{concept.id}"],
"current_status": "draft",
}
)
for signal_idx, signal in enumerate(concept.mastery_signals, start=1):
signal_obs_id = f"obs_signal_{concept.id}_{signal_idx}"
observation_rows.append(
{
"observation_id": signal_obs_id,
"import_id": context.import_id,
"artifact_id": concepts_artifact_id,
"role": "summary",
"text": signal,
"origin_path": concepts_src.relative_path,
"origin_section": f"{concept.title} mastery signal",
"line_start": 0,
"line_end": 0,
"grounding_status": "grounded",
"support_kind": "direct_source",
"confidence_hint": 0.8,
"current_status": "draft",
}
)
claim_rows.append(
{
"claim_id": f"clm_signal_{concept.id}_{signal_idx}",
"import_id": context.import_id,
"claim_text": signal,
"claim_kind": "mastery_signal",
"source_observation_ids": [signal_obs_id],
"supporting_fragment_ids": [],
"concept_ids": [concept_key],
"contradicts_claim_ids": [],
"supersedes_claim_ids": [],
"confidence_hint": 0.8,
"grounding_status": "grounded",
"current_status": "triaged",
}
)
if roadmap_payload is not None and roadmap_src is not None:
roadmap_artifact_id = next(
(row["artifact_id"] for row in artifact_rows if row["path"] == roadmap_src.relative_path),
"",
)
for stage in roadmap_payload.stages:
for concept_id in stage.concepts:
observation_id = f"obs_stage_{stage.id}_{concept_id}"
observation_rows.append(
{
"observation_id": observation_id,
"import_id": context.import_id,
"artifact_id": roadmap_artifact_id,
"role": "summary",
"text": f"{concept_id} appears in roadmap stage {stage.title}.",
"origin_path": roadmap_src.relative_path,
"origin_section": stage.title,
"line_start": 0,
"line_end": 0,
"grounding_status": "grounded",
"support_kind": "direct_source",
"confidence_hint": 0.75,
"current_status": "draft",
}
)
claim_rows.append(
{
"claim_id": f"clm_stage_{stage.id}_{concept_id}",
"import_id": context.import_id,
"claim_text": f"{concept_id} belongs to roadmap stage {stage.title}.",
"claim_kind": "roadmap_stage",
"source_observation_ids": [observation_id],
"supporting_fragment_ids": [],
"concept_ids": [f"concept::{concept_id}"],
"contradicts_claim_ids": [],
"supersedes_claim_ids": [],
"confidence_hint": 0.75,
"grounding_status": "grounded",
"current_status": "triaged",
}
)
return StructuredImportRows(
artifact_rows=artifact_rows,
observation_rows=observation_rows,
claim_rows=claim_rows,
concept_rows=concept_rows,
relation_rows=relation_rows,
)
register_source_adapter(DidactopusPackSourceAdapter())

View File

@ -0,0 +1,150 @@
from __future__ import annotations
import json
from hashlib import sha256
from pathlib import Path
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
class DocliftBundleSourceAdapter:
name = "doclift_bundle"
def detect(self, root: str | Path) -> bool:
base = Path(root)
return (base / "manifest.json").exists() and (base / "documents").exists()
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
base = Path(root)
rows: list[DiscoveredImportSource] = []
for path in sorted(p for p in base.rglob("*") if p.is_file() and p.suffix.lower() in {".json", ".md"}):
rows.append(
DiscoveredImportSource(
path=path,
relative_path=path.relative_to(base).as_posix(),
source_kind="doclift_bundle",
artifact_kind="doclift_bundle_artifact",
is_text=True,
metadata={},
)
)
return rows
def import_intent(self) -> str:
return "both"
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
base = Path(context.source_root)
manifest_path = base / "manifest.json"
if not manifest_path.exists():
return None
manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
artifact_rows: list[dict] = []
observation_rows: list[dict] = []
claim_rows: list[dict] = []
concept_rows: list[dict] = []
relation_rows: list[dict] = []
artifact_by_path: dict[str, str] = {}
for source in sources:
artifact_id = f"ia_{sha256(source.relative_path.encode('utf-8')).hexdigest()[:12]}"
artifact_rows.append(
{
"artifact_id": artifact_id,
"import_id": context.import_id,
"artifact_kind": source.artifact_kind,
"path": source.relative_path,
"title": source.path.stem,
"sha256": sha256(source.path.read_bytes()).hexdigest(),
"created_at": context.imported_at,
"metadata": {"source_kind": source.source_kind},
"current_status": "draft",
}
)
artifact_by_path[source.relative_path] = artifact_id
documents = [item for item in manifest.get("documents", []) if isinstance(item, dict)]
previous_concept_id: str | None = None
for index, document in enumerate(documents, start=1):
title = str(document.get("title") or f"Document {index}")
concept_id = f"concept::{document.get('document_id') or title.lower().replace(' ', '-')}"
markdown_path = Path(document.get("markdown_path", ""))
relative_markdown = markdown_path.relative_to(base).as_posix() if markdown_path.is_absolute() and markdown_path.exists() and markdown_path.is_relative_to(base) else document.get("markdown_path", "")
artifact_id = artifact_by_path.get(str(relative_markdown), "")
figures_path = Path(document.get("figures_path", ""))
figure_payload = {}
if figures_path.exists():
figure_payload = json.loads(figures_path.read_text(encoding="utf-8"))
source_path = str(figure_payload.get("source_path") or document.get("source_path") or relative_markdown)
concept_rows.append(
{
"concept_id": concept_id,
"import_id": context.import_id,
"title": title,
"aliases": [],
"description": f"Imported from doclift bundle document kind '{document.get('document_kind', 'document')}'.",
"source_artifact_ids": [artifact_id] if artifact_id else [],
"current_status": "triaged",
}
)
observation_id = f"obs_doclift_{index}"
observation_rows.append(
{
"observation_id": observation_id,
"import_id": context.import_id,
"artifact_id": artifact_id,
"role": "summary",
"text": title,
"origin_path": relative_markdown,
"origin_section": title,
"line_start": 0,
"line_end": 0,
"source_url": source_path,
"grounding_status": "grounded",
"support_kind": "direct_source",
"confidence_hint": 0.85,
"current_status": "draft",
}
)
claim_rows.append(
{
"claim_id": f"clm_doclift_{index}",
"import_id": context.import_id,
"claim_text": f"{title} is a {document.get('document_kind', 'document')} in the imported doclift bundle.",
"claim_kind": "summary",
"source_observation_ids": [observation_id],
"supporting_fragment_ids": [],
"concept_ids": [concept_id],
"contradicts_claim_ids": [],
"supersedes_claim_ids": [],
"confidence_hint": 0.85,
"grounding_status": "grounded",
"current_status": "triaged",
}
)
if previous_concept_id is not None:
relation_rows.append(
{
"relation_id": f"rel_doclift_seq_{index}",
"import_id": context.import_id,
"source_id": previous_concept_id,
"target_id": concept_id,
"relation_type": "references",
"evidence_ids": [f"clm_doclift_{index}"],
"current_status": "draft",
}
)
previous_concept_id = concept_id
return StructuredImportRows(
artifact_rows=artifact_rows,
observation_rows=observation_rows,
claim_rows=claim_rows,
concept_rows=concept_rows,
relation_rows=relation_rows,
)
register_source_adapter(DocliftBundleSourceAdapter())

View File

@ -0,0 +1,36 @@
from __future__ import annotations
from pathlib import Path
from ..groundrecall_discovery import discover_llmwiki_artifacts
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
class LLMWikiSourceAdapter:
name = "llmwiki"
def detect(self, root: str | Path) -> bool:
base = Path(root)
return (base / "wiki").exists() or (base / "raw").exists() or any(path.name.startswith("schema.") for path in base.iterdir() if path.exists())
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
return [
DiscoveredImportSource(
path=item.path,
relative_path=item.relative_path,
source_kind="llmwiki",
artifact_kind=item.artifact_kind,
is_text=item.is_text,
metadata={},
)
for item in discover_llmwiki_artifacts(root)
]
def import_intent(self) -> str:
return "grounded_knowledge"
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
return None
register_source_adapter(LLMWikiSourceAdapter())

View File

@ -0,0 +1,41 @@
from __future__ import annotations
from pathlib import Path
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
TEXT_SUFFIXES = {".md", ".markdown", ".txt", ".tex"}
class MarkdownNotesSourceAdapter:
name = "markdown_notes"
def detect(self, root: str | Path) -> bool:
base = Path(root)
return any(path.suffix.lower() in TEXT_SUFFIXES for path in base.rglob("*") if path.is_file())
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
base = Path(root)
rows: list[DiscoveredImportSource] = []
for path in sorted(p for p in base.rglob("*") if p.is_file() and p.suffix.lower() in TEXT_SUFFIXES):
rows.append(
DiscoveredImportSource(
path=path,
relative_path=path.relative_to(base).as_posix(),
source_kind="markdown_notes",
artifact_kind="markdown_note",
is_text=True,
metadata={},
)
)
return rows
def import_intent(self) -> str:
return "grounded_knowledge"
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
return None
register_source_adapter(MarkdownNotesSourceAdapter())

View File

@ -0,0 +1,106 @@
from __future__ import annotations
from pathlib import Path
import re
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
TEXT_SUFFIXES = {".tex"}
EXCLUDED_NAMES = {
".pp-export-tmp.tex",
"paper.woven.arxiv.tex",
"paper.woven.test.tex",
"paper.woven.org",
"paper.org",
"paper_b.org",
"paper_c.org",
"paper_c.bak.org",
"paper-demo.org",
"paper-orig.org",
"test.output.org",
"tex-blocks.org",
}
EXCLUDED_DIRS = {".git", "__pycache__", ".pytest_cache", "setup"}
EXCLUDED_PREFIXES = ("table-", "figure-", "fig-")
INCLUDE_RE = re.compile(r"\\(?:include|input)\{([^}]+)\}")
class PolyPaperSourceAdapter:
name = "polypaper"
def detect(self, root: str | Path) -> bool:
base = Path(root)
return (
(base / "main.tex").exists()
and (base / "pieces").is_dir()
and ((base / "paper.org").exists() or (base / "README.md").exists())
)
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
base = Path(root)
allowed_paths = self._collect_reachable_tex(base)
rows: list[DiscoveredImportSource] = []
for path in sorted(allowed_paths):
rows.append(
DiscoveredImportSource(
path=path,
relative_path=path.relative_to(base).as_posix(),
source_kind="polypaper",
artifact_kind="markdown_note",
is_text=True,
metadata={},
)
)
return rows
def import_intent(self) -> str:
return "grounded_knowledge"
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
return None
def _collect_reachable_tex(self, base: Path) -> set[Path]:
entrypoint = base / "main.tex"
reachable: set[Path] = set()
pending: list[Path] = [entrypoint]
while pending:
current = pending.pop()
if not current.exists():
continue
if current in reachable:
continue
if any(part in EXCLUDED_DIRS for part in current.relative_to(base).parts):
continue
if current.name in EXCLUDED_NAMES or current.suffix.lower() not in TEXT_SUFFIXES:
continue
if current.parent.name == "figs":
continue
if current.name.startswith(EXCLUDED_PREFIXES) or current.name == "tables.tex":
continue
text = current.read_text(encoding="utf-8")
for raw_ref in INCLUDE_RE.findall(text):
candidate = self._resolve_include(base, current.parent, raw_ref.strip())
if candidate is not None and candidate not in reachable:
pending.append(candidate)
if current != entrypoint:
reachable.add(current)
return reachable
def _resolve_include(self, base: Path, current_dir: Path, raw_ref: str) -> Path | None:
candidates = [current_dir / raw_ref, base / raw_ref]
resolved: list[Path] = []
for candidate in candidates:
if candidate.suffix:
resolved.append(candidate)
else:
resolved.append(candidate.with_suffix(".tex"))
for candidate in resolved:
if candidate.exists():
return candidate
return None
register_source_adapter(PolyPaperSourceAdapter())

View File

@ -0,0 +1,38 @@
from __future__ import annotations
from pathlib import Path
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
class TranscriptSourceAdapter:
name = "transcript"
def detect(self, root: str | Path) -> bool:
base = Path(root)
return any("transcript" in path.name.lower() for path in base.rglob("*") if path.is_file())
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
base = Path(root)
rows: list[DiscoveredImportSource] = []
for path in sorted(p for p in base.rglob("*") if p.is_file() and "transcript" in p.name.lower()):
rows.append(
DiscoveredImportSource(
path=path,
relative_path=path.relative_to(base).as_posix(),
source_kind="transcript",
artifact_kind="session_log",
is_text=True,
metadata={},
)
)
return rows
def import_intent(self) -> str:
return "grounded_knowledge"
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
return None
register_source_adapter(TranscriptSourceAdapter())

View File

@ -0,0 +1,9 @@
from __future__ import annotations
"""Legacy extracted GroundRecall store module.
Compatibility path retained while the standalone repo converges on the
top-level ``groundrecall.store`` module as the primary implementation.
"""
from .store import GroundRecallStore

231
src/groundrecall/ingest.py Normal file
View File

@ -0,0 +1,231 @@
from __future__ import annotations
import argparse
import json
import shutil
import socket
import subprocess
from collections import OrderedDict
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
from .groundrecall_discovery import DiscoveredArtifact
from .groundrecall_lint import lint_import_directory
from .groundrecall_normalizer import (
ImportContext,
build_artifact_record,
build_claim_record,
build_concept_records,
build_observation_record,
build_relation_records,
manifest_record,
)
from .groundrecall_review_bridge import export_review_bundle_from_import
from .groundrecall_review_queue import build_review_queue
from .groundrecall_segmenter import SegmentedPage, segment_markdown_artifact
from .groundrecall_source_adapters.base import detect_source_adapter
import groundrecall.groundrecall_source_adapters # noqa: F401
VALID_MODES = {"archive", "quick", "grounded"}
@dataclass
class ImportResult:
manifest: dict[str, Any]
artifacts: list[dict[str, Any]]
observations: list[dict[str, Any]]
claims: list[dict[str, Any]]
concepts: list[dict[str, Any]]
relations: list[dict[str, Any]]
out_dir: Path
def _timestamp() -> str:
return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
def _default_import_id(source_root: Path) -> str:
stem = source_root.name.lower().replace("_", "-")
stamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
return f"{stem}-{stamp}"
def _write_json(path: Path, payload: dict[str, Any]) -> None:
path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
def _write_jsonl(path: Path, rows: list[dict[str, Any]]) -> None:
text = "\n".join(json.dumps(row, sort_keys=True) for row in rows)
if text:
text += "\n"
path.write_text(text, encoding="utf-8")
def _dedupe_by_key(rows: list[dict[str, Any]], key: str) -> list[dict[str, Any]]:
unique: OrderedDict[str, dict[str, Any]] = OrderedDict()
for row in rows:
unique.setdefault(str(row[key]), row)
return list(unique.values())
def _convert_tex_to_markdown(path: Path) -> str | None:
pandoc = shutil.which("pandoc")
if pandoc is None:
return None
result = subprocess.run(
[pandoc, "-f", "latex", "-t", "gfm", str(path)],
capture_output=True,
text=True,
check=False,
)
if result.returncode != 0:
return None
markdown = result.stdout.strip()
return markdown or None
def _segment_artifact(artifact: DiscoveredArtifact) -> SegmentedPage | None:
if not artifact.is_text:
return None
suffix = artifact.path.suffix.lower()
if suffix not in {".md", ".markdown", ".txt", ".tex", ".log"}:
return None
if suffix == ".tex":
converted = _convert_tex_to_markdown(artifact.path)
if converted is not None:
return segment_markdown_artifact(artifact, text=converted)
return segment_markdown_artifact(artifact)
def run_groundrecall_import(
source_root: str | Path,
out_root: str | Path | None = None,
mode: str = "quick",
import_id: str | None = None,
machine_id: str | None = None,
agent_id: str = "groundrecall.ingest",
) -> ImportResult:
source_path = Path(source_root).resolve()
if mode not in VALID_MODES:
raise ValueError(f"Unsupported import mode: {mode}")
adapter = detect_source_adapter(source_path)
discovered = adapter.discover(source_path)
artifacts = [
DiscoveredArtifact(
path=item.path,
relative_path=item.relative_path,
artifact_kind=item.artifact_kind,
is_text=item.is_text,
)
for item in discovered
]
actual_import_id = import_id or _default_import_id(source_path)
output_root = Path(out_root) if out_root else source_path / "imports"
output_dir = output_root / actual_import_id
output_dir.mkdir(parents=True, exist_ok=True)
context = ImportContext(
import_id=actual_import_id,
import_mode=mode,
machine_id=machine_id or socket.gethostname(),
agent_id=agent_id,
source_root=str(source_path),
imported_at=_timestamp(),
)
artifact_rows: list[dict[str, Any]] = []
observation_rows: list[dict[str, Any]] = []
claim_rows: list[dict[str, Any]] = []
concept_rows: list[dict[str, Any]] = []
relation_rows: list[dict[str, Any]] = []
structured_rows = adapter.build_rows(context, discovered)
if structured_rows is not None:
artifact_rows.extend(structured_rows.artifact_rows)
observation_rows.extend(structured_rows.observation_rows)
claim_rows.extend(structured_rows.claim_rows)
concept_rows.extend(structured_rows.concept_rows)
relation_rows.extend(structured_rows.relation_rows)
else:
for artifact in artifacts:
page = _segment_artifact(artifact)
artifact_row = build_artifact_record(context, artifact, page)
artifact_rows.append(artifact_row)
if page is None:
continue
concept_rows.extend(build_concept_records(context, artifact_row, page.concepts))
relation_rows.extend(build_relation_records(context, artifact_row, page.concepts, page.links))
for index, observation in enumerate(page.observations, start=1):
observation_row = build_observation_record(context, artifact_row, observation, index)
observation_rows.append(observation_row)
if mode == "archive":
continue
if observation.role not in {"claim", "summary"}:
continue
claim_rows.append(build_claim_record(context, observation_row, observation, page.concepts[:3], index))
concept_rows = _dedupe_by_key(concept_rows, "concept_id")
relation_rows = _dedupe_by_key(relation_rows, "relation_id")
artifact_rows = _dedupe_by_key(artifact_rows, "artifact_id")
observation_rows = _dedupe_by_key(observation_rows, "observation_id")
claim_rows = _dedupe_by_key(claim_rows, "claim_id")
manifest = manifest_record(context) | {
"source_adapter": adapter.name,
"import_intent": adapter.import_intent(),
"artifact_count": len(artifact_rows),
"observation_count": len(observation_rows),
"claim_count": len(claim_rows),
"concept_count": len(concept_rows),
"relation_count": len(relation_rows),
}
_write_json(output_dir / "manifest.json", manifest)
_write_jsonl(output_dir / "artifacts.jsonl", artifact_rows)
_write_jsonl(output_dir / "observations.jsonl", observation_rows)
_write_jsonl(output_dir / "claims.jsonl", claim_rows)
_write_jsonl(output_dir / "concepts.jsonl", concept_rows)
_write_jsonl(output_dir / "relations.jsonl", relation_rows)
lint_payload = lint_import_directory(output_dir)
_write_json(output_dir / "lint_findings.json", lint_payload)
review_queue = build_review_queue(output_dir)
_write_json(output_dir / "review_queue.json", review_queue)
export_review_bundle_from_import(output_dir)
return ImportResult(
manifest=manifest,
artifacts=artifact_rows,
observations=observation_rows,
claims=claim_rows,
concepts=concept_rows,
relations=relation_rows,
out_dir=output_dir,
)
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Import an llmwiki-style repository into GroundRecall import artifacts.")
parser.add_argument("source_root")
parser.add_argument("--out-root", default=None)
parser.add_argument("--mode", choices=sorted(VALID_MODES), default="quick")
parser.add_argument("--import-id", default=None)
parser.add_argument("--machine-id", default=None)
parser.add_argument("--agent-id", default="groundrecall.ingest")
return parser
def main() -> None:
args = build_parser().parse_args()
result = run_groundrecall_import(
source_root=args.source_root,
out_root=args.out_root,
mode=args.mode,
import_id=args.import_id,
machine_id=args.machine_id,
agent_id=args.agent_id,
)
print(f"Wrote import artifacts to {result.out_dir}")

View File

@ -0,0 +1,47 @@
from __future__ import annotations
import argparse
import json
from pathlib import Path
from typing import Any
from .store import GroundRecallStore
def summarize_store(store_dir: str | Path) -> dict[str, Any]:
store = GroundRecallStore(store_dir)
snapshots = store.list_snapshots()
latest_snapshot = max(snapshots, key=lambda item: item.created_at, default=None)
return {
"store_dir": str(Path(store_dir)),
"source_count": len(store.list_sources()),
"artifact_count": len(store.list_artifacts()),
"observation_count": len(store.list_observations()),
"claim_count": len(store.list_claims()),
"concept_count": len(store.list_concepts()),
"relation_count": len(store.list_relations()),
"review_candidate_count": len(store.list_review_candidates()),
"promotion_count": len(store.list_promotions()),
"snapshot_count": len(snapshots),
"latest_snapshot_id": latest_snapshot.snapshot_id if latest_snapshot is not None else "",
}
def inspect_store(store_dir: str | Path, out_path: str | Path | None = None) -> dict[str, Any]:
payload = summarize_store(store_dir)
if out_path is not None:
Path(out_path).write_text(json.dumps(payload, indent=2), encoding="utf-8")
return payload
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Inspect canonical GroundRecall store contents.")
parser.add_argument("store_dir")
parser.add_argument("--out", default=None)
return parser
def main() -> None:
args = build_parser().parse_args()
payload = inspect_store(args.store_dir, out_path=args.out)
print(json.dumps(payload, indent=2))

196
src/groundrecall/lint.py Normal file
View File

@ -0,0 +1,196 @@
from __future__ import annotations
import argparse
import json
from collections import Counter, defaultdict
from pathlib import Path
from typing import Any
def _read_json(path: Path) -> dict[str, Any]:
return json.loads(path.read_text(encoding="utf-8"))
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
if not path.exists():
return []
text = path.read_text(encoding="utf-8").strip()
if not text:
return []
return [json.loads(line) for line in text.splitlines()]
def lint_import_directory(import_dir: str | Path) -> dict[str, Any]:
base = Path(import_dir)
manifest = _read_json(base / "manifest.json")
artifacts = _read_jsonl(base / "artifacts.jsonl")
observations = _read_jsonl(base / "observations.jsonl")
claims = _read_jsonl(base / "claims.jsonl")
concepts = _read_jsonl(base / "concepts.jsonl")
relations = _read_jsonl(base / "relations.jsonl")
findings: list[dict[str, Any]] = []
observation_by_id = {row["observation_id"]: row for row in observations}
concept_ids = {row["concept_id"] for row in concepts}
text_counter = Counter(row["claim_text"].strip().lower() for row in claims if row.get("claim_text", "").strip())
claim_ids = {row["claim_id"] for row in claims}
for claim in claims:
claim_text = claim.get("claim_text", "").strip()
if not claim.get("source_observation_ids"):
findings.append(
{
"severity": "error",
"code": "claim_missing_observation",
"target_id": claim["claim_id"],
"message": "Claim has no source observation ids.",
}
)
if not claim.get("concept_ids"):
findings.append(
{
"severity": "warning",
"code": "claim_missing_concept",
"target_id": claim["claim_id"],
"message": "Claim is not associated with any concepts.",
}
)
if claim.get("grounding_status") == "ungrounded":
findings.append(
{
"severity": "warning",
"code": "claim_ungrounded",
"target_id": claim["claim_id"],
"message": "Claim is ungrounded and should not be promoted directly.",
}
)
if claim_text and text_counter[claim_text.lower()] > 1:
findings.append(
{
"severity": "warning",
"code": "duplicate_claim_text",
"target_id": claim["claim_id"],
"message": "Claim text duplicates another imported claim.",
}
)
for obs_id in claim.get("source_observation_ids", []):
if obs_id not in observation_by_id:
findings.append(
{
"severity": "error",
"code": "claim_observation_missing",
"target_id": claim["claim_id"],
"message": f"Claim references missing observation {obs_id}.",
}
)
for target_claim_id in claim.get("contradicts_claim_ids", []):
if target_claim_id not in claim_ids:
findings.append(
{
"severity": "warning",
"code": "unresolved_contradiction_ref",
"target_id": claim["claim_id"],
"message": f"Claim references missing contradiction target {target_claim_id}.",
}
)
for target_claim_id in claim.get("supersedes_claim_ids", []):
if target_claim_id not in claim_ids:
findings.append(
{
"severity": "warning",
"code": "unresolved_supersession_ref",
"target_id": claim["claim_id"],
"message": f"Claim references missing supersession target {target_claim_id}.",
}
)
if claim.get("contradicts_claim_ids") and claim.get("supersedes_claim_ids"):
findings.append(
{
"severity": "warning",
"code": "claim_mixed_conflict_and_supersession",
"target_id": claim["claim_id"],
"message": "Claim marks both contradiction and supersession targets; review the intended relation.",
}
)
concept_sources: defaultdict[str, set[str]] = defaultdict(set)
for claim in claims:
for concept_id in claim.get("concept_ids", []):
concept_sources[concept_id].add(claim["claim_id"])
for relation in relations:
concept_sources[relation.get("source_id", "")].add(relation["relation_id"])
concept_sources[relation.get("target_id", "")].add(relation["relation_id"])
for concept in concepts:
if not concept_sources.get(concept["concept_id"]):
findings.append(
{
"severity": "warning",
"code": "orphan_concept",
"target_id": concept["concept_id"],
"message": "Concept has no connected claims or relations.",
}
)
for relation in relations:
if relation.get("source_id") not in concept_ids:
findings.append(
{
"severity": "error",
"code": "relation_missing_source",
"target_id": relation["relation_id"],
"message": f"Relation source {relation.get('source_id')} is missing.",
}
)
if relation.get("target_id") not in concept_ids:
findings.append(
{
"severity": "error",
"code": "relation_missing_target",
"target_id": relation["relation_id"],
"message": f"Relation target {relation.get('target_id')} is missing.",
}
)
for observation in observations:
role = observation.get("role")
if role == "summary" and observation.get("grounding_status") == "ungrounded":
findings.append(
{
"severity": "warning",
"code": "ungrounded_summary",
"target_id": observation["observation_id"],
"message": "Summary observation is ungrounded.",
}
)
summary = {
"artifact_count": len(artifacts),
"observation_count": len(observations),
"claim_count": len(claims),
"concept_count": len(concepts),
"relation_count": len(relations),
"error_count": sum(1 for item in findings if item["severity"] == "error"),
"warning_count": sum(1 for item in findings if item["severity"] == "warning"),
}
return {
"import_id": manifest["import_id"],
"import_mode": manifest["import_mode"],
"summary": summary,
"findings": findings,
}
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Lint GroundRecall import artifacts.")
parser.add_argument("import_dir")
parser.add_argument("--out", default=None)
return parser
def main() -> None:
args = build_parser().parse_args()
payload = lint_import_directory(args.import_dir)
out_path = Path(args.out) if args.out else Path(args.import_dir) / "lint_findings.json"
out_path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
print(f"Wrote {out_path}")

137
src/groundrecall/models.py Normal file
View File

@ -0,0 +1,137 @@
from __future__ import annotations
from typing import Literal
from pydantic import BaseModel, Field
LifecycleStatus = Literal["draft", "triaged", "reviewed", "promoted", "superseded", "archived", "rejected"]
GroundingStatus = Literal["grounded", "partially_grounded", "ungrounded"]
SupportKind = Literal["direct_source", "derived_from_page", "derived_from_session", "inferred", "unknown"]
class ProvenanceRecord(BaseModel):
origin_artifact_id: str = ""
origin_path: str = ""
origin_section: str = ""
source_url: str = ""
retrieval_date: str = ""
machine_id: str = ""
session_id: str = ""
support_kind: SupportKind = "unknown"
grounding_status: GroundingStatus = "ungrounded"
class SourceRecord(BaseModel):
source_id: str
title: str = ""
source_type: str = "document"
path: str = ""
url: str = ""
retrieved_at: str = ""
metadata: dict = Field(default_factory=dict)
current_status: LifecycleStatus = "draft"
class FragmentRecord(BaseModel):
fragment_id: str
source_id: str
text: str
section: str = ""
line_start: int = 0
line_end: int = 0
metadata: dict = Field(default_factory=dict)
current_status: LifecycleStatus = "draft"
class ArtifactRecord(BaseModel):
artifact_id: str
artifact_kind: str
title: str = ""
path: str = ""
sha256: str = ""
created_at: str = ""
metadata: dict = Field(default_factory=dict)
current_status: LifecycleStatus = "draft"
class ObservationRecord(BaseModel):
observation_id: str
artifact_id: str = ""
role: str
text: str
provenance: ProvenanceRecord = Field(default_factory=ProvenanceRecord)
confidence_hint: float = 0.0
current_status: LifecycleStatus = "draft"
class ClaimRecord(BaseModel):
claim_id: str
claim_text: str
claim_kind: str = "statement"
source_observation_ids: list[str] = Field(default_factory=list)
supporting_fragment_ids: list[str] = Field(default_factory=list)
concept_ids: list[str] = Field(default_factory=list)
contradicts_claim_ids: list[str] = Field(default_factory=list)
supersedes_claim_ids: list[str] = Field(default_factory=list)
confidence_hint: float = 0.0
review_confidence: float = 0.0
last_confirmed_at: str = ""
provenance: ProvenanceRecord = Field(default_factory=ProvenanceRecord)
current_status: LifecycleStatus = "draft"
class ConceptRecord(BaseModel):
concept_id: str
title: str
aliases: list[str] = Field(default_factory=list)
description: str = ""
source_artifact_ids: list[str] = Field(default_factory=list)
current_status: LifecycleStatus = "draft"
class RelationRecord(BaseModel):
relation_id: str
source_id: str
target_id: str
relation_type: str
evidence_ids: list[str] = Field(default_factory=list)
provenance: ProvenanceRecord = Field(default_factory=ProvenanceRecord)
current_status: LifecycleStatus = "draft"
class ReviewCandidateRecord(BaseModel):
review_candidate_id: str
candidate_type: Literal["claim", "concept", "relation"]
candidate_id: str
triage_lane: str = "knowledge_capture"
priority: int = 50
finding_codes: list[str] = Field(default_factory=list)
rationale: str = ""
current_status: LifecycleStatus = "draft"
class PromotionRecord(BaseModel):
promotion_id: str
candidate_type: Literal["claim", "concept", "relation"]
candidate_id: str
promotion_target: str = "groundrecall_store"
verdict: Literal["approved", "rejected", "superseded"] = "approved"
reviewer: str = ""
promoted_object_ids: list[str] = Field(default_factory=list)
notes: str = ""
promoted_at: str = ""
class GroundRecallSnapshot(BaseModel):
snapshot_id: str
created_at: str
sources: list[SourceRecord] = Field(default_factory=list)
fragments: list[FragmentRecord] = Field(default_factory=list)
artifacts: list[ArtifactRecord] = Field(default_factory=list)
observations: list[ObservationRecord] = Field(default_factory=list)
claims: list[ClaimRecord] = Field(default_factory=list)
concepts: list[ConceptRecord] = Field(default_factory=list)
relations: list[RelationRecord] = Field(default_factory=list)
promotions: list[PromotionRecord] = Field(default_factory=list)
metadata: dict = Field(default_factory=dict)

View File

@ -0,0 +1,250 @@
from __future__ import annotations
import argparse
import json
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
from .models import (
ArtifactRecord,
ClaimRecord,
ConceptRecord,
ObservationRecord,
PromotionRecord,
ProvenanceRecord,
RelationRecord,
ReviewCandidateRecord,
)
from .review_schema import ReviewSession
from .store import GroundRecallStore
def _read_json(path: Path) -> dict[str, Any]:
return json.loads(path.read_text(encoding="utf-8"))
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
if not path.exists():
return []
text = path.read_text(encoding="utf-8").strip()
if not text:
return []
return [json.loads(line) for line in text.splitlines()]
def _now() -> str:
return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
def _review_status_map(status: str) -> str:
return {
"trusted": "promoted",
"provisional": "reviewed",
"rejected": "rejected",
"needs_review": "triaged",
}.get(status, "triaged")
def _provenance_from_payload(payload: dict[str, Any]) -> ProvenanceRecord:
return ProvenanceRecord(
origin_artifact_id=payload.get("origin_artifact_id", ""),
origin_path=payload.get("origin_path", ""),
origin_section=payload.get("origin_section", ""),
source_url=payload.get("source_url", ""),
retrieval_date=payload.get("retrieval_date", ""),
machine_id=payload.get("machine_id", ""),
session_id=payload.get("session_id", ""),
support_kind=payload.get("support_kind", "unknown"),
grounding_status=payload.get("grounding_status", "ungrounded"),
)
def promote_import_to_store(
import_dir: str | Path,
store_dir: str | Path,
reviewer: str | None = None,
snapshot_id: str | None = None,
) -> dict[str, Any]:
base = Path(import_dir)
manifest = _read_json(base / "manifest.json")
review_session = ReviewSession.model_validate_json((base / "review_session.json").read_text(encoding="utf-8"))
queue_payload = _read_json(base / "review_queue.json")
artifacts = _read_jsonl(base / "artifacts.jsonl")
observations = _read_jsonl(base / "observations.jsonl")
claims = _read_jsonl(base / "claims.jsonl")
concepts = _read_jsonl(base / "concepts.jsonl")
relations = _read_jsonl(base / "relations.jsonl")
store = GroundRecallStore(store_dir)
reviewed_by_concept = {entry.concept_id: entry for entry in review_session.draft_pack.concepts}
promoted_claim_ids: list[str] = []
promoted_concept_ids: list[str] = []
promoted_relation_ids: list[str] = []
for artifact in artifacts:
store.save_artifact(
ArtifactRecord(
artifact_id=artifact["artifact_id"],
artifact_kind=artifact["artifact_kind"],
title=artifact.get("title", ""),
path=artifact.get("path", ""),
sha256=artifact.get("sha256", ""),
created_at=artifact.get("created_at", ""),
metadata=dict(artifact.get("metadata", {})),
current_status="reviewed",
)
)
for observation in observations:
store.save_observation(
ObservationRecord(
observation_id=observation["observation_id"],
artifact_id=observation.get("artifact_id", ""),
role=observation.get("role", "summary"),
text=observation.get("text", ""),
provenance=_provenance_from_payload(observation),
confidence_hint=float(observation.get("confidence_hint", 0.0)),
current_status="reviewed",
)
)
for concept in concepts:
short_id = concept["concept_id"].replace("concept::", "", 1)
review_entry = reviewed_by_concept.get(short_id)
current_status = _review_status_map(review_entry.status if review_entry else concept.get("current_status", "triaged"))
record = store.save_concept(
ConceptRecord(
concept_id=concept["concept_id"],
title=review_entry.title if review_entry else concept.get("title", concept["concept_id"]),
aliases=list(concept.get("aliases", [])),
description=review_entry.description if review_entry else concept.get("description", ""),
source_artifact_ids=list(concept.get("source_artifact_ids", [])),
current_status=current_status, # type: ignore[arg-type]
)
)
if record.current_status in {"promoted", "reviewed"}:
promoted_concept_ids.append(record.concept_id)
reviewed_concept_ids = set(promoted_concept_ids)
for claim in claims:
concept_ids = list(claim.get("concept_ids", []))
statuses = []
for concept_id in concept_ids:
short_id = concept_id.replace("concept::", "", 1)
review_entry = reviewed_by_concept.get(short_id)
statuses.append(_review_status_map(review_entry.status) if review_entry else "triaged")
if statuses and all(status == "rejected" for status in statuses):
current_status = "rejected"
elif statuses and any(status == "promoted" for status in statuses):
current_status = "promoted"
elif statuses and any(status == "reviewed" for status in statuses):
current_status = "reviewed"
else:
current_status = "triaged"
record = store.save_claim(
ClaimRecord(
claim_id=claim["claim_id"],
claim_text=claim.get("claim_text", ""),
claim_kind=claim.get("claim_kind", "statement"),
source_observation_ids=list(claim.get("source_observation_ids", [])),
supporting_fragment_ids=list(claim.get("supporting_fragment_ids", [])),
concept_ids=concept_ids,
contradicts_claim_ids=list(claim.get("contradicts_claim_ids", [])),
supersedes_claim_ids=list(claim.get("supersedes_claim_ids", [])),
confidence_hint=float(claim.get("confidence_hint", 0.0)),
review_confidence=float(claim.get("review_confidence", 0.0)),
last_confirmed_at=claim.get("last_confirmed_at", ""),
provenance=_provenance_from_payload(claim),
current_status=current_status, # type: ignore[arg-type]
)
)
if record.current_status in {"promoted", "reviewed"}:
promoted_claim_ids.append(record.claim_id)
for relation in relations:
src_ok = relation.get("source_id") in reviewed_concept_ids
tgt_ok = relation.get("target_id") in reviewed_concept_ids
current_status = "promoted" if src_ok and tgt_ok else "triaged"
record = store.save_relation(
RelationRecord(
relation_id=relation["relation_id"],
source_id=relation.get("source_id", ""),
target_id=relation.get("target_id", ""),
relation_type=relation.get("relation_type", "references"),
evidence_ids=list(relation.get("evidence_ids", [])),
provenance=_provenance_from_payload(relation),
current_status=current_status, # type: ignore[arg-type]
)
)
if record.current_status in {"promoted", "reviewed"}:
promoted_relation_ids.append(record.relation_id)
for item in queue_payload.get("items", []):
store.save_review_candidate(
ReviewCandidateRecord(
review_candidate_id=item["queue_id"],
candidate_type=item["candidate_type"],
candidate_id=item["candidate_id"],
triage_lane=item.get("triage_lane", "knowledge_capture"),
priority=int(item.get("priority", 50)),
finding_codes=list(item.get("finding_codes", [])),
rationale=item.get("title", ""),
current_status="reviewed" if item["candidate_id"] in set(promoted_claim_ids + promoted_concept_ids + promoted_relation_ids) else "triaged",
)
)
promotion = store.save_promotion(
PromotionRecord(
promotion_id=f"promotion-{manifest['import_id']}",
candidate_type="concept",
candidate_id=manifest["import_id"],
promotion_target="groundrecall_store",
verdict="approved",
reviewer=reviewer or review_session.reviewer,
promoted_object_ids=promoted_concept_ids + promoted_claim_ids + promoted_relation_ids,
notes=f"Promoted import {manifest['import_id']} into GroundRecallStore.",
promoted_at=_now(),
)
)
built_snapshot = store.build_snapshot(
snapshot_id=snapshot_id or f"snapshot-{manifest['import_id']}",
created_at=_now(),
metadata={
"source_import_id": manifest["import_id"],
"reviewer": reviewer or review_session.reviewer,
"export_kind": "canonical",
},
)
store.save_snapshot(built_snapshot)
return {
"import_id": manifest["import_id"],
"store_dir": str(Path(store_dir)),
"promotion_id": promotion.promotion_id,
"promoted_concept_count": len(promoted_concept_ids),
"promoted_claim_count": len(promoted_claim_ids),
"promoted_relation_count": len(promoted_relation_ids),
"snapshot_id": built_snapshot.snapshot_id,
}
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Promote a GroundRecall import into canonical store objects.")
parser.add_argument("import_dir")
parser.add_argument("store_dir")
parser.add_argument("--reviewer", default=None)
parser.add_argument("--snapshot-id", default=None)
return parser
def main() -> None:
args = build_parser().parse_args()
payload = promote_import_to_store(
import_dir=args.import_dir,
store_dir=args.store_dir,
reviewer=args.reviewer,
snapshot_id=args.snapshot_id,
)
print(json.dumps(payload, indent=2))

188
src/groundrecall/query.py Normal file
View File

@ -0,0 +1,188 @@
from __future__ import annotations
import argparse
import json
from pathlib import Path
from typing import Any
from .store import GroundRecallStore
def _normalize(text: str) -> str:
return " ".join(text.lower().split())
def _matches(query: str, *values: str) -> bool:
needle = _normalize(query)
return any(needle in _normalize(value) for value in values if value)
def query_concept(store_dir: str | Path, concept_ref: str) -> dict[str, Any] | None:
store = GroundRecallStore(store_dir)
concepts = store.list_concepts()
concept = next(
(
item
for item in concepts
if concept_ref == item.concept_id
or concept_ref == item.concept_id.replace("concept::", "", 1)
or _matches(concept_ref, item.title, item.description, *item.aliases)
),
None,
)
if concept is None:
return None
claims = [item for item in store.list_claims() if concept.concept_id in item.concept_ids and item.current_status != "rejected"]
relations = [
item
for item in store.list_relations()
if (item.source_id == concept.concept_id or item.target_id == concept.concept_id) and item.current_status != "rejected"
]
artifacts = {item.artifact_id: item for item in store.list_artifacts()}
observations = {item.observation_id: item for item in store.list_observations()}
supporting_observations = []
for claim in claims:
for observation_id in claim.source_observation_ids:
observation = observations.get(observation_id)
if observation is not None:
supporting_observations.append(
{
"observation_id": observation.observation_id,
"text": observation.text,
"role": observation.role,
"origin_path": observation.provenance.origin_path,
"grounding_status": observation.provenance.grounding_status,
}
)
related_concept_ids = sorted(
{
relation.target_id if relation.source_id == concept.concept_id else relation.source_id
for relation in relations
if relation.source_id != relation.target_id
}
)
related_concepts = [item.model_dump() for item in concepts if item.concept_id in related_concept_ids]
source_artifacts = [
artifact.model_dump()
for artifact in artifacts.values()
if artifact.artifact_id in set(concept.source_artifact_ids)
]
return {
"query_type": "concept",
"concept": concept.model_dump(),
"claims": [item.model_dump() for item in claims],
"relations": [item.model_dump() for item in relations],
"related_concepts": related_concepts,
"supporting_observations": supporting_observations,
"source_artifacts": source_artifacts,
}
def search_claims(
store_dir: str | Path,
text: str,
include_rejected: bool = False,
limit: int = 20,
) -> dict[str, Any]:
store = GroundRecallStore(store_dir)
concepts = {item.concept_id: item for item in store.list_concepts()}
matches = []
for claim in store.list_claims():
if not include_rejected and claim.current_status == "rejected":
continue
concept_titles = [concepts[concept_id].title for concept_id in claim.concept_ids if concept_id in concepts]
if _matches(text, claim.claim_text, *concept_titles):
matches.append(
{
"claim": claim.model_dump(),
"concept_titles": concept_titles,
"provenance": claim.provenance.model_dump(),
}
)
if len(matches) >= limit:
break
return {
"query_type": "claim_search",
"query": text,
"matches": matches,
}
def query_provenance(
store_dir: str | Path,
origin_path: str | None = None,
source_url: str | None = None,
) -> dict[str, Any]:
store = GroundRecallStore(store_dir)
claims = []
observations = []
for claim in store.list_claims():
if origin_path and claim.provenance.origin_path == origin_path:
claims.append(claim.model_dump())
continue
if source_url and claim.provenance.source_url == source_url:
claims.append(claim.model_dump())
for observation in store.list_observations():
if origin_path and observation.provenance.origin_path == origin_path:
observations.append(observation.model_dump())
continue
if source_url and observation.provenance.source_url == source_url:
observations.append(observation.model_dump())
return {
"query_type": "provenance",
"origin_path": origin_path or "",
"source_url": source_url or "",
"claims": claims,
"observations": observations,
}
def build_query_bundle_for_concept(store_dir: str | Path, concept_ref: str) -> dict[str, Any] | None:
payload = query_concept(store_dir, concept_ref)
if payload is None:
return None
claims = payload["claims"]
contradictions = [item for item in claims if item.get("contradicts_claim_ids")]
supersessions = [item for item in claims if item.get("supersedes_claim_ids")]
return {
"bundle_kind": "groundrecall_query_bundle",
"query_type": "concept",
"concept": payload["concept"],
"relevant_claims": claims,
"supporting_observations": payload["supporting_observations"],
"related_concepts": payload["related_concepts"],
"contradictions": contradictions,
"supersessions": supersessions,
"suggested_next_actions": [
"Review promoted claims with low review confidence.",
"Inspect supporting observations before exporting assistant context.",
"Check related concepts for hidden prerequisite or contradiction edges.",
],
}
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Query canonical GroundRecall objects.")
parser.add_argument("store_dir")
parser.add_argument("query")
parser.add_argument("--kind", choices=["concept", "claim", "provenance", "bundle"], default="concept")
parser.add_argument("--source-url", default=None)
return parser
def main() -> None:
args = build_parser().parse_args()
if args.kind == "concept":
payload = query_concept(args.store_dir, args.query)
elif args.kind == "claim":
payload = search_claims(args.store_dir, args.query)
elif args.kind == "provenance":
payload = query_provenance(args.store_dir, origin_path=args.query, source_url=args.source_url)
else:
payload = build_query_bundle_for_concept(args.store_dir, args.query)
print(json.dumps(payload, indent=2))

View File

@ -0,0 +1,366 @@
const state = {
reviewData: null,
selectedConceptId: null,
selectedCitationId: null,
conceptSearch: "",
citationFilter: "all",
message: "",
verificationResult: null,
};
function escapeHtml(value) {
return String(value ?? "")
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll('"', "&quot;");
}
function splitLines(value) {
return String(value || "")
.split("\n")
.map((line) => line.trim())
.filter(Boolean);
}
function conceptRows() {
const data = state.reviewData;
if (!data) return [];
const reviewById = new Map((data.concept_reviews || []).map((item) => [item.concept_id, item]));
return (data.draft_pack?.concepts || []).map((concept) => ({
...concept,
review: reviewById.get(concept.concept_id) || null,
}));
}
function citationRows() {
return state.reviewData?.citation_reviews || [];
}
function selectedConcept() {
return conceptRows().find((item) => item.concept_id === state.selectedConceptId) || conceptRows()[0] || null;
}
function selectedCitation() {
return citationRows().find((item) => item.citation_review_id === state.selectedCitationId) || citationRows()[0] || null;
}
async function loadReviewData() {
const response = await fetch("/api/load");
const payload = await response.json();
state.reviewData = payload.review_data;
if (!state.selectedConceptId && conceptRows()[0]) {
state.selectedConceptId = conceptRows()[0].concept_id;
}
if (!state.selectedCitationId && citationRows()[0]) {
state.selectedCitationId = citationRows()[0].citation_review_id;
}
render();
}
async function saveConcept(form) {
const payload = {
concept_updates: [
{
concept_id: form.get("concept_id"),
status: form.get("status"),
description: form.get("description"),
prerequisites: splitLines(form.get("prerequisites")),
notes: splitLines(form.get("notes")),
},
],
};
const response = await fetch("/api/save", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(payload),
});
const result = await response.json();
state.reviewData = result.review_data;
state.message = `Saved concept ${payload.concept_updates[0].concept_id}.`;
render();
}
async function saveCitation(form) {
const payload = {
citation_updates: [
{
citation_review_id: form.get("citation_review_id"),
status: form.get("status"),
notes: splitLines(form.get("notes")),
},
],
};
const response = await fetch("/api/save", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(payload),
});
const result = await response.json();
state.reviewData = result.review_data;
state.message = `Saved citation review ${payload.citation_updates[0].citation_review_id}.`;
render();
}
async function verifyCitation(citationReviewId) {
const response = await fetch("/api/citations/verify", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ citation_review_id: citationReviewId }),
});
state.verificationResult = await response.json();
state.message = `Verification run for ${citationReviewId}.`;
render();
}
function statusOptions(specs, selectedValue) {
return (specs?.options || [])
.map((option) => `<option value="${escapeHtml(option.value)}"${option.value === selectedValue ? " selected" : ""}>${escapeHtml(option.label)}</option>`)
.join("");
}
function renderConceptPanel(concept) {
if (!concept) {
return `<section class="panel"><h2>No concept selected</h2></section>`;
}
const review = concept.review || {};
const statusSpec = (state.reviewData.field_specs || []).find((item) => item.field === "status");
const guidance = (state.reviewData.review_guidance?.priorities || []).map((item) => `<li>${escapeHtml(item)}</li>`).join("");
const claims = (review.top_claims || []).map((claim) => `
<article class="claim-card">
<div class="claim-head">
<strong>${escapeHtml(claim.claim_kind || "claim")}</strong>
<span class="chip">${escapeHtml(claim.grounding_status || "unknown")}</span>
</div>
<p>${escapeHtml(claim.claim_text || "")}</p>
<div class="tiny">Artifacts: ${escapeHtml((claim.artifact_paths || []).join(", ") || "none")}</div>
${(claim.supporting_observations || []).slice(0, 2).map((obs) => `
<div class="support-block">
<div class="tiny">${escapeHtml(obs.origin_path || "")}${obs.line_start ? `:${obs.line_start}` : ""}</div>
<div>${escapeHtml(obs.text || "")}</div>
</div>
`).join("")}
</article>
`).join("");
return `
<section class="panel detail">
<div class="panel-head">
<div>
<h2>${escapeHtml(concept.title)}</h2>
<div class="muted">${escapeHtml(concept.concept_id)} · claims ${escapeHtml(review.claim_count || 0)} · grounded ${escapeHtml(review.grounded_claim_count || 0)} · warnings ${escapeHtml(review.warning_count || 0)}</div>
</div>
<div class="pill ${review.has_citation_support ? "pill-good" : "pill-warn"}">${review.has_citation_support ? "citation-bearing" : "no citation support"}</div>
</div>
<p class="help">${escapeHtml(review.review_help || "")}</p>
<form id="concept-form">
<input type="hidden" name="concept_id" value="${escapeHtml(concept.concept_id)}" />
<label>
<span>Review status</span>
<select name="status">${statusOptions(statusSpec, concept.status)}</select>
</label>
<label>
<span>Description</span>
<textarea name="description" rows="3">${escapeHtml(concept.description || "")}</textarea>
</label>
<label>
<span>Prerequisites</span>
<textarea name="prerequisites" rows="3">${escapeHtml((concept.prerequisites || []).join("\n"))}</textarea>
</label>
<label>
<span>Reviewer notes</span>
<textarea name="notes" rows="5">${escapeHtml((concept.notes || []).join("\n"))}</textarea>
</label>
<div class="actions">
<button type="submit" class="primary">Save Concept Review</button>
</div>
</form>
<section class="subpanel">
<h3>Reviewer guidance</h3>
<ul>${guidance}</ul>
</section>
<section class="subpanel">
<h3>Representative claims</h3>
<div class="stack">${claims || "<div class=\"muted\">No representative claims available.</div>"}</div>
</section>
</section>
`;
}
function renderCitationPanel(citation) {
const statusSpec = (state.reviewData.citation_field_specs || []).find((item) => item.field === "status");
const nextActions = (state.reviewData.citations?.next_actions || []).map((item) => `<li>${escapeHtml(item)}</li>`).join("");
if (!citation) {
return `<section class="panel"><h2>No citation selected</h2></section>`;
}
return `
<section class="panel detail">
<div class="panel-head">
<div>
<h2>Citation lane</h2>
<div class="muted">${escapeHtml(citation.source_kind)} · ${escapeHtml(citation.artifact_path || citation.locator || "")}</div>
</div>
<div class="pill">${escapeHtml(citation.status)}</div>
</div>
<form id="citation-form">
<input type="hidden" name="citation_review_id" value="${escapeHtml(citation.citation_review_id)}" />
<label>
<span>Status</span>
<select name="status">${statusOptions(statusSpec, citation.status)}</select>
</label>
<label>
<span>Citation key</span>
<input value="${escapeHtml(citation.citation_key || "")}" disabled />
</label>
<label>
<span>Reference title</span>
<input value="${escapeHtml(citation.title || "")}" disabled />
</label>
<label>
<span>Bibliography source</span>
<input value="${escapeHtml(citation.source_bib_path || "")}" disabled />
</label>
<label>
<span>Reviewer notes</span>
<textarea name="notes" rows="5">${escapeHtml((citation.notes || []).join("\n"))}</textarea>
</label>
<div class="tiny">Related concepts: ${escapeHtml((citation.related_concept_ids || []).join(", ") || "none")}</div>
<div class="tiny">Related claims: ${escapeHtml((citation.related_claim_ids || []).join(", ") || "none")}</div>
<div class="actions">
<button type="button" id="verify-citation" class="secondary">Verify With CiteGeist</button>
<button type="submit" class="primary">Save Citation Review</button>
</div>
</form>
<section class="subpanel">
<h3>Citation guidance</h3>
<ul>${(state.reviewData.review_guidance?.citation_guidance || []).map((item) => `<li>${escapeHtml(item)}</li>`).join("")}</ul>
</section>
<section class="subpanel">
<h3>Next actions</h3>
<ul>${nextActions}</ul>
</section>
<section class="subpanel">
<h3>Verification</h3>
${
state.verificationResult && state.verificationResult.citation_review_id === citation.citation_review_id
? `<pre class="json-block">${escapeHtml(JSON.stringify(state.verificationResult, null, 2))}</pre>`
: `<div class="muted">Run CiteGeist verification to inspect the stored entry and candidate matches.</div>`
}
</section>
</section>
`;
}
function render() {
const app = document.getElementById("app");
if (!state.reviewData) {
app.innerHTML = `<main class="shell"><section class="panel"><h1>Loading review data…</h1></section></main>`;
return;
}
const summary = state.reviewData.import_context?.manifest || {};
const conceptList = conceptRows().filter((item) => {
const needle = state.conceptSearch.trim().toLowerCase();
return !needle || item.title.toLowerCase().includes(needle) || item.concept_id.toLowerCase().includes(needle);
});
const citationList = citationRows().filter((item) => {
if (state.citationFilter === "all") return true;
return item.status === state.citationFilter;
});
const concept = selectedConcept();
const citation = selectedCitation();
app.innerHTML = `
<main class="shell">
<header class="hero">
<div>
<h1>GroundRecall Review Workbench</h1>
<p>Concept-first review with a dedicated citation lane for academic imports.</p>
<div class="muted">${escapeHtml(summary.import_id || "")} · ${escapeHtml(summary.source_root || "")}</div>
${state.message ? `<div class="message">${escapeHtml(state.message)}</div>` : ""}
</div>
<div class="hero-stats">
<div class="stat"><strong>${escapeHtml(summary.artifact_count || 0)}</strong><span>artifacts</span></div>
<div class="stat"><strong>${escapeHtml(summary.claim_count || 0)}</strong><span>claims</span></div>
<div class="stat"><strong>${escapeHtml(summary.concept_count || 0)}</strong><span>concepts</span></div>
<div class="stat"><strong>${escapeHtml(state.reviewData.citations?.summary?.citation_key_total || 0)}</strong><span>citation keys</span></div>
</div>
</header>
<section class="workspace-grid">
<aside class="panel list-panel">
<div class="panel-head"><h2>Concepts</h2></div>
<label class="search">
<span>Search</span>
<input id="concept-search" value="${escapeHtml(state.conceptSearch)}" />
</label>
<div class="stack">
${conceptList.map((item) => `
<button class="list-item ${item.concept_id === concept?.concept_id ? "active" : ""}" data-concept-id="${escapeHtml(item.concept_id)}">
<strong>${escapeHtml(item.title)}</strong>
<span>${escapeHtml(item.status)}</span>
</button>
`).join("")}
</div>
</aside>
${renderConceptPanel(concept)}
</section>
<section class="workspace-grid">
<aside class="panel list-panel">
<div class="panel-head"><h2>Citation lane</h2></div>
<label class="search">
<span>Filter</span>
<select id="citation-filter">
${["all", "unreviewed", "verified", "needs_source_check", "misleading", "irrelevant", "fabricated"].map((value) => `<option value="${value}"${value === state.citationFilter ? " selected" : ""}>${value}</option>`).join("")}
</select>
</label>
<div class="stack">
${citationList.map((item) => `
<button class="list-item ${item.citation_review_id === citation?.citation_review_id ? "active" : ""}" data-citation-id="${escapeHtml(item.citation_review_id)}">
<strong>${escapeHtml(item.citation_key || item.title || item.citation_review_id)}</strong>
<span>${escapeHtml(item.status)}</span>
</button>
`).join("")}
</div>
</aside>
${renderCitationPanel(citation)}
</section>
</main>
`;
document.querySelectorAll("[data-concept-id]").forEach((node) => {
node.addEventListener("click", () => {
state.selectedConceptId = node.getAttribute("data-concept-id");
render();
});
});
document.querySelectorAll("[data-citation-id]").forEach((node) => {
node.addEventListener("click", () => {
state.selectedCitationId = node.getAttribute("data-citation-id");
render();
});
});
document.getElementById("concept-search")?.addEventListener("input", (event) => {
state.conceptSearch = event.target.value;
render();
});
document.getElementById("citation-filter")?.addEventListener("change", (event) => {
state.citationFilter = event.target.value;
render();
});
document.getElementById("concept-form")?.addEventListener("submit", async (event) => {
event.preventDefault();
await saveConcept(new FormData(event.target));
});
document.getElementById("citation-form")?.addEventListener("submit", async (event) => {
event.preventDefault();
await saveCitation(new FormData(event.target));
});
document.getElementById("verify-citation")?.addEventListener("click", async () => {
if (state.selectedCitationId) {
await verifyCitation(state.selectedCitationId);
}
});
}
loadReviewData();

View File

@ -0,0 +1,13 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>GroundRecall Review Workbench</title>
<link rel="stylesheet" href="/styles.css" />
</head>
<body>
<div id="app"></div>
<script src="/app.js"></script>
</body>
</html>

View File

@ -0,0 +1,248 @@
:root {
--bg: #f4f0e8;
--panel: rgba(255, 251, 245, 0.92);
--panel-strong: #fffdfa;
--ink: #1f2933;
--muted: #5f6c76;
--accent: #0f766e;
--accent-strong: #134e4a;
--warn: #9a3412;
--line: rgba(31, 41, 51, 0.12);
--shadow: 0 20px 60px rgba(31, 41, 51, 0.08);
}
* {
box-sizing: border-box;
}
body {
margin: 0;
color: var(--ink);
background:
radial-gradient(circle at top right, rgba(15, 118, 110, 0.14), transparent 28%),
radial-gradient(circle at top left, rgba(154, 52, 18, 0.08), transparent 24%),
linear-gradient(180deg, #fbf8f1 0%, var(--bg) 100%);
font-family: "Iowan Old Style", "Palatino Linotype", "Book Antiqua", serif;
}
button,
input,
select,
textarea {
font: inherit;
}
.shell {
width: min(1500px, calc(100vw - 32px));
margin: 24px auto 48px;
}
.hero,
.panel {
background: var(--panel);
border: 1px solid var(--line);
border-radius: 20px;
box-shadow: var(--shadow);
}
.hero {
display: grid;
grid-template-columns: 1.6fr 1fr;
gap: 20px;
padding: 24px;
margin-bottom: 18px;
}
.hero h1,
.panel h2,
.panel h3 {
margin: 0 0 8px;
font-family: Georgia, "Times New Roman", serif;
}
.hero-stats {
display: grid;
grid-template-columns: repeat(2, minmax(0, 1fr));
gap: 12px;
}
.stat {
padding: 14px;
border-radius: 16px;
background: var(--panel-strong);
border: 1px solid var(--line);
}
.stat strong {
display: block;
font-size: 1.8rem;
}
.stat span,
.muted,
.tiny,
.help {
color: var(--muted);
}
.message {
margin-top: 10px;
color: var(--accent-strong);
}
.workspace-grid {
display: grid;
grid-template-columns: minmax(280px, 360px) 1fr;
gap: 18px;
margin-bottom: 18px;
}
.panel {
padding: 18px;
}
.panel-head {
display: flex;
justify-content: space-between;
align-items: flex-start;
gap: 12px;
margin-bottom: 12px;
}
.list-panel {
max-height: 78vh;
overflow: auto;
}
.search,
label {
display: grid;
gap: 6px;
margin-bottom: 12px;
}
.stack {
display: grid;
gap: 10px;
}
.list-item,
.primary,
.secondary {
border: 1px solid var(--line);
border-radius: 14px;
background: var(--panel-strong);
}
.list-item {
display: grid;
gap: 4px;
width: 100%;
padding: 12px;
text-align: left;
cursor: pointer;
}
.list-item.active {
border-color: rgba(15, 118, 110, 0.45);
background: rgba(15, 118, 110, 0.08);
}
input,
select,
textarea {
width: 100%;
padding: 10px 12px;
border: 1px solid var(--line);
border-radius: 12px;
background: #fff;
}
textarea {
resize: vertical;
}
.actions {
display: flex;
justify-content: flex-end;
}
.primary {
padding: 10px 16px;
color: white;
background: linear-gradient(135deg, var(--accent) 0%, var(--accent-strong) 100%);
cursor: pointer;
}
.secondary {
padding: 10px 16px;
color: var(--accent-strong);
cursor: pointer;
}
.subpanel {
margin-top: 16px;
padding-top: 14px;
border-top: 1px solid var(--line);
}
.claim-card,
.support-block {
padding: 12px;
border-radius: 14px;
background: var(--panel-strong);
border: 1px solid var(--line);
}
.claim-head {
display: flex;
justify-content: space-between;
gap: 12px;
margin-bottom: 8px;
}
.chip,
.pill {
display: inline-flex;
align-items: center;
padding: 4px 10px;
border-radius: 999px;
border: 1px solid var(--line);
background: #fff;
font-size: 0.85rem;
}
.pill-good {
color: var(--accent-strong);
}
.pill-warn {
color: var(--warn);
}
ul {
margin: 0;
padding-left: 20px;
}
.json-block {
padding: 12px;
overflow: auto;
border-radius: 14px;
background: #f8f7f3;
border: 1px solid var(--line);
font-family: "SFMono-Regular", Consolas, "Liberation Mono", monospace;
font-size: 0.9rem;
white-space: pre-wrap;
}
@media (max-width: 980px) {
.hero,
.workspace-grid {
grid-template-columns: 1fr;
}
.list-panel {
max-height: none;
}
}

View File

@ -0,0 +1,439 @@
from __future__ import annotations
from pathlib import Path
import hashlib
import json, yaml
import re
import sys
from collections import defaultdict
from typing import Any, Callable
from .citation_support import bibliography_summary_payload, load_bibliography_index, serialize_bib_entry
from .review_schema import CitationReviewEntry, ReviewSession
def export_review_state_json(session: ReviewSession, path: str | Path) -> None:
Path(path).write_text(session.model_dump_json(indent=2), encoding="utf-8")
def export_promoted_pack(session: ReviewSession, outdir: str | Path) -> None:
outdir = Path(outdir)
outdir.mkdir(parents=True, exist_ok=True)
promoted_pack = dict(session.draft_pack.pack)
promoted_pack["version"] = str(promoted_pack.get("version", "0.1.0-draft")).replace("-draft", "-reviewed")
promoted_pack["curation"] = {"reviewer": session.reviewer, "ledger_entries": len(session.ledger)}
concepts = []
for concept in session.draft_pack.concepts:
if concept.status == "rejected":
continue
concepts.append({
"id": concept.concept_id,
"title": concept.title,
"description": concept.description,
"prerequisites": concept.prerequisites,
"mastery_signals": concept.mastery_signals,
"status": concept.status,
"notes": concept.notes,
"mastery_profile": {},
})
(outdir / "pack.yaml").write_text(yaml.safe_dump(promoted_pack, sort_keys=False), encoding="utf-8")
(outdir / "concepts.yaml").write_text(yaml.safe_dump({"concepts": concepts}, sort_keys=False), encoding="utf-8")
(outdir / "review_ledger.json").write_text(json.dumps(session.model_dump(), indent=2), encoding="utf-8")
(outdir / "license_attribution.json").write_text(json.dumps(session.draft_pack.attribution, indent=2), encoding="utf-8")
def export_promoted_pack_to_course_repo(session: ReviewSession, course_repo: str | Path, outdir: str | Path | None = None) -> Path:
from .course_repo import resolve_course_repo
resolved = resolve_course_repo(course_repo)
target = Path(outdir) if outdir is not None else Path(resolved.generated_pack_dir or (Path(resolved.repo_root) / "generated" / "pack"))
export_promoted_pack(session, target)
return target
LATEX_CITE_RE = re.compile(r"\\cite[a-zA-Z*]*(?:\[[^\]]*\])?(?:\[[^\]]*\])?\{([^}]+)\}")
def _read_json(path: Path) -> dict[str, Any]:
return json.loads(path.read_text(encoding="utf-8"))
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
text = path.read_text(encoding="utf-8").strip()
if not text:
return []
return [json.loads(line) for line in text.splitlines()]
def _status_field_spec() -> dict[str, Any]:
return {
"field": "status",
"label": "Review status",
"input": "select",
"required": True,
"options": [
{
"value": "trusted",
"label": "Trusted",
"help": "Promote this concept and its supported claims when the evidence and wording are ready.",
},
{
"value": "provisional",
"label": "Provisional",
"help": "Keep this concept in reviewed state when it is promising but still needs citation or wording cleanup.",
},
{
"value": "needs_review",
"label": "Needs Review",
"help": "Leave undecided when support, scope, or concept boundaries are still unclear.",
},
{
"value": "rejected",
"label": "Rejected",
"help": "Exclude this concept when it is noise, unsupported, duplicated, or misleading.",
},
],
}
def _text_field_spec(field: str, label: str, help_text: str, *, multiline: bool = False) -> dict[str, Any]:
return {
"field": field,
"label": label,
"input": "textarea" if multiline else "text",
"required": False,
"help": help_text,
}
def _citation_status_field_spec() -> dict[str, Any]:
return {
"field": "status",
"label": "Citation review status",
"input": "select",
"required": True,
"options": [
{
"value": "unreviewed",
"label": "Unreviewed",
"help": "Keep this citation candidate in triage until fit and existence are checked.",
},
{
"value": "verified",
"label": "Verified",
"help": "The cited work exists and materially supports the associated manuscript claim.",
},
{
"value": "needs_source_check",
"label": "Needs Source Check",
"help": "The citation may be useful but still needs direct source inspection or metadata cleanup.",
},
{
"value": "misleading",
"label": "Misleading",
"help": "The citation exists but overstates, contradicts, or poorly fits the claim.",
},
{
"value": "irrelevant",
"label": "Irrelevant",
"help": "The citation does not materially support the concept or claim under review.",
},
{
"value": "fabricated",
"label": "Fabricated",
"help": "The citation appears invented, malformed, or otherwise not real.",
},
],
}
def _load_citegeist_extract() -> tuple[Callable[[str], list[Any]] | None, list[str]]:
citegeist_src = Path("/home/netuser/bin/CiteGeist/src")
if citegeist_src.exists():
sys.path.insert(0, str(citegeist_src))
try:
from citegeist import available_extraction_backends, extract_references # type: ignore
except Exception:
return None, []
return extract_references, list(available_extraction_backends())
def _extract_citation_keys(text: str) -> list[str]:
keys: list[str] = []
for raw_group in LATEX_CITE_RE.findall(text):
keys.extend(part.strip() for part in raw_group.split(",") if part.strip())
return sorted(set(keys))
def _artifact_citation_payloads(
artifacts: list[dict[str, Any]],
*,
source_root: str,
) -> tuple[list[dict[str, Any]], dict[str, dict[str, Any]]]:
extract_references, backends = _load_citegeist_extract()
artifact_payloads: list[dict[str, Any]] = []
summaries: dict[str, dict[str, Any]] = {}
root = Path(source_root) if source_root else None
bibliography_index = load_bibliography_index(source_root) if source_root else {}
for artifact in artifacts:
path = Path(source_root) / artifact["path"] if root is not None else None
raw_text = ""
if path is not None and path.exists():
try:
raw_text = path.read_text(encoding="utf-8")
except UnicodeDecodeError:
raw_text = ""
citation_keys = _extract_citation_keys(raw_text) if raw_text else []
extracted_refs: list[dict[str, Any]] = []
if extract_references is not None and raw_text:
try:
for entry in extract_references(raw_text):
extracted_refs.append(
{
"citation_key": "",
"entry_type": entry.entry_type,
"title": entry.fields.get("title", ""),
"author": entry.fields.get("author", ""),
"year": entry.fields.get("year", ""),
"venue": entry.fields.get("journal", "") or entry.fields.get("booktitle", ""),
}
)
except Exception:
extracted_refs = []
payload = {
"artifact_id": artifact["artifact_id"],
"path": artifact["path"],
"title": artifact.get("title", ""),
"citation_keys": citation_keys,
"resolved_entries": [serialize_bib_entry(bibliography_index.get(key)) for key in citation_keys if bibliography_index.get(key)],
"citation_key_count": len(citation_keys),
"extracted_references": extracted_refs[:12],
"extracted_reference_count": len(extracted_refs),
"citegeist_backends": backends,
}
artifact_payloads.append(payload)
summaries[artifact["artifact_id"]] = {
"citation_key_count": len(citation_keys),
"extracted_reference_count": len(extracted_refs),
"has_citation_support": bool(citation_keys or extracted_refs),
}
return artifact_payloads, summaries
def build_citation_review_entries_from_import(import_dir: str | Path) -> list[CitationReviewEntry]:
base = Path(import_dir)
manifest = _read_json(base / "manifest.json")
artifacts = _read_jsonl(base / "artifacts.jsonl")
observations = _read_jsonl(base / "observations.jsonl")
claims = _read_jsonl(base / "claims.jsonl")
bibliography_index = load_bibliography_index(manifest.get("source_root", ""))
artifact_payloads, _ = _artifact_citation_payloads(
artifacts,
source_root=manifest.get("source_root", ""),
)
observations_by_id = {item["observation_id"]: item for item in observations}
artifact_claim_links: dict[str, dict[str, set[str]]] = defaultdict(lambda: {"claim_ids": set(), "concept_ids": set()})
for claim in claims:
artifact_ids = {
observations_by_id[item]["artifact_id"]
for item in claim.get("source_observation_ids", [])
if item in observations_by_id and observations_by_id[item].get("artifact_id")
}
for artifact_id in artifact_ids:
artifact_claim_links[artifact_id]["claim_ids"].add(claim["claim_id"])
artifact_claim_links[artifact_id]["concept_ids"].update(
concept_id.replace("concept::", "", 1) for concept_id in claim.get("concept_ids", [])
)
entries: list[CitationReviewEntry] = []
for artifact in artifact_payloads:
link_payload = artifact_claim_links.get(artifact["artifact_id"], {"claim_ids": set(), "concept_ids": set()})
for citation_key in artifact.get("citation_keys", []):
digest = hashlib.sha1(f"{artifact['artifact_id']}|key|{citation_key}".encode("utf-8")).hexdigest()[:12]
bib_entry = bibliography_index.get(citation_key, {})
fields = bib_entry.get("fields", {})
entries.append(
CitationReviewEntry(
citation_review_id=f"citrev-{digest}",
artifact_id=artifact["artifact_id"],
artifact_path=artifact.get("path", ""),
artifact_title=artifact.get("title", ""),
source_kind="citation_key",
locator=artifact.get("path", ""),
citation_key=citation_key,
title=str(fields.get("title", "")),
author=str(fields.get("author", "")),
year=str(fields.get("year", "")),
venue=str(fields.get("journal", "") or fields.get("booktitle", "") or fields.get("publisher", "")),
source_bib_path=str(bib_entry.get("source_bib_path", "")),
raw_bibtex=str(bib_entry.get("raw_bibtex", "")),
related_concept_ids=sorted(link_payload["concept_ids"]),
related_claim_ids=sorted(link_payload["claim_ids"]),
)
)
for index, reference in enumerate(artifact.get("extracted_references", []), start=1):
digest = hashlib.sha1(
f"{artifact['artifact_id']}|ref|{reference.get('title', '')}|{reference.get('author', '')}|{index}".encode("utf-8")
).hexdigest()[:12]
entries.append(
CitationReviewEntry(
citation_review_id=f"citrev-{digest}",
artifact_id=artifact["artifact_id"],
artifact_path=artifact.get("path", ""),
artifact_title=artifact.get("title", ""),
source_kind="extracted_reference",
locator=f"{artifact.get('path', '')}#ref-{index}",
citation_key="",
title=reference.get("title", ""),
author=reference.get("author", ""),
year=reference.get("year", ""),
venue=reference.get("venue", ""),
related_concept_ids=sorted(link_payload["concept_ids"]),
related_claim_ids=sorted(link_payload["claim_ids"]),
)
)
return entries
def _build_import_review_payload(session: ReviewSession, import_dir: Path) -> dict[str, Any]:
manifest = _read_json(import_dir / "manifest.json")
lint_payload = _read_json(import_dir / "lint_findings.json")
queue_payload = _read_json(import_dir / "review_queue.json")
artifacts = _read_jsonl(import_dir / "artifacts.jsonl")
observations = _read_jsonl(import_dir / "observations.jsonl")
claims = _read_jsonl(import_dir / "claims.jsonl")
observations_by_id = {item["observation_id"]: item for item in observations}
claims_by_concept: dict[str, list[dict[str, Any]]] = defaultdict(list)
findings_by_target: dict[str, list[dict[str, Any]]] = defaultdict(list)
for finding in lint_payload.get("findings", []):
findings_by_target[finding["target_id"]].append(finding)
for claim in claims:
for concept_id in claim.get("concept_ids", []):
claims_by_concept[concept_id].append(claim)
artifact_citations, artifact_citation_summary = _artifact_citation_payloads(
artifacts,
source_root=manifest.get("source_root", ""),
)
artifact_by_id = {item["artifact_id"]: item for item in artifacts}
concept_reviews: list[dict[str, Any]] = []
for concept in session.draft_pack.concepts:
full_concept_id = f"concept::{concept.concept_id}" if not concept.concept_id.startswith("concept::") else concept.concept_id
concept_claims = claims_by_concept.get(full_concept_id, [])
claim_payloads: list[dict[str, Any]] = []
has_citation_support = False
for claim in concept_claims[:25]:
supporting_observations = [observations_by_id[item] for item in claim.get("source_observation_ids", []) if item in observations_by_id]
artifact_ids = {item["artifact_id"] for item in supporting_observations}
citation_support = [artifact_citation_summary.get(artifact_id, {}) for artifact_id in artifact_ids]
has_citation_support = has_citation_support or any(item.get("has_citation_support") for item in citation_support)
claim_payloads.append(
{
"claim_id": claim["claim_id"],
"claim_text": claim.get("claim_text", ""),
"claim_kind": claim.get("claim_kind", ""),
"grounding_status": claim.get("grounding_status", "unknown"),
"supporting_observations": [
{
"observation_id": obs["observation_id"],
"origin_path": obs.get("origin_path", ""),
"origin_section": obs.get("origin_section", ""),
"text": obs.get("text", ""),
"line_start": obs.get("line_start", 0),
"line_end": obs.get("line_end", 0),
}
for obs in supporting_observations
],
"citation_support": citation_support,
"artifact_paths": [artifact_by_id[item]["path"] for item in artifact_ids if item in artifact_by_id],
"finding_messages": [item["message"] for item in findings_by_target.get(claim["claim_id"], [])],
}
)
concept_reviews.append(
{
"concept_id": concept.concept_id,
"title": concept.title,
"status": concept.status,
"description": concept.description,
"review_help": (
"Prefer `trusted` when claims are coherent and citation-bearing support is appropriate; "
"prefer `provisional` when the concept is plausible but still needs citation or wording cleanup."
),
"claim_count": len(concept_claims),
"grounded_claim_count": sum(1 for item in concept_claims if item.get("grounding_status") == "grounded"),
"warning_count": len(findings_by_target.get(full_concept_id, [])),
"has_citation_support": has_citation_support,
"top_claims": claim_payloads,
"notes": list(concept.notes),
}
)
return {
"import_context": {
"manifest": manifest,
"lint_summary": lint_payload.get("summary", {}),
"queue_length": queue_payload.get("queue_length", 0),
"source_adapter": manifest.get("source_adapter", ""),
},
"review_guidance": {
"overview": (
"Review concepts first, then inspect representative claims and their source observations before promotion."
),
"priorities": [
"Focus reviewer effort on concepts with strong grounded claims and explicit citations first.",
"Downgrade or reject concepts whose claims are fragmented, duplicated, or missing meaningful support.",
"For academic material, citation-bearing claims deserve special scrutiny for fit, contradiction, and fabrication risk.",
],
"citation_guidance": [
"A citation key or extracted reference is evidence of traceability, not correctness.",
"Check whether the cited work actually supports the claim and whether the claim overstates it.",
"Use the citation track to prioritize claims that can move into a separate citation-ingestion workflow.",
],
},
"field_specs": [
_status_field_spec(),
_text_field_spec("description", "Concept description", "Refine the concept summary to match the strongest supported interpretation."),
_text_field_spec("notes", "Reviewer notes", "Record why this concept is trusted, provisional, rejected, or still unclear.", multiline=True),
_text_field_spec("prerequisites", "Prerequisites", "List prerequisite concepts only when the manuscript support is explicit or defensible.", multiline=True),
],
"citation_field_specs": [
_citation_status_field_spec(),
_text_field_spec("notes", "Citation notes", "Record whether the cited work exists, fits the claim, or should move into a dedicated citation-ingestion lane.", multiline=True),
],
"concept_reviews": concept_reviews,
"citation_reviews": [entry.model_dump() for entry in session.citation_reviews],
"bibliography": bibliography_summary_payload(manifest.get("source_root", "")),
"citations": {
"enabled": True,
"provider": "citegeist" if artifact_citations and artifact_citations[0].get("citegeist_backends") else "none",
"artifacts": artifact_citations,
"summary": {
"artifact_count_with_citations": sum(1 for item in artifact_citations if item["citation_key_count"] or item["extracted_reference_count"]),
"citation_key_total": sum(item["citation_key_count"] for item in artifact_citations),
"extracted_reference_total": sum(item["extracted_reference_count"] for item in artifact_citations),
},
"next_actions": [
"Promote citation-bearing claims into a dedicated citation review lane.",
"Use CiteGeist extraction as a first pass, then verify support and metadata before trusting the citation.",
],
},
}
def export_review_ui_data(session: ReviewSession, outdir: str | Path, import_dir: str | Path | None = None) -> None:
outdir = Path(outdir)
outdir.mkdir(parents=True, exist_ok=True)
payload = {
"reviewer": session.reviewer,
"draft_pack": session.draft_pack.model_dump(),
"citation_reviews": [entry.model_dump() for entry in session.citation_reviews],
"ledger": [entry.model_dump() for entry in session.ledger],
}
if import_dir is not None:
payload.update(_build_import_review_payload(session, Path(import_dir)))
(outdir / "review_data.json").write_text(json.dumps(payload, indent=2), encoding="utf-8")

View File

@ -0,0 +1,80 @@
from __future__ import annotations
from pydantic import BaseModel, Field
from typing import Literal
TrustStatus = Literal["trusted", "provisional", "rejected", "needs_review"]
CitationStatus = Literal["unreviewed", "verified", "needs_source_check", "misleading", "irrelevant", "fabricated"]
class ConceptReviewEntry(BaseModel):
concept_id: str
title: str
description: str = ""
prerequisites: list[str] = Field(default_factory=list)
mastery_signals: list[str] = Field(default_factory=list)
status: TrustStatus = "needs_review"
notes: list[str] = Field(default_factory=list)
class CitationReviewEntry(BaseModel):
citation_review_id: str
artifact_id: str
artifact_path: str = ""
artifact_title: str = ""
source_kind: Literal["citation_key", "extracted_reference"] = "citation_key"
locator: str = ""
citation_key: str = ""
title: str = ""
author: str = ""
year: str = ""
venue: str = ""
source_bib_path: str = ""
raw_bibtex: str = ""
status: CitationStatus = "unreviewed"
notes: list[str] = Field(default_factory=list)
related_concept_ids: list[str] = Field(default_factory=list)
related_claim_ids: list[str] = Field(default_factory=list)
class DraftPackData(BaseModel):
pack: dict = Field(default_factory=dict)
concepts: list[ConceptReviewEntry] = Field(default_factory=list)
conflicts: list[str] = Field(default_factory=list)
review_flags: list[str] = Field(default_factory=list)
attribution: dict = Field(default_factory=dict)
class ReviewAction(BaseModel):
action_type: str
target: str = ""
payload: dict = Field(default_factory=dict)
rationale: str = ""
class ReviewLedgerEntry(BaseModel):
reviewer: str
action: ReviewAction
class ReviewSession(BaseModel):
reviewer: str
draft_pack: DraftPackData
citation_reviews: list[CitationReviewEntry] = Field(default_factory=list)
ledger: list[ReviewLedgerEntry] = Field(default_factory=list)
class WorkspaceMeta(BaseModel):
workspace_id: str
title: str
path: str
created_at: str
last_opened_at: str
notes: str = ""
class WorkspaceRegistry(BaseModel):
workspaces: list[WorkspaceMeta] = Field(default_factory=list)
recent_workspace_ids: list[str] = Field(default_factory=list)
class ImportPreview(BaseModel):
ok: bool = False
source_dir: str
workspace_id: str
overwrite_required: bool = False
errors: list[str] = Field(default_factory=list)
warnings: list[str] = Field(default_factory=list)
summary: dict = Field(default_factory=dict)
semantic_warnings: list[str] = Field(default_factory=list)

View File

@ -0,0 +1,246 @@
from __future__ import annotations
import argparse
import json
import mimetypes
from http.server import BaseHTTPRequestHandler, HTTPServer
from pathlib import Path
from urllib.parse import parse_qs, urlparse
from .citation_support import materialize_citegeist_store
from .promotion import promote_import_to_store
from .review_workspace import GroundRecallReviewWorkspace
def _json_response(handler: BaseHTTPRequestHandler, status: int, payload: dict) -> None:
body = json.dumps(payload, indent=2).encode("utf-8")
handler.send_response(status)
handler.send_header("Content-Type", "application/json")
handler.send_header("Content-Length", str(len(body)))
handler.send_header("Access-Control-Allow-Origin", "*")
handler.send_header("Access-Control-Allow-Methods", "GET,POST,OPTIONS")
handler.send_header("Access-Control-Allow-Headers", "Content-Type")
handler.end_headers()
handler.wfile.write(body)
def _serve_static(handler: BaseHTTPRequestHandler, asset_path: Path) -> None:
if not asset_path.exists():
_json_response(handler, 404, {"error": "asset not found"})
return
body = asset_path.read_bytes()
handler.send_response(200)
handler.send_header("Content-Type", mimetypes.guess_type(str(asset_path))[0] or "application/octet-stream")
handler.send_header("Content-Length", str(len(body)))
handler.end_headers()
handler.wfile.write(body)
def _safe_show_entry(api: object, citation_key: str) -> dict | None:
if not citation_key:
return None
try:
return api.show_entry( # type: ignore[attr-defined]
citation_key,
include_provenance=True,
include_conflicts=True,
include_bibtex=True,
)
except AttributeError:
pass
store = getattr(api, "store", None)
if store is None:
return None
entry = store.get_entry(citation_key)
if entry is None:
return None
payload = dict(entry)
if hasattr(store, "get_field_provenance"):
try:
payload["provenance"] = store.get_field_provenance(citation_key)
except Exception:
payload["provenance"] = []
if hasattr(store, "get_conflicts"):
try:
payload["conflicts"] = store.get_conflicts(citation_key)
except Exception:
payload["conflicts"] = []
else:
payload["conflicts"] = []
if hasattr(store, "get_entry_bibtex"):
try:
payload["bibtex"] = store.get_entry_bibtex(citation_key)
except Exception:
payload["bibtex"] = None
return payload
def _safe_verify_entry(api: object, entry: object, *, context: str, limit: int) -> dict:
if getattr(entry, "raw_bibtex", ""):
try:
return api.verify_bibtex(entry.raw_bibtex, context=context, limit=limit) # type: ignore[attr-defined]
except Exception:
pass
values = [item for item in [getattr(entry, "citation_key", ""), getattr(entry, "title", ""), getattr(entry, "author", ""), getattr(entry, "year", "")] if item]
try:
return api.verify_strings(values, context=context, limit=limit) # type: ignore[attr-defined]
except Exception as exc:
return {
"context": context,
"results": [],
"error": str(exc),
}
class GroundRecallReviewHandler(BaseHTTPRequestHandler):
workspace: GroundRecallReviewWorkspace
default_store_dir: str | None = None
citegeist_bundle: dict | None = None
def do_OPTIONS(self) -> None:
_json_response(self, 200, {"ok": True})
def do_GET(self) -> None:
parsed = urlparse(self.path)
if parsed.path == "/api/healthz":
_json_response(self, 200, {"ok": True})
return
if parsed.path == "/api/load":
review_data = self.workspace.load_review_data()
review_data["citegeist"] = {
"enabled": bool(self.citegeist_bundle and self.citegeist_bundle.get("available")),
"db_path": self.citegeist_bundle.get("db_path") if self.citegeist_bundle else "",
"ingested_files": self.citegeist_bundle.get("ingested_files", []) if self.citegeist_bundle else [],
"show_entry_endpoint": "/api/citations/show-entry",
"verify_endpoint": "/api/citations/verify",
}
_json_response(
self,
200,
{
"ok": True,
"import_dir": str(self.workspace.import_dir),
"review_data": review_data,
},
)
return
if parsed.path == "/api/citations/show-entry":
if not self.citegeist_bundle or not self.citegeist_bundle.get("available"):
_json_response(self, 404, {"ok": False, "error": "citegeist unavailable"})
return
citation_key = parse_qs(parsed.query).get("citation_key", [""])[0]
if not citation_key:
_json_response(self, 400, {"ok": False, "error": "citation_key is required"})
return
payload = _safe_show_entry(self.citegeist_bundle["api"], citation_key)
_json_response(self, 200, {"ok": payload is not None, "entry": payload})
return
asset_root = Path(__file__).with_name("review_app")
if parsed.path in {"/", "/index.html"}:
_serve_static(self, asset_root / "index.html")
return
if parsed.path == "/app.js":
_serve_static(self, asset_root / "app.js")
return
if parsed.path == "/styles.css":
_serve_static(self, asset_root / "styles.css")
return
_json_response(self, 404, {"error": "not found"})
def do_POST(self) -> None:
parsed = urlparse(self.path)
length = int(self.headers.get("Content-Length", "0"))
raw = self.rfile.read(length) if length else b"{}"
payload = json.loads(raw.decode("utf-8") or "{}")
if parsed.path == "/api/save":
self.workspace.apply_updates(
concept_updates=payload.get("concept_updates"),
citation_updates=payload.get("citation_updates"),
reviewer=payload.get("reviewer"),
)
_json_response(
self,
200,
{
"ok": True,
"import_dir": str(self.workspace.import_dir),
"review_data": self.workspace.load_review_data(),
},
)
return
if parsed.path == "/api/promote":
store_dir = payload.get("store_dir") or self.default_store_dir
if not store_dir:
_json_response(self, 400, {"ok": False, "error": "store_dir is required"})
return
result = promote_import_to_store(
import_dir=self.workspace.import_dir,
store_dir=store_dir,
reviewer=payload.get("reviewer"),
snapshot_id=payload.get("snapshot_id"),
)
_json_response(self, 200, {"ok": True, "promotion": result})
return
if parsed.path == "/api/citations/verify":
if not self.citegeist_bundle or not self.citegeist_bundle.get("available"):
_json_response(self, 404, {"ok": False, "error": "citegeist unavailable"})
return
citation_review_id = str(payload.get("citation_review_id") or "").strip()
if not citation_review_id:
_json_response(self, 400, {"ok": False, "error": "citation_review_id is required"})
return
session = self.workspace.load_session()
entry = next((item for item in session.citation_reviews if item.citation_review_id == citation_review_id), None)
if entry is None:
_json_response(self, 404, {"ok": False, "error": "citation review entry not found"})
return
api = self.citegeist_bundle["api"]
show_entry_payload = _safe_show_entry(api, entry.citation_key) if entry.citation_key else None
context = f"{entry.artifact_path} {entry.artifact_title}".strip()
verification = _safe_verify_entry(api, entry, context=context, limit=int(payload.get("limit", 5)))
_json_response(
self,
200,
{
"ok": True,
"citation_review_id": citation_review_id,
"entry": show_entry_payload,
"verification": verification,
},
)
return
_json_response(self, 404, {"error": "not found"})
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="GroundRecall local review server")
parser.add_argument("import_dir")
parser.add_argument("--host", default="127.0.0.1")
parser.add_argument("--port", type=int, default=8766)
parser.add_argument("--reviewer", default="GroundRecall Import")
parser.add_argument("--store-dir", default=None)
return parser
def main() -> None:
args = build_parser().parse_args()
GroundRecallReviewHandler.workspace = GroundRecallReviewWorkspace(args.import_dir, reviewer=args.reviewer)
GroundRecallReviewHandler.default_store_dir = args.store_dir
GroundRecallReviewHandler.workspace.ensure_review_bundle()
session = GroundRecallReviewHandler.workspace.load_session()
GroundRecallReviewHandler.citegeist_bundle = materialize_citegeist_store(
args.import_dir,
session.draft_pack.pack.get("source_root", ""),
)
server = HTTPServer((args.host, args.port), GroundRecallReviewHandler)
print(f"GroundRecall review server listening on http://{args.host}:{args.port}")
server.serve_forever()
if __name__ == "__main__":
main()

View File

@ -0,0 +1,126 @@
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
from .groundrecall_review_bridge import export_review_bundle_from_import
from .review_export import build_citation_review_entries_from_import, export_review_state_json, export_review_ui_data
from .review_schema import ReviewAction, ReviewLedgerEntry, ReviewSession
def _normalize_lines(value: Any) -> list[str]:
if isinstance(value, list):
return [str(item).strip() for item in value if str(item).strip()]
if isinstance(value, str):
return [line.strip() for line in value.splitlines() if line.strip()]
return []
class GroundRecallReviewWorkspace:
def __init__(self, import_dir: str | Path, reviewer: str = "GroundRecall Import") -> None:
self.import_dir = Path(import_dir)
self.reviewer = reviewer
@property
def review_session_path(self) -> Path:
return self.import_dir / "review_session.json"
@property
def review_data_path(self) -> Path:
return self.import_dir / "review_data.json"
def ensure_review_bundle(self) -> None:
if not self.review_session_path.exists():
export_review_bundle_from_import(self.import_dir, reviewer=self.reviewer)
return
session = ReviewSession.model_validate_json(self.review_session_path.read_text(encoding="utf-8"))
updated = False
if (
not session.citation_reviews
or any(entry.source_kind == "citation_key" and not entry.title for entry in session.citation_reviews)
or any(entry.source_kind == "citation_key" and not entry.source_bib_path for entry in session.citation_reviews)
):
session.citation_reviews = build_citation_review_entries_from_import(self.import_dir)
updated = True
if updated or not self.review_data_path.exists():
self.save_session(session)
def load_session(self) -> ReviewSession:
self.ensure_review_bundle()
return ReviewSession.model_validate_json(self.review_session_path.read_text(encoding="utf-8"))
def save_session(self, session: ReviewSession) -> None:
export_review_state_json(session, self.review_session_path)
export_review_ui_data(session, self.import_dir, import_dir=self.import_dir)
def load_review_data(self) -> dict[str, Any]:
self.ensure_review_bundle()
return json.loads(self.review_data_path.read_text(encoding="utf-8"))
def apply_updates(
self,
*,
concept_updates: list[dict[str, Any]] | None = None,
citation_updates: list[dict[str, Any]] | None = None,
reviewer: str | None = None,
) -> ReviewSession:
session = self.load_session()
if reviewer:
session.reviewer = reviewer
concept_by_id = {concept.concept_id: concept for concept in session.draft_pack.concepts}
citation_by_id = {entry.citation_review_id: entry for entry in session.citation_reviews}
for payload in concept_updates or []:
concept_id = str(payload.get("concept_id", "")).strip()
if not concept_id or concept_id not in concept_by_id:
continue
concept = concept_by_id[concept_id]
if "status" in payload:
concept.status = payload["status"]
if "description" in payload:
concept.description = str(payload.get("description", "")).strip()
if "notes" in payload:
concept.notes = _normalize_lines(payload.get("notes"))
if "prerequisites" in payload:
concept.prerequisites = _normalize_lines(payload.get("prerequisites"))
session.ledger.append(
ReviewLedgerEntry(
reviewer=session.reviewer,
action=ReviewAction(
action_type="edit_concept",
target=concept_id,
payload={
"status": concept.status,
"description": concept.description,
"notes": concept.notes,
"prerequisites": concept.prerequisites,
},
rationale=str(payload.get("rationale", "")).strip(),
),
)
)
for payload in citation_updates or []:
citation_review_id = str(payload.get("citation_review_id", "")).strip()
if not citation_review_id or citation_review_id not in citation_by_id:
continue
entry = citation_by_id[citation_review_id]
if "status" in payload:
entry.status = payload["status"]
if "notes" in payload:
entry.notes = _normalize_lines(payload.get("notes"))
session.ledger.append(
ReviewLedgerEntry(
reviewer=session.reviewer,
action=ReviewAction(
action_type="edit_citation",
target=citation_review_id,
payload={"status": entry.status, "notes": entry.notes},
rationale=str(payload.get("rationale", "")).strip(),
),
)
)
self.save_session(session)
return session

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from .. import groundrecall_source_adapters as _legacy_source_adapters # noqa: F401

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from ..groundrecall_source_adapters.base import * # noqa: F403

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from ..groundrecall_source_adapters.didactopus_pack import * # noqa: F403

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from ..groundrecall_source_adapters.llmwiki import * # noqa: F403

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from ..groundrecall_source_adapters.markdown_notes import * # noqa: F403

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from ..groundrecall_source_adapters.polypaper import * # noqa: F403

View File

@ -0,0 +1,3 @@
from __future__ import annotations
from ..groundrecall_source_adapters.transcript import * # noqa: F403

203
src/groundrecall/store.py Normal file
View File

@ -0,0 +1,203 @@
from __future__ import annotations
import os
import tempfile
from pathlib import Path
from typing import TypeVar
from pydantic import BaseModel
from .models import (
ArtifactRecord,
ClaimRecord,
ConceptRecord,
FragmentRecord,
GroundRecallSnapshot,
ObservationRecord,
PromotionRecord,
RelationRecord,
ReviewCandidateRecord,
SourceRecord,
)
ModelT = TypeVar("ModelT", bound=BaseModel)
class GroundRecallStore:
def __init__(self, base_dir: str | Path):
self.base_dir = Path(base_dir)
self.sources_dir = self.base_dir / "sources"
self.fragments_dir = self.base_dir / "fragments"
self.artifacts_dir = self.base_dir / "artifacts"
self.observations_dir = self.base_dir / "observations"
self.claims_dir = self.base_dir / "claims"
self.concepts_dir = self.base_dir / "concepts"
self.relations_dir = self.base_dir / "relations"
self.review_candidates_dir = self.base_dir / "review_candidates"
self.promotions_dir = self.base_dir / "promotions"
self.snapshots_dir = self.base_dir / "snapshots"
for path in [
self.sources_dir,
self.fragments_dir,
self.artifacts_dir,
self.observations_dir,
self.claims_dir,
self.concepts_dir,
self.relations_dir,
self.review_candidates_dir,
self.promotions_dir,
self.snapshots_dir,
]:
path.mkdir(parents=True, exist_ok=True)
def _save(self, directory: Path, key: str, model: BaseModel) -> None:
target = directory / f"{key}.json"
payload = model.model_dump_json(indent=2)
self._write_text_atomic(target, payload)
def _write_text_atomic(self, path: Path, text: str) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
fd, tmp_name = tempfile.mkstemp(
prefix=f".{path.name}.",
suffix=".tmp",
dir=path.parent,
text=True,
)
tmp_path = Path(tmp_name)
try:
with os.fdopen(fd, "w", encoding="utf-8") as handle:
handle.write(text)
handle.flush()
os.fsync(handle.fileno())
os.replace(tmp_path, path)
finally:
if tmp_path.exists():
tmp_path.unlink()
def _load(self, directory: Path, key: str, model_type: type[ModelT]) -> ModelT | None:
path = directory / f"{key}.json"
if not path.exists():
return None
return model_type.model_validate_json(path.read_text(encoding="utf-8"))
def _list(self, directory: Path, model_type: type[ModelT]) -> list[ModelT]:
items: list[ModelT] = []
for path in sorted(directory.glob("*.json")):
items.append(model_type.model_validate_json(path.read_text(encoding="utf-8")))
return items
def save_source(self, record: SourceRecord) -> SourceRecord:
self._save(self.sources_dir, record.source_id, record)
return record
def get_source(self, source_id: str) -> SourceRecord | None:
return self._load(self.sources_dir, source_id, SourceRecord)
def list_sources(self) -> list[SourceRecord]:
return self._list(self.sources_dir, SourceRecord)
def save_fragment(self, record: FragmentRecord) -> FragmentRecord:
self._save(self.fragments_dir, record.fragment_id, record)
return record
def get_fragment(self, fragment_id: str) -> FragmentRecord | None:
return self._load(self.fragments_dir, fragment_id, FragmentRecord)
def list_fragments(self) -> list[FragmentRecord]:
return self._list(self.fragments_dir, FragmentRecord)
def save_artifact(self, record: ArtifactRecord) -> ArtifactRecord:
self._save(self.artifacts_dir, record.artifact_id, record)
return record
def get_artifact(self, artifact_id: str) -> ArtifactRecord | None:
return self._load(self.artifacts_dir, artifact_id, ArtifactRecord)
def list_artifacts(self) -> list[ArtifactRecord]:
return self._list(self.artifacts_dir, ArtifactRecord)
def save_observation(self, record: ObservationRecord) -> ObservationRecord:
self._save(self.observations_dir, record.observation_id, record)
return record
def get_observation(self, observation_id: str) -> ObservationRecord | None:
return self._load(self.observations_dir, observation_id, ObservationRecord)
def list_observations(self) -> list[ObservationRecord]:
return self._list(self.observations_dir, ObservationRecord)
def save_claim(self, record: ClaimRecord) -> ClaimRecord:
self._save(self.claims_dir, record.claim_id, record)
return record
def get_claim(self, claim_id: str) -> ClaimRecord | None:
return self._load(self.claims_dir, claim_id, ClaimRecord)
def list_claims(self) -> list[ClaimRecord]:
return self._list(self.claims_dir, ClaimRecord)
def save_concept(self, record: ConceptRecord) -> ConceptRecord:
self._save(self.concepts_dir, record.concept_id.replace("::", "__"), record)
return record
def get_concept(self, concept_id: str) -> ConceptRecord | None:
return self._load(self.concepts_dir, concept_id.replace("::", "__"), ConceptRecord)
def list_concepts(self) -> list[ConceptRecord]:
return self._list(self.concepts_dir, ConceptRecord)
def save_relation(self, record: RelationRecord) -> RelationRecord:
self._save(self.relations_dir, record.relation_id, record)
return record
def get_relation(self, relation_id: str) -> RelationRecord | None:
return self._load(self.relations_dir, relation_id, RelationRecord)
def list_relations(self) -> list[RelationRecord]:
return self._list(self.relations_dir, RelationRecord)
def save_review_candidate(self, record: ReviewCandidateRecord) -> ReviewCandidateRecord:
self._save(self.review_candidates_dir, record.review_candidate_id, record)
return record
def get_review_candidate(self, review_candidate_id: str) -> ReviewCandidateRecord | None:
return self._load(self.review_candidates_dir, review_candidate_id, ReviewCandidateRecord)
def list_review_candidates(self) -> list[ReviewCandidateRecord]:
return self._list(self.review_candidates_dir, ReviewCandidateRecord)
def save_promotion(self, record: PromotionRecord) -> PromotionRecord:
self._save(self.promotions_dir, record.promotion_id, record)
return record
def get_promotion(self, promotion_id: str) -> PromotionRecord | None:
return self._load(self.promotions_dir, promotion_id, PromotionRecord)
def list_promotions(self) -> list[PromotionRecord]:
return self._list(self.promotions_dir, PromotionRecord)
def save_snapshot(self, snapshot: GroundRecallSnapshot) -> GroundRecallSnapshot:
self._save(self.snapshots_dir, snapshot.snapshot_id, snapshot)
return snapshot
def get_snapshot(self, snapshot_id: str) -> GroundRecallSnapshot | None:
return self._load(self.snapshots_dir, snapshot_id, GroundRecallSnapshot)
def list_snapshots(self) -> list[GroundRecallSnapshot]:
return self._list(self.snapshots_dir, GroundRecallSnapshot)
def build_snapshot(self, snapshot_id: str, created_at: str, metadata: dict | None = None) -> GroundRecallSnapshot:
return GroundRecallSnapshot(
snapshot_id=snapshot_id,
created_at=created_at,
sources=self.list_sources(),
fragments=self.list_fragments(),
artifacts=self.list_artifacts(),
observations=self.list_observations(),
claims=self.list_claims(),
concepts=self.list_concepts(),
relations=self.list_relations(),
promotions=self.list_promotions(),
metadata=metadata or {},
)

View File

@ -0,0 +1,17 @@
import shutil
import subprocess
def test_groundrecall_console_script_help() -> None:
executable = shutil.which("groundrecall")
assert executable is not None
result = subprocess.run(
[executable, "--help"],
capture_output=True,
text=True,
check=False,
)
assert result.returncode == 0
assert "GroundRecall command-line tools" in result.stdout

View File

@ -0,0 +1,134 @@
from __future__ import annotations
import json
from pathlib import Path
from groundrecall.assistant_export import export_assistant_bundle
from groundrecall.assistants.base import get_assistant_adapter, list_assistant_adapters
import groundrecall.assistants.codex # noqa: F401
import groundrecall.assistants.claude_code # noqa: F401
from groundrecall.models import (
ArtifactRecord,
ClaimRecord,
ConceptRecord,
ObservationRecord,
ProvenanceRecord,
RelationRecord,
)
from groundrecall.query import build_query_bundle_for_concept
from groundrecall.store import GroundRecallStore
def _seed_store(store: GroundRecallStore) -> None:
store.save_artifact(
ArtifactRecord(
artifact_id="ia_001",
artifact_kind="compiled_page",
title="Channel Capacity",
path="wiki/channel-capacity.md",
current_status="reviewed",
)
)
store.save_observation(
ObservationRecord(
observation_id="obs_001",
artifact_id="ia_001",
role="claim",
text="Reliable communication rate is bounded by channel capacity.",
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="grounded",
),
current_status="reviewed",
)
)
store.save_concept(
ConceptRecord(
concept_id="concept::channel-capacity",
title="Channel Capacity",
description="Reliable communication limit.",
source_artifact_ids=["ia_001"],
current_status="promoted",
)
)
store.save_concept(
ConceptRecord(
concept_id="concept::shannon-entropy",
title="Shannon Entropy",
description="Average uncertainty.",
current_status="promoted",
)
)
store.save_claim(
ClaimRecord(
claim_id="clm_001",
claim_text="Channel capacity bounds reliable communication rate.",
concept_ids=["concept::channel-capacity"],
source_observation_ids=["obs_001"],
confidence_hint=0.8,
review_confidence=0.9,
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="grounded",
),
current_status="promoted",
)
)
store.save_relation(
RelationRecord(
relation_id="rel_001",
source_id="concept::channel-capacity",
target_id="concept::shannon-entropy",
relation_type="references",
current_status="promoted",
)
)
def test_assistant_adapter_registry_lists_known_adapters() -> None:
assert "codex" in list_assistant_adapters()
assert "claude_code" in list_assistant_adapters()
def test_codex_adapter_exports_skill_and_json_bundle(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
manifest = export_assistant_bundle(store.base_dir, "codex", tmp_path / "codex", concept_refs=["channel-capacity"])
assert (tmp_path / "codex" / "SKILL.md").exists()
assert (tmp_path / "codex" / "codex_bundle.json").exists()
assert (tmp_path / "codex" / "assistant_export_manifest.json").exists()
assert manifest["assistant"] == "codex"
def test_claude_code_adapter_exports_memory_and_json_bundle(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
manifest = export_assistant_bundle(store.base_dir, "claude_code", tmp_path / "claude", concept_refs=["channel-capacity"])
assert (tmp_path / "claude" / "CLAUDE.md").exists()
assert (tmp_path / "claude" / "claude_code_bundle.json").exists()
assert manifest["assistant"] == "claude_code"
def test_adapter_contexts_are_derived_from_assistant_neutral_query_bundles(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
query_bundle = build_query_bundle_for_concept(store.base_dir, "channel-capacity")
assert query_bundle is not None
codex = get_assistant_adapter("codex")
claude = get_assistant_adapter("claude_code")
codex_context = codex.build_context(query_bundle)
claude_context = claude.build_context(query_bundle)
assert codex_context["concept"]["concept_id"] == "concept::channel-capacity"
assert claude_context["concept"]["concept_id"] == "concept::channel-capacity"
assert codex_context["assistant"] == "codex"
assert claude_context["assistant"] == "claude_code"
assert "relevant_claims" in codex_context
assert "claims" in claude_context

View File

@ -0,0 +1,136 @@
from __future__ import annotations
import json
from pathlib import Path
from groundrecall.export import export_canonical_bundle, export_query_bundle
from groundrecall.models import (
ArtifactRecord,
ClaimRecord,
ConceptRecord,
ObservationRecord,
ProvenanceRecord,
RelationRecord,
SourceRecord,
)
from groundrecall.store import GroundRecallStore
def _read_jsonl(path: Path) -> list[dict]:
text = path.read_text(encoding="utf-8").strip()
if not text:
return []
return [json.loads(line) for line in text.splitlines()]
def _seed_store(store: GroundRecallStore) -> None:
store.save_source(SourceRecord(source_id="src_001", title="Source", current_status="promoted"))
store.save_artifact(
ArtifactRecord(
artifact_id="ia_001",
artifact_kind="compiled_page",
title="Channel Capacity",
path="wiki/channel-capacity.md",
current_status="reviewed",
)
)
store.save_observation(
ObservationRecord(
observation_id="obs_001",
artifact_id="ia_001",
role="claim",
text="Reliable communication rate is bounded by channel capacity.",
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="grounded",
),
current_status="reviewed",
)
)
store.save_concept(
ConceptRecord(
concept_id="concept::channel-capacity",
title="Channel Capacity",
description="Reliable communication limit.",
source_artifact_ids=["ia_001"],
current_status="promoted",
)
)
store.save_concept(
ConceptRecord(
concept_id="concept::shannon-entropy",
title="Shannon Entropy",
description="Average uncertainty.",
current_status="promoted",
)
)
store.save_claim(
ClaimRecord(
claim_id="clm_001",
claim_text="Channel capacity bounds reliable communication rate.",
concept_ids=["concept::channel-capacity"],
source_observation_ids=["obs_001"],
confidence_hint=0.8,
review_confidence=0.9,
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="grounded",
),
current_status="promoted",
)
)
store.save_relation(
RelationRecord(
relation_id="rel_001",
source_id="concept::channel-capacity",
target_id="concept::shannon-entropy",
relation_type="references",
current_status="promoted",
)
)
def test_export_canonical_bundle_writes_expected_files(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
out_dir = tmp_path / "exports"
payload = export_canonical_bundle(
store_dir=store.base_dir,
out_dir=out_dir,
concept_refs=["channel-capacity"],
snapshot_id="snap_export_001",
)
assert (out_dir / "groundrecall_snapshot.json").exists()
assert (out_dir / "claims.jsonl").exists()
assert (out_dir / "concepts.jsonl").exists()
assert (out_dir / "relations.jsonl").exists()
assert (out_dir / "provenance_manifest.json").exists()
assert (out_dir / "export_manifest.json").exists()
assert (out_dir / "query_bundle__channel-capacity.json").exists()
snapshot = json.loads((out_dir / "groundrecall_snapshot.json").read_text(encoding="utf-8"))
manifest = json.loads((out_dir / "export_manifest.json").read_text(encoding="utf-8"))
claims = _read_jsonl(out_dir / "claims.jsonl")
assert snapshot["snapshot_id"] == "snap_export_001"
assert manifest["export_kind"] == "canonical"
assert len(manifest["query_bundles"]) == 1
assert claims[0]["claim_id"] == "clm_001"
assert payload["query_bundles"]
def test_export_query_bundle_is_assistant_neutral(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
out_path = tmp_path / "bundle.json"
payload = export_query_bundle(store.base_dir, "channel capacity", out_path)
assert out_path.exists()
assert payload["bundle_kind"] == "groundrecall_query_bundle"
forbidden = {"assistant", "codex", "claude", "prompt_text"}
assert set(payload).isdisjoint(forbidden)

View File

@ -0,0 +1,161 @@
from __future__ import annotations
import json
from pathlib import Path
from groundrecall.ingest import run_groundrecall_import
from groundrecall.lint import lint_import_directory
def _read_jsonl(path: Path) -> list[dict]:
text = path.read_text(encoding="utf-8").strip()
if not text:
return []
return [json.loads(line) for line in text.splitlines()]
def test_groundrecall_import_emits_normalized_artifacts(tmp_path: Path) -> None:
root = tmp_path / "llmwiki"
(root / "wiki").mkdir(parents=True)
(root / "raw").mkdir()
(root / "logs").mkdir()
(root / "wiki" / "channel-capacity.md").write_text(
"# Channel Capacity\n\n"
"- Reliable rate upper bound for a noisy channel.\n\n"
"See also [[Shannon Entropy]].\n",
encoding="utf-8",
)
(root / "raw" / "notes.md").write_text(
"Speculation: Capacity may depend on constraints.\n",
encoding="utf-8",
)
(root / "logs" / "session.log").write_text(
"Learner asked about entropy and communication limits.\n",
encoding="utf-8",
)
result = run_groundrecall_import(root, mode="quick", import_id="import-test")
assert result.out_dir == root / "imports" / "import-test"
manifest = json.loads((result.out_dir / "manifest.json").read_text(encoding="utf-8"))
assert manifest["source_repo_kind"] == "llmwiki"
assert manifest["artifact_count"] == 3
assert manifest["claim_count"] >= 1
artifacts = _read_jsonl(result.out_dir / "artifacts.jsonl")
assert {item["artifact_kind"] for item in artifacts} == {"compiled_page", "raw_note", "session_log"}
claims = _read_jsonl(result.out_dir / "claims.jsonl")
assert any("Reliable rate upper bound" in item["claim_text"] for item in claims)
concepts = _read_jsonl(result.out_dir / "concepts.jsonl")
concept_ids = {item["concept_id"] for item in concepts}
assert "concept::channel-capacity" in concept_ids
assert "concept::shannon-entropy" in concept_ids
relations = _read_jsonl(result.out_dir / "relations.jsonl")
assert any(item["target_id"] == "concept::shannon-entropy" for item in relations)
lint_payload = json.loads((result.out_dir / "lint_findings.json").read_text(encoding="utf-8"))
assert "summary" in lint_payload
assert lint_payload["summary"]["warning_count"] >= 0
review_queue = json.loads((result.out_dir / "review_queue.json").read_text(encoding="utf-8"))
assert review_queue["queue_length"] >= 1
assert any(item["candidate_type"] == "claim" for item in review_queue["items"])
review_session = json.loads((result.out_dir / "review_session.json").read_text(encoding="utf-8"))
assert review_session["reviewer"] == "GroundRecall Import"
assert review_session["draft_pack"]["pack"]["source_import_id"] == "import-test"
assert any(item["concept_id"] == "channel-capacity" for item in review_session["draft_pack"]["concepts"])
review_data = json.loads((result.out_dir / "review_data.json").read_text(encoding="utf-8"))
assert review_data["reviewer"] == "GroundRecall Import"
assert "field_specs" in review_data
assert any(item["field"] == "status" for item in review_data["field_specs"])
assert "review_guidance" in review_data
assert "concept_reviews" in review_data
assert "citations" in review_data
assert "citation_reviews" in review_data
def test_groundrecall_import_parses_explicit_claim_relations(tmp_path: Path) -> None:
root = tmp_path / "llmwiki"
(root / "wiki").mkdir(parents=True)
(root / "wiki" / "notes.md").write_text(
"# Notes\n\n"
"- [claim_id: base] Channel capacity bounds reliable communication rate.\n"
"- [claim_id: revised] [supersedes: base] Channel capacity bounds reliable communication rate for a specified channel model.\n"
"- [claim_id: dissent] [contradicts: revised] Channel capacity has no stable interpretation.\n",
encoding="utf-8",
)
result = run_groundrecall_import(root, mode="quick", import_id="relations-test")
claims = _read_jsonl(result.out_dir / "claims.jsonl")
by_id = {item["claim_id"]: item for item in claims}
assert "clm_base" in by_id
assert by_id["clm_revised"]["supersedes_claim_ids"] == ["clm_base"]
assert by_id["clm_dissent"]["contradicts_claim_ids"] == ["clm_revised"]
lint_payload = json.loads((result.out_dir / "lint_findings.json").read_text(encoding="utf-8"))
codes = {item["code"] for item in lint_payload["findings"]}
assert "unresolved_supersession_ref" not in codes
assert "unresolved_contradiction_ref" not in codes
def test_groundrecall_lint_flags_orphan_concepts_and_missing_targets(tmp_path: Path) -> None:
root = tmp_path / "llmwiki"
(root / "wiki").mkdir(parents=True)
(root / "wiki" / "solo.md").write_text(
"# Solo Concept\n",
encoding="utf-8",
)
(root / "wiki" / "broken.md").write_text(
"# Broken\n\nSee also [[Missing Concept]].\n",
encoding="utf-8",
)
result = run_groundrecall_import(root, mode="quick", import_id="lint-test")
lint_payload = json.loads((result.out_dir / "lint_findings.json").read_text(encoding="utf-8"))
codes = {item["code"] for item in lint_payload["findings"]}
assert "orphan_concept" in codes
def test_groundrecall_lint_detects_relation_missing_target(tmp_path: Path) -> None:
import_dir = tmp_path / "imports" / "broken-import"
import_dir.mkdir(parents=True)
(import_dir / "manifest.json").write_text(
json.dumps({"import_id": "broken-import", "import_mode": "quick"}),
encoding="utf-8",
)
(import_dir / "artifacts.jsonl").write_text("", encoding="utf-8")
(import_dir / "observations.jsonl").write_text("", encoding="utf-8")
(import_dir / "claims.jsonl").write_text("", encoding="utf-8")
(import_dir / "concepts.jsonl").write_text(
json.dumps(
{
"concept_id": "concept::existing",
"title": "Existing",
"current_status": "triaged",
}
)
+ "\n",
encoding="utf-8",
)
(import_dir / "relations.jsonl").write_text(
json.dumps(
{
"relation_id": "rel_1",
"source_id": "concept::existing",
"target_id": "concept::missing",
"relation_type": "references",
"current_status": "draft",
}
)
+ "\n",
encoding="utf-8",
)
payload = lint_import_directory(import_dir)
codes = {item["code"] for item in payload["findings"]}
assert "relation_missing_target" in codes

View File

@ -0,0 +1,70 @@
import sys
from pathlib import Path
from groundrecall.cli import main as groundrecall_cli_main
from groundrecall.export import export_canonical_bundle
from groundrecall.ingest import run_groundrecall_import
from groundrecall.inspect import inspect_store
from groundrecall.models import ClaimRecord
from groundrecall.query import query_concept
from groundrecall.store import GroundRecallStore
from groundrecall.lint import lint_import_directory
from groundrecall.promotion import promote_import_to_store
def _build_llmwiki_fixture(root: Path) -> Path:
(root / "wiki").mkdir(parents=True)
(root / "raw").mkdir()
(root / "wiki" / "channel-capacity.md").write_text(
"# Channel Capacity\n\n"
"- Reliable rate upper bound for a noisy channel.\n\n"
"See also [[Shannon Entropy]].\n",
encoding="utf-8",
)
(root / "raw" / "notes.md").write_text(
"Speculation: Capacity may depend on constraints.\n",
encoding="utf-8",
)
return root
def test_groundrecall_namespace_reexports_core_functions() -> None:
assert run_groundrecall_import.__module__ == "groundrecall.ingest"
assert query_concept.__module__ == "groundrecall.query"
assert export_canonical_bundle.__module__ == "groundrecall.export"
assert lint_import_directory.__module__ == "groundrecall.lint"
assert promote_import_to_store.__module__ == "groundrecall.promotion"
assert GroundRecallStore.__module__ == "groundrecall.store"
assert ClaimRecord.__module__ == "groundrecall.models"
def test_groundrecall_inspect_summarizes_store(tmp_path: Path) -> None:
source_root = _build_llmwiki_fixture(tmp_path / "llmwiki")
import_result = run_groundrecall_import(source_root, out_root=tmp_path / "imports", mode="quick", import_id="fixture-import")
store_dir = tmp_path / "store"
promote_import_to_store(import_result.out_dir, store_dir)
payload = inspect_store(store_dir, out_path=tmp_path / "inspect.json")
assert (tmp_path / "inspect.json").exists()
assert payload["claim_count"] >= 1
assert payload["concept_count"] >= 1
assert payload["snapshot_count"] >= 1
def test_groundrecall_cli_inspect_dispatches(tmp_path: Path, capsys) -> None:
source_root = _build_llmwiki_fixture(tmp_path / "llmwiki")
import_result = run_groundrecall_import(source_root, out_root=tmp_path / "imports", mode="quick", import_id="fixture-import")
store_dir = tmp_path / "store"
promote_import_to_store(import_result.out_dir, store_dir)
original_argv = sys.argv
try:
sys.argv = ["groundrecall.cli", "inspect", str(store_dir)]
groundrecall_cli_main()
finally:
sys.argv = original_argv
output = capsys.readouterr().out
assert '"claim_count"' in output
assert '"concept_count"' in output

View File

@ -0,0 +1,96 @@
from __future__ import annotations
import json
from pathlib import Path
from groundrecall.ingest import run_groundrecall_import
from groundrecall.promotion import promote_import_to_store
from groundrecall.store import GroundRecallStore
def test_groundrecall_promotion_writes_canonical_objects(tmp_path: Path) -> None:
root = tmp_path / "llmwiki"
(root / "wiki").mkdir(parents=True)
(root / "wiki" / "channel-capacity.md").write_text(
"# Channel Capacity\n\n"
"- Reliable rate upper bound for a noisy channel.\n\n"
"See also [[Shannon Entropy]].\n",
encoding="utf-8",
)
result = run_groundrecall_import(root, mode="quick", import_id="promote-test")
review_path = result.out_dir / "review_session.json"
review_payload = json.loads(review_path.read_text(encoding="utf-8"))
for concept in review_payload["draft_pack"]["concepts"]:
concept["status"] = "trusted"
review_path.write_text(json.dumps(review_payload, indent=2), encoding="utf-8")
store_dir = tmp_path / "groundrecall-store"
payload = promote_import_to_store(result.out_dir, store_dir, reviewer="R")
store = GroundRecallStore(store_dir)
concepts = store.list_concepts()
claims = store.list_claims()
relations = store.list_relations()
promotions = store.list_promotions()
snapshots = store.list_snapshots()
assert payload["promoted_concept_count"] >= 1
assert payload["promoted_claim_count"] >= 1
assert len(concepts) >= 2
assert any(item.current_status == "promoted" for item in concepts)
assert any(item.current_status == "promoted" for item in claims)
assert len(relations) >= 1
assert len(promotions) == 1
assert promotions[0].reviewer == "R"
assert len(snapshots) == 1
assert snapshots[0].metadata["source_import_id"] == "promote-test"
def test_groundrecall_promotion_respects_rejected_review_status(tmp_path: Path) -> None:
root = tmp_path / "llmwiki"
(root / "wiki").mkdir(parents=True)
(root / "wiki" / "solo.md").write_text(
"# Solo Concept\n\n- A solitary claim.\n",
encoding="utf-8",
)
result = run_groundrecall_import(root, mode="quick", import_id="reject-test")
review_path = result.out_dir / "review_session.json"
review_payload = json.loads(review_path.read_text(encoding="utf-8"))
review_payload["draft_pack"]["concepts"][0]["status"] = "rejected"
review_path.write_text(json.dumps(review_payload, indent=2), encoding="utf-8")
store_dir = tmp_path / "groundrecall-store"
promote_import_to_store(result.out_dir, store_dir, reviewer="R")
store = GroundRecallStore(store_dir)
assert store.list_concepts()[0].current_status == "rejected"
assert store.list_claims()[0].current_status == "rejected"
def test_groundrecall_promotion_preserves_contradiction_and_supersession_links(tmp_path: Path) -> None:
root = tmp_path / "llmwiki"
(root / "wiki").mkdir(parents=True)
(root / "wiki" / "notes.md").write_text(
"# Notes\n\n"
"- [claim_id: base] Channel capacity bounds reliable communication rate.\n"
"- [claim_id: revised] [supersedes: base] Channel capacity bounds reliable communication rate for a specified channel model.\n"
"- [claim_id: dissent] [contradicts: revised] Channel capacity has no stable interpretation.\n",
encoding="utf-8",
)
result = run_groundrecall_import(root, mode="quick", import_id="graph-test")
review_path = result.out_dir / "review_session.json"
review_payload = json.loads(review_path.read_text(encoding="utf-8"))
for concept in review_payload["draft_pack"]["concepts"]:
concept["status"] = "trusted"
review_path.write_text(json.dumps(review_payload, indent=2), encoding="utf-8")
store_dir = tmp_path / "groundrecall-store"
promote_import_to_store(result.out_dir, store_dir, reviewer="R")
store = GroundRecallStore(store_dir)
claims = {item.claim_id: item for item in store.list_claims()}
assert claims["clm_revised"].supersedes_claim_ids == ["clm_base"]
assert claims["clm_dissent"].contradicts_claim_ids == ["clm_revised"]

View File

@ -0,0 +1,190 @@
from __future__ import annotations
from pathlib import Path
from groundrecall.models import (
ArtifactRecord,
ClaimRecord,
ConceptRecord,
ObservationRecord,
ProvenanceRecord,
RelationRecord,
)
from groundrecall.query import (
build_query_bundle_for_concept,
query_concept,
query_provenance,
search_claims,
)
from groundrecall.store import GroundRecallStore
def _seed_store(store: GroundRecallStore) -> None:
store.save_artifact(
ArtifactRecord(
artifact_id="ia_001",
artifact_kind="compiled_page",
title="Channel Capacity",
path="wiki/channel-capacity.md",
current_status="reviewed",
)
)
store.save_observation(
ObservationRecord(
observation_id="obs_001",
artifact_id="ia_001",
role="claim",
text="Reliable communication rate is bounded by channel capacity.",
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="grounded",
),
current_status="reviewed",
)
)
store.save_concept(
ConceptRecord(
concept_id="concept::channel-capacity",
title="Channel Capacity",
description="Reliable communication limit.",
source_artifact_ids=["ia_001"],
current_status="promoted",
)
)
store.save_concept(
ConceptRecord(
concept_id="concept::shannon-entropy",
title="Shannon Entropy",
description="Average uncertainty.",
current_status="promoted",
)
)
store.save_claim(
ClaimRecord(
claim_id="clm_001",
claim_text="Channel capacity bounds reliable communication rate.",
concept_ids=["concept::channel-capacity"],
source_observation_ids=["obs_001"],
confidence_hint=0.8,
review_confidence=0.9,
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="grounded",
),
current_status="promoted",
)
)
store.save_claim(
ClaimRecord(
claim_id="clm_002",
claim_text="Shannon entropy can inform channel coding intuition.",
concept_ids=["concept::shannon-entropy"],
contradicts_claim_ids=["clm_999"],
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="partially_grounded",
),
current_status="reviewed",
)
)
store.save_relation(
RelationRecord(
relation_id="rel_001",
source_id="concept::channel-capacity",
target_id="concept::shannon-entropy",
relation_type="references",
current_status="promoted",
)
)
def test_query_concept_returns_neighborhood_and_support(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
payload = query_concept(store.base_dir, "channel-capacity")
assert payload is not None
assert payload["concept"]["concept_id"] == "concept::channel-capacity"
assert len(payload["claims"]) == 1
assert len(payload["relations"]) == 1
assert any(item["concept_id"] == "concept::shannon-entropy" for item in payload["related_concepts"])
assert payload["supporting_observations"][0]["origin_path"] == "wiki/channel-capacity.md"
def test_search_claims_matches_text_and_concept_titles(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
payload = search_claims(store.base_dir, "entropy")
assert payload["query_type"] == "claim_search"
assert any(match["claim"]["claim_id"] == "clm_002" for match in payload["matches"])
def test_query_provenance_filters_by_origin_path(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
payload = query_provenance(store.base_dir, origin_path="wiki/channel-capacity.md")
assert len(payload["claims"]) == 2
assert len(payload["observations"]) == 1
def test_build_query_bundle_for_concept_is_assistant_neutral(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
payload = build_query_bundle_for_concept(store.base_dir, "channel capacity")
assert payload is not None
assert payload["bundle_kind"] == "groundrecall_query_bundle"
assert payload["concept"]["concept_id"] == "concept::channel-capacity"
assert isinstance(payload["suggested_next_actions"], list)
forbidden = {"assistant", "codex", "claude", "prompt_text"}
assert set(payload).isdisjoint(forbidden)
def test_query_bundle_surfaces_contradictions_and_supersessions(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
_seed_store(store)
store.save_claim(
ClaimRecord(
claim_id="clm_003",
claim_text="Channel capacity is undefined in practice.",
concept_ids=["concept::channel-capacity"],
contradicts_claim_ids=["clm_001"],
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="partially_grounded",
),
current_status="reviewed",
)
)
store.save_claim(
ClaimRecord(
claim_id="clm_004",
claim_text="Channel capacity should be interpreted relative to a specific channel model.",
concept_ids=["concept::channel-capacity"],
supersedes_claim_ids=["clm_001"],
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="grounded",
),
current_status="reviewed",
)
)
payload = build_query_bundle_for_concept(store.base_dir, "channel-capacity")
assert payload is not None
contradiction_ids = {item["claim_id"] for item in payload["contradictions"]}
supersession_ids = {item["claim_id"] for item in payload["supersessions"]}
assert "clm_003" in contradiction_ids
assert "clm_004" in supersession_ids

View File

@ -0,0 +1,60 @@
from __future__ import annotations
from types import SimpleNamespace
from groundrecall.review_server import _safe_show_entry, _safe_verify_entry
class _StoreWithoutConflicts:
def get_entry(self, citation_key: str):
if citation_key != "baum1974generalized":
return None
return {"citation_key": citation_key, "title": "On two types of deviation"}
def get_field_provenance(self, citation_key: str):
return [{"field_name": "title", "source_label": "refs.bib"}]
def get_entry_bibtex(self, citation_key: str):
return "@article{baum1974generalized, title={On two types of deviation}}"
class _ApiWithPartialSupport:
def __init__(self):
self.store = _StoreWithoutConflicts()
def show_entry(self, citation_key: str, **kwargs):
raise AttributeError("get_conflicts missing in underlying store")
def verify_bibtex(self, bibtex_text: str, *, context: str = "", limit: int = 5):
raise RuntimeError("pybtex unavailable")
def verify_strings(self, values: list[str], *, context: str = "", limit: int = 5):
return {"context": context, "results": [{"values": values, "limit": limit}]}
def test_safe_show_entry_falls_back_when_citegeist_show_entry_is_incompatible() -> None:
api = _ApiWithPartialSupport()
payload = _safe_show_entry(api, "baum1974generalized")
assert payload is not None
assert payload["citation_key"] == "baum1974generalized"
assert payload["conflicts"] == []
assert payload["provenance"][0]["source_label"] == "refs.bib"
assert "bibtex" in payload
def test_safe_verify_entry_falls_back_to_verify_strings() -> None:
api = _ApiWithPartialSupport()
entry = SimpleNamespace(
citation_key="baum1974generalized",
title="On two types of deviation",
author="W. M. Baum",
year="1974",
raw_bibtex="@article{baum1974generalized, title={On two types of deviation}}",
)
payload = _safe_verify_entry(api, entry, context="pieces/intro.tex Intro", limit=5)
assert payload["results"]
assert payload["results"][0]["values"][0] == "baum1974generalized"

View File

@ -0,0 +1,86 @@
from __future__ import annotations
import json
from pathlib import Path
from groundrecall.ingest import run_groundrecall_import
from groundrecall.review_workspace import GroundRecallReviewWorkspace
def _build_citation_fixture(root: Path) -> Path:
(root / "wiki").mkdir(parents=True)
(root / "wiki" / "learning-theory.md").write_text(
"# Learning Theory\n\n"
"Matching-law style regularities can be compared with machine learning optimization.\n\n"
"See \\\\cite{herrnstein1961matching} for the classic framing.\n",
encoding="utf-8",
)
return root
def test_review_workspace_populates_and_persists_citation_reviews(tmp_path: Path) -> None:
source_root = _build_citation_fixture(tmp_path / "llmwiki")
import_result = run_groundrecall_import(source_root, out_root=tmp_path / "imports", mode="quick", import_id="review-fixture")
workspace = GroundRecallReviewWorkspace(import_result.out_dir)
payload = workspace.load_review_data()
assert payload["citation_reviews"]
citation_review_id = payload["citation_reviews"][0]["citation_review_id"]
workspace.apply_updates(
concept_updates=[
{
"concept_id": "learning-theory",
"status": "trusted",
"notes": ["Strong framing concept.", "Citation support looks plausible."],
}
],
citation_updates=[
{
"citation_review_id": citation_review_id,
"status": "verified",
"notes": ["Classic matching-law citation."],
}
],
reviewer="Unit Test Reviewer",
)
session = json.loads((import_result.out_dir / "review_session.json").read_text(encoding="utf-8"))
concept = next(item for item in session["draft_pack"]["concepts"] if item["concept_id"] == "learning-theory")
citation = next(item for item in session["citation_reviews"] if item["citation_review_id"] == citation_review_id)
assert session["reviewer"] == "Unit Test Reviewer"
assert concept["status"] == "trusted"
assert citation["status"] == "verified"
review_data = json.loads((import_result.out_dir / "review_data.json").read_text(encoding="utf-8"))
assert any(item["citation_review_id"] == citation_review_id for item in review_data["citation_reviews"])
def test_review_workspace_resolves_citation_metadata_from_bibtex(tmp_path: Path) -> None:
root = tmp_path / "llmwiki"
(root / "wiki").mkdir(parents=True)
(root / "wiki" / "matching.md").write_text(
"# Matching\n\n"
"The manuscript cites \\\\cite{baum1974generalized} here.\n",
encoding="utf-8",
)
(root / "refs.bib").write_text(
"@article{baum1974generalized,\n"
" author = {W. M. Baum},\n"
" title = {On two types of deviation from the matching law: Bias and undermatching},\n"
" journal = {Journal of the Experimental Analysis of Behavior},\n"
" year = {1974}\n"
"}\n",
encoding="utf-8",
)
import_result = run_groundrecall_import(root, out_root=tmp_path / "imports", mode="quick", import_id="bib-fixture")
workspace = GroundRecallReviewWorkspace(import_result.out_dir)
payload = workspace.load_review_data()
entry = next(item for item in payload["citation_reviews"] if item["citation_key"] == "baum1974generalized")
assert entry["title"] == "On two types of deviation from the matching law: Bias and undermatching"
assert entry["source_bib_path"] == "refs.bib"
assert entry["raw_bibtex"]
assert payload["bibliography"]["entry_count"] >= 1

View File

@ -0,0 +1,235 @@
from __future__ import annotations
from pathlib import Path
import groundrecall.ingest as ingest_module
import groundrecall.source_adapters # noqa: F401
from groundrecall.source_adapters.base import detect_source_adapter, list_source_adapters
from groundrecall.ingest import run_groundrecall_import
def test_groundrecall_source_adapter_registry_lists_expected_adapters() -> None:
names = set(list_source_adapters())
assert "llmwiki" in names
assert "polypaper" in names
assert "markdown_notes" in names
assert "transcript" in names
assert "didactopus_pack" in names
assert "doclift_bundle" in names
def test_detect_llmwiki_adapter(tmp_path: Path) -> None:
(tmp_path / "wiki").mkdir()
adapter = detect_source_adapter(tmp_path)
assert adapter.name == "llmwiki"
assert adapter.import_intent() == "grounded_knowledge"
def test_detect_didactopus_pack_adapter(tmp_path: Path) -> None:
(tmp_path / "pack.yaml").write_text("name: p\n", encoding="utf-8")
(tmp_path / "concepts.yaml").write_text("concepts: []\n", encoding="utf-8")
adapter = detect_source_adapter(tmp_path)
assert adapter.name == "didactopus_pack"
assert adapter.import_intent() == "both"
def test_detect_doclift_bundle_adapter(tmp_path: Path) -> None:
(tmp_path / "documents").mkdir()
(tmp_path / "manifest.json").write_text('{"documents": []}\n', encoding="utf-8")
adapter = detect_source_adapter(tmp_path)
assert adapter.name == "doclift_bundle"
assert adapter.import_intent() == "both"
def test_groundrecall_import_records_adapter_and_intent(tmp_path: Path) -> None:
(tmp_path / "wiki").mkdir()
(tmp_path / "wiki" / "note.md").write_text("# Title\n\n- A note.\n", encoding="utf-8")
result = run_groundrecall_import(tmp_path, mode="quick", import_id="adapter-test")
assert result.manifest["source_adapter"] == "llmwiki"
assert result.manifest["import_intent"] == "grounded_knowledge"
def test_markdown_notes_adapter_ingests_tex_files(tmp_path: Path) -> None:
(tmp_path / "draft.tex").write_text(
"\\section{Related Work}\n\n"
"We connect behaviorism and language models.\n",
encoding="utf-8",
)
adapter = detect_source_adapter(tmp_path)
assert adapter.name == "markdown_notes"
result = run_groundrecall_import(tmp_path, mode="quick", import_id="tex-test")
assert result.manifest["source_adapter"] == "markdown_notes"
assert result.manifest["artifact_count"] == 1
assert result.artifacts[0]["path"] == "draft.tex"
assert result.claims
def test_tex_import_uses_pandoc_markdown_when_available(tmp_path: Path, monkeypatch) -> None:
(tmp_path / "draft.tex").write_text(
"\\section{Ignored by fallback}\n"
"\\usepackage{amsmath}\n",
encoding="utf-8",
)
monkeypatch.setattr(
ingest_module,
"_convert_tex_to_markdown",
lambda path: "# Converted Draft\n\n- Converted claim from pandoc.\n",
)
result = run_groundrecall_import(tmp_path, mode="quick", import_id="tex-pandoc-test")
claim_texts = [item["claim_text"] for item in result.claims]
concept_ids = {item["concept_id"] for item in result.concepts}
assert "Converted claim from pandoc." in claim_texts
assert "concept::converted-draft" in concept_ids
def test_detect_polypaper_adapter_and_exclude_support_files(tmp_path: Path) -> None:
(tmp_path / "pieces").mkdir()
(tmp_path / "figs").mkdir()
(tmp_path / "setup").mkdir()
(tmp_path / "main.tex").write_text(
"\\include{pieces/discussion}\n"
"\\include{pieces/table-results}\n"
"\\input{figs/figure-system}\n",
encoding="utf-8",
)
(tmp_path / "paper.org").write_text("* draft\n", encoding="utf-8")
(tmp_path / "pieces" / "discussion.tex").write_text("\\section{Discussion}\n\nMore text.\n", encoding="utf-8")
(tmp_path / "pieces" / "table-results.tex").write_text("\\begin{tabular}x\\end{tabular}\n", encoding="utf-8")
(tmp_path / "pieces" / "unused.tex").write_text("\\section{Unused}\n\nIgnore me.\n", encoding="utf-8")
(tmp_path / "figs" / "figure-system.tex").write_text("\\begin{figure}x\\end{figure}\n", encoding="utf-8")
(tmp_path / "setup" / "venue-arxiv.tex").write_text("\\section{Setup}\n", encoding="utf-8")
(tmp_path / ".pp-export-tmp.tex").write_text("\\section{Tmp}\n", encoding="utf-8")
adapter = detect_source_adapter(tmp_path)
assert adapter.name == "polypaper"
result = run_groundrecall_import(tmp_path, mode="quick", import_id="polypaper-test")
paths = {item["path"] for item in result.artifacts}
assert "main.tex" not in paths
assert "pieces/discussion.tex" in paths
assert "pieces/table-results.tex" not in paths
assert "figs/figure-system.tex" not in paths
assert "pieces/unused.tex" not in paths
assert "setup/venue-arxiv.tex" not in paths
assert ".pp-export-tmp.tex" not in paths
def test_tex_import_skips_table_and_figure_markup_from_pandoc(tmp_path: Path, monkeypatch) -> None:
(tmp_path / "draft.tex").write_text("\\section{Draft}\n", encoding="utf-8")
monkeypatch.setattr(
ingest_module,
"_convert_tex_to_markdown",
lambda path: "\n".join(
[
"# Draft",
"",
"![image](figure.png)",
"| Col A | Col B |",
"| --- | --- |",
"| 1 | 2 |",
"</div>",
"\\begin{tabular}{ll}",
"- Real manuscript claim.",
]
),
)
result = run_groundrecall_import(tmp_path, mode="quick", import_id="tex-cleanup-test")
claim_texts = [item["claim_text"] for item in result.claims]
assert claim_texts == ["Real manuscript claim."]
def test_didactopus_pack_import_generates_structured_concepts_and_relations(tmp_path: Path) -> None:
(tmp_path / "pack.yaml").write_text(
"\n".join(
[
"name: sample-pack",
"display_name: Sample Pack",
"version: 0.1.0",
"schema_version: 0.1.0",
"didactopus_min_version: 0.1.0",
"didactopus_max_version: 9.9.9",
]
),
encoding="utf-8",
)
(tmp_path / "concepts.yaml").write_text(
"\n".join(
[
"concepts:",
" - id: basics",
" title: Basics",
" description: Foundational concept.",
" mastery_signals: [Explain the foundation.]",
" - id: advanced",
" title: Advanced",
" description: Builds on basics.",
" prerequisites: [basics]",
]
),
encoding="utf-8",
)
(tmp_path / "roadmap.yaml").write_text(
"\n".join(
[
"stages:",
" - id: stage1",
" title: Stage One",
" concepts: [basics, advanced]",
]
),
encoding="utf-8",
)
result = run_groundrecall_import(tmp_path, mode="quick", import_id="pack-test")
assert result.manifest["source_adapter"] == "didactopus_pack"
assert result.manifest["import_intent"] == "both"
concept_ids = {item["concept_id"] for item in result.concepts}
assert "concept::basics" in concept_ids
assert "concept::advanced" in concept_ids
relation_targets = {(item["source_id"], item["target_id"], item["relation_type"]) for item in result.relations}
assert ("concept::basics", "concept::advanced", "prerequisite") in relation_targets
claim_ids = {item["claim_id"] for item in result.claims}
assert "clm_pack_basics" in claim_ids
assert "clm_stage_stage1_basics" in claim_ids
def test_doclift_bundle_import_generates_structured_concepts(tmp_path: Path) -> None:
doc_dir = tmp_path / "documents" / "lesson-a"
doc_dir.mkdir(parents=True)
(tmp_path / "manifest.json").write_text(
'\n'.join(
[
"{",
' "documents": [',
" {",
' "document_id": "lesson-a",',
' "title": "Lecture 1. Example",',
' "document_kind": "lecture",',
f' "output_dir": "{doc_dir}",',
f' "markdown_path": "{doc_dir / "document.md"}",',
f' "figures_path": "{doc_dir / "document.figures.json"}"',
" }",
" ]",
"}",
]
),
encoding="utf-8",
)
(doc_dir / "document.md").write_text("# Lecture 1. Example\n\nBody.\n", encoding="utf-8")
(doc_dir / "document.figures.json").write_text('{"source_path": "/tmp/source.doc"}\n', encoding="utf-8")
result = run_groundrecall_import(tmp_path, mode="quick", import_id="doclift-test")
assert result.manifest["source_adapter"] == "doclift_bundle"
assert result.manifest["import_intent"] == "both"
concept_ids = {item["concept_id"] for item in result.concepts}
assert "concept::lesson-a" in concept_ids
claim_ids = {item["claim_id"] for item in result.claims}
assert "clm_doclift_1" in claim_ids

View File

@ -0,0 +1,148 @@
from __future__ import annotations
import json
from pathlib import Path
from groundrecall.models import (
ClaimRecord,
ConceptRecord,
GroundRecallSnapshot,
PromotionRecord,
ProvenanceRecord,
RelationRecord,
ReviewCandidateRecord,
SourceRecord,
)
from groundrecall.store import GroundRecallStore
def test_groundrecall_store_round_trips_canonical_objects(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
source = store.save_source(
SourceRecord(
source_id="src_001",
title="Channel Notes",
source_type="markdown",
path="wiki/channel-capacity.md",
current_status="promoted",
)
)
claim = store.save_claim(
ClaimRecord(
claim_id="clm_001",
claim_text="Channel capacity bounds reliable communication rate.",
claim_kind="definition",
concept_ids=["concept::channel-capacity"],
confidence_hint=0.72,
provenance=ProvenanceRecord(
origin_artifact_id="ia_001",
origin_path="wiki/channel-capacity.md",
support_kind="derived_from_page",
grounding_status="partially_grounded",
),
current_status="reviewed",
)
)
concept = store.save_concept(
ConceptRecord(
concept_id="concept::channel-capacity",
title="Channel Capacity",
description="Imported concept.",
current_status="promoted",
)
)
relation = store.save_relation(
RelationRecord(
relation_id="rel_001",
source_id="concept::channel-capacity",
target_id="concept::shannon-entropy",
relation_type="references",
current_status="draft",
)
)
review_candidate = store.save_review_candidate(
ReviewCandidateRecord(
review_candidate_id="rc_001",
candidate_type="claim",
candidate_id="clm_001",
triage_lane="knowledge_capture",
priority=10,
current_status="triaged",
)
)
promotion = store.save_promotion(
PromotionRecord(
promotion_id="pr_001",
candidate_type="claim",
candidate_id="clm_001",
reviewer="R",
promoted_object_ids=["clm_001"],
promoted_at="2026-04-17T12:00:00Z",
)
)
assert store.get_source(source.source_id) is not None
assert store.get_claim(claim.claim_id) is not None
assert store.get_concept(concept.concept_id) is not None
assert store.get_relation(relation.relation_id) is not None
assert store.get_review_candidate(review_candidate.review_candidate_id) is not None
assert store.get_promotion(promotion.promotion_id) is not None
def test_groundrecall_store_builds_and_persists_snapshot(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
store.save_source(SourceRecord(source_id="src_001", title="T", current_status="promoted"))
store.save_claim(
ClaimRecord(
claim_id="clm_001",
claim_text="A grounded claim.",
concept_ids=["concept::c1"],
current_status="promoted",
)
)
store.save_concept(ConceptRecord(concept_id="concept::c1", title="C1", current_status="promoted"))
snapshot = store.build_snapshot(
snapshot_id="snap_001",
created_at="2026-04-17T12:00:00Z",
metadata={"export_kind": "canonical"},
)
saved = store.save_snapshot(snapshot)
loaded = store.get_snapshot(saved.snapshot_id)
assert loaded is not None
assert isinstance(loaded, GroundRecallSnapshot)
assert loaded.metadata["export_kind"] == "canonical"
assert len(loaded.sources) == 1
assert len(loaded.claims) == 1
assert len(loaded.concepts) == 1
def test_groundrecall_models_remain_assistant_neutral() -> None:
claim_fields = set(ClaimRecord.model_fields)
concept_fields = set(ConceptRecord.model_fields)
snapshot_fields = set(GroundRecallSnapshot.model_fields)
forbidden = {"assistant", "assistant_name", "codex", "claude", "skill_bundle", "prompt_text"}
assert claim_fields.isdisjoint(forbidden)
assert concept_fields.isdisjoint(forbidden)
assert snapshot_fields.isdisjoint(forbidden)
def test_groundrecall_store_writes_json_atomically_without_tmp_artifacts(tmp_path: Path) -> None:
store = GroundRecallStore(tmp_path / "groundrecall")
claim = store.save_claim(
ClaimRecord(
claim_id="clm_atomic",
claim_text="Atomic writes should leave valid JSON on disk.",
concept_ids=["concept::atomicity"],
current_status="reviewed",
)
)
claim_path = store.claims_dir / f"{claim.claim_id}.json"
payload = json.loads(claim_path.read_text(encoding="utf-8"))
assert payload["claim_id"] == "clm_atomic"
assert list(store.claims_dir.glob("*.tmp")) == []