Initial commit
This commit is contained in:
parent
bf5c1a0e4a
commit
e819f17607
|
|
@ -0,0 +1,20 @@
|
|||
# Docs
|
||||
|
||||
The top-level documentation in this repository is intended to describe `GroundRecall` as a standalone project.
|
||||
|
||||
Primary docs:
|
||||
|
||||
- [quickstart.md](quickstart.md)
|
||||
- [architecture.md](architecture.md)
|
||||
- [llmwiki-import.md](llmwiki-import.md)
|
||||
- [sync-roadmap.md](sync-roadmap.md)
|
||||
|
||||
Legacy extraction notes:
|
||||
|
||||
- [legacy/groundrecall-assistant-architecture.md](legacy/groundrecall-assistant-architecture.md)
|
||||
- [legacy/groundrecall-ingestion-refactor.md](legacy/groundrecall-ingestion-refactor.md)
|
||||
- [legacy/groundrecall-llmwiki-import.md](legacy/groundrecall-llmwiki-import.md)
|
||||
- [legacy/groundrecall-migration-plan.md](legacy/groundrecall-migration-plan.md)
|
||||
- [legacy/groundrecall-repo-bootstrap.md](legacy/groundrecall-repo-bootstrap.md)
|
||||
|
||||
Those legacy documents were carried over from the earlier `Didactopus`-embedded phase. They remain useful as design history, but they are not the preferred starting point for current standalone `GroundRecall` work.
|
||||
|
|
@ -0,0 +1,93 @@
|
|||
# Architecture
|
||||
|
||||
`GroundRecall` is the grounded knowledge substrate in a larger stack:
|
||||
|
||||
- `GroundRecall`: canonical knowledge ingestion, promotion, query, export, and future sync
|
||||
- `Didactopus`: learner-facing workflows and educational tooling
|
||||
- `GenieHive`: model and routing layer where runtime assistant/service resolution is needed
|
||||
|
||||
## Core Design
|
||||
|
||||
The system is built around one canonical flow:
|
||||
|
||||
1. ingest weakly structured sources
|
||||
2. normalize them into stable knowledge objects
|
||||
3. lint and queue them for review
|
||||
4. promote reviewed objects into a canonical store
|
||||
5. query and export promoted state
|
||||
|
||||
## Core Objects
|
||||
|
||||
The canonical store is built from these object families:
|
||||
|
||||
- `Source`
|
||||
- `Fragment`
|
||||
- `Artifact`
|
||||
- `Observation`
|
||||
- `Claim`
|
||||
- `Concept`
|
||||
- `Relation`
|
||||
- `ReviewCandidate`
|
||||
- `PromotionRecord`
|
||||
- `GroundRecallSnapshot`
|
||||
|
||||
These objects are assistant-neutral. Assistant-specific formatting belongs at the adapter layer.
|
||||
|
||||
## Package Surface
|
||||
|
||||
The main standalone package surface is:
|
||||
|
||||
- `groundrecall.ingest`
|
||||
- `groundrecall.lint`
|
||||
- `groundrecall.models`
|
||||
- `groundrecall.store`
|
||||
- `groundrecall.promotion`
|
||||
- `groundrecall.query`
|
||||
- `groundrecall.export`
|
||||
- `groundrecall.assistant_export`
|
||||
- `groundrecall.inspect`
|
||||
- `groundrecall.source_adapters.*`
|
||||
- `groundrecall.assistants.*`
|
||||
|
||||
There are also compatibility-style helper modules prefixed with `groundrecall_` inside the package. Those exist because the standalone repo was extracted from an earlier monorepo layout.
|
||||
|
||||
## Source Adapters
|
||||
|
||||
Adapters handle source-shape-specific discovery and mapping while the downstream pipeline stays generic.
|
||||
|
||||
Current adapter families include:
|
||||
|
||||
- `llmwiki`
|
||||
- `markdown_notes`
|
||||
- `transcript`
|
||||
- `didactopus_pack`
|
||||
|
||||
## Assistant Boundary
|
||||
|
||||
Assistant integration is intentionally outside the core store and query semantics.
|
||||
|
||||
The rule is:
|
||||
|
||||
- core `GroundRecall` owns truth, provenance, lifecycle, and retrieval semantics
|
||||
- assistant adapters own presentation, bundle shaping, and tool-specific exports
|
||||
|
||||
Current adapters include:
|
||||
|
||||
- `codex`
|
||||
- `claude_code`
|
||||
|
||||
## Alpha Boundary
|
||||
|
||||
The current alpha is strong enough for:
|
||||
|
||||
- local import and promotion
|
||||
- canonical query and export
|
||||
- assistant-neutral bundles
|
||||
- assistant-targeted bundle generation
|
||||
|
||||
It is not yet complete for:
|
||||
|
||||
- multi-node sync and merge
|
||||
- re-import/update semantics
|
||||
- richer review adjudication
|
||||
- large-scale distributed corpus integration
|
||||
|
|
@ -0,0 +1,174 @@
|
|||
# GroundRecall Assistant Integration Architecture
|
||||
|
||||
This document defines how GroundRecall should support Codex, Claude Code, and
|
||||
future assistant environments without treating any single assistant as the
|
||||
authoritative integration target.
|
||||
|
||||
## Design rule
|
||||
|
||||
GroundRecall core must be assistant-agnostic.
|
||||
|
||||
Assistant-specific formats are derived views over promoted GroundRecall objects,
|
||||
not the canonical representation of knowledge.
|
||||
|
||||
## Why this boundary matters
|
||||
|
||||
If assistant-specific prompt packaging leaks into the core model too early,
|
||||
GroundRecall becomes:
|
||||
|
||||
- harder to evolve
|
||||
- harder to validate
|
||||
- harder to sync across machines
|
||||
- harder to support across multiple assistant environments
|
||||
|
||||
The stable boundary should instead be:
|
||||
|
||||
- canonical grounded knowledge objects in core
|
||||
- assistant adapters at the edge
|
||||
|
||||
## Core vs adapter split
|
||||
|
||||
### Core GroundRecall responsibilities
|
||||
|
||||
These should remain assistant-neutral:
|
||||
|
||||
- schemas for `Source`, `Fragment`, `Artifact`, `Observation`, `Claim`,
|
||||
`Concept`, `Relation`, `ReviewCandidate`, and `PromotionRecord`
|
||||
- provenance and confidence modeling
|
||||
- contradiction and supersession handling
|
||||
- linting and review queue generation
|
||||
- review and promotion workflows
|
||||
- persistent storage for promoted objects
|
||||
- query and retrieval semantics
|
||||
- sync and multi-machine consolidation
|
||||
- canonical export formats
|
||||
|
||||
### Assistant adapter responsibilities
|
||||
|
||||
These should be adapter-specific:
|
||||
|
||||
- prompt/context packaging
|
||||
- assistant-specific bundle layout
|
||||
- memory-file rendering
|
||||
- skill-file rendering
|
||||
- assistant capability declarations
|
||||
- token-budget shaping and truncation policy
|
||||
- tool-specific metadata
|
||||
|
||||
## Canonical export contract
|
||||
|
||||
GroundRecall should export assistant-neutral artifacts first.
|
||||
|
||||
Recommended canonical exports:
|
||||
|
||||
- `groundrecall_snapshot.json`
|
||||
- `claims.jsonl`
|
||||
- `concepts.jsonl`
|
||||
- `relations.jsonl`
|
||||
- `provenance_manifest.json`
|
||||
- `query_bundle.json`
|
||||
|
||||
Assistant adapters then derive secondary outputs from those canonical exports.
|
||||
|
||||
## Assistant adapter interface
|
||||
|
||||
GroundRecall should expose a small adapter protocol.
|
||||
|
||||
Example shape:
|
||||
|
||||
```python
|
||||
class AssistantAdapter(Protocol):
|
||||
name: str
|
||||
|
||||
def export_bundle(self, snapshot: dict, out_dir: Path) -> list[Path]:
|
||||
...
|
||||
|
||||
def build_context(self, query_result: dict) -> dict:
|
||||
...
|
||||
|
||||
def supported_capabilities(self) -> dict[str, bool]:
|
||||
...
|
||||
```
|
||||
|
||||
This is a strategy/plugin boundary. A small registry or factory is acceptable,
|
||||
but the important architectural decision is the separation of concerns, not the
|
||||
factory itself.
|
||||
|
||||
## Recommended package layout
|
||||
|
||||
Recommended modules:
|
||||
|
||||
- `didactopus.groundrecall.models`
|
||||
- `didactopus.groundrecall.store`
|
||||
- `didactopus.groundrecall.promotion`
|
||||
- `didactopus.groundrecall.query`
|
||||
- `didactopus.groundrecall.export`
|
||||
- `didactopus.groundrecall.assistants.base`
|
||||
- `didactopus.groundrecall.assistants.codex`
|
||||
- `didactopus.groundrecall.assistants.claude_code`
|
||||
|
||||
## Export layering
|
||||
|
||||
Recommended filesystem layout:
|
||||
|
||||
- `exports/canonical/`
|
||||
- `exports/assistants/codex/`
|
||||
- `exports/assistants/claude-code/`
|
||||
|
||||
Canonical exports remain the durable interchange format.
|
||||
|
||||
Assistant exports remain reproducible derived artifacts.
|
||||
|
||||
## Query layering
|
||||
|
||||
The query layer should return assistant-neutral structures such as:
|
||||
|
||||
- relevant claims
|
||||
- supporting fragments
|
||||
- provenance
|
||||
- contradictions
|
||||
- supersessions
|
||||
- confidence and recency
|
||||
- suggested next actions
|
||||
|
||||
Adapters may then convert this payload into:
|
||||
|
||||
- Codex skill/context bundles
|
||||
- Claude Code project memory/context bundles
|
||||
- future assistant context packages
|
||||
|
||||
## Stability policy
|
||||
|
||||
GroundRecall should adopt these rules early:
|
||||
|
||||
1. No assistant-specific fields in canonical `Claim` or `Concept` objects.
|
||||
2. No assistant-specific persistence formats as authoritative storage.
|
||||
3. No review or promotion decisions based on assistant-specific packaging.
|
||||
4. Assistant adapters may be added or removed without changing canonical objects.
|
||||
|
||||
## Migration implication
|
||||
|
||||
Current and future GroundRecall work should replace language like:
|
||||
|
||||
- "Codex-facing export"
|
||||
- "Codex skill bundle"
|
||||
|
||||
with:
|
||||
|
||||
- "assistant adapter bundle"
|
||||
- "assistant-facing export"
|
||||
- "assistant-specific derived bundle"
|
||||
|
||||
Codex can still be one adapter and may remain the first implemented adapter, but
|
||||
it should not define the system boundary.
|
||||
|
||||
## Immediate implementation impact
|
||||
|
||||
The next GroundRecall milestones should be interpreted as:
|
||||
|
||||
1. build assistant-neutral canonical models and storage
|
||||
2. build review and promotion over canonical objects
|
||||
3. build canonical query and export layers
|
||||
4. add assistant adapters as thin renderers over those canonical outputs
|
||||
|
||||
This is the lowest-risk path for long-term stability.
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
# GroundRecall Ingestion Refactor Plan
|
||||
|
||||
GroundRecall should treat `llmwiki` as one upstream source shape, not as the
|
||||
defining architecture for grounded knowledge import.
|
||||
|
||||
Didactopus already has broader ambitions around ingestion of weakly structured
|
||||
materials such as:
|
||||
|
||||
- markdown notes
|
||||
- transcripts
|
||||
- HTML/text course materials
|
||||
- generated draft packs
|
||||
- review sessions
|
||||
- learner artifacts
|
||||
|
||||
The GroundRecall import pipeline should therefore be generalized around a shared
|
||||
normalization and promotion substrate with pluggable source adapters.
|
||||
|
||||
## Design rule
|
||||
|
||||
Source-specific logic should live at the ingestion edge.
|
||||
|
||||
These stages should be generic:
|
||||
|
||||
- segmentation
|
||||
- extraction
|
||||
- normalization
|
||||
- lint
|
||||
- review queue generation
|
||||
- review bridge
|
||||
- promotion
|
||||
- canonical store
|
||||
- query
|
||||
- canonical export
|
||||
|
||||
## Recommended module split
|
||||
|
||||
Recommended package layout:
|
||||
|
||||
- `didactopus.groundrecall_ingest`
|
||||
- `didactopus.groundrecall_source_adapters.base`
|
||||
- `didactopus.groundrecall_source_adapters.llmwiki`
|
||||
- `didactopus.groundrecall_source_adapters.markdown_notes`
|
||||
- `didactopus.groundrecall_source_adapters.transcript`
|
||||
- `didactopus.groundrecall_source_adapters.didactopus_pack`
|
||||
- `didactopus.groundrecall_source_adapters.didactopus_review`
|
||||
|
||||
## Shared intermediate envelope
|
||||
|
||||
Adapters should emit shared discovery records rather than jumping straight into
|
||||
canonical GroundRecall objects.
|
||||
|
||||
Recommended intermediate types:
|
||||
|
||||
- `DiscoveredImportSource`
|
||||
- `SegmentCandidate`
|
||||
- `ImportProfile`
|
||||
|
||||
This keeps adapter-specific parsing separate from the shared import pipeline.
|
||||
|
||||
## Output intent
|
||||
|
||||
Not every imported source should be treated the same way.
|
||||
|
||||
Adapters should declare an output intent:
|
||||
|
||||
- `grounded_knowledge`
|
||||
- `curriculum`
|
||||
- `both`
|
||||
|
||||
Examples:
|
||||
|
||||
- `llmwiki` usually targets `grounded_knowledge`
|
||||
- loose transcripts may target `grounded_knowledge`
|
||||
- syllabus/course folders often target `curriculum`
|
||||
- Didactopus packs or review sessions may target `both`
|
||||
|
||||
## First refactor milestones
|
||||
|
||||
### Milestone 1
|
||||
|
||||
- introduce adapter registry and adapter protocol
|
||||
- move current `llmwiki` discovery/classification behind an adapter
|
||||
- preserve the current import CLI behavior
|
||||
|
||||
### Milestone 2
|
||||
|
||||
- add a `markdown_notes` adapter
|
||||
- add a `transcript` adapter
|
||||
- add import profiles that tune extraction strictness
|
||||
|
||||
### Milestone 3
|
||||
|
||||
- add a `didactopus_pack` adapter for pack and review artifacts
|
||||
- allow current Didactopus outputs to feed into GroundRecall directly
|
||||
|
||||
## Why this matters
|
||||
|
||||
This avoids building two parallel ingestion stacks inside Didactopus:
|
||||
|
||||
- one for packs and educational structures
|
||||
- another for grounded knowledge capture
|
||||
|
||||
Instead, the system gets one generic ingestion substrate with multiple source
|
||||
adapters and multiple downstream promotion/export paths.
|
||||
|
|
@ -0,0 +1,496 @@
|
|||
# GroundRecall `llmwiki` Import Specification
|
||||
|
||||
This document defines the first-pass import path for users who already have some
|
||||
form of `llmwiki`-style repository and want to migrate it into the broader
|
||||
GroundRecall substrate while staying compatible with Didactopus review and
|
||||
promotion flows.
|
||||
|
||||
## Goal
|
||||
|
||||
The import path should let an existing `llmwiki` corpus become:
|
||||
|
||||
- searchable without immediate manual cleanup
|
||||
- reviewable rather than blindly trusted
|
||||
- grounded in explicit provenance
|
||||
- promotable into durable structured knowledge objects
|
||||
- exportable back into compiled wiki pages, assistant adapter bundles, and
|
||||
queryable graph artifacts
|
||||
|
||||
The key rule is:
|
||||
|
||||
Imported wiki pages are **derived artifacts**, not automatic source truth.
|
||||
|
||||
## Import philosophy
|
||||
|
||||
Users coming from `llmwiki` often have a mixture of:
|
||||
|
||||
- raw notes
|
||||
- compiled markdown pages
|
||||
- local source files
|
||||
- generated summaries
|
||||
- ad hoc link graphs
|
||||
- session transcripts
|
||||
- speculative or weakly-supported synthesis
|
||||
|
||||
GroundRecall should preserve that work without pretending all of it is already
|
||||
promoted knowledge.
|
||||
|
||||
The import pipeline therefore has two responsibilities:
|
||||
|
||||
1. Preserve the original material with minimal loss.
|
||||
2. Reify explicit structured objects that can later be reviewed and promoted.
|
||||
|
||||
## Scope of the first implementation
|
||||
|
||||
The first implementation should support common `llmwiki` layouts such as:
|
||||
|
||||
- `raw/`
|
||||
- `wiki/`
|
||||
- `schema.*`
|
||||
- `logs/`
|
||||
- `sources/`
|
||||
- top-level markdown pages
|
||||
|
||||
The importer should not require a canonical upstream schema. It should operate
|
||||
from directory conventions plus simple heuristics.
|
||||
|
||||
## Import modes
|
||||
|
||||
### 1. `archive`
|
||||
|
||||
Purpose:
|
||||
- preserve an existing `llmwiki` tree as read-only imported artifacts
|
||||
- index it for search and later review
|
||||
|
||||
Behavior:
|
||||
- no claim promotion
|
||||
- minimal extraction
|
||||
- all compiled pages remain `draft`
|
||||
|
||||
Use when:
|
||||
- the user wants backward compatibility first
|
||||
- the corpus quality is unknown
|
||||
|
||||
### 2. `quick`
|
||||
|
||||
Purpose:
|
||||
- bootstrap usable structured objects fast
|
||||
|
||||
Behavior:
|
||||
- import pages and raw sources
|
||||
- extract candidate claims and concepts heuristically
|
||||
- attach lightweight provenance
|
||||
- queue uncertain items for review
|
||||
|
||||
Use when:
|
||||
- the user wants early utility and accepts heuristic noise
|
||||
|
||||
### 3. `grounded`
|
||||
|
||||
Purpose:
|
||||
- perform a migration suitable for long-lived shared knowledge
|
||||
|
||||
Behavior:
|
||||
- require provenance for promoted claims
|
||||
- mark unsupported statements explicitly
|
||||
- produce review records and lint findings
|
||||
- populate promotion queues rather than auto-promoting
|
||||
|
||||
Use when:
|
||||
- the imported corpus will be shared across machines or agents
|
||||
|
||||
## Pipeline stages
|
||||
|
||||
### 1. Capture
|
||||
|
||||
The importer records the source repository as an import artifact.
|
||||
|
||||
Required metadata:
|
||||
|
||||
- `import_id`
|
||||
- `import_mode`
|
||||
- `source_root`
|
||||
- `imported_at`
|
||||
- `machine_id`
|
||||
- `agent_id`
|
||||
- `source_repo_kind=llmwiki`
|
||||
|
||||
Outputs:
|
||||
|
||||
- import manifest
|
||||
- artifact records for all discovered files
|
||||
|
||||
### 2. Segment
|
||||
|
||||
Imported content is split into stable units.
|
||||
|
||||
Primary segment types:
|
||||
|
||||
- `source_document`
|
||||
- `source_fragment`
|
||||
- `compiled_page`
|
||||
- `section_summary`
|
||||
- `candidate_claim`
|
||||
- `candidate_concept`
|
||||
- `candidate_relation`
|
||||
- `session_observation`
|
||||
|
||||
Segmentation should preserve:
|
||||
|
||||
- original path
|
||||
- section heading
|
||||
- line or byte offsets when possible
|
||||
- page title
|
||||
- frontmatter fields
|
||||
|
||||
### 3. Classify
|
||||
|
||||
Each segment gets a semantic role.
|
||||
|
||||
Recommended roles:
|
||||
|
||||
- `source`
|
||||
- `derivation`
|
||||
- `claim`
|
||||
- `summary`
|
||||
- `question`
|
||||
- `todo`
|
||||
- `speculation`
|
||||
- `obsolete`
|
||||
- `transcript`
|
||||
|
||||
This prevents unsupported prose from being confused with grounded knowledge.
|
||||
|
||||
### 4. Ground
|
||||
|
||||
Each imported segment gets provenance and support metadata.
|
||||
|
||||
Required grounding fields:
|
||||
|
||||
- `origin_artifact_id`
|
||||
- `origin_path`
|
||||
- `origin_section`
|
||||
- `source_url` when known
|
||||
- `retrieval_date` when known
|
||||
- `machine_id`
|
||||
- `session_id` when known
|
||||
- `support_kind`
|
||||
- `grounding_status`
|
||||
|
||||
Suggested values:
|
||||
|
||||
- `support_kind`: `direct_source`, `derived_from_page`, `derived_from_session`,
|
||||
`inferred`, `unknown`
|
||||
- `grounding_status`: `grounded`, `partially_grounded`, `ungrounded`
|
||||
|
||||
### 5. Normalize
|
||||
|
||||
The importer emits explicit GroundRecall objects.
|
||||
|
||||
Minimum object set:
|
||||
|
||||
- `Source`
|
||||
- `Fragment`
|
||||
- `Artifact`
|
||||
- `Observation`
|
||||
- `Claim`
|
||||
- `Concept`
|
||||
- `Relation`
|
||||
|
||||
### 6. Lint
|
||||
|
||||
The importer produces machine-readable findings before promotion.
|
||||
|
||||
Required lint checks:
|
||||
|
||||
- claim has no supporting fragment
|
||||
- multiple claims appear text-identical
|
||||
- concept is orphaned
|
||||
- relation points to missing concept
|
||||
- page summary has no cited support
|
||||
- imported item marked `obsolete` still linked as current
|
||||
- same claim imported with conflicting confidence or polarity
|
||||
|
||||
### 7. Promote
|
||||
|
||||
Imported objects enter existing Didactopus review/promotion lanes rather than
|
||||
becoming trusted immediately.
|
||||
|
||||
Recommended states:
|
||||
|
||||
- `draft`
|
||||
- `triaged`
|
||||
- `reviewed`
|
||||
- `promoted`
|
||||
- `superseded`
|
||||
- `archived`
|
||||
|
||||
### 8. Export
|
||||
|
||||
Promoted objects can then be rendered back out as:
|
||||
|
||||
- compiled wiki pages
|
||||
- graph snapshots
|
||||
- assistant adapter bundles
|
||||
- review reports
|
||||
- query bundles for assistant-facing use
|
||||
|
||||
## Object contracts
|
||||
|
||||
### `ImportedArtifact`
|
||||
|
||||
```json
|
||||
{
|
||||
"artifact_id": "ia_001",
|
||||
"import_id": "imp_2026_04_16_a",
|
||||
"artifact_kind": "compiled_page",
|
||||
"path": "wiki/channel-capacity.md",
|
||||
"title": "Channel Capacity",
|
||||
"sha256": "abc123",
|
||||
"created_at": "2026-04-16T14:00:00Z",
|
||||
"metadata": {
|
||||
"frontmatter": {},
|
||||
"headings": ["Definition", "Examples"]
|
||||
},
|
||||
"current_status": "draft"
|
||||
}
|
||||
```
|
||||
|
||||
### `ImportedObservation`
|
||||
|
||||
```json
|
||||
{
|
||||
"observation_id": "obs_001",
|
||||
"import_id": "imp_2026_04_16_a",
|
||||
"artifact_id": "ia_001",
|
||||
"role": "summary",
|
||||
"text": "Capacity bounds reliable communication over a noisy channel.",
|
||||
"origin_path": "wiki/channel-capacity.md",
|
||||
"origin_section": "Definition",
|
||||
"line_start": 12,
|
||||
"line_end": 14,
|
||||
"grounding_status": "partially_grounded",
|
||||
"support_kind": "derived_from_page",
|
||||
"confidence_hint": 0.63,
|
||||
"current_status": "draft"
|
||||
}
|
||||
```
|
||||
|
||||
### `ImportedClaim`
|
||||
|
||||
```json
|
||||
{
|
||||
"claim_id": "clm_001",
|
||||
"import_id": "imp_2026_04_16_a",
|
||||
"claim_text": "Channel capacity is the maximum reliable communication rate for a channel model.",
|
||||
"claim_kind": "definition",
|
||||
"source_observation_ids": ["obs_001"],
|
||||
"supporting_fragment_ids": ["frag_014"],
|
||||
"concept_ids": ["concept::channel-capacity"],
|
||||
"confidence_hint": 0.74,
|
||||
"grounding_status": "grounded",
|
||||
"current_status": "triaged"
|
||||
}
|
||||
```
|
||||
|
||||
### `ImportedConcept`
|
||||
|
||||
```json
|
||||
{
|
||||
"concept_id": "concept::channel-capacity",
|
||||
"import_id": "imp_2026_04_16_a",
|
||||
"title": "Channel Capacity",
|
||||
"aliases": [],
|
||||
"description": "Imported concept from llmwiki corpus.",
|
||||
"source_artifact_ids": ["ia_001"],
|
||||
"current_status": "triaged"
|
||||
}
|
||||
```
|
||||
|
||||
### `ImportedRelation`
|
||||
|
||||
```json
|
||||
{
|
||||
"relation_id": "rel_001",
|
||||
"import_id": "imp_2026_04_16_a",
|
||||
"source_id": "concept::shannon-entropy",
|
||||
"target_id": "concept::channel-capacity",
|
||||
"relation_type": "supports_understanding_of",
|
||||
"evidence_ids": ["obs_015"],
|
||||
"current_status": "draft"
|
||||
}
|
||||
```
|
||||
|
||||
## Mapping from `llmwiki` into GroundRecall
|
||||
|
||||
Recommended first-pass mapping:
|
||||
|
||||
- `raw/*` -> `Source` or `Artifact(kind=raw_note)`
|
||||
- `wiki/*.md` -> `Artifact(kind=compiled_page)`
|
||||
- frontmatter -> artifact metadata
|
||||
- headings -> section boundaries
|
||||
- linked page names -> candidate `Concept` and `Relation`
|
||||
- bullet or sentence extraction -> candidate `Observation` and `Claim`
|
||||
- chat or session logs -> `Observation(kind=session_note)`
|
||||
- schema files -> import metadata only unless a future adapter exists
|
||||
|
||||
## Confidence and trust policy
|
||||
|
||||
Imported confidence must remain clearly separate from reviewed confidence.
|
||||
|
||||
Recommended fields:
|
||||
|
||||
- `confidence_hint`
|
||||
- `review_confidence`
|
||||
- `grounding_status`
|
||||
- `review_verdict`
|
||||
|
||||
Policy:
|
||||
|
||||
- `confidence_hint` comes from heuristic import scoring
|
||||
- `review_confidence` exists only after review
|
||||
- promotion requires at least `partially_grounded`
|
||||
- fully ungrounded claims can be stored, but only as `draft` or `archived`
|
||||
|
||||
## Provenance policy
|
||||
|
||||
The importer should follow the existing Didactopus provenance direction:
|
||||
|
||||
- preserve source identity
|
||||
- preserve retrieval date when available
|
||||
- preserve adaptation status
|
||||
- keep both human-readable and machine-readable provenance
|
||||
|
||||
When only a compiled wiki page exists and the original source is missing:
|
||||
|
||||
- the compiled page becomes the immediate origin artifact
|
||||
- all extracted claims must be marked `derived_from_page`
|
||||
- such claims should not auto-promote in `grounded` mode
|
||||
|
||||
## Review and promotion integration
|
||||
|
||||
Imported `Claim` and `Concept` objects should feed into the same general review
|
||||
machinery already used for pack-oriented promotion:
|
||||
|
||||
- create candidate records
|
||||
- attach lint findings
|
||||
- route to a triage lane
|
||||
- collect review verdicts
|
||||
- emit promotion records
|
||||
|
||||
Suggested triage lanes:
|
||||
|
||||
- `knowledge_capture`
|
||||
- `pack_improvement`
|
||||
- `skill_export`
|
||||
- `source_cleanup`
|
||||
- `conflict_resolution`
|
||||
|
||||
## Module layout
|
||||
|
||||
First-pass module layout:
|
||||
|
||||
- `didactopus.groundrecall_import`
|
||||
Entry points and top-level orchestration.
|
||||
- `didactopus.groundrecall_discovery`
|
||||
Finds `llmwiki`-style files and classifies paths.
|
||||
- `didactopus.groundrecall_segmenter`
|
||||
Splits pages and logs into stable observations and candidate claims.
|
||||
- `didactopus.groundrecall_normalizer`
|
||||
Emits normalized import objects.
|
||||
- `didactopus.groundrecall_lint`
|
||||
Import-time lint checks.
|
||||
- `didactopus.groundrecall_review_bridge`
|
||||
Converts imported objects into review candidates and promotion records.
|
||||
- `didactopus.groundrecall_export`
|
||||
Renders promoted objects back to wiki, graph, and skill artifacts.
|
||||
|
||||
## CLI shape
|
||||
|
||||
Suggested CLI:
|
||||
|
||||
```bash
|
||||
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode archive
|
||||
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode quick
|
||||
python -m didactopus.groundrecall.cli import /path/to/llmwiki --mode grounded
|
||||
python -m didactopus.groundrecall.cli lint imports/<import-id>
|
||||
python -m didactopus.groundrecall.cli promote imports/<import-id> /path/to/store
|
||||
python -m didactopus.groundrecall.cli export /path/to/store exports/groundrecall --concept channel-capacity
|
||||
```
|
||||
|
||||
Compatibility wrappers still exist during migration:
|
||||
|
||||
```bash
|
||||
python -m didactopus.groundrecall_import /path/to/llmwiki --mode grounded
|
||||
python -m didactopus.groundrecall_lint imports/<import-id>
|
||||
python -m didactopus.groundrecall_export /path/to/store exports/groundrecall --concept channel-capacity
|
||||
```
|
||||
|
||||
## Filesystem layout
|
||||
|
||||
Suggested repository-local layout:
|
||||
|
||||
- `imports/<import-id>/manifest.json`
|
||||
- `imports/<import-id>/artifacts.jsonl`
|
||||
- `imports/<import-id>/observations.jsonl`
|
||||
- `imports/<import-id>/claims.jsonl`
|
||||
- `imports/<import-id>/concepts.jsonl`
|
||||
- `imports/<import-id>/relations.jsonl`
|
||||
- `imports/<import-id>/lint_findings.json`
|
||||
- `imports/<import-id>/review_queue.json`
|
||||
|
||||
This keeps imported state auditable and easy to sync across machines.
|
||||
|
||||
## Multi-machine sync implication
|
||||
|
||||
For distributed assistant use, imported state should be append-oriented and
|
||||
rebuildable.
|
||||
|
||||
Recommended sync primitives:
|
||||
|
||||
- import manifests
|
||||
- normalized jsonl object streams
|
||||
- review records
|
||||
- promotion records
|
||||
|
||||
Non-authoritative derived artifacts:
|
||||
|
||||
- rendered wiki pages
|
||||
- local indexes
|
||||
- embeddings
|
||||
- cache files
|
||||
|
||||
This allows multiple machines to contribute import events without making the
|
||||
compiled page tree the merge primitive.
|
||||
|
||||
## First implementation milestones
|
||||
|
||||
### Milestone 1
|
||||
|
||||
- discover `raw/` and `wiki/`
|
||||
- import artifacts
|
||||
- segment markdown by headings
|
||||
- emit observations and candidate claims
|
||||
- write import manifest and jsonl outputs
|
||||
|
||||
### Milestone 2
|
||||
|
||||
- add grounding metadata
|
||||
- add lint checks
|
||||
- add triage lanes and review queue output
|
||||
|
||||
### Milestone 3
|
||||
|
||||
- map promoted claims into assistant-neutral exports plus assistant adapter bundles
|
||||
- render compiled wiki views from promoted objects
|
||||
- support multi-machine import manifests and merge-safe event storage
|
||||
|
||||
## Non-goals for the first pass
|
||||
|
||||
- perfect semantic claim extraction
|
||||
- automatic trust assignment
|
||||
- full upstream `llmwiki` schema compatibility
|
||||
- lossless import of every custom plugin or script
|
||||
- embeddings-first retrieval
|
||||
|
||||
The first pass should be conservative, inspectable, and easy to improve.
|
||||
|
|
@ -0,0 +1,281 @@
|
|||
# GroundRecall Migration Plan
|
||||
|
||||
This document turns the boundary decisions in [deployment-modes.md](deployment-modes.md) into an implementation plan.
|
||||
|
||||
The goal is not an immediate repo split. The goal is to let `GroundRecall` become independently deployable and operable without destabilizing ongoing `Didactopus` learner work.
|
||||
|
||||
## Current State
|
||||
|
||||
Today, GroundRecall exists as a set of modules under `src/didactopus/`:
|
||||
|
||||
- `groundrecall_import`
|
||||
- `groundrecall_source_adapters/*`
|
||||
- `groundrecall_lint`
|
||||
- `groundrecall_review_queue`
|
||||
- `groundrecall_review_bridge`
|
||||
- `groundrecall_models`
|
||||
- `groundrecall_store`
|
||||
- `groundrecall_promotion`
|
||||
- `groundrecall_query`
|
||||
- `groundrecall_export`
|
||||
- `groundrecall_assistant_export`
|
||||
- `groundrecall_assistants/*`
|
||||
|
||||
This is acceptable as an implementation phase, but it creates two risks:
|
||||
|
||||
1. generic knowledge-substrate functionality may continue to accrete under `didactopus.main`
|
||||
2. feature work may silently assume the presence of learner-facing Didactopus components
|
||||
|
||||
## Migration Goal
|
||||
|
||||
Target state:
|
||||
|
||||
- `Didactopus` remains the learner-facing application
|
||||
- `GroundRecall` becomes the standalone grounded knowledge substrate
|
||||
- `GenieHive` remains the model and routing control plane
|
||||
|
||||
The package, CLI, and deployment boundaries should eventually reflect that.
|
||||
|
||||
## Target Ownership
|
||||
|
||||
### GroundRecall should own
|
||||
|
||||
- source ingestion and normalization
|
||||
- claim/concept/relation/artifact/provenance schemas
|
||||
- canonical store and snapshots
|
||||
- lint and review queue generation
|
||||
- promotion and merge semantics
|
||||
- assistant-neutral query and export
|
||||
- assistant adapter export
|
||||
- sync, merge, and team/shared knowledge operations
|
||||
|
||||
### Didactopus should own
|
||||
|
||||
- learner session flows
|
||||
- mentor/practice/evaluator/project-advisor workflows
|
||||
- pack and curriculum-specific review UX
|
||||
- mastery-ledger and learner evidence experiences
|
||||
- educational packaging over grounded knowledge
|
||||
|
||||
### Shared boundary helpers should stay narrow
|
||||
|
||||
- provider policy that depends on GenieHive route resolution but serves learner workflows
|
||||
- review bridges where GroundRecall needs to feed an existing Didactopus review process during the transition
|
||||
|
||||
## Packaging Direction
|
||||
|
||||
### Phase 0: Present layout, stricter discipline
|
||||
|
||||
Keep the code in `src/didactopus/`, but use naming and imports that preserve the eventual split.
|
||||
|
||||
Rules:
|
||||
|
||||
- new generic knowledge features go into `groundrecall_*` modules
|
||||
- new learner-facing features go into `didactopus` learner modules
|
||||
- do not add generic knowledge operations to `didactopus.main`
|
||||
- treat review bridges as bridges, not permanent core ownership
|
||||
|
||||
### Phase 1: Explicit namespace inside the repo
|
||||
|
||||
Preferred direction:
|
||||
|
||||
- move GroundRecall modules under `src/didactopus/groundrecall/`
|
||||
|
||||
Target structure:
|
||||
|
||||
- `src/didactopus/groundrecall/ingest.py`
|
||||
- `src/didactopus/groundrecall/source_adapters/`
|
||||
- `src/didactopus/groundrecall/models.py`
|
||||
- `src/didactopus/groundrecall/store.py`
|
||||
- `src/didactopus/groundrecall/promotion.py`
|
||||
- `src/didactopus/groundrecall/query.py`
|
||||
- `src/didactopus/groundrecall/export.py`
|
||||
- `src/didactopus/groundrecall/assistants/`
|
||||
- `src/didactopus/groundrecall/sync.py`
|
||||
- `src/didactopus/groundrecall/merge.py`
|
||||
- `src/didactopus/groundrecall/cli.py`
|
||||
|
||||
Benefits:
|
||||
|
||||
- cleaner conceptual grouping
|
||||
- easier extraction later
|
||||
- clearer import discipline
|
||||
|
||||
Compatibility path:
|
||||
|
||||
- keep thin wrapper modules at old import paths during transition
|
||||
- deprecate wrappers only after tests and docs have moved
|
||||
|
||||
### Phase 2: Dual CLI identity
|
||||
|
||||
Before any repo split, expose GroundRecall as a first-class CLI namespace.
|
||||
|
||||
Desired commands:
|
||||
|
||||
- `python -m didactopus.groundrecall.cli import ...`
|
||||
- `python -m didactopus.groundrecall.cli lint ...`
|
||||
- `python -m didactopus.groundrecall.cli promote ...`
|
||||
- `python -m didactopus.groundrecall.cli query ...`
|
||||
- `python -m didactopus.groundrecall.cli export ...`
|
||||
- `python -m didactopus.groundrecall.cli inspect ...`
|
||||
|
||||
At that point, `didactopus.main` should only surface:
|
||||
|
||||
- learner-facing commands
|
||||
- review-workflow commands with educational intent
|
||||
- possibly a pointer to GroundRecall commands, but not ownership of them
|
||||
|
||||
### Phase 3: Optional package extraction
|
||||
|
||||
Only after sync/merge and standalone use are mature:
|
||||
|
||||
- move GroundRecall to its own package or repo if that becomes operationally useful
|
||||
- keep Didactopus consuming it as a dependency
|
||||
|
||||
This step is optional. A clean package boundary inside one repo may be sufficient for a long time.
|
||||
|
||||
## CLI Migration Plan
|
||||
|
||||
### Keep under `didactopus.main`
|
||||
|
||||
- `review`
|
||||
- future learner-facing workbench commands
|
||||
|
||||
### Move toward GroundRecall CLI
|
||||
|
||||
- import
|
||||
- lint
|
||||
- review queue
|
||||
- promotion
|
||||
- canonical query
|
||||
- canonical export
|
||||
- assistant export
|
||||
- sync and merge
|
||||
|
||||
### Transitional exception
|
||||
|
||||
`provider-inspect` can remain on the Didactopus umbrella CLI for now because:
|
||||
|
||||
- it is already useful operationally
|
||||
- it supports learner-node deployments
|
||||
- it is not a GroundRecall-specific operation
|
||||
|
||||
Longer term, it may also belong on a separate operator surface depending on whether Didactopus becomes the standard local application shell.
|
||||
|
||||
## Module Mapping
|
||||
|
||||
### Move first
|
||||
|
||||
Current -> target
|
||||
|
||||
- `didactopus.groundrecall_import` -> `didactopus.groundrecall.ingest`
|
||||
- `didactopus.groundrecall_source_adapters.*` -> `didactopus.groundrecall.source_adapters.*`
|
||||
- `didactopus.groundrecall_models` -> `didactopus.groundrecall.models`
|
||||
- `didactopus.groundrecall_store` -> `didactopus.groundrecall.store`
|
||||
- `didactopus.groundrecall_promotion` -> `didactopus.groundrecall.promotion`
|
||||
- `didactopus.groundrecall_query` -> `didactopus.groundrecall.query`
|
||||
- `didactopus.groundrecall_export` -> `didactopus.groundrecall.export`
|
||||
- `didactopus.groundrecall_assistants.*` -> `didactopus.groundrecall.assistants.*`
|
||||
|
||||
### Keep as transitional bridges
|
||||
|
||||
- `didactopus.groundrecall_review_bridge`
|
||||
- source adapters that ingest Didactopus-native artifacts
|
||||
|
||||
These are legitimate but should be documented as cross-boundary adapters rather than intrinsic ownership proof.
|
||||
|
||||
### Stay in Didactopus
|
||||
|
||||
- `learner_session`
|
||||
- `learner_session_demo`
|
||||
- `mentor`
|
||||
- `practice`
|
||||
- `project_advisor`
|
||||
- educational review UX modules
|
||||
- pack and graph-planning modules
|
||||
|
||||
## Service Boundary Direction
|
||||
|
||||
### GroundRecall service candidates
|
||||
|
||||
Once needed, a GroundRecall service should focus on:
|
||||
|
||||
- canonical knowledge query
|
||||
- import status and queue inspection
|
||||
- promotion status
|
||||
- sync/merge status
|
||||
- assistant-neutral bundle retrieval
|
||||
|
||||
### Didactopus service candidates
|
||||
|
||||
- learner session orchestration
|
||||
- learner progress and evaluation
|
||||
- pack/workbench interactions
|
||||
|
||||
### GenieHive service candidates
|
||||
|
||||
- model and service inspection
|
||||
- route resolution
|
||||
- cluster health
|
||||
|
||||
## Milestones
|
||||
|
||||
### Milestone 1: Namespace discipline
|
||||
|
||||
Done when:
|
||||
|
||||
- new generic knowledge work lands only in GroundRecall-oriented modules
|
||||
- `didactopus.main` stops growing generic knowledge commands
|
||||
- docs consistently describe GroundRecall as a substrate, not a learner feature
|
||||
|
||||
### Milestone 2: Internal package reorganization
|
||||
|
||||
Done when:
|
||||
|
||||
- GroundRecall modules live under an explicit package path
|
||||
- old flat import paths are wrappers only
|
||||
- tests target the new package paths
|
||||
|
||||
### Milestone 3: First-class GroundRecall CLI
|
||||
|
||||
Done when:
|
||||
|
||||
- import/lint/promote/query/export/inspect are available under one GroundRecall CLI surface
|
||||
- operator docs no longer require `Didactopus` framing for generic knowledge tasks
|
||||
|
||||
### Milestone 4: Sync and merge maturity
|
||||
|
||||
Done when:
|
||||
|
||||
- append-only event ingestion exists
|
||||
- promoted-state merge semantics exist
|
||||
- team/shared knowledge workflows are practical without learner workflows
|
||||
|
||||
### Milestone 5: Extraction decision
|
||||
|
||||
Done when:
|
||||
|
||||
- the project can make an informed choice between:
|
||||
- one repo, multiple packages
|
||||
- separate GroundRecall package/repo
|
||||
|
||||
## Immediate Next Work
|
||||
|
||||
Recommended next implementation steps:
|
||||
|
||||
1. Introduce `didactopus.groundrecall` as an internal package namespace.
|
||||
2. Add a single GroundRecall umbrella CLI module.
|
||||
3. Keep thin wrapper modules for compatibility.
|
||||
4. Start moving docs and tests to the new namespace.
|
||||
5. Begin implementing sync/merge primitives under GroundRecall rather than under Didactopus learner flows.
|
||||
|
||||
## Decision Rule For New Work
|
||||
|
||||
Before adding a new command, module, or service, ask:
|
||||
|
||||
1. Would this still be needed if there were no learner session?
|
||||
2. Would a team using only shared knowledge still need it?
|
||||
3. Is the canonical artifact knowledge state or educational interaction?
|
||||
4. Would it still matter if Didactopus UI vanished?
|
||||
|
||||
If yes, default toward GroundRecall.
|
||||
|
|
@ -0,0 +1,286 @@
|
|||
# GroundRecall Repo Bootstrap Checklist
|
||||
|
||||
This document turns the broader [groundrecall-migration-plan.md](groundrecall-migration-plan.md) into a practical checklist for creating a standalone `GroundRecall` repository.
|
||||
|
||||
The goal here is narrower than full feature completion. The goal is to get to a standalone repository that can be installed, run locally, and used for real `llmwiki++`-style work without requiring `Didactopus` as the primary shell.
|
||||
|
||||
## Bootstrap Goal
|
||||
|
||||
Minimum viable standalone `GroundRecall` repo:
|
||||
|
||||
- installable as its own Python package
|
||||
- exposes a first-class `groundrecall` CLI
|
||||
- imports and normalizes knowledge sources
|
||||
- promotes reviewed knowledge into a canonical store
|
||||
- supports query and export over promoted state
|
||||
- supports assistant-neutral exports plus adapter exports
|
||||
- remains consumable by `Didactopus` as a dependency or sibling package
|
||||
|
||||
This is enough for a local standalone alpha. It is not yet the full distributed team and corpus-scale vision.
|
||||
|
||||
## What Already Exists
|
||||
|
||||
The current `Didactopus` codebase already contains most of the implementation spine:
|
||||
|
||||
- `didactopus.groundrecall.ingest`
|
||||
- `didactopus.groundrecall.source_adapters.*`
|
||||
- `didactopus.groundrecall.models`
|
||||
- `didactopus.groundrecall.store`
|
||||
- `didactopus.groundrecall.promotion`
|
||||
- `didactopus.groundrecall.query`
|
||||
- `didactopus.groundrecall.export`
|
||||
- `didactopus.groundrecall.assistant_export`
|
||||
- `didactopus.groundrecall.assistants.*`
|
||||
- `didactopus.groundrecall.inspect`
|
||||
- `didactopus.groundrecall.cli`
|
||||
|
||||
This means the repo bootstrap is primarily a packaging and boundary exercise, not a greenfield implementation.
|
||||
|
||||
## Target Repo Shape
|
||||
|
||||
Suggested standalone layout:
|
||||
|
||||
```text
|
||||
groundrecall/
|
||||
pyproject.toml
|
||||
README.md
|
||||
LICENSE
|
||||
src/
|
||||
groundrecall/
|
||||
__init__.py
|
||||
cli.py
|
||||
ingest.py
|
||||
inspect.py
|
||||
lint.py
|
||||
models.py
|
||||
store.py
|
||||
promotion.py
|
||||
query.py
|
||||
export.py
|
||||
assistant_export.py
|
||||
review_queue.py
|
||||
review_bridge.py
|
||||
source_adapters/
|
||||
assistants/
|
||||
tests/
|
||||
docs/
|
||||
quickstart.md
|
||||
llmwiki-import.md
|
||||
deployment-modes.md
|
||||
assistant-architecture.md
|
||||
sync-roadmap.md
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `review_bridge.py` may remain optional if the standalone repo only needs generic review artifacts.
|
||||
- `review_queue.py` belongs in `GroundRecall`; it is not a learner-only concern.
|
||||
- `review_bridge.py` is the most likely file to stay transitional if it depends too directly on Didactopus review objects.
|
||||
|
||||
## Move / Keep / Bridge
|
||||
|
||||
### Move into standalone `GroundRecall`
|
||||
|
||||
Move first:
|
||||
|
||||
- `didactopus.groundrecall.ingest`
|
||||
- `didactopus.groundrecall.inspect`
|
||||
- `didactopus.groundrecall.lint`
|
||||
- `didactopus.groundrecall.models`
|
||||
- `didactopus.groundrecall.store`
|
||||
- `didactopus.groundrecall.promotion`
|
||||
- `didactopus.groundrecall.query`
|
||||
- `didactopus.groundrecall.export`
|
||||
- `didactopus.groundrecall.assistant_export`
|
||||
- `didactopus.groundrecall.review_queue`
|
||||
- `didactopus.groundrecall.source_adapters.*`
|
||||
- `didactopus.groundrecall.assistants.*`
|
||||
- `didactopus.groundrecall.cli`
|
||||
|
||||
### Keep in `Didactopus`
|
||||
|
||||
These should not move:
|
||||
|
||||
- learner session and mentor/practice flows
|
||||
- educational pack authoring and pack-specific UX
|
||||
- mastery/evidence learner experiences
|
||||
- provider demos that exist to support Didactopus learner workflows
|
||||
|
||||
### Keep as temporary bridges
|
||||
|
||||
These may need a staged treatment:
|
||||
|
||||
- `groundrecall_review_bridge`
|
||||
- `didactopus_pack` source adapter
|
||||
|
||||
Those are useful during transition, but they are cross-boundary integrations, not proof that `GroundRecall` must remain inside `Didactopus`.
|
||||
|
||||
## Bootstrap Checklist
|
||||
|
||||
### 1. Create the new repo skeleton
|
||||
|
||||
Required:
|
||||
|
||||
- create a new repo root
|
||||
- add `pyproject.toml`
|
||||
- add `src/groundrecall/`
|
||||
- add `tests/`
|
||||
- add `docs/`
|
||||
- add a minimal `README.md`
|
||||
- add `LICENSE`
|
||||
|
||||
Definition of done:
|
||||
|
||||
- `pip install -e .` works
|
||||
- `python -m groundrecall.cli --help` works
|
||||
|
||||
### 2. Move the package code
|
||||
|
||||
Required:
|
||||
|
||||
- copy the current `didactopus.groundrecall.*` package into `src/groundrecall/`
|
||||
- update relative imports as needed
|
||||
- remove `didactopus`-prefixed assumptions in docstrings and parser help text
|
||||
|
||||
Definition of done:
|
||||
|
||||
- module imports succeed under `groundrecall.*`
|
||||
- no package file requires `didactopus` imports except explicit transition bridges
|
||||
|
||||
### 3. Extract the tests
|
||||
|
||||
Required:
|
||||
|
||||
- move GroundRecall-focused tests into the new repo
|
||||
- keep Didactopus integration tests in Didactopus
|
||||
- add an end-to-end CLI smoke test that runs:
|
||||
- `import`
|
||||
- `promote`
|
||||
- `query`
|
||||
- `export`
|
||||
- `inspect`
|
||||
|
||||
Definition of done:
|
||||
|
||||
- the new repo has its own passing test suite
|
||||
- Didactopus retains only integration tests that prove interoperability
|
||||
|
||||
### 4. Harden the standalone CLI
|
||||
|
||||
Required commands:
|
||||
|
||||
- `groundrecall import`
|
||||
- `groundrecall lint`
|
||||
- `groundrecall promote`
|
||||
- `groundrecall query`
|
||||
- `groundrecall export`
|
||||
- `groundrecall inspect`
|
||||
|
||||
Recommended additions:
|
||||
|
||||
- `groundrecall assistant-export`
|
||||
- `groundrecall review-queue`
|
||||
|
||||
Definition of done:
|
||||
|
||||
- the CLI help text is standalone and does not refer users back to `Didactopus`
|
||||
|
||||
### 5. Publish a repo-local data layout
|
||||
|
||||
Pick and document a stable layout such as:
|
||||
|
||||
```text
|
||||
.groundrecall/
|
||||
imports/
|
||||
store/
|
||||
exports/
|
||||
events/
|
||||
```
|
||||
|
||||
Required:
|
||||
|
||||
- make these paths configurable
|
||||
- define sane defaults
|
||||
- remove assumptions that the caller already knows the Didactopus workspace layout
|
||||
|
||||
Definition of done:
|
||||
|
||||
- a new user can run GroundRecall in an empty directory and get predictable local state
|
||||
|
||||
### 6. Document the standalone workflows
|
||||
|
||||
At minimum:
|
||||
|
||||
- quickstart
|
||||
- migrate from `llmwiki`
|
||||
- query and export patterns
|
||||
- assistant adapter exports
|
||||
- relationship to `Didactopus`
|
||||
- relationship to `GenieHive`
|
||||
|
||||
Definition of done:
|
||||
|
||||
- the README can orient a new user without requiring Didactopus-specific context
|
||||
|
||||
### 7. Leave compatibility shims in `Didactopus`
|
||||
|
||||
Required:
|
||||
|
||||
- keep thin wrappers at `didactopus.groundrecall_*` or `didactopus.groundrecall.*` integration paths as needed
|
||||
- make `Didactopus` import the extracted package where possible
|
||||
- clearly mark the wrappers as compatibility paths
|
||||
|
||||
Definition of done:
|
||||
|
||||
- existing Didactopus workflows do not break during the split
|
||||
|
||||
## Alpha Completion Criteria
|
||||
|
||||
The standalone repo is alpha-ready when:
|
||||
|
||||
- `llmwiki` import works
|
||||
- `markdown_notes` import works
|
||||
- at least one Didactopus-native adapter still works as an integration adapter
|
||||
- canonical store creation and snapshot export work
|
||||
- query works over promoted objects
|
||||
- assistant-neutral export works
|
||||
- at least two assistant adapters export usable bundles
|
||||
|
||||
This is the right threshold for “functional GroundRecall repo.”
|
||||
|
||||
## Still Missing After Alpha
|
||||
|
||||
A standalone alpha is not yet the full target system. These remain post-bootstrap priorities:
|
||||
|
||||
- re-import and update semantics
|
||||
- append-only event logs for multi-node merge
|
||||
- shared/private scope support
|
||||
- merge and sync conflict handling
|
||||
- stronger claim extraction
|
||||
- richer claim-level review and adjudication
|
||||
- corpus-scale distributed coordination
|
||||
|
||||
Those features should be built in `GroundRecall`, but they do not need to block repo extraction.
|
||||
|
||||
## Recommended Execution Order
|
||||
|
||||
Use this order:
|
||||
|
||||
1. create the repo and package skeleton
|
||||
2. copy the current `groundrecall` package and make imports pass
|
||||
3. move tests and get the standalone suite green
|
||||
4. finalize CLI and README
|
||||
5. switch Didactopus integration points to consume the extracted package
|
||||
6. only then continue with sync/merge and corpus-scale features
|
||||
|
||||
This keeps the boundary clean without stalling feature progress.
|
||||
|
||||
## First PR-Sized Steps
|
||||
|
||||
If this were executed as concrete work, the first three small changes should be:
|
||||
|
||||
1. create the new repo with package skeleton and copy `src/didactopus/groundrecall/`
|
||||
2. move the existing namespace-focused tests and make them pass under `groundrecall.*`
|
||||
3. add a standalone README quickstart and one end-to-end CLI smoke test
|
||||
|
||||
After that, the repo is real enough to iterate in place rather than continuing to plan around it.
|
||||
|
|
@ -0,0 +1,85 @@
|
|||
# llmwiki Import
|
||||
|
||||
`GroundRecall` treats `llmwiki` as one important source shape, not as the defining architecture.
|
||||
|
||||
An imported `llmwiki` tree is treated as:
|
||||
|
||||
- raw source material
|
||||
- prior synthesized artifacts
|
||||
- candidate claims and concepts
|
||||
- provenance that needs to be normalized and reviewed
|
||||
|
||||
Compiled wiki pages are useful artifacts, but they are not automatically promoted as canonical truth.
|
||||
|
||||
## Import Modes
|
||||
|
||||
### `archive`
|
||||
|
||||
- preserve source material with minimal interpretation
|
||||
- index and normalize without assuming promotion readiness
|
||||
- useful for long-tail historical corpora
|
||||
|
||||
### `quick`
|
||||
|
||||
- fast bootstrap mode
|
||||
- extracts candidate concepts, claims, and relations heuristically
|
||||
- useful when getting an old corpus into GroundRecall quickly matters more than perfect grounding
|
||||
|
||||
### `grounded`
|
||||
|
||||
- stricter mode
|
||||
- expects better provenance and cleaner support signals
|
||||
- better fit for shared or promoted knowledge
|
||||
|
||||
## Import Flow
|
||||
|
||||
The normalized import flow is:
|
||||
|
||||
1. capture source files
|
||||
2. discover and classify artifacts
|
||||
3. segment content into observations
|
||||
4. normalize claims, concepts, and relations
|
||||
5. lint the import
|
||||
6. emit a review queue and review bundle
|
||||
7. promote reviewed artifacts into the canonical store
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
groundrecall import /path/to/llmwiki --mode archive
|
||||
groundrecall import /path/to/llmwiki --mode quick
|
||||
groundrecall import /path/to/llmwiki --mode grounded
|
||||
|
||||
groundrecall lint imports/<import-id>
|
||||
groundrecall promote imports/<import-id> store/
|
||||
groundrecall export store/ exports/groundrecall --concept channel-capacity
|
||||
```
|
||||
|
||||
## Current Heuristics
|
||||
|
||||
Today’s importer already supports:
|
||||
|
||||
- `raw/` and `wiki/` discovery
|
||||
- markdown and log segmentation
|
||||
- claim extraction with inline contradiction and supersession markers
|
||||
- review queue generation
|
||||
- review bundle export
|
||||
|
||||
Areas still planned:
|
||||
|
||||
- stronger re-import/update semantics
|
||||
- more robust transcript and semi-structured document handling
|
||||
- stronger large-corpus extraction and consolidation
|
||||
|
||||
## Recommended Promotion Rule
|
||||
|
||||
Treat imported wiki pages as derived artifacts.
|
||||
|
||||
That means:
|
||||
|
||||
- preserve them
|
||||
- mine them for claims and concepts
|
||||
- review what matters
|
||||
- promote canonical claims and concepts into the store
|
||||
|
||||
This is the main difference between `GroundRecall` and a plain markdown wiki.
|
||||
|
|
@ -0,0 +1,97 @@
|
|||
# Quickstart
|
||||
|
||||
`GroundRecall` is a local-first grounded knowledge substrate for `llmwiki++`-style workflows.
|
||||
|
||||
This quickstart assumes a fresh checkout of the standalone repository.
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
pip install -e .
|
||||
groundrecall --help
|
||||
```
|
||||
|
||||
You can also use the module entry point:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src python -m groundrecall --help
|
||||
```
|
||||
|
||||
## Import A Knowledge Source
|
||||
|
||||
Fast import from an `llmwiki`-style tree:
|
||||
|
||||
```bash
|
||||
groundrecall import /path/to/llmwiki --mode quick
|
||||
```
|
||||
|
||||
More conservative import with stronger grounding expectations:
|
||||
|
||||
```bash
|
||||
groundrecall import /path/to/llmwiki --mode grounded
|
||||
```
|
||||
|
||||
The importer writes normalized artifacts under `imports/<import-id>/`.
|
||||
|
||||
## Review And Promote
|
||||
|
||||
Inspect the import outputs:
|
||||
|
||||
```bash
|
||||
groundrecall lint imports/<import-id>
|
||||
```
|
||||
|
||||
Promote the imported review artifacts into a canonical store:
|
||||
|
||||
```bash
|
||||
groundrecall promote imports/<import-id> store/
|
||||
```
|
||||
|
||||
## Query The Canonical Store
|
||||
|
||||
Query a concept:
|
||||
|
||||
```bash
|
||||
groundrecall query store/ channel-capacity
|
||||
```
|
||||
|
||||
Inspect the overall store:
|
||||
|
||||
```bash
|
||||
groundrecall inspect store/
|
||||
```
|
||||
|
||||
## Export
|
||||
|
||||
Export assistant-neutral artifacts:
|
||||
|
||||
```bash
|
||||
groundrecall export store/ exports/groundrecall --concept channel-capacity
|
||||
```
|
||||
|
||||
Export assistant-targeted bundles:
|
||||
|
||||
```bash
|
||||
groundrecall assistant-export store/ codex exports/codex --concept channel-capacity
|
||||
groundrecall assistant-export store/ claude_code exports/claude --concept channel-capacity
|
||||
```
|
||||
|
||||
## Default Working Layout
|
||||
|
||||
A simple local layout is:
|
||||
|
||||
```text
|
||||
.groundrecall/
|
||||
imports/
|
||||
store/
|
||||
exports/
|
||||
events/
|
||||
```
|
||||
|
||||
The current alpha does not require this exact layout, but it is a sensible starting point.
|
||||
|
||||
## Next Reading
|
||||
|
||||
- [architecture.md](architecture.md)
|
||||
- [llmwiki-import.md](llmwiki-import.md)
|
||||
- [sync-roadmap.md](sync-roadmap.md)
|
||||
|
|
@ -0,0 +1,73 @@
|
|||
# Sync Roadmap
|
||||
|
||||
The current standalone alpha is local-first. Sync and merge are planned next-stage features.
|
||||
|
||||
## Goal
|
||||
|
||||
Support these use cases cleanly:
|
||||
|
||||
- one user across multiple machines
|
||||
- teams with shared and individual knowledge
|
||||
- parallel corpus transformation and consolidation
|
||||
|
||||
## Planned Model
|
||||
|
||||
The intended model is:
|
||||
|
||||
- append-only event capture at the edge
|
||||
- canonical promoted store as the durable reviewed state
|
||||
- generated exports and assistant bundles as derived artifacts
|
||||
|
||||
This avoids treating compiled wiki pages or generated bundles as merge primitives.
|
||||
|
||||
## Likely Local Layout
|
||||
|
||||
```text
|
||||
.groundrecall/
|
||||
events/
|
||||
imports/
|
||||
store/
|
||||
exports/
|
||||
```
|
||||
|
||||
## Planned Phases
|
||||
|
||||
### Phase 1: Re-import And Update Semantics
|
||||
|
||||
- import the same source tree repeatedly without duplicating everything
|
||||
- support import lineage and supersession
|
||||
- track object continuity across imports
|
||||
|
||||
### Phase 2: Event Log Capture
|
||||
|
||||
- record machine-local observations and import events
|
||||
- distinguish machine-local state from promoted shared state
|
||||
- preserve provenance and timestamps explicitly
|
||||
|
||||
### Phase 3: Merge And Consolidation
|
||||
|
||||
- merge append-only events from multiple machines
|
||||
- consolidate draft claims and review candidates
|
||||
- preserve contradiction and supersession history
|
||||
|
||||
### Phase 4: Shared And Private Scopes
|
||||
|
||||
- private notes and private candidate knowledge
|
||||
- shared promoted knowledge
|
||||
- controlled promotion from private to shared
|
||||
|
||||
### Phase 5: Team And Corpus Workflows
|
||||
|
||||
- parallel ingestion over large corpora
|
||||
- coordinated claim review and adjudication
|
||||
- export of consolidated assistant-neutral snapshots
|
||||
|
||||
## Non-Goals For The Current Alpha
|
||||
|
||||
The current repo does not yet provide:
|
||||
|
||||
- real-time networked sync
|
||||
- conflict-free replicated data types
|
||||
- hosted review services
|
||||
|
||||
The next useful milestone is a practical local event-log and re-import model, not a full distributed platform in one step.
|
||||
|
|
@ -0,0 +1,31 @@
|
|||
[build-system]
|
||||
requires = ["setuptools>=68", "wheel"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "groundrecall"
|
||||
version = "0.1.0a0"
|
||||
description = "Grounded knowledge substrate for llmwiki++ style workflows."
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.10"
|
||||
license = { text = "MIT" }
|
||||
authors = [
|
||||
{ name = "GroundRecall contributors" }
|
||||
]
|
||||
dependencies = [
|
||||
"pydantic>=2,<3",
|
||||
"PyYAML>=6,<7",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
groundrecall = "groundrecall.cli:main"
|
||||
|
||||
[tool.setuptools]
|
||||
package-dir = { "" = "src" }
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
where = ["src"]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
pythonpath = ["src"]
|
||||
testpaths = ["tests"]
|
||||
|
|
@ -0,0 +1,34 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from .inspect import inspect_store, summarize_store
|
||||
from .ingest import ImportResult, build_parser as build_import_parser, main as import_main, run_groundrecall_import
|
||||
from .models import * # noqa: F403
|
||||
from .promotion import build_parser as build_promotion_parser, main as promotion_main, promote_import_to_store
|
||||
from .query import (
|
||||
build_parser as build_query_parser,
|
||||
build_query_bundle_for_concept,
|
||||
main as query_main,
|
||||
query_concept,
|
||||
query_provenance,
|
||||
search_claims,
|
||||
)
|
||||
from .store import GroundRecallStore
|
||||
|
||||
__all__ = [
|
||||
"GroundRecallStore",
|
||||
"ImportResult",
|
||||
"run_groundrecall_import",
|
||||
"build_import_parser",
|
||||
"import_main",
|
||||
"promote_import_to_store",
|
||||
"build_promotion_parser",
|
||||
"promotion_main",
|
||||
"query_concept",
|
||||
"query_provenance",
|
||||
"search_claims",
|
||||
"build_query_bundle_for_concept",
|
||||
"build_query_parser",
|
||||
"query_main",
|
||||
"summarize_store",
|
||||
"inspect_store",
|
||||
]
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
from .cli import main
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,81 @@
|
|||
from typing import Any
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
|
||||
class DependencySpec(BaseModel):
|
||||
name: str
|
||||
min_version: str = "0.0.0"
|
||||
max_version: str = "9999.9999.9999"
|
||||
|
||||
|
||||
class MasteryProfileSpec(BaseModel):
|
||||
template: str | None = None
|
||||
required_dimensions: list[str] = Field(default_factory=list)
|
||||
dimension_threshold_overrides: dict[str, float] = Field(default_factory=dict)
|
||||
|
||||
|
||||
class CrossPackLinkSpec(BaseModel):
|
||||
source_concept: str
|
||||
target_concept: str
|
||||
relation: str
|
||||
|
||||
|
||||
class ProfileTemplateSpec(BaseModel):
|
||||
required_dimensions: list[str] = Field(default_factory=list)
|
||||
dimension_threshold_overrides: dict[str, float] = Field(default_factory=dict)
|
||||
|
||||
|
||||
class PackManifest(BaseModel):
|
||||
name: str
|
||||
display_name: str
|
||||
version: str
|
||||
schema_version: str
|
||||
didactopus_min_version: str
|
||||
didactopus_max_version: str
|
||||
description: str = ""
|
||||
author: str = ""
|
||||
license: str = "unspecified"
|
||||
dependencies: list[DependencySpec] = Field(default_factory=list)
|
||||
overrides: list[str] = Field(default_factory=list)
|
||||
profile_templates: dict[str, ProfileTemplateSpec] = Field(default_factory=dict)
|
||||
cross_pack_links: list[CrossPackLinkSpec] = Field(default_factory=list)
|
||||
|
||||
|
||||
class ConceptEntry(BaseModel):
|
||||
id: str
|
||||
title: str
|
||||
description: str = ""
|
||||
prerequisites: list[str] = Field(default_factory=list)
|
||||
mastery_signals: list[str] = Field(default_factory=list)
|
||||
mastery_profile: MasteryProfileSpec = Field(default_factory=MasteryProfileSpec)
|
||||
|
||||
|
||||
class ConceptsFile(BaseModel):
|
||||
concepts: list[ConceptEntry]
|
||||
|
||||
|
||||
class RoadmapStageEntry(BaseModel):
|
||||
id: str
|
||||
title: str
|
||||
concepts: list[str] = Field(default_factory=list)
|
||||
checkpoint: list[str] = Field(default_factory=list)
|
||||
|
||||
|
||||
class RoadmapFile(BaseModel):
|
||||
stages: list[RoadmapStageEntry]
|
||||
|
||||
|
||||
class ProjectEntry(BaseModel):
|
||||
id: str
|
||||
title: str
|
||||
difficulty: str = ""
|
||||
prerequisites: list[str] = Field(default_factory=list)
|
||||
deliverables: list[str] = Field(default_factory=list)
|
||||
|
||||
|
||||
class ProjectsFile(BaseModel):
|
||||
projects: list[ProjectEntry]
|
||||
|
||||
|
||||
class RubricsFile(BaseModel):
|
||||
rubrics: list[dict[str, Any]]
|
||||
|
|
@ -0,0 +1,59 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .assistants.base import get_assistant_adapter
|
||||
from .query import build_query_bundle_for_concept
|
||||
from .store import GroundRecallStore
|
||||
|
||||
|
||||
def export_assistant_bundle(
|
||||
store_dir: str | Path,
|
||||
assistant: str,
|
||||
out_dir: str | Path,
|
||||
concept_refs: list[str] | None = None,
|
||||
) -> dict[str, Any]:
|
||||
store = GroundRecallStore(store_dir)
|
||||
snapshot = store.build_snapshot(
|
||||
snapshot_id="assistant-export",
|
||||
created_at="",
|
||||
metadata={"export_kind": "assistant_adapter", "assistant": assistant},
|
||||
).model_dump()
|
||||
query_bundles = []
|
||||
for concept_ref in concept_refs or []:
|
||||
payload = build_query_bundle_for_concept(store_dir, concept_ref)
|
||||
if payload is not None:
|
||||
query_bundles.append(payload)
|
||||
adapter = get_assistant_adapter(assistant)
|
||||
paths = adapter.export_bundle(snapshot, query_bundles, out_dir)
|
||||
manifest = {
|
||||
"assistant": assistant,
|
||||
"output_paths": [str(path) for path in paths],
|
||||
"query_bundle_count": len(query_bundles),
|
||||
}
|
||||
Path(out_dir).mkdir(parents=True, exist_ok=True)
|
||||
(Path(out_dir) / "assistant_export_manifest.json").write_text(json.dumps(manifest, indent=2), encoding="utf-8")
|
||||
return manifest
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Export assistant-specific GroundRecall bundles from canonical store data.")
|
||||
parser.add_argument("store_dir")
|
||||
parser.add_argument("assistant")
|
||||
parser.add_argument("out_dir")
|
||||
parser.add_argument("--concept", action="append", default=[])
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
payload = export_assistant_bundle(
|
||||
store_dir=args.store_dir,
|
||||
assistant=args.assistant,
|
||||
out_dir=args.out_dir,
|
||||
concept_refs=list(args.concept or []),
|
||||
)
|
||||
print(json.dumps(payload, indent=2))
|
||||
|
|
@ -0,0 +1,2 @@
|
|||
from __future__ import annotations
|
||||
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from ..groundrecall_assistants.base import * # noqa: F403
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from ..groundrecall_assistants.claude_code import * # noqa: F403
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from ..groundrecall_assistants.codex import * # noqa: F403
|
||||
|
|
@ -0,0 +1,239 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
import re
|
||||
from typing import Any
|
||||
|
||||
|
||||
def _load_citegeist_symbols() -> dict[str, Any] | None:
|
||||
import sys
|
||||
|
||||
citegeist_src = Path("/home/netuser/bin/CiteGeist/src")
|
||||
if citegeist_src.exists():
|
||||
sys.path.insert(0, str(citegeist_src))
|
||||
try:
|
||||
from citegeist.app_api import LiteratureExplorerApi # type: ignore
|
||||
from citegeist.bibtex import BibEntry, parse_bibtex, render_bibtex # type: ignore
|
||||
from citegeist.storage import BibliographyStore # type: ignore
|
||||
except Exception:
|
||||
return None
|
||||
return {
|
||||
"LiteratureExplorerApi": LiteratureExplorerApi,
|
||||
"BibEntry": BibEntry,
|
||||
"parse_bibtex": parse_bibtex,
|
||||
"render_bibtex": render_bibtex,
|
||||
"BibliographyStore": BibliographyStore,
|
||||
}
|
||||
|
||||
|
||||
def discover_bib_files(source_root: str | Path) -> list[Path]:
|
||||
root = Path(source_root)
|
||||
if not root.exists():
|
||||
return []
|
||||
candidates = [
|
||||
path
|
||||
for path in root.rglob("*.bib")
|
||||
if path.is_file() and not path.name.endswith("-bak.bib") and not path.name.startswith(".")
|
||||
]
|
||||
|
||||
def rank(path: Path) -> tuple[int, int, str]:
|
||||
rel = path.relative_to(root)
|
||||
name = path.name
|
||||
if rel == Path("refs.bib"):
|
||||
return (0, len(rel.parts), str(rel))
|
||||
if rel == Path("biblio.bib"):
|
||||
return (1, len(rel.parts), str(rel))
|
||||
if name == "refs.bib":
|
||||
return (2, len(rel.parts), str(rel))
|
||||
if name == "biblio.bib":
|
||||
return (3, len(rel.parts), str(rel))
|
||||
return (4, len(rel.parts), str(rel))
|
||||
|
||||
return sorted(candidates, key=rank)
|
||||
|
||||
|
||||
def load_bibliography_index(source_root: str | Path) -> dict[str, dict[str, Any]]:
|
||||
symbols = _load_citegeist_symbols()
|
||||
root = Path(source_root)
|
||||
index: dict[str, dict[str, Any]] = {}
|
||||
for bib_path in discover_bib_files(root):
|
||||
try:
|
||||
entries = _parse_bib_entries(bib_path.read_text(encoding="utf-8"), symbols=symbols)
|
||||
except Exception:
|
||||
continue
|
||||
for entry in entries:
|
||||
raw_bibtex = _render_entry_bibtex(entry, symbols=symbols)
|
||||
payload = {
|
||||
"citation_key": entry.citation_key,
|
||||
"entry_type": entry.entry_type,
|
||||
"fields": dict(entry.fields),
|
||||
"source_bib_path": str(bib_path.relative_to(root)),
|
||||
"raw_bibtex": raw_bibtex,
|
||||
"duplicate_source_bib_paths": [],
|
||||
}
|
||||
existing = index.get(entry.citation_key)
|
||||
if existing is None:
|
||||
index[entry.citation_key] = payload
|
||||
else:
|
||||
existing.setdefault("duplicate_source_bib_paths", []).append(str(bib_path.relative_to(root)))
|
||||
return index
|
||||
|
||||
|
||||
def materialize_citegeist_store(import_dir: str | Path, source_root: str | Path) -> dict[str, Any]:
|
||||
symbols = _load_citegeist_symbols()
|
||||
if symbols is None:
|
||||
return {"available": False}
|
||||
BibliographyStore = symbols["BibliographyStore"]
|
||||
LiteratureExplorerApi = symbols["LiteratureExplorerApi"]
|
||||
|
||||
import_root = Path(import_dir)
|
||||
db_path = import_root / "citegeist.sqlite3"
|
||||
if db_path.exists():
|
||||
db_path.unlink()
|
||||
store = BibliographyStore(db_path)
|
||||
ingested_files: list[str] = []
|
||||
for bib_path in discover_bib_files(source_root):
|
||||
try:
|
||||
text = bib_path.read_text(encoding="utf-8")
|
||||
entries = _parse_bib_entries(text, symbols=symbols)
|
||||
for entry in entries:
|
||||
store.upsert_entry(
|
||||
entry,
|
||||
raw_bibtex=_render_entry_bibtex(entry, symbols=symbols),
|
||||
source_type="bibtex",
|
||||
source_label=str(bib_path.relative_to(Path(source_root))),
|
||||
review_status="draft",
|
||||
)
|
||||
store.connection.commit()
|
||||
ingested_files.append(str(bib_path.relative_to(Path(source_root))))
|
||||
except Exception:
|
||||
continue
|
||||
api = LiteratureExplorerApi(store)
|
||||
return {
|
||||
"available": True,
|
||||
"db_path": str(db_path),
|
||||
"ingested_files": ingested_files,
|
||||
"api": api,
|
||||
"store": store,
|
||||
}
|
||||
|
||||
|
||||
def bibliography_summary_payload(source_root: str | Path) -> dict[str, Any]:
|
||||
index = load_bibliography_index(source_root)
|
||||
source_files = discover_bib_files(source_root)
|
||||
return {
|
||||
"enabled": bool(index),
|
||||
"entry_count": len(index),
|
||||
"source_files": [str(path.relative_to(Path(source_root))) for path in source_files],
|
||||
}
|
||||
|
||||
|
||||
def serialize_bib_entry(entry: dict[str, Any] | None) -> dict[str, Any] | None:
|
||||
if entry is None:
|
||||
return None
|
||||
return {
|
||||
"citation_key": entry.get("citation_key", ""),
|
||||
"entry_type": entry.get("entry_type", ""),
|
||||
"fields": dict(entry.get("fields", {})),
|
||||
"source_bib_path": entry.get("source_bib_path", ""),
|
||||
"raw_bibtex": entry.get("raw_bibtex", ""),
|
||||
"duplicate_source_bib_paths": list(entry.get("duplicate_source_bib_paths", [])),
|
||||
}
|
||||
|
||||
|
||||
def serialize_citegeist_entry_payload(payload: dict[str, Any] | None) -> dict[str, Any] | None:
|
||||
if payload is None:
|
||||
return None
|
||||
result = dict(payload)
|
||||
if "raw_bibtex" in result and isinstance(result["raw_bibtex"], str):
|
||||
return result
|
||||
return json.loads(json.dumps(result))
|
||||
|
||||
|
||||
def _parse_bib_entries(text: str, *, symbols: dict[str, Any] | None) -> list[Any]:
|
||||
if symbols is not None:
|
||||
try:
|
||||
return symbols["parse_bibtex"](text)
|
||||
except Exception:
|
||||
pass
|
||||
return _fallback_parse_bibtex(text, symbols=symbols)
|
||||
|
||||
|
||||
def _render_entry_bibtex(entry: Any, *, symbols: dict[str, Any] | None) -> str:
|
||||
if symbols is not None:
|
||||
try:
|
||||
return symbols["render_bibtex"]([entry])
|
||||
except Exception:
|
||||
pass
|
||||
fields = []
|
||||
for key, value in entry.fields.items():
|
||||
fields.append(f" {key} = {{{value}}}")
|
||||
body = ",\n".join(fields)
|
||||
return f"@{entry.entry_type}{{{entry.citation_key},\n{body}\n}}"
|
||||
|
||||
|
||||
def _fallback_parse_bibtex(text: str, *, symbols: dict[str, Any] | None) -> list[Any]:
|
||||
BibEntry = symbols["BibEntry"] if symbols is not None else None
|
||||
entries: list[Any] = []
|
||||
pattern = re.compile(r"@(?P<entry_type>[A-Za-z]+)\s*\{\s*(?P<citation_key>[^,\s]+)\s*,", re.MULTILINE)
|
||||
matches = list(pattern.finditer(text))
|
||||
for index, match in enumerate(matches):
|
||||
start = match.end()
|
||||
end = matches[index + 1].start() if index + 1 < len(matches) else len(text)
|
||||
body = text[start:end]
|
||||
fields = _fallback_parse_fields(body)
|
||||
if BibEntry is not None:
|
||||
entries.append(BibEntry(entry_type=match.group("entry_type").lower(), citation_key=match.group("citation_key").strip(), fields=fields))
|
||||
else:
|
||||
entries.append(type("BibEntryFallback", (), {"entry_type": match.group("entry_type").lower(), "citation_key": match.group("citation_key").strip(), "fields": fields})())
|
||||
return entries
|
||||
|
||||
|
||||
def _fallback_parse_fields(body: str) -> dict[str, str]:
|
||||
fields: dict[str, str] = {}
|
||||
index = 0
|
||||
length = len(body)
|
||||
while index < length:
|
||||
while index < length and body[index] in " \t\r\n,":
|
||||
index += 1
|
||||
if index >= length or body[index] == "}":
|
||||
break
|
||||
key_start = index
|
||||
while index < length and re.match(r"[A-Za-z0-9_:-]", body[index]):
|
||||
index += 1
|
||||
key = body[key_start:index].strip().lower()
|
||||
while index < length and body[index] in " \t\r\n=":
|
||||
index += 1
|
||||
value = ""
|
||||
if index < length and body[index] == "{":
|
||||
depth = 1
|
||||
index += 1
|
||||
value_start = index
|
||||
while index < length and depth > 0:
|
||||
if body[index] == "{":
|
||||
depth += 1
|
||||
elif body[index] == "}":
|
||||
depth -= 1
|
||||
if depth == 0:
|
||||
break
|
||||
index += 1
|
||||
value = body[value_start:index].strip()
|
||||
index += 1
|
||||
elif index < length and body[index] == '"':
|
||||
index += 1
|
||||
value_start = index
|
||||
while index < length and body[index] != '"':
|
||||
if body[index] == "\\":
|
||||
index += 1
|
||||
index += 1
|
||||
value = body[value_start:index].strip()
|
||||
index += 1
|
||||
else:
|
||||
value_start = index
|
||||
while index < length and body[index] not in ",\n":
|
||||
index += 1
|
||||
value = body[value_start:index].strip()
|
||||
if key:
|
||||
fields[key] = value.rstrip(",")
|
||||
return fields
|
||||
|
|
@ -0,0 +1,40 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
from . import assistant_export, export, ingest, inspect, lint, promotion, query, review_server
|
||||
|
||||
|
||||
COMMANDS = {
|
||||
"import": ingest.main,
|
||||
"lint": lint.main,
|
||||
"promote": promotion.main,
|
||||
"query": query.main,
|
||||
"export": export.main,
|
||||
"assistant-export": assistant_export.main,
|
||||
"inspect": inspect.main,
|
||||
"review-server": review_server.main,
|
||||
}
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="GroundRecall command-line tools")
|
||||
parser.add_argument("command", nargs="?", choices=sorted(COMMANDS))
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
argv = sys.argv[1:]
|
||||
parser = build_parser()
|
||||
args, remainder = parser.parse_known_args(argv)
|
||||
if not args.command:
|
||||
parser.print_help()
|
||||
return
|
||||
handler = COMMANDS[args.command]
|
||||
original_argv = sys.argv
|
||||
try:
|
||||
sys.argv = [f"groundrecall.cli {args.command}", *remainder]
|
||||
handler()
|
||||
finally:
|
||||
sys.argv = original_argv
|
||||
|
|
@ -0,0 +1,136 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .query import build_query_bundle_for_concept
|
||||
from .store import GroundRecallStore
|
||||
|
||||
|
||||
def _now() -> str:
|
||||
return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
|
||||
|
||||
|
||||
def _write_json(path: Path, payload: dict[str, Any]) -> None:
|
||||
path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
|
||||
|
||||
|
||||
def _write_jsonl(path: Path, rows: list[dict[str, Any]]) -> None:
|
||||
text = "\n".join(json.dumps(row, sort_keys=True) for row in rows)
|
||||
if text:
|
||||
text += "\n"
|
||||
path.write_text(text, encoding="utf-8")
|
||||
|
||||
|
||||
def export_canonical_snapshot(
|
||||
store_dir: str | Path,
|
||||
out_dir: str | Path,
|
||||
snapshot_id: str | None = None,
|
||||
metadata: dict[str, Any] | None = None,
|
||||
) -> dict[str, str]:
|
||||
store = GroundRecallStore(store_dir)
|
||||
target = Path(out_dir)
|
||||
target.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
actual_snapshot_id = snapshot_id or f"snapshot-export-{datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')}"
|
||||
snapshot = store.build_snapshot(
|
||||
snapshot_id=actual_snapshot_id,
|
||||
created_at=_now(),
|
||||
metadata={"export_kind": "canonical", **(metadata or {})},
|
||||
)
|
||||
store.save_snapshot(snapshot)
|
||||
|
||||
snapshot_path = target / "groundrecall_snapshot.json"
|
||||
_write_json(snapshot_path, snapshot.model_dump())
|
||||
_write_jsonl(target / "claims.jsonl", [item.model_dump() for item in snapshot.claims])
|
||||
_write_jsonl(target / "concepts.jsonl", [item.model_dump() for item in snapshot.concepts])
|
||||
_write_jsonl(target / "relations.jsonl", [item.model_dump() for item in snapshot.relations])
|
||||
provenance_manifest = {
|
||||
"snapshot_id": snapshot.snapshot_id,
|
||||
"created_at": snapshot.created_at,
|
||||
"source_count": len(snapshot.sources),
|
||||
"artifact_count": len(snapshot.artifacts),
|
||||
"observation_count": len(snapshot.observations),
|
||||
}
|
||||
_write_json(target / "provenance_manifest.json", provenance_manifest)
|
||||
manifest = {
|
||||
"export_kind": "canonical",
|
||||
"snapshot_id": snapshot.snapshot_id,
|
||||
"files": [
|
||||
"groundrecall_snapshot.json",
|
||||
"claims.jsonl",
|
||||
"concepts.jsonl",
|
||||
"relations.jsonl",
|
||||
"provenance_manifest.json",
|
||||
],
|
||||
}
|
||||
_write_json(target / "export_manifest.json", manifest)
|
||||
return {
|
||||
"snapshot_json": str(snapshot_path),
|
||||
"claims_jsonl": str(target / "claims.jsonl"),
|
||||
"concepts_jsonl": str(target / "concepts.jsonl"),
|
||||
"relations_jsonl": str(target / "relations.jsonl"),
|
||||
"provenance_manifest_json": str(target / "provenance_manifest.json"),
|
||||
"export_manifest_json": str(target / "export_manifest.json"),
|
||||
}
|
||||
|
||||
|
||||
def export_query_bundle(
|
||||
store_dir: str | Path,
|
||||
concept_ref: str,
|
||||
out_path: str | Path,
|
||||
) -> dict[str, Any]:
|
||||
payload = build_query_bundle_for_concept(store_dir, concept_ref)
|
||||
if payload is None:
|
||||
raise KeyError(f"Unknown concept reference: {concept_ref}")
|
||||
path = Path(out_path)
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
_write_json(path, payload)
|
||||
return payload
|
||||
|
||||
|
||||
def export_canonical_bundle(
|
||||
store_dir: str | Path,
|
||||
out_dir: str | Path,
|
||||
concept_refs: list[str] | None = None,
|
||||
snapshot_id: str | None = None,
|
||||
) -> dict[str, Any]:
|
||||
target = Path(out_dir)
|
||||
target.mkdir(parents=True, exist_ok=True)
|
||||
outputs = export_canonical_snapshot(store_dir, target, snapshot_id=snapshot_id)
|
||||
query_bundle_paths: list[str] = []
|
||||
for concept_ref in concept_refs or []:
|
||||
safe_name = concept_ref.lower().replace(" ", "-").replace("::", "-")
|
||||
bundle_path = target / f"query_bundle__{safe_name}.json"
|
||||
export_query_bundle(store_dir, concept_ref, bundle_path)
|
||||
query_bundle_paths.append(str(bundle_path))
|
||||
manifest = json.loads((target / "export_manifest.json").read_text(encoding="utf-8"))
|
||||
manifest["query_bundles"] = query_bundle_paths
|
||||
_write_json(target / "export_manifest.json", manifest)
|
||||
return {
|
||||
"canonical_outputs": outputs,
|
||||
"query_bundles": query_bundle_paths,
|
||||
}
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Export canonical GroundRecall artifacts.")
|
||||
parser.add_argument("store_dir")
|
||||
parser.add_argument("out_dir")
|
||||
parser.add_argument("--snapshot-id", default=None)
|
||||
parser.add_argument("--concept", action="append", default=[])
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
payload = export_canonical_bundle(
|
||||
store_dir=args.store_dir,
|
||||
out_dir=args.out_dir,
|
||||
concept_refs=list(args.concept or []),
|
||||
snapshot_id=args.snapshot_id,
|
||||
)
|
||||
print(json.dumps(payload, indent=2))
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy flat GroundRecall assistant export module.
|
||||
|
||||
Compatibility path retained during the internal namespace migration.
|
||||
Prefer imports under ``didactopus.groundrecall.assistant_export`` or CLI usage
|
||||
via ``didactopus.groundrecall.cli`` for new code.
|
||||
"""
|
||||
|
||||
from .groundrecall.assistant_export import build_parser, export_assistant_bundle, main
|
||||
|
||||
__all__ = ["export_assistant_bundle", "build_parser", "main"]
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
"""Legacy flat GroundRecall assistants package.
|
||||
|
||||
Compatibility path retained during the internal namespace migration.
|
||||
Prefer imports under ``didactopus.groundrecall.assistants`` for new code.
|
||||
"""
|
||||
|
||||
from .base import get_assistant_adapter, list_assistant_adapters
|
||||
|
||||
__all__ = ["get_assistant_adapter", "list_assistant_adapters"]
|
||||
|
|
@ -0,0 +1,43 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy flat GroundRecall assistant adapter base module.
|
||||
|
||||
Compatibility path retained during the internal namespace migration.
|
||||
Prefer imports under ``didactopus.groundrecall.assistants.base`` for new code.
|
||||
"""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Protocol
|
||||
|
||||
|
||||
class AssistantAdapter(Protocol):
|
||||
name: str
|
||||
|
||||
def export_bundle(self, snapshot: dict, query_bundles: list[dict], out_dir: str | Path) -> list[Path]:
|
||||
...
|
||||
|
||||
def build_context(self, query_result: dict) -> dict:
|
||||
...
|
||||
|
||||
def supported_capabilities(self) -> dict[str, bool]:
|
||||
...
|
||||
|
||||
|
||||
_REGISTRY: dict[str, AssistantAdapter] = {}
|
||||
|
||||
|
||||
def register_assistant_adapter(adapter: AssistantAdapter) -> AssistantAdapter:
|
||||
_REGISTRY[adapter.name] = adapter
|
||||
return adapter
|
||||
|
||||
|
||||
def get_assistant_adapter(name: str) -> AssistantAdapter:
|
||||
try:
|
||||
return _REGISTRY[name]
|
||||
except KeyError as exc:
|
||||
raise KeyError(f"Unknown assistant adapter: {name}") from exc
|
||||
|
||||
|
||||
def list_assistant_adapters() -> list[str]:
|
||||
return sorted(_REGISTRY)
|
||||
|
|
@ -0,0 +1,69 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from .base import register_assistant_adapter
|
||||
|
||||
|
||||
class ClaudeCodeAdapter:
|
||||
name = "claude_code"
|
||||
|
||||
def export_bundle(self, snapshot: dict, query_bundles: list[dict], out_dir: str | Path) -> list[Path]:
|
||||
target = Path(out_dir)
|
||||
target.mkdir(parents=True, exist_ok=True)
|
||||
paths: list[Path] = []
|
||||
|
||||
memory_md = "\n".join(
|
||||
[
|
||||
"# GroundRecall Memory",
|
||||
"",
|
||||
f"- Snapshot: `{snapshot.get('snapshot_id', '')}`",
|
||||
f"- Concepts: {len(snapshot.get('concepts', []))}",
|
||||
f"- Claims: {len(snapshot.get('claims', []))}",
|
||||
"",
|
||||
"Prefer the canonical GroundRecall snapshot and query bundles over free-form recollection.",
|
||||
"",
|
||||
"## Query Bundles",
|
||||
]
|
||||
+ [f"- `{bundle.get('concept', {}).get('concept_id', 'unknown')}`" for bundle in query_bundles]
|
||||
)
|
||||
memory_path = target / "CLAUDE.md"
|
||||
memory_path.write_text(memory_md, encoding="utf-8")
|
||||
paths.append(memory_path)
|
||||
|
||||
bundle_path = target / "claude_code_bundle.json"
|
||||
bundle_path.write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"assistant": "claude_code",
|
||||
"snapshot_id": snapshot.get("snapshot_id", ""),
|
||||
"query_bundle_count": len(query_bundles),
|
||||
"query_bundles": query_bundles,
|
||||
},
|
||||
indent=2,
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
paths.append(bundle_path)
|
||||
return paths
|
||||
|
||||
def build_context(self, query_result: dict) -> dict:
|
||||
return {
|
||||
"assistant": "claude_code",
|
||||
"memory_kind": "groundrecall_query_bundle",
|
||||
"concept": query_result.get("concept", {}),
|
||||
"claims": query_result.get("relevant_claims", []),
|
||||
"support": query_result.get("supporting_observations", []),
|
||||
"next_actions": query_result.get("suggested_next_actions", []),
|
||||
}
|
||||
|
||||
def supported_capabilities(self) -> dict[str, bool]:
|
||||
return {
|
||||
"skill_markdown": False,
|
||||
"json_bundle": True,
|
||||
"project_memory": True,
|
||||
}
|
||||
|
||||
|
||||
register_assistant_adapter(ClaudeCodeAdapter())
|
||||
|
|
@ -0,0 +1,78 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from .base import register_assistant_adapter
|
||||
|
||||
|
||||
class CodexAdapter:
|
||||
name = "codex"
|
||||
|
||||
def export_bundle(self, snapshot: dict, query_bundles: list[dict], out_dir: str | Path) -> list[Path]:
|
||||
target = Path(out_dir)
|
||||
target.mkdir(parents=True, exist_ok=True)
|
||||
paths: list[Path] = []
|
||||
|
||||
skill_payload = {
|
||||
"name": f"groundrecall-{snapshot.get('snapshot_id', 'snapshot')}",
|
||||
"description": "GroundRecall assistant adapter bundle for Codex.",
|
||||
"snapshot_id": snapshot.get("snapshot_id", ""),
|
||||
"concept_count": len(snapshot.get("concepts", [])),
|
||||
"claim_count": len(snapshot.get("claims", [])),
|
||||
}
|
||||
skill_md = "\n".join(
|
||||
[
|
||||
"---",
|
||||
f"name: {skill_payload['name']}",
|
||||
f"description: {skill_payload['description']}",
|
||||
"---",
|
||||
"",
|
||||
"# GroundRecall Codex Bundle",
|
||||
"",
|
||||
f"- Snapshot: `{skill_payload['snapshot_id']}`",
|
||||
f"- Concepts: {skill_payload['concept_count']}",
|
||||
f"- Claims: {skill_payload['claim_count']}",
|
||||
"",
|
||||
"Use the accompanying canonical JSON and query bundles as the primary source of grounded context.",
|
||||
]
|
||||
)
|
||||
skill_path = target / "SKILL.md"
|
||||
skill_path.write_text(skill_md, encoding="utf-8")
|
||||
paths.append(skill_path)
|
||||
|
||||
bundle_path = target / "codex_bundle.json"
|
||||
bundle_path.write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"assistant": "codex",
|
||||
"snapshot_id": snapshot.get("snapshot_id", ""),
|
||||
"query_bundle_count": len(query_bundles),
|
||||
"query_bundles": query_bundles,
|
||||
},
|
||||
indent=2,
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
paths.append(bundle_path)
|
||||
return paths
|
||||
|
||||
def build_context(self, query_result: dict) -> dict:
|
||||
return {
|
||||
"assistant": "codex",
|
||||
"context_kind": "groundrecall_query_bundle",
|
||||
"concept": query_result.get("concept", {}),
|
||||
"relevant_claims": query_result.get("relevant_claims", []),
|
||||
"supporting_observations": query_result.get("supporting_observations", []),
|
||||
"suggested_next_actions": query_result.get("suggested_next_actions", []),
|
||||
}
|
||||
|
||||
def supported_capabilities(self) -> dict[str, bool]:
|
||||
return {
|
||||
"skill_markdown": True,
|
||||
"json_bundle": True,
|
||||
"project_memory": False,
|
||||
}
|
||||
|
||||
|
||||
register_assistant_adapter(CodexAdapter())
|
||||
|
|
@ -0,0 +1,54 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
TEXT_EXTENSIONS = {
|
||||
".md",
|
||||
".markdown",
|
||||
".txt",
|
||||
".tex",
|
||||
".json",
|
||||
".yaml",
|
||||
".yml",
|
||||
".csv",
|
||||
".log",
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class DiscoveredArtifact:
|
||||
path: Path
|
||||
relative_path: str
|
||||
artifact_kind: str
|
||||
is_text: bool
|
||||
|
||||
|
||||
def classify_artifact(root: Path, path: Path) -> DiscoveredArtifact:
|
||||
rel = path.relative_to(root).as_posix()
|
||||
top = rel.split("/", 1)[0]
|
||||
suffix = path.suffix.lower()
|
||||
is_text = suffix in TEXT_EXTENSIONS or path.name in {"README", "LICENSE"}
|
||||
artifact_kind = "generic_artifact"
|
||||
if top == "wiki":
|
||||
artifact_kind = "compiled_page"
|
||||
elif top in {"raw", "sources"}:
|
||||
artifact_kind = "raw_note"
|
||||
elif top == "logs":
|
||||
artifact_kind = "session_log"
|
||||
elif path.name.startswith("schema."):
|
||||
artifact_kind = "schema_file"
|
||||
elif suffix in {".md", ".markdown"}:
|
||||
artifact_kind = "markdown_note"
|
||||
return DiscoveredArtifact(path=path, relative_path=rel, artifact_kind=artifact_kind, is_text=is_text)
|
||||
|
||||
|
||||
def discover_llmwiki_artifacts(root: str | Path) -> list[DiscoveredArtifact]:
|
||||
base = Path(root)
|
||||
artifacts: list[DiscoveredArtifact] = []
|
||||
for path in sorted(p for p in base.rglob("*") if p.is_file()):
|
||||
if any(part in {".git", "__pycache__", ".pytest_cache"} for part in path.parts):
|
||||
continue
|
||||
artifacts.append(classify_artifact(base, path))
|
||||
return artifacts
|
||||
|
|
@ -0,0 +1,24 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy flat GroundRecall export module.
|
||||
|
||||
Compatibility path retained during the internal namespace migration.
|
||||
Prefer imports under ``didactopus.groundrecall.export`` or CLI usage via
|
||||
``didactopus.groundrecall.cli`` for new code.
|
||||
"""
|
||||
|
||||
from .groundrecall.export import (
|
||||
build_parser,
|
||||
export_canonical_bundle,
|
||||
export_canonical_snapshot,
|
||||
export_query_bundle,
|
||||
main,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"export_canonical_snapshot",
|
||||
"export_query_bundle",
|
||||
"export_canonical_bundle",
|
||||
"build_parser",
|
||||
"main",
|
||||
]
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy extracted GroundRecall import module.
|
||||
|
||||
Compatibility path retained while the standalone repo converges on the
|
||||
top-level ``groundrecall.ingest`` module as the primary implementation.
|
||||
"""
|
||||
|
||||
from .ingest import ImportResult, build_parser, main, run_groundrecall_import
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy extracted GroundRecall lint module.
|
||||
|
||||
Compatibility path retained while the standalone repo converges on the
|
||||
top-level ``groundrecall.lint`` module as the primary implementation.
|
||||
"""
|
||||
|
||||
from .lint import build_parser, lint_import_directory, main
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy extracted GroundRecall models module.
|
||||
|
||||
Compatibility path retained while the standalone repo converges on the
|
||||
top-level ``groundrecall.models`` module as the primary implementation.
|
||||
"""
|
||||
|
||||
from .models import * # noqa: F403
|
||||
|
|
@ -0,0 +1,136 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from dataclasses import asdict, dataclass
|
||||
from hashlib import sha256
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .groundrecall_discovery import DiscoveredArtifact
|
||||
from .groundrecall_segmenter import SegmentedPage, SegmentedObservation
|
||||
|
||||
|
||||
@dataclass
|
||||
class ImportContext:
|
||||
import_id: str
|
||||
import_mode: str
|
||||
machine_id: str
|
||||
agent_id: str
|
||||
source_root: str
|
||||
imported_at: str
|
||||
|
||||
|
||||
def _sanitize_claim_key(value: str) -> str:
|
||||
text = "".join(ch.lower() if ch.isalnum() else "-" for ch in value).strip("-")
|
||||
return text or "claim"
|
||||
|
||||
|
||||
def _claim_id_for_observation(observation_record: dict[str, Any], observation: SegmentedObservation, index: int) -> str:
|
||||
if observation.explicit_claim_key:
|
||||
return f"clm_{_sanitize_claim_key(observation.explicit_claim_key)}"
|
||||
return f"clm_{observation_record['observation_id']}_{index}"
|
||||
|
||||
|
||||
def build_artifact_record(context: ImportContext, artifact: DiscoveredArtifact, page: SegmentedPage | None) -> dict[str, Any]:
|
||||
record = {
|
||||
"artifact_id": f"ia_{sha256(artifact.relative_path.encode('utf-8')).hexdigest()[:12]}",
|
||||
"import_id": context.import_id,
|
||||
"artifact_kind": artifact.artifact_kind,
|
||||
"path": artifact.relative_path,
|
||||
"title": page.title if page else Path(artifact.relative_path).stem,
|
||||
"sha256": sha256(artifact.path.read_bytes()).hexdigest(),
|
||||
"created_at": context.imported_at,
|
||||
"metadata": {
|
||||
"frontmatter": page.frontmatter if page else {},
|
||||
"headings": page.headings if page else [],
|
||||
},
|
||||
"current_status": "draft",
|
||||
}
|
||||
return record
|
||||
|
||||
|
||||
def build_observation_record(
|
||||
context: ImportContext,
|
||||
artifact_record: dict[str, Any],
|
||||
observation: SegmentedObservation,
|
||||
index: int,
|
||||
) -> dict[str, Any]:
|
||||
return {
|
||||
"observation_id": f"obs_{artifact_record['artifact_id']}_{index}",
|
||||
"import_id": context.import_id,
|
||||
"artifact_id": artifact_record["artifact_id"],
|
||||
"role": observation.role,
|
||||
"text": observation.text,
|
||||
"origin_path": observation.artifact_relative_path,
|
||||
"origin_section": observation.section,
|
||||
"line_start": observation.line_start,
|
||||
"line_end": observation.line_end,
|
||||
"grounding_status": observation.grounding_status,
|
||||
"support_kind": observation.support_kind,
|
||||
"confidence_hint": observation.confidence_hint,
|
||||
"current_status": "draft",
|
||||
}
|
||||
|
||||
|
||||
def build_claim_record(
|
||||
context: ImportContext,
|
||||
observation_record: dict[str, Any],
|
||||
observation: SegmentedObservation,
|
||||
concept_ids: list[str],
|
||||
index: int,
|
||||
) -> dict[str, Any]:
|
||||
return {
|
||||
"claim_id": _claim_id_for_observation(observation_record, observation, index),
|
||||
"import_id": context.import_id,
|
||||
"claim_text": observation_record["text"],
|
||||
"claim_kind": "statement" if observation_record["role"] == "claim" else "summary",
|
||||
"source_observation_ids": [observation_record["observation_id"]],
|
||||
"supporting_fragment_ids": [],
|
||||
"concept_ids": [f"concept::{concept_id}" for concept_id in concept_ids],
|
||||
"contradicts_claim_ids": [f"clm_{_sanitize_claim_key(value)}" for value in observation.contradict_keys],
|
||||
"supersedes_claim_ids": [f"clm_{_sanitize_claim_key(value)}" for value in observation.supersede_keys],
|
||||
"confidence_hint": observation_record["confidence_hint"],
|
||||
"grounding_status": observation_record["grounding_status"],
|
||||
"current_status": "triaged" if observation_record["grounding_status"] != "ungrounded" else "draft",
|
||||
}
|
||||
|
||||
|
||||
def build_concept_records(context: ImportContext, artifact_record: dict[str, Any], concept_ids: list[str]) -> list[dict[str, Any]]:
|
||||
records = []
|
||||
for concept_id in concept_ids:
|
||||
records.append(
|
||||
{
|
||||
"concept_id": f"concept::{concept_id}",
|
||||
"import_id": context.import_id,
|
||||
"title": concept_id.replace("-", " ").title(),
|
||||
"aliases": [],
|
||||
"description": "Imported concept from llmwiki corpus.",
|
||||
"source_artifact_ids": [artifact_record["artifact_id"]],
|
||||
"current_status": "triaged",
|
||||
}
|
||||
)
|
||||
return records
|
||||
|
||||
|
||||
def build_relation_records(context: ImportContext, artifact_record: dict[str, Any], concept_ids: list[str], links: list[str]) -> list[dict[str, Any]]:
|
||||
if not concept_ids:
|
||||
return []
|
||||
primary = f"concept::{concept_ids[0]}"
|
||||
records = []
|
||||
for idx, link in enumerate(links, start=1):
|
||||
target = f"concept::{link.lower().replace(' ', '-')}"
|
||||
records.append(
|
||||
{
|
||||
"relation_id": f"rel_{artifact_record['artifact_id']}_{idx}",
|
||||
"import_id": context.import_id,
|
||||
"source_id": primary,
|
||||
"target_id": target,
|
||||
"relation_type": "references",
|
||||
"evidence_ids": [],
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
return records
|
||||
|
||||
|
||||
def manifest_record(context: ImportContext) -> dict[str, Any]:
|
||||
return asdict(context) | {"source_repo_kind": "llmwiki"}
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy extracted GroundRecall promotion module.
|
||||
|
||||
Compatibility path retained while the standalone repo converges on the
|
||||
top-level ``groundrecall.promotion`` module as the primary implementation.
|
||||
"""
|
||||
|
||||
from .promotion import build_parser, main, promote_import_to_store
|
||||
|
|
@ -0,0 +1,26 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy flat GroundRecall query module.
|
||||
|
||||
Compatibility path retained during the internal namespace migration.
|
||||
Prefer imports under ``didactopus.groundrecall.query`` or CLI usage via
|
||||
``didactopus.groundrecall.cli`` for new code.
|
||||
"""
|
||||
|
||||
from .groundrecall.query import (
|
||||
build_parser,
|
||||
build_query_bundle_for_concept,
|
||||
main,
|
||||
query_concept,
|
||||
query_provenance,
|
||||
search_claims,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"query_concept",
|
||||
"search_claims",
|
||||
"query_provenance",
|
||||
"build_query_bundle_for_concept",
|
||||
"build_parser",
|
||||
"main",
|
||||
]
|
||||
|
|
@ -0,0 +1,138 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .review_export import build_citation_review_entries_from_import, export_review_state_json, export_review_ui_data
|
||||
from .review_schema import ConceptReviewEntry, DraftPackData, ReviewSession
|
||||
|
||||
|
||||
def _read_json(path: Path) -> dict[str, Any]:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
|
||||
if not path.exists():
|
||||
return []
|
||||
text = path.read_text(encoding="utf-8").strip()
|
||||
if not text:
|
||||
return []
|
||||
return [json.loads(line) for line in text.splitlines()]
|
||||
|
||||
|
||||
def _claim_summary(claims: list[dict[str, Any]]) -> list[str]:
|
||||
lines: list[str] = []
|
||||
for claim in claims[:3]:
|
||||
grounding = claim.get("grounding_status", "unknown")
|
||||
lines.append(f"Claim: {claim.get('claim_text', '')} [{grounding}]")
|
||||
if len(claims) > 3:
|
||||
lines.append(f"{len(claims) - 3} additional claims omitted from notes summary.")
|
||||
return lines
|
||||
|
||||
|
||||
def build_review_session_from_import(import_dir: str | Path, reviewer: str = "GroundRecall Import") -> ReviewSession:
|
||||
base = Path(import_dir)
|
||||
manifest = _read_json(base / "manifest.json")
|
||||
lint_payload = _read_json(base / "lint_findings.json")
|
||||
claims = _read_jsonl(base / "claims.jsonl")
|
||||
concepts = _read_jsonl(base / "concepts.jsonl")
|
||||
|
||||
claims_by_concept: defaultdict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||
for claim in claims:
|
||||
for concept_id in claim.get("concept_ids", []):
|
||||
claims_by_concept[concept_id].append(claim)
|
||||
|
||||
findings_by_target: defaultdict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||
concept_findings: defaultdict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||
for finding in lint_payload.get("findings", []):
|
||||
findings_by_target[finding["target_id"]].append(finding)
|
||||
for claim in claims:
|
||||
for concept_id in claim.get("concept_ids", []):
|
||||
concept_findings[concept_id].extend(findings_by_target.get(claim["claim_id"], []))
|
||||
for concept in concepts:
|
||||
concept_findings[concept["concept_id"]].extend(findings_by_target.get(concept["concept_id"], []))
|
||||
|
||||
entries: list[ConceptReviewEntry] = []
|
||||
for concept in concepts:
|
||||
concept_id = concept["concept_id"]
|
||||
related_claims = claims_by_concept.get(concept_id, [])
|
||||
related_findings = concept_findings.get(concept_id, [])
|
||||
has_errors = any(item["severity"] == "error" for item in related_findings)
|
||||
all_grounded = bool(related_claims) and all(item.get("grounding_status") == "grounded" for item in related_claims)
|
||||
status = "needs_review"
|
||||
if not has_errors and all_grounded:
|
||||
status = "provisional"
|
||||
|
||||
notes = _claim_summary(related_claims)
|
||||
notes.extend(item["message"] for item in related_findings[:5])
|
||||
|
||||
entries.append(
|
||||
ConceptReviewEntry(
|
||||
concept_id=concept_id.replace("concept::", "", 1),
|
||||
title=concept.get("title", concept_id),
|
||||
description=concept.get("description", ""),
|
||||
prerequisites=[],
|
||||
mastery_signals=[],
|
||||
status=status,
|
||||
notes=notes,
|
||||
)
|
||||
)
|
||||
|
||||
conflicts = [item["message"] for item in lint_payload.get("findings", []) if item["severity"] == "error"]
|
||||
review_flags = [item["message"] for item in lint_payload.get("findings", []) if item["severity"] == "warning"]
|
||||
pack = {
|
||||
"name": f"groundrecall-import-{manifest['import_id']}",
|
||||
"display_name": f"GroundRecall Import {manifest['import_id']}",
|
||||
"version": "0.1.0-draft",
|
||||
"source_import_id": manifest["import_id"],
|
||||
"source_root": manifest.get("source_root", ""),
|
||||
}
|
||||
attribution = {
|
||||
"source_repo_kind": manifest.get("source_repo_kind", "llmwiki"),
|
||||
"source_root": manifest.get("source_root", ""),
|
||||
"imported_at": manifest.get("imported_at", ""),
|
||||
"machine_id": manifest.get("machine_id", ""),
|
||||
"rights_note": "Imported llmwiki-style corpus requires review before promotion.",
|
||||
}
|
||||
return ReviewSession(
|
||||
reviewer=reviewer,
|
||||
draft_pack=DraftPackData(
|
||||
pack=pack,
|
||||
concepts=entries,
|
||||
conflicts=conflicts,
|
||||
review_flags=review_flags,
|
||||
attribution=attribution,
|
||||
),
|
||||
citation_reviews=build_citation_review_entries_from_import(base),
|
||||
)
|
||||
|
||||
|
||||
def export_review_bundle_from_import(import_dir: str | Path, out_dir: str | Path | None = None, reviewer: str = "GroundRecall Import") -> dict[str, str]:
|
||||
base = Path(import_dir)
|
||||
target = Path(out_dir) if out_dir is not None else base
|
||||
target.mkdir(parents=True, exist_ok=True)
|
||||
session = build_review_session_from_import(base, reviewer=reviewer)
|
||||
review_state_path = target / "review_session.json"
|
||||
export_review_state_json(session, review_state_path)
|
||||
export_review_ui_data(session, target, import_dir=base)
|
||||
return {
|
||||
"review_session_json": str(review_state_path),
|
||||
"review_data_json": str(target / "review_data.json"),
|
||||
}
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Build Didactopus review artifacts from a GroundRecall import.")
|
||||
parser.add_argument("import_dir")
|
||||
parser.add_argument("--out-dir", default=None)
|
||||
parser.add_argument("--reviewer", default="GroundRecall Import")
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
outputs = export_review_bundle_from_import(args.import_dir, out_dir=args.out_dir, reviewer=args.reviewer)
|
||||
print(json.dumps(outputs, indent=2))
|
||||
|
|
@ -0,0 +1,114 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
def _read_json(path: Path) -> dict[str, Any]:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
|
||||
if not path.exists():
|
||||
return []
|
||||
text = path.read_text(encoding="utf-8").strip()
|
||||
if not text:
|
||||
return []
|
||||
return [json.loads(line) for line in text.splitlines()]
|
||||
|
||||
|
||||
def _triage_lane(item: dict[str, Any], finding_codes: set[str]) -> str:
|
||||
if {"claim_ungrounded", "ungrounded_summary"} & finding_codes:
|
||||
return "source_cleanup"
|
||||
if {"relation_missing_source", "relation_missing_target", "orphan_concept"} & finding_codes:
|
||||
return "conflict_resolution"
|
||||
return "knowledge_capture"
|
||||
|
||||
|
||||
def _priority(item: dict[str, Any], finding_codes: set[str]) -> int:
|
||||
priority = 50
|
||||
if item.get("grounding_status") == "grounded":
|
||||
priority -= 10
|
||||
if item.get("current_status") == "triaged":
|
||||
priority -= 5
|
||||
if any(code.startswith("claim_") or code.startswith("relation_") for code in finding_codes):
|
||||
priority += 20
|
||||
priority -= min(len(finding_codes) * 2, 10)
|
||||
return max(priority, 1)
|
||||
|
||||
|
||||
def build_review_queue(import_dir: str | Path) -> dict[str, Any]:
|
||||
base = Path(import_dir)
|
||||
manifest = _read_json(base / "manifest.json")
|
||||
lint_payload = _read_json(base / "lint_findings.json")
|
||||
claims = _read_jsonl(base / "claims.jsonl")
|
||||
concepts = _read_jsonl(base / "concepts.jsonl")
|
||||
|
||||
findings_by_target: defaultdict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||
for finding in lint_payload.get("findings", []):
|
||||
findings_by_target[finding["target_id"]].append(finding)
|
||||
|
||||
queue: list[dict[str, Any]] = []
|
||||
|
||||
for claim in claims:
|
||||
related = findings_by_target.get(claim["claim_id"], [])
|
||||
finding_codes = {item["code"] for item in related}
|
||||
queue.append(
|
||||
{
|
||||
"queue_id": f"rq_{claim['claim_id']}",
|
||||
"candidate_type": "claim",
|
||||
"candidate_id": claim["claim_id"],
|
||||
"title": claim["claim_text"][:100],
|
||||
"triage_lane": _triage_lane(claim, finding_codes),
|
||||
"priority": _priority(claim, finding_codes),
|
||||
"grounding_status": claim.get("grounding_status"),
|
||||
"status": "needs_review",
|
||||
"finding_codes": sorted(finding_codes),
|
||||
"concept_ids": list(claim.get("concept_ids", [])),
|
||||
}
|
||||
)
|
||||
|
||||
for concept in concepts:
|
||||
related = findings_by_target.get(concept["concept_id"], [])
|
||||
finding_codes = {item["code"] for item in related}
|
||||
if not finding_codes:
|
||||
continue
|
||||
queue.append(
|
||||
{
|
||||
"queue_id": f"rq_{concept['concept_id'].replace('::', '_')}",
|
||||
"candidate_type": "concept",
|
||||
"candidate_id": concept["concept_id"],
|
||||
"title": concept["title"],
|
||||
"triage_lane": _triage_lane(concept, finding_codes),
|
||||
"priority": _priority(concept, finding_codes),
|
||||
"grounding_status": concept.get("grounding_status", "triaged"),
|
||||
"status": "needs_review",
|
||||
"finding_codes": sorted(finding_codes),
|
||||
"concept_ids": [concept["concept_id"]],
|
||||
}
|
||||
)
|
||||
|
||||
queue.sort(key=lambda item: (item["priority"], item["candidate_type"], item["candidate_id"]))
|
||||
return {
|
||||
"import_id": manifest["import_id"],
|
||||
"queue_length": len(queue),
|
||||
"items": queue,
|
||||
}
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Build a GroundRecall review queue from import artifacts.")
|
||||
parser.add_argument("import_dir")
|
||||
parser.add_argument("--out", default=None)
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
payload = build_review_queue(args.import_dir)
|
||||
out_path = Path(args.out) if args.out else Path(args.import_dir) / "review_queue.json"
|
||||
out_path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
|
||||
print(f"Wrote {out_path}")
|
||||
|
|
@ -0,0 +1,180 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
import re
|
||||
|
||||
from .groundrecall_discovery import DiscoveredArtifact
|
||||
|
||||
|
||||
HEADING_RE = re.compile(r"^(#{1,6})\s+(.*)$")
|
||||
FRONTMATTER_DELIM = "---"
|
||||
ANNOTATION_RE = re.compile(r"\[(claim_id|contradicts|supersedes):([^\]]+)\]", re.IGNORECASE)
|
||||
TABLE_SEPARATOR_RE = re.compile(r"^\|(?:\s*:?-{3,}:?\s*\|)+\s*$")
|
||||
LATEX_STRUCTURAL_RE = re.compile(r"^\\(begin|end|centering|caption|label|tikzset|node|draw|path|matrix|includegraphics)\b")
|
||||
LATEX_MATH_ONLY_RE = re.compile(r"^[\\{}[\]()$&_^%.,;:=+\-*/|<>~0-9A-Za-z ]+$")
|
||||
|
||||
|
||||
@dataclass
|
||||
class SegmentedObservation:
|
||||
artifact_relative_path: str
|
||||
role: str
|
||||
text: str
|
||||
section: str
|
||||
line_start: int
|
||||
line_end: int
|
||||
grounding_status: str
|
||||
support_kind: str
|
||||
confidence_hint: float
|
||||
explicit_claim_key: str = ""
|
||||
contradict_keys: list[str] = field(default_factory=list)
|
||||
supersede_keys: list[str] = field(default_factory=list)
|
||||
|
||||
|
||||
@dataclass
|
||||
class SegmentedPage:
|
||||
title: str
|
||||
headings: list[str] = field(default_factory=list)
|
||||
frontmatter: dict[str, str] = field(default_factory=dict)
|
||||
observations: list[SegmentedObservation] = field(default_factory=list)
|
||||
concepts: list[str] = field(default_factory=list)
|
||||
links: list[str] = field(default_factory=list)
|
||||
|
||||
|
||||
def _parse_frontmatter(lines: list[str]) -> tuple[dict[str, str], int]:
|
||||
if not lines or lines[0].strip() != FRONTMATTER_DELIM:
|
||||
return {}, 0
|
||||
data: dict[str, str] = {}
|
||||
idx = 1
|
||||
while idx < len(lines):
|
||||
stripped = lines[idx].strip()
|
||||
if stripped == FRONTMATTER_DELIM:
|
||||
return data, idx + 1
|
||||
if ":" in stripped:
|
||||
key, value = stripped.split(":", 1)
|
||||
data[key.strip()] = value.strip()
|
||||
idx += 1
|
||||
return data, 0
|
||||
|
||||
|
||||
def _extract_links(text: str) -> list[str]:
|
||||
return re.findall(r"\[\[([^\]]+)\]\]", text)
|
||||
|
||||
|
||||
def _to_concept_id(text: str) -> str:
|
||||
text = re.sub(r"[^a-zA-Z0-9]+", "-", text.strip().lower()).strip("-")
|
||||
return text or "untitled"
|
||||
|
||||
|
||||
def _parse_annotations(text: str) -> tuple[str, str, list[str], list[str]]:
|
||||
claim_key = ""
|
||||
contradict_keys: list[str] = []
|
||||
supersede_keys: list[str] = []
|
||||
for kind, raw_value in ANNOTATION_RE.findall(text):
|
||||
values = [value.strip() for value in raw_value.split(",") if value.strip()]
|
||||
kind_lower = kind.lower()
|
||||
if kind_lower == "claim_id" and values:
|
||||
claim_key = values[0]
|
||||
elif kind_lower == "contradicts":
|
||||
contradict_keys.extend(values)
|
||||
elif kind_lower == "supersedes":
|
||||
supersede_keys.extend(values)
|
||||
cleaned = ANNOTATION_RE.sub("", text)
|
||||
cleaned = re.sub(r"\s{2,}", " ", cleaned).strip()
|
||||
return cleaned, claim_key, contradict_keys, supersede_keys
|
||||
|
||||
|
||||
def _should_skip_line(text: str) -> bool:
|
||||
stripped = text.strip()
|
||||
if not stripped:
|
||||
return True
|
||||
if stripped.startswith("!["):
|
||||
return True
|
||||
if stripped in {"---", "```", "};", "{", "}", "</div>", "<div>", ":::"}:
|
||||
return True
|
||||
if stripped.startswith(":::"):
|
||||
return True
|
||||
if stripped.startswith("|") and stripped.endswith("|"):
|
||||
return True
|
||||
if TABLE_SEPARATOR_RE.match(stripped):
|
||||
return True
|
||||
if LATEX_STRUCTURAL_RE.match(stripped):
|
||||
return True
|
||||
if stripped.startswith("%"):
|
||||
return True
|
||||
if stripped.startswith("\\") and LATEX_MATH_ONLY_RE.match(stripped):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def segment_markdown_artifact(artifact: DiscoveredArtifact, text: str | None = None) -> SegmentedPage:
|
||||
text = artifact.path.read_text(encoding="utf-8") if text is None else text
|
||||
lines = text.splitlines()
|
||||
frontmatter, start_idx = _parse_frontmatter(lines)
|
||||
current_section = frontmatter.get("title", Path(artifact.relative_path).stem.replace("-", " ").title())
|
||||
title = current_section
|
||||
headings: list[str] = []
|
||||
observations: list[SegmentedObservation] = []
|
||||
concepts: list[str] = []
|
||||
links: list[str] = []
|
||||
|
||||
for idx in range(start_idx, len(lines)):
|
||||
raw_line = lines[idx]
|
||||
stripped = raw_line.strip()
|
||||
if _should_skip_line(stripped):
|
||||
continue
|
||||
heading_match = HEADING_RE.match(raw_line)
|
||||
if heading_match:
|
||||
current_section = heading_match.group(2).strip()
|
||||
headings.append(current_section)
|
||||
if not title and heading_match.group(1) == "#":
|
||||
title = current_section
|
||||
concepts.append(_to_concept_id(current_section))
|
||||
continue
|
||||
|
||||
role = "summary"
|
||||
obs_text = stripped
|
||||
if stripped.startswith(("- ", "* ")):
|
||||
role = "claim"
|
||||
obs_text = stripped[2:].strip()
|
||||
elif stripped.lower().startswith(("todo:", "question:", "q:")):
|
||||
role = "question"
|
||||
elif stripped.lower().startswith(("speculation:", "hypothesis:")):
|
||||
role = "speculation"
|
||||
elif artifact.artifact_kind == "session_log":
|
||||
role = "transcript"
|
||||
|
||||
obs_text, claim_key, contradict_keys, supersede_keys = _parse_annotations(obs_text)
|
||||
|
||||
links.extend(_extract_links(obs_text))
|
||||
if role in {"summary", "claim"}:
|
||||
concepts.extend(_to_concept_id(link) for link in _extract_links(obs_text))
|
||||
observations.append(
|
||||
SegmentedObservation(
|
||||
artifact_relative_path=artifact.relative_path,
|
||||
role=role,
|
||||
text=obs_text,
|
||||
section=current_section,
|
||||
line_start=idx + 1,
|
||||
line_end=idx + 1,
|
||||
grounding_status="partially_grounded" if artifact.artifact_kind == "compiled_page" else "grounded",
|
||||
support_kind="derived_from_page" if artifact.artifact_kind == "compiled_page" else "direct_source",
|
||||
confidence_hint=0.55 if role == "speculation" else 0.7 if role == "claim" else 0.6,
|
||||
explicit_claim_key=claim_key,
|
||||
contradict_keys=contradict_keys,
|
||||
supersede_keys=supersede_keys,
|
||||
)
|
||||
)
|
||||
|
||||
if not headings and title:
|
||||
headings.append(title)
|
||||
if not concepts and title:
|
||||
concepts.append(_to_concept_id(title))
|
||||
return SegmentedPage(
|
||||
title=title,
|
||||
headings=headings,
|
||||
frontmatter=frontmatter,
|
||||
observations=observations,
|
||||
concepts=sorted({c for c in concepts if c}),
|
||||
links=sorted({link for link in links if link}),
|
||||
)
|
||||
|
|
@ -0,0 +1,15 @@
|
|||
"""Legacy flat GroundRecall source adapter package.
|
||||
|
||||
Compatibility path retained during the internal namespace migration.
|
||||
Prefer imports under ``didactopus.groundrecall.source_adapters`` for new code.
|
||||
"""
|
||||
|
||||
from .base import get_source_adapter, list_source_adapters
|
||||
from . import llmwiki # noqa: F401
|
||||
from . import polypaper # noqa: F401
|
||||
from . import doclift_bundle # noqa: F401
|
||||
from . import markdown_notes # noqa: F401
|
||||
from . import transcript # noqa: F401
|
||||
from . import didactopus_pack # noqa: F401
|
||||
|
||||
__all__ = ["get_source_adapter", "list_source_adapters"]
|
||||
|
|
@ -0,0 +1,76 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy flat GroundRecall source adapter base module.
|
||||
|
||||
Compatibility path retained during the internal namespace migration.
|
||||
Prefer imports under ``didactopus.groundrecall.source_adapters.base`` for new
|
||||
code.
|
||||
"""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Literal, Protocol
|
||||
|
||||
|
||||
ImportIntent = Literal["grounded_knowledge", "curriculum", "both"]
|
||||
|
||||
|
||||
@dataclass
|
||||
class DiscoveredImportSource:
|
||||
path: Path
|
||||
relative_path: str
|
||||
source_kind: str
|
||||
artifact_kind: str
|
||||
is_text: bool
|
||||
metadata: dict
|
||||
|
||||
|
||||
@dataclass
|
||||
class StructuredImportRows:
|
||||
artifact_rows: list[dict]
|
||||
observation_rows: list[dict]
|
||||
claim_rows: list[dict]
|
||||
concept_rows: list[dict]
|
||||
relation_rows: list[dict]
|
||||
|
||||
|
||||
class GroundRecallSourceAdapter(Protocol):
|
||||
name: str
|
||||
|
||||
def detect(self, root: str | Path) -> bool:
|
||||
...
|
||||
|
||||
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
|
||||
...
|
||||
|
||||
def import_intent(self) -> ImportIntent:
|
||||
...
|
||||
|
||||
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
|
||||
...
|
||||
|
||||
|
||||
_REGISTRY: dict[str, GroundRecallSourceAdapter] = {}
|
||||
|
||||
|
||||
def register_source_adapter(adapter: GroundRecallSourceAdapter) -> GroundRecallSourceAdapter:
|
||||
_REGISTRY[adapter.name] = adapter
|
||||
return adapter
|
||||
|
||||
|
||||
def get_source_adapter(name: str) -> GroundRecallSourceAdapter:
|
||||
try:
|
||||
return _REGISTRY[name]
|
||||
except KeyError as exc:
|
||||
raise KeyError(f"Unknown GroundRecall source adapter: {name}") from exc
|
||||
|
||||
|
||||
def list_source_adapters() -> list[str]:
|
||||
return sorted(_REGISTRY)
|
||||
|
||||
|
||||
def detect_source_adapter(root: str | Path) -> GroundRecallSourceAdapter:
|
||||
for adapter in _REGISTRY.values():
|
||||
if adapter.detect(root):
|
||||
return adapter
|
||||
raise ValueError(f"No GroundRecall source adapter detected for {root}")
|
||||
|
|
@ -0,0 +1,234 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from hashlib import sha256
|
||||
import yaml
|
||||
from pathlib import Path
|
||||
|
||||
from ..artifact_schemas import ConceptsFile, RoadmapFile
|
||||
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
|
||||
|
||||
|
||||
class DidactopusPackSourceAdapter:
|
||||
name = "didactopus_pack"
|
||||
|
||||
def detect(self, root: str | Path) -> bool:
|
||||
base = Path(root)
|
||||
required = {"pack.yaml", "concepts.yaml"}
|
||||
return required.issubset({path.name for path in base.iterdir() if path.exists()})
|
||||
|
||||
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
|
||||
base = Path(root)
|
||||
rows: list[DiscoveredImportSource] = []
|
||||
for filename in ["pack.yaml", "concepts.yaml", "roadmap.yaml", "projects.yaml", "rubrics.yaml", "review_ledger.json"]:
|
||||
path = base / filename
|
||||
if not path.exists():
|
||||
continue
|
||||
rows.append(
|
||||
DiscoveredImportSource(
|
||||
path=path,
|
||||
relative_path=path.relative_to(base).as_posix(),
|
||||
source_kind="didactopus_pack",
|
||||
artifact_kind="didactopus_pack_artifact",
|
||||
is_text=True,
|
||||
metadata={},
|
||||
)
|
||||
)
|
||||
return rows
|
||||
|
||||
def import_intent(self) -> str:
|
||||
return "both"
|
||||
|
||||
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
|
||||
by_name = {Path(item.relative_path).name: item for item in sources}
|
||||
concepts_src = by_name.get("concepts.yaml")
|
||||
if concepts_src is None:
|
||||
return None
|
||||
|
||||
pack_src = by_name.get("pack.yaml")
|
||||
pack_payload = {}
|
||||
if pack_src is not None:
|
||||
pack_payload = yaml.safe_load(pack_src.path.read_text(encoding="utf-8")) or {}
|
||||
concepts_payload = ConceptsFile.model_validate(
|
||||
yaml.safe_load(concepts_src.path.read_text(encoding="utf-8")) or {"concepts": []}
|
||||
)
|
||||
roadmap_payload = None
|
||||
roadmap_src = by_name.get("roadmap.yaml")
|
||||
if roadmap_src is not None:
|
||||
roadmap_payload = RoadmapFile.model_validate(
|
||||
yaml.safe_load(roadmap_src.path.read_text(encoding="utf-8")) or {"stages": []}
|
||||
)
|
||||
|
||||
artifact_rows: list[dict] = []
|
||||
observation_rows: list[dict] = []
|
||||
claim_rows: list[dict] = []
|
||||
concept_rows: list[dict] = []
|
||||
relation_rows: list[dict] = []
|
||||
|
||||
for source in sources:
|
||||
artifact_rows.append(
|
||||
{
|
||||
"artifact_id": f"ia_{sha256(source.relative_path.encode('utf-8')).hexdigest()[:12]}",
|
||||
"import_id": context.import_id,
|
||||
"artifact_kind": source.artifact_kind,
|
||||
"path": source.relative_path,
|
||||
"title": source.path.stem,
|
||||
"sha256": sha256(source.path.read_bytes()).hexdigest(),
|
||||
"created_at": context.imported_at,
|
||||
"metadata": {"source_kind": source.source_kind},
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
|
||||
pack_name = pack_payload.get("name", Path(context.source_root).name)
|
||||
concepts_artifact_id = next(
|
||||
(row["artifact_id"] for row in artifact_rows if row["path"] == concepts_src.relative_path),
|
||||
"",
|
||||
)
|
||||
|
||||
for index, concept in enumerate(concepts_payload.concepts, start=1):
|
||||
concept_key = f"concept::{concept.id}"
|
||||
concept_rows.append(
|
||||
{
|
||||
"concept_id": concept_key,
|
||||
"import_id": context.import_id,
|
||||
"title": concept.title,
|
||||
"aliases": [],
|
||||
"description": concept.description or f"Imported concept from Didactopus pack {pack_name}.",
|
||||
"source_artifact_ids": [concepts_artifact_id] if concepts_artifact_id else [],
|
||||
"current_status": "triaged",
|
||||
}
|
||||
)
|
||||
observation_id = f"obs_pack_{concept.id}_{index}"
|
||||
observation_rows.append(
|
||||
{
|
||||
"observation_id": observation_id,
|
||||
"import_id": context.import_id,
|
||||
"artifact_id": concepts_artifact_id,
|
||||
"role": "summary",
|
||||
"text": concept.description or concept.title,
|
||||
"origin_path": concepts_src.relative_path,
|
||||
"origin_section": concept.title,
|
||||
"line_start": 0,
|
||||
"line_end": 0,
|
||||
"grounding_status": "grounded",
|
||||
"support_kind": "direct_source",
|
||||
"confidence_hint": 0.85,
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
claim_rows.append(
|
||||
{
|
||||
"claim_id": f"clm_pack_{concept.id}",
|
||||
"import_id": context.import_id,
|
||||
"claim_text": concept.description or f"{concept.title} is a concept in pack {pack_name}.",
|
||||
"claim_kind": "summary",
|
||||
"source_observation_ids": [observation_id],
|
||||
"supporting_fragment_ids": [],
|
||||
"concept_ids": [concept_key],
|
||||
"contradicts_claim_ids": [],
|
||||
"supersedes_claim_ids": [],
|
||||
"confidence_hint": 0.85,
|
||||
"grounding_status": "grounded",
|
||||
"current_status": "triaged",
|
||||
}
|
||||
)
|
||||
for prereq in concept.prerequisites:
|
||||
relation_rows.append(
|
||||
{
|
||||
"relation_id": f"rel_prereq_{concept.id}_{prereq}",
|
||||
"import_id": context.import_id,
|
||||
"source_id": f"concept::{prereq}",
|
||||
"target_id": concept_key,
|
||||
"relation_type": "prerequisite",
|
||||
"evidence_ids": [f"clm_pack_{concept.id}"],
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
for signal_idx, signal in enumerate(concept.mastery_signals, start=1):
|
||||
signal_obs_id = f"obs_signal_{concept.id}_{signal_idx}"
|
||||
observation_rows.append(
|
||||
{
|
||||
"observation_id": signal_obs_id,
|
||||
"import_id": context.import_id,
|
||||
"artifact_id": concepts_artifact_id,
|
||||
"role": "summary",
|
||||
"text": signal,
|
||||
"origin_path": concepts_src.relative_path,
|
||||
"origin_section": f"{concept.title} mastery signal",
|
||||
"line_start": 0,
|
||||
"line_end": 0,
|
||||
"grounding_status": "grounded",
|
||||
"support_kind": "direct_source",
|
||||
"confidence_hint": 0.8,
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
claim_rows.append(
|
||||
{
|
||||
"claim_id": f"clm_signal_{concept.id}_{signal_idx}",
|
||||
"import_id": context.import_id,
|
||||
"claim_text": signal,
|
||||
"claim_kind": "mastery_signal",
|
||||
"source_observation_ids": [signal_obs_id],
|
||||
"supporting_fragment_ids": [],
|
||||
"concept_ids": [concept_key],
|
||||
"contradicts_claim_ids": [],
|
||||
"supersedes_claim_ids": [],
|
||||
"confidence_hint": 0.8,
|
||||
"grounding_status": "grounded",
|
||||
"current_status": "triaged",
|
||||
}
|
||||
)
|
||||
|
||||
if roadmap_payload is not None and roadmap_src is not None:
|
||||
roadmap_artifact_id = next(
|
||||
(row["artifact_id"] for row in artifact_rows if row["path"] == roadmap_src.relative_path),
|
||||
"",
|
||||
)
|
||||
for stage in roadmap_payload.stages:
|
||||
for concept_id in stage.concepts:
|
||||
observation_id = f"obs_stage_{stage.id}_{concept_id}"
|
||||
observation_rows.append(
|
||||
{
|
||||
"observation_id": observation_id,
|
||||
"import_id": context.import_id,
|
||||
"artifact_id": roadmap_artifact_id,
|
||||
"role": "summary",
|
||||
"text": f"{concept_id} appears in roadmap stage {stage.title}.",
|
||||
"origin_path": roadmap_src.relative_path,
|
||||
"origin_section": stage.title,
|
||||
"line_start": 0,
|
||||
"line_end": 0,
|
||||
"grounding_status": "grounded",
|
||||
"support_kind": "direct_source",
|
||||
"confidence_hint": 0.75,
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
claim_rows.append(
|
||||
{
|
||||
"claim_id": f"clm_stage_{stage.id}_{concept_id}",
|
||||
"import_id": context.import_id,
|
||||
"claim_text": f"{concept_id} belongs to roadmap stage {stage.title}.",
|
||||
"claim_kind": "roadmap_stage",
|
||||
"source_observation_ids": [observation_id],
|
||||
"supporting_fragment_ids": [],
|
||||
"concept_ids": [f"concept::{concept_id}"],
|
||||
"contradicts_claim_ids": [],
|
||||
"supersedes_claim_ids": [],
|
||||
"confidence_hint": 0.75,
|
||||
"grounding_status": "grounded",
|
||||
"current_status": "triaged",
|
||||
}
|
||||
)
|
||||
|
||||
return StructuredImportRows(
|
||||
artifact_rows=artifact_rows,
|
||||
observation_rows=observation_rows,
|
||||
claim_rows=claim_rows,
|
||||
concept_rows=concept_rows,
|
||||
relation_rows=relation_rows,
|
||||
)
|
||||
|
||||
|
||||
register_source_adapter(DidactopusPackSourceAdapter())
|
||||
|
|
@ -0,0 +1,150 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from hashlib import sha256
|
||||
from pathlib import Path
|
||||
|
||||
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
|
||||
|
||||
|
||||
class DocliftBundleSourceAdapter:
|
||||
name = "doclift_bundle"
|
||||
|
||||
def detect(self, root: str | Path) -> bool:
|
||||
base = Path(root)
|
||||
return (base / "manifest.json").exists() and (base / "documents").exists()
|
||||
|
||||
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
|
||||
base = Path(root)
|
||||
rows: list[DiscoveredImportSource] = []
|
||||
for path in sorted(p for p in base.rglob("*") if p.is_file() and p.suffix.lower() in {".json", ".md"}):
|
||||
rows.append(
|
||||
DiscoveredImportSource(
|
||||
path=path,
|
||||
relative_path=path.relative_to(base).as_posix(),
|
||||
source_kind="doclift_bundle",
|
||||
artifact_kind="doclift_bundle_artifact",
|
||||
is_text=True,
|
||||
metadata={},
|
||||
)
|
||||
)
|
||||
return rows
|
||||
|
||||
def import_intent(self) -> str:
|
||||
return "both"
|
||||
|
||||
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
|
||||
base = Path(context.source_root)
|
||||
manifest_path = base / "manifest.json"
|
||||
if not manifest_path.exists():
|
||||
return None
|
||||
manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
|
||||
|
||||
artifact_rows: list[dict] = []
|
||||
observation_rows: list[dict] = []
|
||||
claim_rows: list[dict] = []
|
||||
concept_rows: list[dict] = []
|
||||
relation_rows: list[dict] = []
|
||||
|
||||
artifact_by_path: dict[str, str] = {}
|
||||
for source in sources:
|
||||
artifact_id = f"ia_{sha256(source.relative_path.encode('utf-8')).hexdigest()[:12]}"
|
||||
artifact_rows.append(
|
||||
{
|
||||
"artifact_id": artifact_id,
|
||||
"import_id": context.import_id,
|
||||
"artifact_kind": source.artifact_kind,
|
||||
"path": source.relative_path,
|
||||
"title": source.path.stem,
|
||||
"sha256": sha256(source.path.read_bytes()).hexdigest(),
|
||||
"created_at": context.imported_at,
|
||||
"metadata": {"source_kind": source.source_kind},
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
artifact_by_path[source.relative_path] = artifact_id
|
||||
|
||||
documents = [item for item in manifest.get("documents", []) if isinstance(item, dict)]
|
||||
previous_concept_id: str | None = None
|
||||
for index, document in enumerate(documents, start=1):
|
||||
title = str(document.get("title") or f"Document {index}")
|
||||
concept_id = f"concept::{document.get('document_id') or title.lower().replace(' ', '-')}"
|
||||
markdown_path = Path(document.get("markdown_path", ""))
|
||||
relative_markdown = markdown_path.relative_to(base).as_posix() if markdown_path.is_absolute() and markdown_path.exists() and markdown_path.is_relative_to(base) else document.get("markdown_path", "")
|
||||
artifact_id = artifact_by_path.get(str(relative_markdown), "")
|
||||
figures_path = Path(document.get("figures_path", ""))
|
||||
figure_payload = {}
|
||||
if figures_path.exists():
|
||||
figure_payload = json.loads(figures_path.read_text(encoding="utf-8"))
|
||||
source_path = str(figure_payload.get("source_path") or document.get("source_path") or relative_markdown)
|
||||
|
||||
concept_rows.append(
|
||||
{
|
||||
"concept_id": concept_id,
|
||||
"import_id": context.import_id,
|
||||
"title": title,
|
||||
"aliases": [],
|
||||
"description": f"Imported from doclift bundle document kind '{document.get('document_kind', 'document')}'.",
|
||||
"source_artifact_ids": [artifact_id] if artifact_id else [],
|
||||
"current_status": "triaged",
|
||||
}
|
||||
)
|
||||
observation_id = f"obs_doclift_{index}"
|
||||
observation_rows.append(
|
||||
{
|
||||
"observation_id": observation_id,
|
||||
"import_id": context.import_id,
|
||||
"artifact_id": artifact_id,
|
||||
"role": "summary",
|
||||
"text": title,
|
||||
"origin_path": relative_markdown,
|
||||
"origin_section": title,
|
||||
"line_start": 0,
|
||||
"line_end": 0,
|
||||
"source_url": source_path,
|
||||
"grounding_status": "grounded",
|
||||
"support_kind": "direct_source",
|
||||
"confidence_hint": 0.85,
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
claim_rows.append(
|
||||
{
|
||||
"claim_id": f"clm_doclift_{index}",
|
||||
"import_id": context.import_id,
|
||||
"claim_text": f"{title} is a {document.get('document_kind', 'document')} in the imported doclift bundle.",
|
||||
"claim_kind": "summary",
|
||||
"source_observation_ids": [observation_id],
|
||||
"supporting_fragment_ids": [],
|
||||
"concept_ids": [concept_id],
|
||||
"contradicts_claim_ids": [],
|
||||
"supersedes_claim_ids": [],
|
||||
"confidence_hint": 0.85,
|
||||
"grounding_status": "grounded",
|
||||
"current_status": "triaged",
|
||||
}
|
||||
)
|
||||
if previous_concept_id is not None:
|
||||
relation_rows.append(
|
||||
{
|
||||
"relation_id": f"rel_doclift_seq_{index}",
|
||||
"import_id": context.import_id,
|
||||
"source_id": previous_concept_id,
|
||||
"target_id": concept_id,
|
||||
"relation_type": "references",
|
||||
"evidence_ids": [f"clm_doclift_{index}"],
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
previous_concept_id = concept_id
|
||||
|
||||
return StructuredImportRows(
|
||||
artifact_rows=artifact_rows,
|
||||
observation_rows=observation_rows,
|
||||
claim_rows=claim_rows,
|
||||
concept_rows=concept_rows,
|
||||
relation_rows=relation_rows,
|
||||
)
|
||||
|
||||
|
||||
register_source_adapter(DocliftBundleSourceAdapter())
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from ..groundrecall_discovery import discover_llmwiki_artifacts
|
||||
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
|
||||
|
||||
|
||||
class LLMWikiSourceAdapter:
|
||||
name = "llmwiki"
|
||||
|
||||
def detect(self, root: str | Path) -> bool:
|
||||
base = Path(root)
|
||||
return (base / "wiki").exists() or (base / "raw").exists() or any(path.name.startswith("schema.") for path in base.iterdir() if path.exists())
|
||||
|
||||
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
|
||||
return [
|
||||
DiscoveredImportSource(
|
||||
path=item.path,
|
||||
relative_path=item.relative_path,
|
||||
source_kind="llmwiki",
|
||||
artifact_kind=item.artifact_kind,
|
||||
is_text=item.is_text,
|
||||
metadata={},
|
||||
)
|
||||
for item in discover_llmwiki_artifacts(root)
|
||||
]
|
||||
|
||||
def import_intent(self) -> str:
|
||||
return "grounded_knowledge"
|
||||
|
||||
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
|
||||
return None
|
||||
|
||||
|
||||
register_source_adapter(LLMWikiSourceAdapter())
|
||||
|
|
@ -0,0 +1,41 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
|
||||
|
||||
|
||||
TEXT_SUFFIXES = {".md", ".markdown", ".txt", ".tex"}
|
||||
|
||||
|
||||
class MarkdownNotesSourceAdapter:
|
||||
name = "markdown_notes"
|
||||
|
||||
def detect(self, root: str | Path) -> bool:
|
||||
base = Path(root)
|
||||
return any(path.suffix.lower() in TEXT_SUFFIXES for path in base.rglob("*") if path.is_file())
|
||||
|
||||
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
|
||||
base = Path(root)
|
||||
rows: list[DiscoveredImportSource] = []
|
||||
for path in sorted(p for p in base.rglob("*") if p.is_file() and p.suffix.lower() in TEXT_SUFFIXES):
|
||||
rows.append(
|
||||
DiscoveredImportSource(
|
||||
path=path,
|
||||
relative_path=path.relative_to(base).as_posix(),
|
||||
source_kind="markdown_notes",
|
||||
artifact_kind="markdown_note",
|
||||
is_text=True,
|
||||
metadata={},
|
||||
)
|
||||
)
|
||||
return rows
|
||||
|
||||
def import_intent(self) -> str:
|
||||
return "grounded_knowledge"
|
||||
|
||||
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
|
||||
return None
|
||||
|
||||
|
||||
register_source_adapter(MarkdownNotesSourceAdapter())
|
||||
|
|
@ -0,0 +1,106 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
import re
|
||||
|
||||
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
|
||||
|
||||
|
||||
TEXT_SUFFIXES = {".tex"}
|
||||
EXCLUDED_NAMES = {
|
||||
".pp-export-tmp.tex",
|
||||
"paper.woven.arxiv.tex",
|
||||
"paper.woven.test.tex",
|
||||
"paper.woven.org",
|
||||
"paper.org",
|
||||
"paper_b.org",
|
||||
"paper_c.org",
|
||||
"paper_c.bak.org",
|
||||
"paper-demo.org",
|
||||
"paper-orig.org",
|
||||
"test.output.org",
|
||||
"tex-blocks.org",
|
||||
}
|
||||
EXCLUDED_DIRS = {".git", "__pycache__", ".pytest_cache", "setup"}
|
||||
EXCLUDED_PREFIXES = ("table-", "figure-", "fig-")
|
||||
INCLUDE_RE = re.compile(r"\\(?:include|input)\{([^}]+)\}")
|
||||
|
||||
|
||||
class PolyPaperSourceAdapter:
|
||||
name = "polypaper"
|
||||
|
||||
def detect(self, root: str | Path) -> bool:
|
||||
base = Path(root)
|
||||
return (
|
||||
(base / "main.tex").exists()
|
||||
and (base / "pieces").is_dir()
|
||||
and ((base / "paper.org").exists() or (base / "README.md").exists())
|
||||
)
|
||||
|
||||
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
|
||||
base = Path(root)
|
||||
allowed_paths = self._collect_reachable_tex(base)
|
||||
rows: list[DiscoveredImportSource] = []
|
||||
for path in sorted(allowed_paths):
|
||||
rows.append(
|
||||
DiscoveredImportSource(
|
||||
path=path,
|
||||
relative_path=path.relative_to(base).as_posix(),
|
||||
source_kind="polypaper",
|
||||
artifact_kind="markdown_note",
|
||||
is_text=True,
|
||||
metadata={},
|
||||
)
|
||||
)
|
||||
return rows
|
||||
|
||||
def import_intent(self) -> str:
|
||||
return "grounded_knowledge"
|
||||
|
||||
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
|
||||
return None
|
||||
|
||||
def _collect_reachable_tex(self, base: Path) -> set[Path]:
|
||||
entrypoint = base / "main.tex"
|
||||
reachable: set[Path] = set()
|
||||
pending: list[Path] = [entrypoint]
|
||||
|
||||
while pending:
|
||||
current = pending.pop()
|
||||
if not current.exists():
|
||||
continue
|
||||
if current in reachable:
|
||||
continue
|
||||
if any(part in EXCLUDED_DIRS for part in current.relative_to(base).parts):
|
||||
continue
|
||||
if current.name in EXCLUDED_NAMES or current.suffix.lower() not in TEXT_SUFFIXES:
|
||||
continue
|
||||
if current.parent.name == "figs":
|
||||
continue
|
||||
if current.name.startswith(EXCLUDED_PREFIXES) or current.name == "tables.tex":
|
||||
continue
|
||||
text = current.read_text(encoding="utf-8")
|
||||
for raw_ref in INCLUDE_RE.findall(text):
|
||||
candidate = self._resolve_include(base, current.parent, raw_ref.strip())
|
||||
if candidate is not None and candidate not in reachable:
|
||||
pending.append(candidate)
|
||||
if current != entrypoint:
|
||||
reachable.add(current)
|
||||
|
||||
return reachable
|
||||
|
||||
def _resolve_include(self, base: Path, current_dir: Path, raw_ref: str) -> Path | None:
|
||||
candidates = [current_dir / raw_ref, base / raw_ref]
|
||||
resolved: list[Path] = []
|
||||
for candidate in candidates:
|
||||
if candidate.suffix:
|
||||
resolved.append(candidate)
|
||||
else:
|
||||
resolved.append(candidate.with_suffix(".tex"))
|
||||
for candidate in resolved:
|
||||
if candidate.exists():
|
||||
return candidate
|
||||
return None
|
||||
|
||||
|
||||
register_source_adapter(PolyPaperSourceAdapter())
|
||||
|
|
@ -0,0 +1,38 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from .base import DiscoveredImportSource, StructuredImportRows, register_source_adapter
|
||||
|
||||
|
||||
class TranscriptSourceAdapter:
|
||||
name = "transcript"
|
||||
|
||||
def detect(self, root: str | Path) -> bool:
|
||||
base = Path(root)
|
||||
return any("transcript" in path.name.lower() for path in base.rglob("*") if path.is_file())
|
||||
|
||||
def discover(self, root: str | Path) -> list[DiscoveredImportSource]:
|
||||
base = Path(root)
|
||||
rows: list[DiscoveredImportSource] = []
|
||||
for path in sorted(p for p in base.rglob("*") if p.is_file() and "transcript" in p.name.lower()):
|
||||
rows.append(
|
||||
DiscoveredImportSource(
|
||||
path=path,
|
||||
relative_path=path.relative_to(base).as_posix(),
|
||||
source_kind="transcript",
|
||||
artifact_kind="session_log",
|
||||
is_text=True,
|
||||
metadata={},
|
||||
)
|
||||
)
|
||||
return rows
|
||||
|
||||
def import_intent(self) -> str:
|
||||
return "grounded_knowledge"
|
||||
|
||||
def build_rows(self, context, sources: list[DiscoveredImportSource]) -> StructuredImportRows | None:
|
||||
return None
|
||||
|
||||
|
||||
register_source_adapter(TranscriptSourceAdapter())
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
from __future__ import annotations
|
||||
|
||||
"""Legacy extracted GroundRecall store module.
|
||||
|
||||
Compatibility path retained while the standalone repo converges on the
|
||||
top-level ``groundrecall.store`` module as the primary implementation.
|
||||
"""
|
||||
|
||||
from .store import GroundRecallStore
|
||||
|
|
@ -0,0 +1,231 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import shutil
|
||||
import socket
|
||||
import subprocess
|
||||
from collections import OrderedDict
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .groundrecall_discovery import DiscoveredArtifact
|
||||
from .groundrecall_lint import lint_import_directory
|
||||
from .groundrecall_normalizer import (
|
||||
ImportContext,
|
||||
build_artifact_record,
|
||||
build_claim_record,
|
||||
build_concept_records,
|
||||
build_observation_record,
|
||||
build_relation_records,
|
||||
manifest_record,
|
||||
)
|
||||
from .groundrecall_review_bridge import export_review_bundle_from_import
|
||||
from .groundrecall_review_queue import build_review_queue
|
||||
from .groundrecall_segmenter import SegmentedPage, segment_markdown_artifact
|
||||
from .groundrecall_source_adapters.base import detect_source_adapter
|
||||
import groundrecall.groundrecall_source_adapters # noqa: F401
|
||||
|
||||
|
||||
VALID_MODES = {"archive", "quick", "grounded"}
|
||||
|
||||
|
||||
@dataclass
|
||||
class ImportResult:
|
||||
manifest: dict[str, Any]
|
||||
artifacts: list[dict[str, Any]]
|
||||
observations: list[dict[str, Any]]
|
||||
claims: list[dict[str, Any]]
|
||||
concepts: list[dict[str, Any]]
|
||||
relations: list[dict[str, Any]]
|
||||
out_dir: Path
|
||||
|
||||
|
||||
def _timestamp() -> str:
|
||||
return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
|
||||
|
||||
|
||||
def _default_import_id(source_root: Path) -> str:
|
||||
stem = source_root.name.lower().replace("_", "-")
|
||||
stamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
|
||||
return f"{stem}-{stamp}"
|
||||
|
||||
|
||||
def _write_json(path: Path, payload: dict[str, Any]) -> None:
|
||||
path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
|
||||
|
||||
|
||||
def _write_jsonl(path: Path, rows: list[dict[str, Any]]) -> None:
|
||||
text = "\n".join(json.dumps(row, sort_keys=True) for row in rows)
|
||||
if text:
|
||||
text += "\n"
|
||||
path.write_text(text, encoding="utf-8")
|
||||
|
||||
|
||||
def _dedupe_by_key(rows: list[dict[str, Any]], key: str) -> list[dict[str, Any]]:
|
||||
unique: OrderedDict[str, dict[str, Any]] = OrderedDict()
|
||||
for row in rows:
|
||||
unique.setdefault(str(row[key]), row)
|
||||
return list(unique.values())
|
||||
|
||||
|
||||
def _convert_tex_to_markdown(path: Path) -> str | None:
|
||||
pandoc = shutil.which("pandoc")
|
||||
if pandoc is None:
|
||||
return None
|
||||
result = subprocess.run(
|
||||
[pandoc, "-f", "latex", "-t", "gfm", str(path)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
return None
|
||||
markdown = result.stdout.strip()
|
||||
return markdown or None
|
||||
|
||||
|
||||
def _segment_artifact(artifact: DiscoveredArtifact) -> SegmentedPage | None:
|
||||
if not artifact.is_text:
|
||||
return None
|
||||
suffix = artifact.path.suffix.lower()
|
||||
if suffix not in {".md", ".markdown", ".txt", ".tex", ".log"}:
|
||||
return None
|
||||
if suffix == ".tex":
|
||||
converted = _convert_tex_to_markdown(artifact.path)
|
||||
if converted is not None:
|
||||
return segment_markdown_artifact(artifact, text=converted)
|
||||
return segment_markdown_artifact(artifact)
|
||||
|
||||
|
||||
def run_groundrecall_import(
|
||||
source_root: str | Path,
|
||||
out_root: str | Path | None = None,
|
||||
mode: str = "quick",
|
||||
import_id: str | None = None,
|
||||
machine_id: str | None = None,
|
||||
agent_id: str = "groundrecall.ingest",
|
||||
) -> ImportResult:
|
||||
source_path = Path(source_root).resolve()
|
||||
if mode not in VALID_MODES:
|
||||
raise ValueError(f"Unsupported import mode: {mode}")
|
||||
adapter = detect_source_adapter(source_path)
|
||||
discovered = adapter.discover(source_path)
|
||||
artifacts = [
|
||||
DiscoveredArtifact(
|
||||
path=item.path,
|
||||
relative_path=item.relative_path,
|
||||
artifact_kind=item.artifact_kind,
|
||||
is_text=item.is_text,
|
||||
)
|
||||
for item in discovered
|
||||
]
|
||||
actual_import_id = import_id or _default_import_id(source_path)
|
||||
output_root = Path(out_root) if out_root else source_path / "imports"
|
||||
output_dir = output_root / actual_import_id
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
context = ImportContext(
|
||||
import_id=actual_import_id,
|
||||
import_mode=mode,
|
||||
machine_id=machine_id or socket.gethostname(),
|
||||
agent_id=agent_id,
|
||||
source_root=str(source_path),
|
||||
imported_at=_timestamp(),
|
||||
)
|
||||
|
||||
artifact_rows: list[dict[str, Any]] = []
|
||||
observation_rows: list[dict[str, Any]] = []
|
||||
claim_rows: list[dict[str, Any]] = []
|
||||
concept_rows: list[dict[str, Any]] = []
|
||||
relation_rows: list[dict[str, Any]] = []
|
||||
structured_rows = adapter.build_rows(context, discovered)
|
||||
if structured_rows is not None:
|
||||
artifact_rows.extend(structured_rows.artifact_rows)
|
||||
observation_rows.extend(structured_rows.observation_rows)
|
||||
claim_rows.extend(structured_rows.claim_rows)
|
||||
concept_rows.extend(structured_rows.concept_rows)
|
||||
relation_rows.extend(structured_rows.relation_rows)
|
||||
else:
|
||||
for artifact in artifacts:
|
||||
page = _segment_artifact(artifact)
|
||||
artifact_row = build_artifact_record(context, artifact, page)
|
||||
artifact_rows.append(artifact_row)
|
||||
if page is None:
|
||||
continue
|
||||
|
||||
concept_rows.extend(build_concept_records(context, artifact_row, page.concepts))
|
||||
relation_rows.extend(build_relation_records(context, artifact_row, page.concepts, page.links))
|
||||
|
||||
for index, observation in enumerate(page.observations, start=1):
|
||||
observation_row = build_observation_record(context, artifact_row, observation, index)
|
||||
observation_rows.append(observation_row)
|
||||
if mode == "archive":
|
||||
continue
|
||||
if observation.role not in {"claim", "summary"}:
|
||||
continue
|
||||
claim_rows.append(build_claim_record(context, observation_row, observation, page.concepts[:3], index))
|
||||
|
||||
concept_rows = _dedupe_by_key(concept_rows, "concept_id")
|
||||
relation_rows = _dedupe_by_key(relation_rows, "relation_id")
|
||||
artifact_rows = _dedupe_by_key(artifact_rows, "artifact_id")
|
||||
observation_rows = _dedupe_by_key(observation_rows, "observation_id")
|
||||
claim_rows = _dedupe_by_key(claim_rows, "claim_id")
|
||||
|
||||
manifest = manifest_record(context) | {
|
||||
"source_adapter": adapter.name,
|
||||
"import_intent": adapter.import_intent(),
|
||||
"artifact_count": len(artifact_rows),
|
||||
"observation_count": len(observation_rows),
|
||||
"claim_count": len(claim_rows),
|
||||
"concept_count": len(concept_rows),
|
||||
"relation_count": len(relation_rows),
|
||||
}
|
||||
|
||||
_write_json(output_dir / "manifest.json", manifest)
|
||||
_write_jsonl(output_dir / "artifacts.jsonl", artifact_rows)
|
||||
_write_jsonl(output_dir / "observations.jsonl", observation_rows)
|
||||
_write_jsonl(output_dir / "claims.jsonl", claim_rows)
|
||||
_write_jsonl(output_dir / "concepts.jsonl", concept_rows)
|
||||
_write_jsonl(output_dir / "relations.jsonl", relation_rows)
|
||||
lint_payload = lint_import_directory(output_dir)
|
||||
_write_json(output_dir / "lint_findings.json", lint_payload)
|
||||
review_queue = build_review_queue(output_dir)
|
||||
_write_json(output_dir / "review_queue.json", review_queue)
|
||||
export_review_bundle_from_import(output_dir)
|
||||
|
||||
return ImportResult(
|
||||
manifest=manifest,
|
||||
artifacts=artifact_rows,
|
||||
observations=observation_rows,
|
||||
claims=claim_rows,
|
||||
concepts=concept_rows,
|
||||
relations=relation_rows,
|
||||
out_dir=output_dir,
|
||||
)
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Import an llmwiki-style repository into GroundRecall import artifacts.")
|
||||
parser.add_argument("source_root")
|
||||
parser.add_argument("--out-root", default=None)
|
||||
parser.add_argument("--mode", choices=sorted(VALID_MODES), default="quick")
|
||||
parser.add_argument("--import-id", default=None)
|
||||
parser.add_argument("--machine-id", default=None)
|
||||
parser.add_argument("--agent-id", default="groundrecall.ingest")
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
result = run_groundrecall_import(
|
||||
source_root=args.source_root,
|
||||
out_root=args.out_root,
|
||||
mode=args.mode,
|
||||
import_id=args.import_id,
|
||||
machine_id=args.machine_id,
|
||||
agent_id=args.agent_id,
|
||||
)
|
||||
print(f"Wrote import artifacts to {result.out_dir}")
|
||||
|
|
@ -0,0 +1,47 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .store import GroundRecallStore
|
||||
|
||||
|
||||
def summarize_store(store_dir: str | Path) -> dict[str, Any]:
|
||||
store = GroundRecallStore(store_dir)
|
||||
snapshots = store.list_snapshots()
|
||||
latest_snapshot = max(snapshots, key=lambda item: item.created_at, default=None)
|
||||
return {
|
||||
"store_dir": str(Path(store_dir)),
|
||||
"source_count": len(store.list_sources()),
|
||||
"artifact_count": len(store.list_artifacts()),
|
||||
"observation_count": len(store.list_observations()),
|
||||
"claim_count": len(store.list_claims()),
|
||||
"concept_count": len(store.list_concepts()),
|
||||
"relation_count": len(store.list_relations()),
|
||||
"review_candidate_count": len(store.list_review_candidates()),
|
||||
"promotion_count": len(store.list_promotions()),
|
||||
"snapshot_count": len(snapshots),
|
||||
"latest_snapshot_id": latest_snapshot.snapshot_id if latest_snapshot is not None else "",
|
||||
}
|
||||
|
||||
|
||||
def inspect_store(store_dir: str | Path, out_path: str | Path | None = None) -> dict[str, Any]:
|
||||
payload = summarize_store(store_dir)
|
||||
if out_path is not None:
|
||||
Path(out_path).write_text(json.dumps(payload, indent=2), encoding="utf-8")
|
||||
return payload
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Inspect canonical GroundRecall store contents.")
|
||||
parser.add_argument("store_dir")
|
||||
parser.add_argument("--out", default=None)
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
payload = inspect_store(args.store_dir, out_path=args.out)
|
||||
print(json.dumps(payload, indent=2))
|
||||
|
|
@ -0,0 +1,196 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
def _read_json(path: Path) -> dict[str, Any]:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
|
||||
if not path.exists():
|
||||
return []
|
||||
text = path.read_text(encoding="utf-8").strip()
|
||||
if not text:
|
||||
return []
|
||||
return [json.loads(line) for line in text.splitlines()]
|
||||
|
||||
|
||||
def lint_import_directory(import_dir: str | Path) -> dict[str, Any]:
|
||||
base = Path(import_dir)
|
||||
manifest = _read_json(base / "manifest.json")
|
||||
artifacts = _read_jsonl(base / "artifacts.jsonl")
|
||||
observations = _read_jsonl(base / "observations.jsonl")
|
||||
claims = _read_jsonl(base / "claims.jsonl")
|
||||
concepts = _read_jsonl(base / "concepts.jsonl")
|
||||
relations = _read_jsonl(base / "relations.jsonl")
|
||||
|
||||
findings: list[dict[str, Any]] = []
|
||||
observation_by_id = {row["observation_id"]: row for row in observations}
|
||||
concept_ids = {row["concept_id"] for row in concepts}
|
||||
|
||||
text_counter = Counter(row["claim_text"].strip().lower() for row in claims if row.get("claim_text", "").strip())
|
||||
claim_ids = {row["claim_id"] for row in claims}
|
||||
for claim in claims:
|
||||
claim_text = claim.get("claim_text", "").strip()
|
||||
if not claim.get("source_observation_ids"):
|
||||
findings.append(
|
||||
{
|
||||
"severity": "error",
|
||||
"code": "claim_missing_observation",
|
||||
"target_id": claim["claim_id"],
|
||||
"message": "Claim has no source observation ids.",
|
||||
}
|
||||
)
|
||||
if not claim.get("concept_ids"):
|
||||
findings.append(
|
||||
{
|
||||
"severity": "warning",
|
||||
"code": "claim_missing_concept",
|
||||
"target_id": claim["claim_id"],
|
||||
"message": "Claim is not associated with any concepts.",
|
||||
}
|
||||
)
|
||||
if claim.get("grounding_status") == "ungrounded":
|
||||
findings.append(
|
||||
{
|
||||
"severity": "warning",
|
||||
"code": "claim_ungrounded",
|
||||
"target_id": claim["claim_id"],
|
||||
"message": "Claim is ungrounded and should not be promoted directly.",
|
||||
}
|
||||
)
|
||||
if claim_text and text_counter[claim_text.lower()] > 1:
|
||||
findings.append(
|
||||
{
|
||||
"severity": "warning",
|
||||
"code": "duplicate_claim_text",
|
||||
"target_id": claim["claim_id"],
|
||||
"message": "Claim text duplicates another imported claim.",
|
||||
}
|
||||
)
|
||||
for obs_id in claim.get("source_observation_ids", []):
|
||||
if obs_id not in observation_by_id:
|
||||
findings.append(
|
||||
{
|
||||
"severity": "error",
|
||||
"code": "claim_observation_missing",
|
||||
"target_id": claim["claim_id"],
|
||||
"message": f"Claim references missing observation {obs_id}.",
|
||||
}
|
||||
)
|
||||
for target_claim_id in claim.get("contradicts_claim_ids", []):
|
||||
if target_claim_id not in claim_ids:
|
||||
findings.append(
|
||||
{
|
||||
"severity": "warning",
|
||||
"code": "unresolved_contradiction_ref",
|
||||
"target_id": claim["claim_id"],
|
||||
"message": f"Claim references missing contradiction target {target_claim_id}.",
|
||||
}
|
||||
)
|
||||
for target_claim_id in claim.get("supersedes_claim_ids", []):
|
||||
if target_claim_id not in claim_ids:
|
||||
findings.append(
|
||||
{
|
||||
"severity": "warning",
|
||||
"code": "unresolved_supersession_ref",
|
||||
"target_id": claim["claim_id"],
|
||||
"message": f"Claim references missing supersession target {target_claim_id}.",
|
||||
}
|
||||
)
|
||||
if claim.get("contradicts_claim_ids") and claim.get("supersedes_claim_ids"):
|
||||
findings.append(
|
||||
{
|
||||
"severity": "warning",
|
||||
"code": "claim_mixed_conflict_and_supersession",
|
||||
"target_id": claim["claim_id"],
|
||||
"message": "Claim marks both contradiction and supersession targets; review the intended relation.",
|
||||
}
|
||||
)
|
||||
|
||||
concept_sources: defaultdict[str, set[str]] = defaultdict(set)
|
||||
for claim in claims:
|
||||
for concept_id in claim.get("concept_ids", []):
|
||||
concept_sources[concept_id].add(claim["claim_id"])
|
||||
for relation in relations:
|
||||
concept_sources[relation.get("source_id", "")].add(relation["relation_id"])
|
||||
concept_sources[relation.get("target_id", "")].add(relation["relation_id"])
|
||||
|
||||
for concept in concepts:
|
||||
if not concept_sources.get(concept["concept_id"]):
|
||||
findings.append(
|
||||
{
|
||||
"severity": "warning",
|
||||
"code": "orphan_concept",
|
||||
"target_id": concept["concept_id"],
|
||||
"message": "Concept has no connected claims or relations.",
|
||||
}
|
||||
)
|
||||
|
||||
for relation in relations:
|
||||
if relation.get("source_id") not in concept_ids:
|
||||
findings.append(
|
||||
{
|
||||
"severity": "error",
|
||||
"code": "relation_missing_source",
|
||||
"target_id": relation["relation_id"],
|
||||
"message": f"Relation source {relation.get('source_id')} is missing.",
|
||||
}
|
||||
)
|
||||
if relation.get("target_id") not in concept_ids:
|
||||
findings.append(
|
||||
{
|
||||
"severity": "error",
|
||||
"code": "relation_missing_target",
|
||||
"target_id": relation["relation_id"],
|
||||
"message": f"Relation target {relation.get('target_id')} is missing.",
|
||||
}
|
||||
)
|
||||
|
||||
for observation in observations:
|
||||
role = observation.get("role")
|
||||
if role == "summary" and observation.get("grounding_status") == "ungrounded":
|
||||
findings.append(
|
||||
{
|
||||
"severity": "warning",
|
||||
"code": "ungrounded_summary",
|
||||
"target_id": observation["observation_id"],
|
||||
"message": "Summary observation is ungrounded.",
|
||||
}
|
||||
)
|
||||
|
||||
summary = {
|
||||
"artifact_count": len(artifacts),
|
||||
"observation_count": len(observations),
|
||||
"claim_count": len(claims),
|
||||
"concept_count": len(concepts),
|
||||
"relation_count": len(relations),
|
||||
"error_count": sum(1 for item in findings if item["severity"] == "error"),
|
||||
"warning_count": sum(1 for item in findings if item["severity"] == "warning"),
|
||||
}
|
||||
return {
|
||||
"import_id": manifest["import_id"],
|
||||
"import_mode": manifest["import_mode"],
|
||||
"summary": summary,
|
||||
"findings": findings,
|
||||
}
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Lint GroundRecall import artifacts.")
|
||||
parser.add_argument("import_dir")
|
||||
parser.add_argument("--out", default=None)
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
payload = lint_import_directory(args.import_dir)
|
||||
out_path = Path(args.out) if args.out else Path(args.import_dir) / "lint_findings.json"
|
||||
out_path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
|
||||
print(f"Wrote {out_path}")
|
||||
|
|
@ -0,0 +1,137 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from typing import Literal
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
|
||||
LifecycleStatus = Literal["draft", "triaged", "reviewed", "promoted", "superseded", "archived", "rejected"]
|
||||
GroundingStatus = Literal["grounded", "partially_grounded", "ungrounded"]
|
||||
SupportKind = Literal["direct_source", "derived_from_page", "derived_from_session", "inferred", "unknown"]
|
||||
|
||||
|
||||
class ProvenanceRecord(BaseModel):
|
||||
origin_artifact_id: str = ""
|
||||
origin_path: str = ""
|
||||
origin_section: str = ""
|
||||
source_url: str = ""
|
||||
retrieval_date: str = ""
|
||||
machine_id: str = ""
|
||||
session_id: str = ""
|
||||
support_kind: SupportKind = "unknown"
|
||||
grounding_status: GroundingStatus = "ungrounded"
|
||||
|
||||
|
||||
class SourceRecord(BaseModel):
|
||||
source_id: str
|
||||
title: str = ""
|
||||
source_type: str = "document"
|
||||
path: str = ""
|
||||
url: str = ""
|
||||
retrieved_at: str = ""
|
||||
metadata: dict = Field(default_factory=dict)
|
||||
current_status: LifecycleStatus = "draft"
|
||||
|
||||
|
||||
class FragmentRecord(BaseModel):
|
||||
fragment_id: str
|
||||
source_id: str
|
||||
text: str
|
||||
section: str = ""
|
||||
line_start: int = 0
|
||||
line_end: int = 0
|
||||
metadata: dict = Field(default_factory=dict)
|
||||
current_status: LifecycleStatus = "draft"
|
||||
|
||||
|
||||
class ArtifactRecord(BaseModel):
|
||||
artifact_id: str
|
||||
artifact_kind: str
|
||||
title: str = ""
|
||||
path: str = ""
|
||||
sha256: str = ""
|
||||
created_at: str = ""
|
||||
metadata: dict = Field(default_factory=dict)
|
||||
current_status: LifecycleStatus = "draft"
|
||||
|
||||
|
||||
class ObservationRecord(BaseModel):
|
||||
observation_id: str
|
||||
artifact_id: str = ""
|
||||
role: str
|
||||
text: str
|
||||
provenance: ProvenanceRecord = Field(default_factory=ProvenanceRecord)
|
||||
confidence_hint: float = 0.0
|
||||
current_status: LifecycleStatus = "draft"
|
||||
|
||||
|
||||
class ClaimRecord(BaseModel):
|
||||
claim_id: str
|
||||
claim_text: str
|
||||
claim_kind: str = "statement"
|
||||
source_observation_ids: list[str] = Field(default_factory=list)
|
||||
supporting_fragment_ids: list[str] = Field(default_factory=list)
|
||||
concept_ids: list[str] = Field(default_factory=list)
|
||||
contradicts_claim_ids: list[str] = Field(default_factory=list)
|
||||
supersedes_claim_ids: list[str] = Field(default_factory=list)
|
||||
confidence_hint: float = 0.0
|
||||
review_confidence: float = 0.0
|
||||
last_confirmed_at: str = ""
|
||||
provenance: ProvenanceRecord = Field(default_factory=ProvenanceRecord)
|
||||
current_status: LifecycleStatus = "draft"
|
||||
|
||||
|
||||
class ConceptRecord(BaseModel):
|
||||
concept_id: str
|
||||
title: str
|
||||
aliases: list[str] = Field(default_factory=list)
|
||||
description: str = ""
|
||||
source_artifact_ids: list[str] = Field(default_factory=list)
|
||||
current_status: LifecycleStatus = "draft"
|
||||
|
||||
|
||||
class RelationRecord(BaseModel):
|
||||
relation_id: str
|
||||
source_id: str
|
||||
target_id: str
|
||||
relation_type: str
|
||||
evidence_ids: list[str] = Field(default_factory=list)
|
||||
provenance: ProvenanceRecord = Field(default_factory=ProvenanceRecord)
|
||||
current_status: LifecycleStatus = "draft"
|
||||
|
||||
|
||||
class ReviewCandidateRecord(BaseModel):
|
||||
review_candidate_id: str
|
||||
candidate_type: Literal["claim", "concept", "relation"]
|
||||
candidate_id: str
|
||||
triage_lane: str = "knowledge_capture"
|
||||
priority: int = 50
|
||||
finding_codes: list[str] = Field(default_factory=list)
|
||||
rationale: str = ""
|
||||
current_status: LifecycleStatus = "draft"
|
||||
|
||||
|
||||
class PromotionRecord(BaseModel):
|
||||
promotion_id: str
|
||||
candidate_type: Literal["claim", "concept", "relation"]
|
||||
candidate_id: str
|
||||
promotion_target: str = "groundrecall_store"
|
||||
verdict: Literal["approved", "rejected", "superseded"] = "approved"
|
||||
reviewer: str = ""
|
||||
promoted_object_ids: list[str] = Field(default_factory=list)
|
||||
notes: str = ""
|
||||
promoted_at: str = ""
|
||||
|
||||
|
||||
class GroundRecallSnapshot(BaseModel):
|
||||
snapshot_id: str
|
||||
created_at: str
|
||||
sources: list[SourceRecord] = Field(default_factory=list)
|
||||
fragments: list[FragmentRecord] = Field(default_factory=list)
|
||||
artifacts: list[ArtifactRecord] = Field(default_factory=list)
|
||||
observations: list[ObservationRecord] = Field(default_factory=list)
|
||||
claims: list[ClaimRecord] = Field(default_factory=list)
|
||||
concepts: list[ConceptRecord] = Field(default_factory=list)
|
||||
relations: list[RelationRecord] = Field(default_factory=list)
|
||||
promotions: list[PromotionRecord] = Field(default_factory=list)
|
||||
metadata: dict = Field(default_factory=dict)
|
||||
|
|
@ -0,0 +1,250 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .models import (
|
||||
ArtifactRecord,
|
||||
ClaimRecord,
|
||||
ConceptRecord,
|
||||
ObservationRecord,
|
||||
PromotionRecord,
|
||||
ProvenanceRecord,
|
||||
RelationRecord,
|
||||
ReviewCandidateRecord,
|
||||
)
|
||||
from .review_schema import ReviewSession
|
||||
from .store import GroundRecallStore
|
||||
|
||||
|
||||
def _read_json(path: Path) -> dict[str, Any]:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
|
||||
if not path.exists():
|
||||
return []
|
||||
text = path.read_text(encoding="utf-8").strip()
|
||||
if not text:
|
||||
return []
|
||||
return [json.loads(line) for line in text.splitlines()]
|
||||
|
||||
|
||||
def _now() -> str:
|
||||
return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
|
||||
|
||||
|
||||
def _review_status_map(status: str) -> str:
|
||||
return {
|
||||
"trusted": "promoted",
|
||||
"provisional": "reviewed",
|
||||
"rejected": "rejected",
|
||||
"needs_review": "triaged",
|
||||
}.get(status, "triaged")
|
||||
|
||||
|
||||
def _provenance_from_payload(payload: dict[str, Any]) -> ProvenanceRecord:
|
||||
return ProvenanceRecord(
|
||||
origin_artifact_id=payload.get("origin_artifact_id", ""),
|
||||
origin_path=payload.get("origin_path", ""),
|
||||
origin_section=payload.get("origin_section", ""),
|
||||
source_url=payload.get("source_url", ""),
|
||||
retrieval_date=payload.get("retrieval_date", ""),
|
||||
machine_id=payload.get("machine_id", ""),
|
||||
session_id=payload.get("session_id", ""),
|
||||
support_kind=payload.get("support_kind", "unknown"),
|
||||
grounding_status=payload.get("grounding_status", "ungrounded"),
|
||||
)
|
||||
|
||||
|
||||
def promote_import_to_store(
|
||||
import_dir: str | Path,
|
||||
store_dir: str | Path,
|
||||
reviewer: str | None = None,
|
||||
snapshot_id: str | None = None,
|
||||
) -> dict[str, Any]:
|
||||
base = Path(import_dir)
|
||||
manifest = _read_json(base / "manifest.json")
|
||||
review_session = ReviewSession.model_validate_json((base / "review_session.json").read_text(encoding="utf-8"))
|
||||
queue_payload = _read_json(base / "review_queue.json")
|
||||
artifacts = _read_jsonl(base / "artifacts.jsonl")
|
||||
observations = _read_jsonl(base / "observations.jsonl")
|
||||
claims = _read_jsonl(base / "claims.jsonl")
|
||||
concepts = _read_jsonl(base / "concepts.jsonl")
|
||||
relations = _read_jsonl(base / "relations.jsonl")
|
||||
|
||||
store = GroundRecallStore(store_dir)
|
||||
reviewed_by_concept = {entry.concept_id: entry for entry in review_session.draft_pack.concepts}
|
||||
promoted_claim_ids: list[str] = []
|
||||
promoted_concept_ids: list[str] = []
|
||||
promoted_relation_ids: list[str] = []
|
||||
|
||||
for artifact in artifacts:
|
||||
store.save_artifact(
|
||||
ArtifactRecord(
|
||||
artifact_id=artifact["artifact_id"],
|
||||
artifact_kind=artifact["artifact_kind"],
|
||||
title=artifact.get("title", ""),
|
||||
path=artifact.get("path", ""),
|
||||
sha256=artifact.get("sha256", ""),
|
||||
created_at=artifact.get("created_at", ""),
|
||||
metadata=dict(artifact.get("metadata", {})),
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
|
||||
for observation in observations:
|
||||
store.save_observation(
|
||||
ObservationRecord(
|
||||
observation_id=observation["observation_id"],
|
||||
artifact_id=observation.get("artifact_id", ""),
|
||||
role=observation.get("role", "summary"),
|
||||
text=observation.get("text", ""),
|
||||
provenance=_provenance_from_payload(observation),
|
||||
confidence_hint=float(observation.get("confidence_hint", 0.0)),
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
|
||||
for concept in concepts:
|
||||
short_id = concept["concept_id"].replace("concept::", "", 1)
|
||||
review_entry = reviewed_by_concept.get(short_id)
|
||||
current_status = _review_status_map(review_entry.status if review_entry else concept.get("current_status", "triaged"))
|
||||
record = store.save_concept(
|
||||
ConceptRecord(
|
||||
concept_id=concept["concept_id"],
|
||||
title=review_entry.title if review_entry else concept.get("title", concept["concept_id"]),
|
||||
aliases=list(concept.get("aliases", [])),
|
||||
description=review_entry.description if review_entry else concept.get("description", ""),
|
||||
source_artifact_ids=list(concept.get("source_artifact_ids", [])),
|
||||
current_status=current_status, # type: ignore[arg-type]
|
||||
)
|
||||
)
|
||||
if record.current_status in {"promoted", "reviewed"}:
|
||||
promoted_concept_ids.append(record.concept_id)
|
||||
|
||||
reviewed_concept_ids = set(promoted_concept_ids)
|
||||
for claim in claims:
|
||||
concept_ids = list(claim.get("concept_ids", []))
|
||||
statuses = []
|
||||
for concept_id in concept_ids:
|
||||
short_id = concept_id.replace("concept::", "", 1)
|
||||
review_entry = reviewed_by_concept.get(short_id)
|
||||
statuses.append(_review_status_map(review_entry.status) if review_entry else "triaged")
|
||||
if statuses and all(status == "rejected" for status in statuses):
|
||||
current_status = "rejected"
|
||||
elif statuses and any(status == "promoted" for status in statuses):
|
||||
current_status = "promoted"
|
||||
elif statuses and any(status == "reviewed" for status in statuses):
|
||||
current_status = "reviewed"
|
||||
else:
|
||||
current_status = "triaged"
|
||||
record = store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id=claim["claim_id"],
|
||||
claim_text=claim.get("claim_text", ""),
|
||||
claim_kind=claim.get("claim_kind", "statement"),
|
||||
source_observation_ids=list(claim.get("source_observation_ids", [])),
|
||||
supporting_fragment_ids=list(claim.get("supporting_fragment_ids", [])),
|
||||
concept_ids=concept_ids,
|
||||
contradicts_claim_ids=list(claim.get("contradicts_claim_ids", [])),
|
||||
supersedes_claim_ids=list(claim.get("supersedes_claim_ids", [])),
|
||||
confidence_hint=float(claim.get("confidence_hint", 0.0)),
|
||||
review_confidence=float(claim.get("review_confidence", 0.0)),
|
||||
last_confirmed_at=claim.get("last_confirmed_at", ""),
|
||||
provenance=_provenance_from_payload(claim),
|
||||
current_status=current_status, # type: ignore[arg-type]
|
||||
)
|
||||
)
|
||||
if record.current_status in {"promoted", "reviewed"}:
|
||||
promoted_claim_ids.append(record.claim_id)
|
||||
|
||||
for relation in relations:
|
||||
src_ok = relation.get("source_id") in reviewed_concept_ids
|
||||
tgt_ok = relation.get("target_id") in reviewed_concept_ids
|
||||
current_status = "promoted" if src_ok and tgt_ok else "triaged"
|
||||
record = store.save_relation(
|
||||
RelationRecord(
|
||||
relation_id=relation["relation_id"],
|
||||
source_id=relation.get("source_id", ""),
|
||||
target_id=relation.get("target_id", ""),
|
||||
relation_type=relation.get("relation_type", "references"),
|
||||
evidence_ids=list(relation.get("evidence_ids", [])),
|
||||
provenance=_provenance_from_payload(relation),
|
||||
current_status=current_status, # type: ignore[arg-type]
|
||||
)
|
||||
)
|
||||
if record.current_status in {"promoted", "reviewed"}:
|
||||
promoted_relation_ids.append(record.relation_id)
|
||||
|
||||
for item in queue_payload.get("items", []):
|
||||
store.save_review_candidate(
|
||||
ReviewCandidateRecord(
|
||||
review_candidate_id=item["queue_id"],
|
||||
candidate_type=item["candidate_type"],
|
||||
candidate_id=item["candidate_id"],
|
||||
triage_lane=item.get("triage_lane", "knowledge_capture"),
|
||||
priority=int(item.get("priority", 50)),
|
||||
finding_codes=list(item.get("finding_codes", [])),
|
||||
rationale=item.get("title", ""),
|
||||
current_status="reviewed" if item["candidate_id"] in set(promoted_claim_ids + promoted_concept_ids + promoted_relation_ids) else "triaged",
|
||||
)
|
||||
)
|
||||
|
||||
promotion = store.save_promotion(
|
||||
PromotionRecord(
|
||||
promotion_id=f"promotion-{manifest['import_id']}",
|
||||
candidate_type="concept",
|
||||
candidate_id=manifest["import_id"],
|
||||
promotion_target="groundrecall_store",
|
||||
verdict="approved",
|
||||
reviewer=reviewer or review_session.reviewer,
|
||||
promoted_object_ids=promoted_concept_ids + promoted_claim_ids + promoted_relation_ids,
|
||||
notes=f"Promoted import {manifest['import_id']} into GroundRecallStore.",
|
||||
promoted_at=_now(),
|
||||
)
|
||||
)
|
||||
|
||||
built_snapshot = store.build_snapshot(
|
||||
snapshot_id=snapshot_id or f"snapshot-{manifest['import_id']}",
|
||||
created_at=_now(),
|
||||
metadata={
|
||||
"source_import_id": manifest["import_id"],
|
||||
"reviewer": reviewer or review_session.reviewer,
|
||||
"export_kind": "canonical",
|
||||
},
|
||||
)
|
||||
store.save_snapshot(built_snapshot)
|
||||
|
||||
return {
|
||||
"import_id": manifest["import_id"],
|
||||
"store_dir": str(Path(store_dir)),
|
||||
"promotion_id": promotion.promotion_id,
|
||||
"promoted_concept_count": len(promoted_concept_ids),
|
||||
"promoted_claim_count": len(promoted_claim_ids),
|
||||
"promoted_relation_count": len(promoted_relation_ids),
|
||||
"snapshot_id": built_snapshot.snapshot_id,
|
||||
}
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Promote a GroundRecall import into canonical store objects.")
|
||||
parser.add_argument("import_dir")
|
||||
parser.add_argument("store_dir")
|
||||
parser.add_argument("--reviewer", default=None)
|
||||
parser.add_argument("--snapshot-id", default=None)
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
payload = promote_import_to_store(
|
||||
import_dir=args.import_dir,
|
||||
store_dir=args.store_dir,
|
||||
reviewer=args.reviewer,
|
||||
snapshot_id=args.snapshot_id,
|
||||
)
|
||||
print(json.dumps(payload, indent=2))
|
||||
|
|
@ -0,0 +1,188 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .store import GroundRecallStore
|
||||
|
||||
|
||||
def _normalize(text: str) -> str:
|
||||
return " ".join(text.lower().split())
|
||||
|
||||
|
||||
def _matches(query: str, *values: str) -> bool:
|
||||
needle = _normalize(query)
|
||||
return any(needle in _normalize(value) for value in values if value)
|
||||
|
||||
|
||||
def query_concept(store_dir: str | Path, concept_ref: str) -> dict[str, Any] | None:
|
||||
store = GroundRecallStore(store_dir)
|
||||
concepts = store.list_concepts()
|
||||
concept = next(
|
||||
(
|
||||
item
|
||||
for item in concepts
|
||||
if concept_ref == item.concept_id
|
||||
or concept_ref == item.concept_id.replace("concept::", "", 1)
|
||||
or _matches(concept_ref, item.title, item.description, *item.aliases)
|
||||
),
|
||||
None,
|
||||
)
|
||||
if concept is None:
|
||||
return None
|
||||
|
||||
claims = [item for item in store.list_claims() if concept.concept_id in item.concept_ids and item.current_status != "rejected"]
|
||||
relations = [
|
||||
item
|
||||
for item in store.list_relations()
|
||||
if (item.source_id == concept.concept_id or item.target_id == concept.concept_id) and item.current_status != "rejected"
|
||||
]
|
||||
artifacts = {item.artifact_id: item for item in store.list_artifacts()}
|
||||
observations = {item.observation_id: item for item in store.list_observations()}
|
||||
|
||||
supporting_observations = []
|
||||
for claim in claims:
|
||||
for observation_id in claim.source_observation_ids:
|
||||
observation = observations.get(observation_id)
|
||||
if observation is not None:
|
||||
supporting_observations.append(
|
||||
{
|
||||
"observation_id": observation.observation_id,
|
||||
"text": observation.text,
|
||||
"role": observation.role,
|
||||
"origin_path": observation.provenance.origin_path,
|
||||
"grounding_status": observation.provenance.grounding_status,
|
||||
}
|
||||
)
|
||||
|
||||
related_concept_ids = sorted(
|
||||
{
|
||||
relation.target_id if relation.source_id == concept.concept_id else relation.source_id
|
||||
for relation in relations
|
||||
if relation.source_id != relation.target_id
|
||||
}
|
||||
)
|
||||
related_concepts = [item.model_dump() for item in concepts if item.concept_id in related_concept_ids]
|
||||
|
||||
source_artifacts = [
|
||||
artifact.model_dump()
|
||||
for artifact in artifacts.values()
|
||||
if artifact.artifact_id in set(concept.source_artifact_ids)
|
||||
]
|
||||
|
||||
return {
|
||||
"query_type": "concept",
|
||||
"concept": concept.model_dump(),
|
||||
"claims": [item.model_dump() for item in claims],
|
||||
"relations": [item.model_dump() for item in relations],
|
||||
"related_concepts": related_concepts,
|
||||
"supporting_observations": supporting_observations,
|
||||
"source_artifacts": source_artifacts,
|
||||
}
|
||||
|
||||
|
||||
def search_claims(
|
||||
store_dir: str | Path,
|
||||
text: str,
|
||||
include_rejected: bool = False,
|
||||
limit: int = 20,
|
||||
) -> dict[str, Any]:
|
||||
store = GroundRecallStore(store_dir)
|
||||
concepts = {item.concept_id: item for item in store.list_concepts()}
|
||||
matches = []
|
||||
for claim in store.list_claims():
|
||||
if not include_rejected and claim.current_status == "rejected":
|
||||
continue
|
||||
concept_titles = [concepts[concept_id].title for concept_id in claim.concept_ids if concept_id in concepts]
|
||||
if _matches(text, claim.claim_text, *concept_titles):
|
||||
matches.append(
|
||||
{
|
||||
"claim": claim.model_dump(),
|
||||
"concept_titles": concept_titles,
|
||||
"provenance": claim.provenance.model_dump(),
|
||||
}
|
||||
)
|
||||
if len(matches) >= limit:
|
||||
break
|
||||
return {
|
||||
"query_type": "claim_search",
|
||||
"query": text,
|
||||
"matches": matches,
|
||||
}
|
||||
|
||||
|
||||
def query_provenance(
|
||||
store_dir: str | Path,
|
||||
origin_path: str | None = None,
|
||||
source_url: str | None = None,
|
||||
) -> dict[str, Any]:
|
||||
store = GroundRecallStore(store_dir)
|
||||
claims = []
|
||||
observations = []
|
||||
for claim in store.list_claims():
|
||||
if origin_path and claim.provenance.origin_path == origin_path:
|
||||
claims.append(claim.model_dump())
|
||||
continue
|
||||
if source_url and claim.provenance.source_url == source_url:
|
||||
claims.append(claim.model_dump())
|
||||
for observation in store.list_observations():
|
||||
if origin_path and observation.provenance.origin_path == origin_path:
|
||||
observations.append(observation.model_dump())
|
||||
continue
|
||||
if source_url and observation.provenance.source_url == source_url:
|
||||
observations.append(observation.model_dump())
|
||||
return {
|
||||
"query_type": "provenance",
|
||||
"origin_path": origin_path or "",
|
||||
"source_url": source_url or "",
|
||||
"claims": claims,
|
||||
"observations": observations,
|
||||
}
|
||||
|
||||
|
||||
def build_query_bundle_for_concept(store_dir: str | Path, concept_ref: str) -> dict[str, Any] | None:
|
||||
payload = query_concept(store_dir, concept_ref)
|
||||
if payload is None:
|
||||
return None
|
||||
claims = payload["claims"]
|
||||
contradictions = [item for item in claims if item.get("contradicts_claim_ids")]
|
||||
supersessions = [item for item in claims if item.get("supersedes_claim_ids")]
|
||||
return {
|
||||
"bundle_kind": "groundrecall_query_bundle",
|
||||
"query_type": "concept",
|
||||
"concept": payload["concept"],
|
||||
"relevant_claims": claims,
|
||||
"supporting_observations": payload["supporting_observations"],
|
||||
"related_concepts": payload["related_concepts"],
|
||||
"contradictions": contradictions,
|
||||
"supersessions": supersessions,
|
||||
"suggested_next_actions": [
|
||||
"Review promoted claims with low review confidence.",
|
||||
"Inspect supporting observations before exporting assistant context.",
|
||||
"Check related concepts for hidden prerequisite or contradiction edges.",
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Query canonical GroundRecall objects.")
|
||||
parser.add_argument("store_dir")
|
||||
parser.add_argument("query")
|
||||
parser.add_argument("--kind", choices=["concept", "claim", "provenance", "bundle"], default="concept")
|
||||
parser.add_argument("--source-url", default=None)
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
if args.kind == "concept":
|
||||
payload = query_concept(args.store_dir, args.query)
|
||||
elif args.kind == "claim":
|
||||
payload = search_claims(args.store_dir, args.query)
|
||||
elif args.kind == "provenance":
|
||||
payload = query_provenance(args.store_dir, origin_path=args.query, source_url=args.source_url)
|
||||
else:
|
||||
payload = build_query_bundle_for_concept(args.store_dir, args.query)
|
||||
print(json.dumps(payload, indent=2))
|
||||
|
|
@ -0,0 +1,366 @@
|
|||
const state = {
|
||||
reviewData: null,
|
||||
selectedConceptId: null,
|
||||
selectedCitationId: null,
|
||||
conceptSearch: "",
|
||||
citationFilter: "all",
|
||||
message: "",
|
||||
verificationResult: null,
|
||||
};
|
||||
|
||||
function escapeHtml(value) {
|
||||
return String(value ?? "")
|
||||
.replaceAll("&", "&")
|
||||
.replaceAll("<", "<")
|
||||
.replaceAll(">", ">")
|
||||
.replaceAll('"', """);
|
||||
}
|
||||
|
||||
function splitLines(value) {
|
||||
return String(value || "")
|
||||
.split("\n")
|
||||
.map((line) => line.trim())
|
||||
.filter(Boolean);
|
||||
}
|
||||
|
||||
function conceptRows() {
|
||||
const data = state.reviewData;
|
||||
if (!data) return [];
|
||||
const reviewById = new Map((data.concept_reviews || []).map((item) => [item.concept_id, item]));
|
||||
return (data.draft_pack?.concepts || []).map((concept) => ({
|
||||
...concept,
|
||||
review: reviewById.get(concept.concept_id) || null,
|
||||
}));
|
||||
}
|
||||
|
||||
function citationRows() {
|
||||
return state.reviewData?.citation_reviews || [];
|
||||
}
|
||||
|
||||
function selectedConcept() {
|
||||
return conceptRows().find((item) => item.concept_id === state.selectedConceptId) || conceptRows()[0] || null;
|
||||
}
|
||||
|
||||
function selectedCitation() {
|
||||
return citationRows().find((item) => item.citation_review_id === state.selectedCitationId) || citationRows()[0] || null;
|
||||
}
|
||||
|
||||
async function loadReviewData() {
|
||||
const response = await fetch("/api/load");
|
||||
const payload = await response.json();
|
||||
state.reviewData = payload.review_data;
|
||||
if (!state.selectedConceptId && conceptRows()[0]) {
|
||||
state.selectedConceptId = conceptRows()[0].concept_id;
|
||||
}
|
||||
if (!state.selectedCitationId && citationRows()[0]) {
|
||||
state.selectedCitationId = citationRows()[0].citation_review_id;
|
||||
}
|
||||
render();
|
||||
}
|
||||
|
||||
async function saveConcept(form) {
|
||||
const payload = {
|
||||
concept_updates: [
|
||||
{
|
||||
concept_id: form.get("concept_id"),
|
||||
status: form.get("status"),
|
||||
description: form.get("description"),
|
||||
prerequisites: splitLines(form.get("prerequisites")),
|
||||
notes: splitLines(form.get("notes")),
|
||||
},
|
||||
],
|
||||
};
|
||||
const response = await fetch("/api/save", {
|
||||
method: "POST",
|
||||
headers: { "Content-Type": "application/json" },
|
||||
body: JSON.stringify(payload),
|
||||
});
|
||||
const result = await response.json();
|
||||
state.reviewData = result.review_data;
|
||||
state.message = `Saved concept ${payload.concept_updates[0].concept_id}.`;
|
||||
render();
|
||||
}
|
||||
|
||||
async function saveCitation(form) {
|
||||
const payload = {
|
||||
citation_updates: [
|
||||
{
|
||||
citation_review_id: form.get("citation_review_id"),
|
||||
status: form.get("status"),
|
||||
notes: splitLines(form.get("notes")),
|
||||
},
|
||||
],
|
||||
};
|
||||
const response = await fetch("/api/save", {
|
||||
method: "POST",
|
||||
headers: { "Content-Type": "application/json" },
|
||||
body: JSON.stringify(payload),
|
||||
});
|
||||
const result = await response.json();
|
||||
state.reviewData = result.review_data;
|
||||
state.message = `Saved citation review ${payload.citation_updates[0].citation_review_id}.`;
|
||||
render();
|
||||
}
|
||||
|
||||
async function verifyCitation(citationReviewId) {
|
||||
const response = await fetch("/api/citations/verify", {
|
||||
method: "POST",
|
||||
headers: { "Content-Type": "application/json" },
|
||||
body: JSON.stringify({ citation_review_id: citationReviewId }),
|
||||
});
|
||||
state.verificationResult = await response.json();
|
||||
state.message = `Verification run for ${citationReviewId}.`;
|
||||
render();
|
||||
}
|
||||
|
||||
function statusOptions(specs, selectedValue) {
|
||||
return (specs?.options || [])
|
||||
.map((option) => `<option value="${escapeHtml(option.value)}"${option.value === selectedValue ? " selected" : ""}>${escapeHtml(option.label)}</option>`)
|
||||
.join("");
|
||||
}
|
||||
|
||||
function renderConceptPanel(concept) {
|
||||
if (!concept) {
|
||||
return `<section class="panel"><h2>No concept selected</h2></section>`;
|
||||
}
|
||||
const review = concept.review || {};
|
||||
const statusSpec = (state.reviewData.field_specs || []).find((item) => item.field === "status");
|
||||
const guidance = (state.reviewData.review_guidance?.priorities || []).map((item) => `<li>${escapeHtml(item)}</li>`).join("");
|
||||
const claims = (review.top_claims || []).map((claim) => `
|
||||
<article class="claim-card">
|
||||
<div class="claim-head">
|
||||
<strong>${escapeHtml(claim.claim_kind || "claim")}</strong>
|
||||
<span class="chip">${escapeHtml(claim.grounding_status || "unknown")}</span>
|
||||
</div>
|
||||
<p>${escapeHtml(claim.claim_text || "")}</p>
|
||||
<div class="tiny">Artifacts: ${escapeHtml((claim.artifact_paths || []).join(", ") || "none")}</div>
|
||||
${(claim.supporting_observations || []).slice(0, 2).map((obs) => `
|
||||
<div class="support-block">
|
||||
<div class="tiny">${escapeHtml(obs.origin_path || "")}${obs.line_start ? `:${obs.line_start}` : ""}</div>
|
||||
<div>${escapeHtml(obs.text || "")}</div>
|
||||
</div>
|
||||
`).join("")}
|
||||
</article>
|
||||
`).join("");
|
||||
|
||||
return `
|
||||
<section class="panel detail">
|
||||
<div class="panel-head">
|
||||
<div>
|
||||
<h2>${escapeHtml(concept.title)}</h2>
|
||||
<div class="muted">${escapeHtml(concept.concept_id)} · claims ${escapeHtml(review.claim_count || 0)} · grounded ${escapeHtml(review.grounded_claim_count || 0)} · warnings ${escapeHtml(review.warning_count || 0)}</div>
|
||||
</div>
|
||||
<div class="pill ${review.has_citation_support ? "pill-good" : "pill-warn"}">${review.has_citation_support ? "citation-bearing" : "no citation support"}</div>
|
||||
</div>
|
||||
<p class="help">${escapeHtml(review.review_help || "")}</p>
|
||||
<form id="concept-form">
|
||||
<input type="hidden" name="concept_id" value="${escapeHtml(concept.concept_id)}" />
|
||||
<label>
|
||||
<span>Review status</span>
|
||||
<select name="status">${statusOptions(statusSpec, concept.status)}</select>
|
||||
</label>
|
||||
<label>
|
||||
<span>Description</span>
|
||||
<textarea name="description" rows="3">${escapeHtml(concept.description || "")}</textarea>
|
||||
</label>
|
||||
<label>
|
||||
<span>Prerequisites</span>
|
||||
<textarea name="prerequisites" rows="3">${escapeHtml((concept.prerequisites || []).join("\n"))}</textarea>
|
||||
</label>
|
||||
<label>
|
||||
<span>Reviewer notes</span>
|
||||
<textarea name="notes" rows="5">${escapeHtml((concept.notes || []).join("\n"))}</textarea>
|
||||
</label>
|
||||
<div class="actions">
|
||||
<button type="submit" class="primary">Save Concept Review</button>
|
||||
</div>
|
||||
</form>
|
||||
<section class="subpanel">
|
||||
<h3>Reviewer guidance</h3>
|
||||
<ul>${guidance}</ul>
|
||||
</section>
|
||||
<section class="subpanel">
|
||||
<h3>Representative claims</h3>
|
||||
<div class="stack">${claims || "<div class=\"muted\">No representative claims available.</div>"}</div>
|
||||
</section>
|
||||
</section>
|
||||
`;
|
||||
}
|
||||
|
||||
function renderCitationPanel(citation) {
|
||||
const statusSpec = (state.reviewData.citation_field_specs || []).find((item) => item.field === "status");
|
||||
const nextActions = (state.reviewData.citations?.next_actions || []).map((item) => `<li>${escapeHtml(item)}</li>`).join("");
|
||||
if (!citation) {
|
||||
return `<section class="panel"><h2>No citation selected</h2></section>`;
|
||||
}
|
||||
return `
|
||||
<section class="panel detail">
|
||||
<div class="panel-head">
|
||||
<div>
|
||||
<h2>Citation lane</h2>
|
||||
<div class="muted">${escapeHtml(citation.source_kind)} · ${escapeHtml(citation.artifact_path || citation.locator || "")}</div>
|
||||
</div>
|
||||
<div class="pill">${escapeHtml(citation.status)}</div>
|
||||
</div>
|
||||
<form id="citation-form">
|
||||
<input type="hidden" name="citation_review_id" value="${escapeHtml(citation.citation_review_id)}" />
|
||||
<label>
|
||||
<span>Status</span>
|
||||
<select name="status">${statusOptions(statusSpec, citation.status)}</select>
|
||||
</label>
|
||||
<label>
|
||||
<span>Citation key</span>
|
||||
<input value="${escapeHtml(citation.citation_key || "")}" disabled />
|
||||
</label>
|
||||
<label>
|
||||
<span>Reference title</span>
|
||||
<input value="${escapeHtml(citation.title || "")}" disabled />
|
||||
</label>
|
||||
<label>
|
||||
<span>Bibliography source</span>
|
||||
<input value="${escapeHtml(citation.source_bib_path || "")}" disabled />
|
||||
</label>
|
||||
<label>
|
||||
<span>Reviewer notes</span>
|
||||
<textarea name="notes" rows="5">${escapeHtml((citation.notes || []).join("\n"))}</textarea>
|
||||
</label>
|
||||
<div class="tiny">Related concepts: ${escapeHtml((citation.related_concept_ids || []).join(", ") || "none")}</div>
|
||||
<div class="tiny">Related claims: ${escapeHtml((citation.related_claim_ids || []).join(", ") || "none")}</div>
|
||||
<div class="actions">
|
||||
<button type="button" id="verify-citation" class="secondary">Verify With CiteGeist</button>
|
||||
<button type="submit" class="primary">Save Citation Review</button>
|
||||
</div>
|
||||
</form>
|
||||
<section class="subpanel">
|
||||
<h3>Citation guidance</h3>
|
||||
<ul>${(state.reviewData.review_guidance?.citation_guidance || []).map((item) => `<li>${escapeHtml(item)}</li>`).join("")}</ul>
|
||||
</section>
|
||||
<section class="subpanel">
|
||||
<h3>Next actions</h3>
|
||||
<ul>${nextActions}</ul>
|
||||
</section>
|
||||
<section class="subpanel">
|
||||
<h3>Verification</h3>
|
||||
${
|
||||
state.verificationResult && state.verificationResult.citation_review_id === citation.citation_review_id
|
||||
? `<pre class="json-block">${escapeHtml(JSON.stringify(state.verificationResult, null, 2))}</pre>`
|
||||
: `<div class="muted">Run CiteGeist verification to inspect the stored entry and candidate matches.</div>`
|
||||
}
|
||||
</section>
|
||||
</section>
|
||||
`;
|
||||
}
|
||||
|
||||
function render() {
|
||||
const app = document.getElementById("app");
|
||||
if (!state.reviewData) {
|
||||
app.innerHTML = `<main class="shell"><section class="panel"><h1>Loading review data…</h1></section></main>`;
|
||||
return;
|
||||
}
|
||||
const summary = state.reviewData.import_context?.manifest || {};
|
||||
const conceptList = conceptRows().filter((item) => {
|
||||
const needle = state.conceptSearch.trim().toLowerCase();
|
||||
return !needle || item.title.toLowerCase().includes(needle) || item.concept_id.toLowerCase().includes(needle);
|
||||
});
|
||||
const citationList = citationRows().filter((item) => {
|
||||
if (state.citationFilter === "all") return true;
|
||||
return item.status === state.citationFilter;
|
||||
});
|
||||
const concept = selectedConcept();
|
||||
const citation = selectedCitation();
|
||||
|
||||
app.innerHTML = `
|
||||
<main class="shell">
|
||||
<header class="hero">
|
||||
<div>
|
||||
<h1>GroundRecall Review Workbench</h1>
|
||||
<p>Concept-first review with a dedicated citation lane for academic imports.</p>
|
||||
<div class="muted">${escapeHtml(summary.import_id || "")} · ${escapeHtml(summary.source_root || "")}</div>
|
||||
${state.message ? `<div class="message">${escapeHtml(state.message)}</div>` : ""}
|
||||
</div>
|
||||
<div class="hero-stats">
|
||||
<div class="stat"><strong>${escapeHtml(summary.artifact_count || 0)}</strong><span>artifacts</span></div>
|
||||
<div class="stat"><strong>${escapeHtml(summary.claim_count || 0)}</strong><span>claims</span></div>
|
||||
<div class="stat"><strong>${escapeHtml(summary.concept_count || 0)}</strong><span>concepts</span></div>
|
||||
<div class="stat"><strong>${escapeHtml(state.reviewData.citations?.summary?.citation_key_total || 0)}</strong><span>citation keys</span></div>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<section class="workspace-grid">
|
||||
<aside class="panel list-panel">
|
||||
<div class="panel-head"><h2>Concepts</h2></div>
|
||||
<label class="search">
|
||||
<span>Search</span>
|
||||
<input id="concept-search" value="${escapeHtml(state.conceptSearch)}" />
|
||||
</label>
|
||||
<div class="stack">
|
||||
${conceptList.map((item) => `
|
||||
<button class="list-item ${item.concept_id === concept?.concept_id ? "active" : ""}" data-concept-id="${escapeHtml(item.concept_id)}">
|
||||
<strong>${escapeHtml(item.title)}</strong>
|
||||
<span>${escapeHtml(item.status)}</span>
|
||||
</button>
|
||||
`).join("")}
|
||||
</div>
|
||||
</aside>
|
||||
${renderConceptPanel(concept)}
|
||||
</section>
|
||||
|
||||
<section class="workspace-grid">
|
||||
<aside class="panel list-panel">
|
||||
<div class="panel-head"><h2>Citation lane</h2></div>
|
||||
<label class="search">
|
||||
<span>Filter</span>
|
||||
<select id="citation-filter">
|
||||
${["all", "unreviewed", "verified", "needs_source_check", "misleading", "irrelevant", "fabricated"].map((value) => `<option value="${value}"${value === state.citationFilter ? " selected" : ""}>${value}</option>`).join("")}
|
||||
</select>
|
||||
</label>
|
||||
<div class="stack">
|
||||
${citationList.map((item) => `
|
||||
<button class="list-item ${item.citation_review_id === citation?.citation_review_id ? "active" : ""}" data-citation-id="${escapeHtml(item.citation_review_id)}">
|
||||
<strong>${escapeHtml(item.citation_key || item.title || item.citation_review_id)}</strong>
|
||||
<span>${escapeHtml(item.status)}</span>
|
||||
</button>
|
||||
`).join("")}
|
||||
</div>
|
||||
</aside>
|
||||
${renderCitationPanel(citation)}
|
||||
</section>
|
||||
</main>
|
||||
`;
|
||||
|
||||
document.querySelectorAll("[data-concept-id]").forEach((node) => {
|
||||
node.addEventListener("click", () => {
|
||||
state.selectedConceptId = node.getAttribute("data-concept-id");
|
||||
render();
|
||||
});
|
||||
});
|
||||
document.querySelectorAll("[data-citation-id]").forEach((node) => {
|
||||
node.addEventListener("click", () => {
|
||||
state.selectedCitationId = node.getAttribute("data-citation-id");
|
||||
render();
|
||||
});
|
||||
});
|
||||
document.getElementById("concept-search")?.addEventListener("input", (event) => {
|
||||
state.conceptSearch = event.target.value;
|
||||
render();
|
||||
});
|
||||
document.getElementById("citation-filter")?.addEventListener("change", (event) => {
|
||||
state.citationFilter = event.target.value;
|
||||
render();
|
||||
});
|
||||
document.getElementById("concept-form")?.addEventListener("submit", async (event) => {
|
||||
event.preventDefault();
|
||||
await saveConcept(new FormData(event.target));
|
||||
});
|
||||
document.getElementById("citation-form")?.addEventListener("submit", async (event) => {
|
||||
event.preventDefault();
|
||||
await saveCitation(new FormData(event.target));
|
||||
});
|
||||
document.getElementById("verify-citation")?.addEventListener("click", async () => {
|
||||
if (state.selectedCitationId) {
|
||||
await verifyCitation(state.selectedCitationId);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
loadReviewData();
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<title>GroundRecall Review Workbench</title>
|
||||
<link rel="stylesheet" href="/styles.css" />
|
||||
</head>
|
||||
<body>
|
||||
<div id="app"></div>
|
||||
<script src="/app.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
|
|
@ -0,0 +1,248 @@
|
|||
:root {
|
||||
--bg: #f4f0e8;
|
||||
--panel: rgba(255, 251, 245, 0.92);
|
||||
--panel-strong: #fffdfa;
|
||||
--ink: #1f2933;
|
||||
--muted: #5f6c76;
|
||||
--accent: #0f766e;
|
||||
--accent-strong: #134e4a;
|
||||
--warn: #9a3412;
|
||||
--line: rgba(31, 41, 51, 0.12);
|
||||
--shadow: 0 20px 60px rgba(31, 41, 51, 0.08);
|
||||
}
|
||||
|
||||
* {
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
margin: 0;
|
||||
color: var(--ink);
|
||||
background:
|
||||
radial-gradient(circle at top right, rgba(15, 118, 110, 0.14), transparent 28%),
|
||||
radial-gradient(circle at top left, rgba(154, 52, 18, 0.08), transparent 24%),
|
||||
linear-gradient(180deg, #fbf8f1 0%, var(--bg) 100%);
|
||||
font-family: "Iowan Old Style", "Palatino Linotype", "Book Antiqua", serif;
|
||||
}
|
||||
|
||||
button,
|
||||
input,
|
||||
select,
|
||||
textarea {
|
||||
font: inherit;
|
||||
}
|
||||
|
||||
.shell {
|
||||
width: min(1500px, calc(100vw - 32px));
|
||||
margin: 24px auto 48px;
|
||||
}
|
||||
|
||||
.hero,
|
||||
.panel {
|
||||
background: var(--panel);
|
||||
border: 1px solid var(--line);
|
||||
border-radius: 20px;
|
||||
box-shadow: var(--shadow);
|
||||
}
|
||||
|
||||
.hero {
|
||||
display: grid;
|
||||
grid-template-columns: 1.6fr 1fr;
|
||||
gap: 20px;
|
||||
padding: 24px;
|
||||
margin-bottom: 18px;
|
||||
}
|
||||
|
||||
.hero h1,
|
||||
.panel h2,
|
||||
.panel h3 {
|
||||
margin: 0 0 8px;
|
||||
font-family: Georgia, "Times New Roman", serif;
|
||||
}
|
||||
|
||||
.hero-stats {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(2, minmax(0, 1fr));
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.stat {
|
||||
padding: 14px;
|
||||
border-radius: 16px;
|
||||
background: var(--panel-strong);
|
||||
border: 1px solid var(--line);
|
||||
}
|
||||
|
||||
.stat strong {
|
||||
display: block;
|
||||
font-size: 1.8rem;
|
||||
}
|
||||
|
||||
.stat span,
|
||||
.muted,
|
||||
.tiny,
|
||||
.help {
|
||||
color: var(--muted);
|
||||
}
|
||||
|
||||
.message {
|
||||
margin-top: 10px;
|
||||
color: var(--accent-strong);
|
||||
}
|
||||
|
||||
.workspace-grid {
|
||||
display: grid;
|
||||
grid-template-columns: minmax(280px, 360px) 1fr;
|
||||
gap: 18px;
|
||||
margin-bottom: 18px;
|
||||
}
|
||||
|
||||
.panel {
|
||||
padding: 18px;
|
||||
}
|
||||
|
||||
.panel-head {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: flex-start;
|
||||
gap: 12px;
|
||||
margin-bottom: 12px;
|
||||
}
|
||||
|
||||
.list-panel {
|
||||
max-height: 78vh;
|
||||
overflow: auto;
|
||||
}
|
||||
|
||||
.search,
|
||||
label {
|
||||
display: grid;
|
||||
gap: 6px;
|
||||
margin-bottom: 12px;
|
||||
}
|
||||
|
||||
.stack {
|
||||
display: grid;
|
||||
gap: 10px;
|
||||
}
|
||||
|
||||
.list-item,
|
||||
.primary,
|
||||
.secondary {
|
||||
border: 1px solid var(--line);
|
||||
border-radius: 14px;
|
||||
background: var(--panel-strong);
|
||||
}
|
||||
|
||||
.list-item {
|
||||
display: grid;
|
||||
gap: 4px;
|
||||
width: 100%;
|
||||
padding: 12px;
|
||||
text-align: left;
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.list-item.active {
|
||||
border-color: rgba(15, 118, 110, 0.45);
|
||||
background: rgba(15, 118, 110, 0.08);
|
||||
}
|
||||
|
||||
input,
|
||||
select,
|
||||
textarea {
|
||||
width: 100%;
|
||||
padding: 10px 12px;
|
||||
border: 1px solid var(--line);
|
||||
border-radius: 12px;
|
||||
background: #fff;
|
||||
}
|
||||
|
||||
textarea {
|
||||
resize: vertical;
|
||||
}
|
||||
|
||||
.actions {
|
||||
display: flex;
|
||||
justify-content: flex-end;
|
||||
}
|
||||
|
||||
.primary {
|
||||
padding: 10px 16px;
|
||||
color: white;
|
||||
background: linear-gradient(135deg, var(--accent) 0%, var(--accent-strong) 100%);
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.secondary {
|
||||
padding: 10px 16px;
|
||||
color: var(--accent-strong);
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.subpanel {
|
||||
margin-top: 16px;
|
||||
padding-top: 14px;
|
||||
border-top: 1px solid var(--line);
|
||||
}
|
||||
|
||||
.claim-card,
|
||||
.support-block {
|
||||
padding: 12px;
|
||||
border-radius: 14px;
|
||||
background: var(--panel-strong);
|
||||
border: 1px solid var(--line);
|
||||
}
|
||||
|
||||
.claim-head {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
gap: 12px;
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.chip,
|
||||
.pill {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
padding: 4px 10px;
|
||||
border-radius: 999px;
|
||||
border: 1px solid var(--line);
|
||||
background: #fff;
|
||||
font-size: 0.85rem;
|
||||
}
|
||||
|
||||
.pill-good {
|
||||
color: var(--accent-strong);
|
||||
}
|
||||
|
||||
.pill-warn {
|
||||
color: var(--warn);
|
||||
}
|
||||
|
||||
ul {
|
||||
margin: 0;
|
||||
padding-left: 20px;
|
||||
}
|
||||
|
||||
.json-block {
|
||||
padding: 12px;
|
||||
overflow: auto;
|
||||
border-radius: 14px;
|
||||
background: #f8f7f3;
|
||||
border: 1px solid var(--line);
|
||||
font-family: "SFMono-Regular", Consolas, "Liberation Mono", monospace;
|
||||
font-size: 0.9rem;
|
||||
white-space: pre-wrap;
|
||||
}
|
||||
|
||||
@media (max-width: 980px) {
|
||||
.hero,
|
||||
.workspace-grid {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.list-panel {
|
||||
max-height: none;
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,439 @@
|
|||
from __future__ import annotations
|
||||
from pathlib import Path
|
||||
import hashlib
|
||||
import json, yaml
|
||||
import re
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from typing import Any, Callable
|
||||
from .citation_support import bibliography_summary_payload, load_bibliography_index, serialize_bib_entry
|
||||
from .review_schema import CitationReviewEntry, ReviewSession
|
||||
|
||||
def export_review_state_json(session: ReviewSession, path: str | Path) -> None:
|
||||
Path(path).write_text(session.model_dump_json(indent=2), encoding="utf-8")
|
||||
|
||||
def export_promoted_pack(session: ReviewSession, outdir: str | Path) -> None:
|
||||
outdir = Path(outdir)
|
||||
outdir.mkdir(parents=True, exist_ok=True)
|
||||
promoted_pack = dict(session.draft_pack.pack)
|
||||
promoted_pack["version"] = str(promoted_pack.get("version", "0.1.0-draft")).replace("-draft", "-reviewed")
|
||||
promoted_pack["curation"] = {"reviewer": session.reviewer, "ledger_entries": len(session.ledger)}
|
||||
|
||||
concepts = []
|
||||
for concept in session.draft_pack.concepts:
|
||||
if concept.status == "rejected":
|
||||
continue
|
||||
concepts.append({
|
||||
"id": concept.concept_id,
|
||||
"title": concept.title,
|
||||
"description": concept.description,
|
||||
"prerequisites": concept.prerequisites,
|
||||
"mastery_signals": concept.mastery_signals,
|
||||
"status": concept.status,
|
||||
"notes": concept.notes,
|
||||
"mastery_profile": {},
|
||||
})
|
||||
|
||||
(outdir / "pack.yaml").write_text(yaml.safe_dump(promoted_pack, sort_keys=False), encoding="utf-8")
|
||||
(outdir / "concepts.yaml").write_text(yaml.safe_dump({"concepts": concepts}, sort_keys=False), encoding="utf-8")
|
||||
(outdir / "review_ledger.json").write_text(json.dumps(session.model_dump(), indent=2), encoding="utf-8")
|
||||
(outdir / "license_attribution.json").write_text(json.dumps(session.draft_pack.attribution, indent=2), encoding="utf-8")
|
||||
|
||||
|
||||
def export_promoted_pack_to_course_repo(session: ReviewSession, course_repo: str | Path, outdir: str | Path | None = None) -> Path:
|
||||
from .course_repo import resolve_course_repo
|
||||
|
||||
resolved = resolve_course_repo(course_repo)
|
||||
target = Path(outdir) if outdir is not None else Path(resolved.generated_pack_dir or (Path(resolved.repo_root) / "generated" / "pack"))
|
||||
export_promoted_pack(session, target)
|
||||
return target
|
||||
|
||||
|
||||
LATEX_CITE_RE = re.compile(r"\\cite[a-zA-Z*]*(?:\[[^\]]*\])?(?:\[[^\]]*\])?\{([^}]+)\}")
|
||||
|
||||
|
||||
def _read_json(path: Path) -> dict[str, Any]:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def _read_jsonl(path: Path) -> list[dict[str, Any]]:
|
||||
text = path.read_text(encoding="utf-8").strip()
|
||||
if not text:
|
||||
return []
|
||||
return [json.loads(line) for line in text.splitlines()]
|
||||
|
||||
|
||||
def _status_field_spec() -> dict[str, Any]:
|
||||
return {
|
||||
"field": "status",
|
||||
"label": "Review status",
|
||||
"input": "select",
|
||||
"required": True,
|
||||
"options": [
|
||||
{
|
||||
"value": "trusted",
|
||||
"label": "Trusted",
|
||||
"help": "Promote this concept and its supported claims when the evidence and wording are ready.",
|
||||
},
|
||||
{
|
||||
"value": "provisional",
|
||||
"label": "Provisional",
|
||||
"help": "Keep this concept in reviewed state when it is promising but still needs citation or wording cleanup.",
|
||||
},
|
||||
{
|
||||
"value": "needs_review",
|
||||
"label": "Needs Review",
|
||||
"help": "Leave undecided when support, scope, or concept boundaries are still unclear.",
|
||||
},
|
||||
{
|
||||
"value": "rejected",
|
||||
"label": "Rejected",
|
||||
"help": "Exclude this concept when it is noise, unsupported, duplicated, or misleading.",
|
||||
},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def _text_field_spec(field: str, label: str, help_text: str, *, multiline: bool = False) -> dict[str, Any]:
|
||||
return {
|
||||
"field": field,
|
||||
"label": label,
|
||||
"input": "textarea" if multiline else "text",
|
||||
"required": False,
|
||||
"help": help_text,
|
||||
}
|
||||
|
||||
|
||||
def _citation_status_field_spec() -> dict[str, Any]:
|
||||
return {
|
||||
"field": "status",
|
||||
"label": "Citation review status",
|
||||
"input": "select",
|
||||
"required": True,
|
||||
"options": [
|
||||
{
|
||||
"value": "unreviewed",
|
||||
"label": "Unreviewed",
|
||||
"help": "Keep this citation candidate in triage until fit and existence are checked.",
|
||||
},
|
||||
{
|
||||
"value": "verified",
|
||||
"label": "Verified",
|
||||
"help": "The cited work exists and materially supports the associated manuscript claim.",
|
||||
},
|
||||
{
|
||||
"value": "needs_source_check",
|
||||
"label": "Needs Source Check",
|
||||
"help": "The citation may be useful but still needs direct source inspection or metadata cleanup.",
|
||||
},
|
||||
{
|
||||
"value": "misleading",
|
||||
"label": "Misleading",
|
||||
"help": "The citation exists but overstates, contradicts, or poorly fits the claim.",
|
||||
},
|
||||
{
|
||||
"value": "irrelevant",
|
||||
"label": "Irrelevant",
|
||||
"help": "The citation does not materially support the concept or claim under review.",
|
||||
},
|
||||
{
|
||||
"value": "fabricated",
|
||||
"label": "Fabricated",
|
||||
"help": "The citation appears invented, malformed, or otherwise not real.",
|
||||
},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def _load_citegeist_extract() -> tuple[Callable[[str], list[Any]] | None, list[str]]:
|
||||
citegeist_src = Path("/home/netuser/bin/CiteGeist/src")
|
||||
if citegeist_src.exists():
|
||||
sys.path.insert(0, str(citegeist_src))
|
||||
try:
|
||||
from citegeist import available_extraction_backends, extract_references # type: ignore
|
||||
except Exception:
|
||||
return None, []
|
||||
return extract_references, list(available_extraction_backends())
|
||||
|
||||
|
||||
def _extract_citation_keys(text: str) -> list[str]:
|
||||
keys: list[str] = []
|
||||
for raw_group in LATEX_CITE_RE.findall(text):
|
||||
keys.extend(part.strip() for part in raw_group.split(",") if part.strip())
|
||||
return sorted(set(keys))
|
||||
|
||||
|
||||
def _artifact_citation_payloads(
|
||||
artifacts: list[dict[str, Any]],
|
||||
*,
|
||||
source_root: str,
|
||||
) -> tuple[list[dict[str, Any]], dict[str, dict[str, Any]]]:
|
||||
extract_references, backends = _load_citegeist_extract()
|
||||
artifact_payloads: list[dict[str, Any]] = []
|
||||
summaries: dict[str, dict[str, Any]] = {}
|
||||
root = Path(source_root) if source_root else None
|
||||
bibliography_index = load_bibliography_index(source_root) if source_root else {}
|
||||
|
||||
for artifact in artifacts:
|
||||
path = Path(source_root) / artifact["path"] if root is not None else None
|
||||
raw_text = ""
|
||||
if path is not None and path.exists():
|
||||
try:
|
||||
raw_text = path.read_text(encoding="utf-8")
|
||||
except UnicodeDecodeError:
|
||||
raw_text = ""
|
||||
citation_keys = _extract_citation_keys(raw_text) if raw_text else []
|
||||
extracted_refs: list[dict[str, Any]] = []
|
||||
if extract_references is not None and raw_text:
|
||||
try:
|
||||
for entry in extract_references(raw_text):
|
||||
extracted_refs.append(
|
||||
{
|
||||
"citation_key": "",
|
||||
"entry_type": entry.entry_type,
|
||||
"title": entry.fields.get("title", ""),
|
||||
"author": entry.fields.get("author", ""),
|
||||
"year": entry.fields.get("year", ""),
|
||||
"venue": entry.fields.get("journal", "") or entry.fields.get("booktitle", ""),
|
||||
}
|
||||
)
|
||||
except Exception:
|
||||
extracted_refs = []
|
||||
|
||||
payload = {
|
||||
"artifact_id": artifact["artifact_id"],
|
||||
"path": artifact["path"],
|
||||
"title": artifact.get("title", ""),
|
||||
"citation_keys": citation_keys,
|
||||
"resolved_entries": [serialize_bib_entry(bibliography_index.get(key)) for key in citation_keys if bibliography_index.get(key)],
|
||||
"citation_key_count": len(citation_keys),
|
||||
"extracted_references": extracted_refs[:12],
|
||||
"extracted_reference_count": len(extracted_refs),
|
||||
"citegeist_backends": backends,
|
||||
}
|
||||
artifact_payloads.append(payload)
|
||||
summaries[artifact["artifact_id"]] = {
|
||||
"citation_key_count": len(citation_keys),
|
||||
"extracted_reference_count": len(extracted_refs),
|
||||
"has_citation_support": bool(citation_keys or extracted_refs),
|
||||
}
|
||||
return artifact_payloads, summaries
|
||||
|
||||
|
||||
def build_citation_review_entries_from_import(import_dir: str | Path) -> list[CitationReviewEntry]:
|
||||
base = Path(import_dir)
|
||||
manifest = _read_json(base / "manifest.json")
|
||||
artifacts = _read_jsonl(base / "artifacts.jsonl")
|
||||
observations = _read_jsonl(base / "observations.jsonl")
|
||||
claims = _read_jsonl(base / "claims.jsonl")
|
||||
bibliography_index = load_bibliography_index(manifest.get("source_root", ""))
|
||||
|
||||
artifact_payloads, _ = _artifact_citation_payloads(
|
||||
artifacts,
|
||||
source_root=manifest.get("source_root", ""),
|
||||
)
|
||||
observations_by_id = {item["observation_id"]: item for item in observations}
|
||||
artifact_claim_links: dict[str, dict[str, set[str]]] = defaultdict(lambda: {"claim_ids": set(), "concept_ids": set()})
|
||||
|
||||
for claim in claims:
|
||||
artifact_ids = {
|
||||
observations_by_id[item]["artifact_id"]
|
||||
for item in claim.get("source_observation_ids", [])
|
||||
if item in observations_by_id and observations_by_id[item].get("artifact_id")
|
||||
}
|
||||
for artifact_id in artifact_ids:
|
||||
artifact_claim_links[artifact_id]["claim_ids"].add(claim["claim_id"])
|
||||
artifact_claim_links[artifact_id]["concept_ids"].update(
|
||||
concept_id.replace("concept::", "", 1) for concept_id in claim.get("concept_ids", [])
|
||||
)
|
||||
|
||||
entries: list[CitationReviewEntry] = []
|
||||
for artifact in artifact_payloads:
|
||||
link_payload = artifact_claim_links.get(artifact["artifact_id"], {"claim_ids": set(), "concept_ids": set()})
|
||||
for citation_key in artifact.get("citation_keys", []):
|
||||
digest = hashlib.sha1(f"{artifact['artifact_id']}|key|{citation_key}".encode("utf-8")).hexdigest()[:12]
|
||||
bib_entry = bibliography_index.get(citation_key, {})
|
||||
fields = bib_entry.get("fields", {})
|
||||
entries.append(
|
||||
CitationReviewEntry(
|
||||
citation_review_id=f"citrev-{digest}",
|
||||
artifact_id=artifact["artifact_id"],
|
||||
artifact_path=artifact.get("path", ""),
|
||||
artifact_title=artifact.get("title", ""),
|
||||
source_kind="citation_key",
|
||||
locator=artifact.get("path", ""),
|
||||
citation_key=citation_key,
|
||||
title=str(fields.get("title", "")),
|
||||
author=str(fields.get("author", "")),
|
||||
year=str(fields.get("year", "")),
|
||||
venue=str(fields.get("journal", "") or fields.get("booktitle", "") or fields.get("publisher", "")),
|
||||
source_bib_path=str(bib_entry.get("source_bib_path", "")),
|
||||
raw_bibtex=str(bib_entry.get("raw_bibtex", "")),
|
||||
related_concept_ids=sorted(link_payload["concept_ids"]),
|
||||
related_claim_ids=sorted(link_payload["claim_ids"]),
|
||||
)
|
||||
)
|
||||
for index, reference in enumerate(artifact.get("extracted_references", []), start=1):
|
||||
digest = hashlib.sha1(
|
||||
f"{artifact['artifact_id']}|ref|{reference.get('title', '')}|{reference.get('author', '')}|{index}".encode("utf-8")
|
||||
).hexdigest()[:12]
|
||||
entries.append(
|
||||
CitationReviewEntry(
|
||||
citation_review_id=f"citrev-{digest}",
|
||||
artifact_id=artifact["artifact_id"],
|
||||
artifact_path=artifact.get("path", ""),
|
||||
artifact_title=artifact.get("title", ""),
|
||||
source_kind="extracted_reference",
|
||||
locator=f"{artifact.get('path', '')}#ref-{index}",
|
||||
citation_key="",
|
||||
title=reference.get("title", ""),
|
||||
author=reference.get("author", ""),
|
||||
year=reference.get("year", ""),
|
||||
venue=reference.get("venue", ""),
|
||||
related_concept_ids=sorted(link_payload["concept_ids"]),
|
||||
related_claim_ids=sorted(link_payload["claim_ids"]),
|
||||
)
|
||||
)
|
||||
return entries
|
||||
|
||||
|
||||
def _build_import_review_payload(session: ReviewSession, import_dir: Path) -> dict[str, Any]:
|
||||
manifest = _read_json(import_dir / "manifest.json")
|
||||
lint_payload = _read_json(import_dir / "lint_findings.json")
|
||||
queue_payload = _read_json(import_dir / "review_queue.json")
|
||||
artifacts = _read_jsonl(import_dir / "artifacts.jsonl")
|
||||
observations = _read_jsonl(import_dir / "observations.jsonl")
|
||||
claims = _read_jsonl(import_dir / "claims.jsonl")
|
||||
|
||||
observations_by_id = {item["observation_id"]: item for item in observations}
|
||||
claims_by_concept: dict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||
findings_by_target: dict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||
for finding in lint_payload.get("findings", []):
|
||||
findings_by_target[finding["target_id"]].append(finding)
|
||||
for claim in claims:
|
||||
for concept_id in claim.get("concept_ids", []):
|
||||
claims_by_concept[concept_id].append(claim)
|
||||
|
||||
artifact_citations, artifact_citation_summary = _artifact_citation_payloads(
|
||||
artifacts,
|
||||
source_root=manifest.get("source_root", ""),
|
||||
)
|
||||
artifact_by_id = {item["artifact_id"]: item for item in artifacts}
|
||||
|
||||
concept_reviews: list[dict[str, Any]] = []
|
||||
for concept in session.draft_pack.concepts:
|
||||
full_concept_id = f"concept::{concept.concept_id}" if not concept.concept_id.startswith("concept::") else concept.concept_id
|
||||
concept_claims = claims_by_concept.get(full_concept_id, [])
|
||||
claim_payloads: list[dict[str, Any]] = []
|
||||
has_citation_support = False
|
||||
for claim in concept_claims[:25]:
|
||||
supporting_observations = [observations_by_id[item] for item in claim.get("source_observation_ids", []) if item in observations_by_id]
|
||||
artifact_ids = {item["artifact_id"] for item in supporting_observations}
|
||||
citation_support = [artifact_citation_summary.get(artifact_id, {}) for artifact_id in artifact_ids]
|
||||
has_citation_support = has_citation_support or any(item.get("has_citation_support") for item in citation_support)
|
||||
claim_payloads.append(
|
||||
{
|
||||
"claim_id": claim["claim_id"],
|
||||
"claim_text": claim.get("claim_text", ""),
|
||||
"claim_kind": claim.get("claim_kind", ""),
|
||||
"grounding_status": claim.get("grounding_status", "unknown"),
|
||||
"supporting_observations": [
|
||||
{
|
||||
"observation_id": obs["observation_id"],
|
||||
"origin_path": obs.get("origin_path", ""),
|
||||
"origin_section": obs.get("origin_section", ""),
|
||||
"text": obs.get("text", ""),
|
||||
"line_start": obs.get("line_start", 0),
|
||||
"line_end": obs.get("line_end", 0),
|
||||
}
|
||||
for obs in supporting_observations
|
||||
],
|
||||
"citation_support": citation_support,
|
||||
"artifact_paths": [artifact_by_id[item]["path"] for item in artifact_ids if item in artifact_by_id],
|
||||
"finding_messages": [item["message"] for item in findings_by_target.get(claim["claim_id"], [])],
|
||||
}
|
||||
)
|
||||
|
||||
concept_reviews.append(
|
||||
{
|
||||
"concept_id": concept.concept_id,
|
||||
"title": concept.title,
|
||||
"status": concept.status,
|
||||
"description": concept.description,
|
||||
"review_help": (
|
||||
"Prefer `trusted` when claims are coherent and citation-bearing support is appropriate; "
|
||||
"prefer `provisional` when the concept is plausible but still needs citation or wording cleanup."
|
||||
),
|
||||
"claim_count": len(concept_claims),
|
||||
"grounded_claim_count": sum(1 for item in concept_claims if item.get("grounding_status") == "grounded"),
|
||||
"warning_count": len(findings_by_target.get(full_concept_id, [])),
|
||||
"has_citation_support": has_citation_support,
|
||||
"top_claims": claim_payloads,
|
||||
"notes": list(concept.notes),
|
||||
}
|
||||
)
|
||||
|
||||
return {
|
||||
"import_context": {
|
||||
"manifest": manifest,
|
||||
"lint_summary": lint_payload.get("summary", {}),
|
||||
"queue_length": queue_payload.get("queue_length", 0),
|
||||
"source_adapter": manifest.get("source_adapter", ""),
|
||||
},
|
||||
"review_guidance": {
|
||||
"overview": (
|
||||
"Review concepts first, then inspect representative claims and their source observations before promotion."
|
||||
),
|
||||
"priorities": [
|
||||
"Focus reviewer effort on concepts with strong grounded claims and explicit citations first.",
|
||||
"Downgrade or reject concepts whose claims are fragmented, duplicated, or missing meaningful support.",
|
||||
"For academic material, citation-bearing claims deserve special scrutiny for fit, contradiction, and fabrication risk.",
|
||||
],
|
||||
"citation_guidance": [
|
||||
"A citation key or extracted reference is evidence of traceability, not correctness.",
|
||||
"Check whether the cited work actually supports the claim and whether the claim overstates it.",
|
||||
"Use the citation track to prioritize claims that can move into a separate citation-ingestion workflow.",
|
||||
],
|
||||
},
|
||||
"field_specs": [
|
||||
_status_field_spec(),
|
||||
_text_field_spec("description", "Concept description", "Refine the concept summary to match the strongest supported interpretation."),
|
||||
_text_field_spec("notes", "Reviewer notes", "Record why this concept is trusted, provisional, rejected, or still unclear.", multiline=True),
|
||||
_text_field_spec("prerequisites", "Prerequisites", "List prerequisite concepts only when the manuscript support is explicit or defensible.", multiline=True),
|
||||
],
|
||||
"citation_field_specs": [
|
||||
_citation_status_field_spec(),
|
||||
_text_field_spec("notes", "Citation notes", "Record whether the cited work exists, fits the claim, or should move into a dedicated citation-ingestion lane.", multiline=True),
|
||||
],
|
||||
"concept_reviews": concept_reviews,
|
||||
"citation_reviews": [entry.model_dump() for entry in session.citation_reviews],
|
||||
"bibliography": bibliography_summary_payload(manifest.get("source_root", "")),
|
||||
"citations": {
|
||||
"enabled": True,
|
||||
"provider": "citegeist" if artifact_citations and artifact_citations[0].get("citegeist_backends") else "none",
|
||||
"artifacts": artifact_citations,
|
||||
"summary": {
|
||||
"artifact_count_with_citations": sum(1 for item in artifact_citations if item["citation_key_count"] or item["extracted_reference_count"]),
|
||||
"citation_key_total": sum(item["citation_key_count"] for item in artifact_citations),
|
||||
"extracted_reference_total": sum(item["extracted_reference_count"] for item in artifact_citations),
|
||||
},
|
||||
"next_actions": [
|
||||
"Promote citation-bearing claims into a dedicated citation review lane.",
|
||||
"Use CiteGeist extraction as a first pass, then verify support and metadata before trusting the citation.",
|
||||
],
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def export_review_ui_data(session: ReviewSession, outdir: str | Path, import_dir: str | Path | None = None) -> None:
|
||||
outdir = Path(outdir)
|
||||
outdir.mkdir(parents=True, exist_ok=True)
|
||||
payload = {
|
||||
"reviewer": session.reviewer,
|
||||
"draft_pack": session.draft_pack.model_dump(),
|
||||
"citation_reviews": [entry.model_dump() for entry in session.citation_reviews],
|
||||
"ledger": [entry.model_dump() for entry in session.ledger],
|
||||
}
|
||||
if import_dir is not None:
|
||||
payload.update(_build_import_review_payload(session, Path(import_dir)))
|
||||
(outdir / "review_data.json").write_text(json.dumps(payload, indent=2), encoding="utf-8")
|
||||
|
|
@ -0,0 +1,80 @@
|
|||
from __future__ import annotations
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import Literal
|
||||
|
||||
TrustStatus = Literal["trusted", "provisional", "rejected", "needs_review"]
|
||||
CitationStatus = Literal["unreviewed", "verified", "needs_source_check", "misleading", "irrelevant", "fabricated"]
|
||||
|
||||
class ConceptReviewEntry(BaseModel):
|
||||
concept_id: str
|
||||
title: str
|
||||
description: str = ""
|
||||
prerequisites: list[str] = Field(default_factory=list)
|
||||
mastery_signals: list[str] = Field(default_factory=list)
|
||||
status: TrustStatus = "needs_review"
|
||||
notes: list[str] = Field(default_factory=list)
|
||||
|
||||
|
||||
class CitationReviewEntry(BaseModel):
|
||||
citation_review_id: str
|
||||
artifact_id: str
|
||||
artifact_path: str = ""
|
||||
artifact_title: str = ""
|
||||
source_kind: Literal["citation_key", "extracted_reference"] = "citation_key"
|
||||
locator: str = ""
|
||||
citation_key: str = ""
|
||||
title: str = ""
|
||||
author: str = ""
|
||||
year: str = ""
|
||||
venue: str = ""
|
||||
source_bib_path: str = ""
|
||||
raw_bibtex: str = ""
|
||||
status: CitationStatus = "unreviewed"
|
||||
notes: list[str] = Field(default_factory=list)
|
||||
related_concept_ids: list[str] = Field(default_factory=list)
|
||||
related_claim_ids: list[str] = Field(default_factory=list)
|
||||
|
||||
class DraftPackData(BaseModel):
|
||||
pack: dict = Field(default_factory=dict)
|
||||
concepts: list[ConceptReviewEntry] = Field(default_factory=list)
|
||||
conflicts: list[str] = Field(default_factory=list)
|
||||
review_flags: list[str] = Field(default_factory=list)
|
||||
attribution: dict = Field(default_factory=dict)
|
||||
|
||||
class ReviewAction(BaseModel):
|
||||
action_type: str
|
||||
target: str = ""
|
||||
payload: dict = Field(default_factory=dict)
|
||||
rationale: str = ""
|
||||
|
||||
class ReviewLedgerEntry(BaseModel):
|
||||
reviewer: str
|
||||
action: ReviewAction
|
||||
|
||||
class ReviewSession(BaseModel):
|
||||
reviewer: str
|
||||
draft_pack: DraftPackData
|
||||
citation_reviews: list[CitationReviewEntry] = Field(default_factory=list)
|
||||
ledger: list[ReviewLedgerEntry] = Field(default_factory=list)
|
||||
|
||||
class WorkspaceMeta(BaseModel):
|
||||
workspace_id: str
|
||||
title: str
|
||||
path: str
|
||||
created_at: str
|
||||
last_opened_at: str
|
||||
notes: str = ""
|
||||
|
||||
class WorkspaceRegistry(BaseModel):
|
||||
workspaces: list[WorkspaceMeta] = Field(default_factory=list)
|
||||
recent_workspace_ids: list[str] = Field(default_factory=list)
|
||||
|
||||
class ImportPreview(BaseModel):
|
||||
ok: bool = False
|
||||
source_dir: str
|
||||
workspace_id: str
|
||||
overwrite_required: bool = False
|
||||
errors: list[str] = Field(default_factory=list)
|
||||
warnings: list[str] = Field(default_factory=list)
|
||||
summary: dict = Field(default_factory=dict)
|
||||
semantic_warnings: list[str] = Field(default_factory=list)
|
||||
|
|
@ -0,0 +1,246 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import mimetypes
|
||||
from http.server import BaseHTTPRequestHandler, HTTPServer
|
||||
from pathlib import Path
|
||||
from urllib.parse import parse_qs, urlparse
|
||||
|
||||
from .citation_support import materialize_citegeist_store
|
||||
from .promotion import promote_import_to_store
|
||||
from .review_workspace import GroundRecallReviewWorkspace
|
||||
|
||||
|
||||
def _json_response(handler: BaseHTTPRequestHandler, status: int, payload: dict) -> None:
|
||||
body = json.dumps(payload, indent=2).encode("utf-8")
|
||||
handler.send_response(status)
|
||||
handler.send_header("Content-Type", "application/json")
|
||||
handler.send_header("Content-Length", str(len(body)))
|
||||
handler.send_header("Access-Control-Allow-Origin", "*")
|
||||
handler.send_header("Access-Control-Allow-Methods", "GET,POST,OPTIONS")
|
||||
handler.send_header("Access-Control-Allow-Headers", "Content-Type")
|
||||
handler.end_headers()
|
||||
handler.wfile.write(body)
|
||||
|
||||
|
||||
def _serve_static(handler: BaseHTTPRequestHandler, asset_path: Path) -> None:
|
||||
if not asset_path.exists():
|
||||
_json_response(handler, 404, {"error": "asset not found"})
|
||||
return
|
||||
body = asset_path.read_bytes()
|
||||
handler.send_response(200)
|
||||
handler.send_header("Content-Type", mimetypes.guess_type(str(asset_path))[0] or "application/octet-stream")
|
||||
handler.send_header("Content-Length", str(len(body)))
|
||||
handler.end_headers()
|
||||
handler.wfile.write(body)
|
||||
|
||||
|
||||
def _safe_show_entry(api: object, citation_key: str) -> dict | None:
|
||||
if not citation_key:
|
||||
return None
|
||||
try:
|
||||
return api.show_entry( # type: ignore[attr-defined]
|
||||
citation_key,
|
||||
include_provenance=True,
|
||||
include_conflicts=True,
|
||||
include_bibtex=True,
|
||||
)
|
||||
except AttributeError:
|
||||
pass
|
||||
|
||||
store = getattr(api, "store", None)
|
||||
if store is None:
|
||||
return None
|
||||
entry = store.get_entry(citation_key)
|
||||
if entry is None:
|
||||
return None
|
||||
payload = dict(entry)
|
||||
if hasattr(store, "get_field_provenance"):
|
||||
try:
|
||||
payload["provenance"] = store.get_field_provenance(citation_key)
|
||||
except Exception:
|
||||
payload["provenance"] = []
|
||||
if hasattr(store, "get_conflicts"):
|
||||
try:
|
||||
payload["conflicts"] = store.get_conflicts(citation_key)
|
||||
except Exception:
|
||||
payload["conflicts"] = []
|
||||
else:
|
||||
payload["conflicts"] = []
|
||||
if hasattr(store, "get_entry_bibtex"):
|
||||
try:
|
||||
payload["bibtex"] = store.get_entry_bibtex(citation_key)
|
||||
except Exception:
|
||||
payload["bibtex"] = None
|
||||
return payload
|
||||
|
||||
|
||||
def _safe_verify_entry(api: object, entry: object, *, context: str, limit: int) -> dict:
|
||||
if getattr(entry, "raw_bibtex", ""):
|
||||
try:
|
||||
return api.verify_bibtex(entry.raw_bibtex, context=context, limit=limit) # type: ignore[attr-defined]
|
||||
except Exception:
|
||||
pass
|
||||
values = [item for item in [getattr(entry, "citation_key", ""), getattr(entry, "title", ""), getattr(entry, "author", ""), getattr(entry, "year", "")] if item]
|
||||
try:
|
||||
return api.verify_strings(values, context=context, limit=limit) # type: ignore[attr-defined]
|
||||
except Exception as exc:
|
||||
return {
|
||||
"context": context,
|
||||
"results": [],
|
||||
"error": str(exc),
|
||||
}
|
||||
|
||||
|
||||
class GroundRecallReviewHandler(BaseHTTPRequestHandler):
|
||||
workspace: GroundRecallReviewWorkspace
|
||||
default_store_dir: str | None = None
|
||||
citegeist_bundle: dict | None = None
|
||||
|
||||
def do_OPTIONS(self) -> None:
|
||||
_json_response(self, 200, {"ok": True})
|
||||
|
||||
def do_GET(self) -> None:
|
||||
parsed = urlparse(self.path)
|
||||
if parsed.path == "/api/healthz":
|
||||
_json_response(self, 200, {"ok": True})
|
||||
return
|
||||
if parsed.path == "/api/load":
|
||||
review_data = self.workspace.load_review_data()
|
||||
review_data["citegeist"] = {
|
||||
"enabled": bool(self.citegeist_bundle and self.citegeist_bundle.get("available")),
|
||||
"db_path": self.citegeist_bundle.get("db_path") if self.citegeist_bundle else "",
|
||||
"ingested_files": self.citegeist_bundle.get("ingested_files", []) if self.citegeist_bundle else [],
|
||||
"show_entry_endpoint": "/api/citations/show-entry",
|
||||
"verify_endpoint": "/api/citations/verify",
|
||||
}
|
||||
_json_response(
|
||||
self,
|
||||
200,
|
||||
{
|
||||
"ok": True,
|
||||
"import_dir": str(self.workspace.import_dir),
|
||||
"review_data": review_data,
|
||||
},
|
||||
)
|
||||
return
|
||||
if parsed.path == "/api/citations/show-entry":
|
||||
if not self.citegeist_bundle or not self.citegeist_bundle.get("available"):
|
||||
_json_response(self, 404, {"ok": False, "error": "citegeist unavailable"})
|
||||
return
|
||||
citation_key = parse_qs(parsed.query).get("citation_key", [""])[0]
|
||||
if not citation_key:
|
||||
_json_response(self, 400, {"ok": False, "error": "citation_key is required"})
|
||||
return
|
||||
payload = _safe_show_entry(self.citegeist_bundle["api"], citation_key)
|
||||
_json_response(self, 200, {"ok": payload is not None, "entry": payload})
|
||||
return
|
||||
|
||||
asset_root = Path(__file__).with_name("review_app")
|
||||
if parsed.path in {"/", "/index.html"}:
|
||||
_serve_static(self, asset_root / "index.html")
|
||||
return
|
||||
if parsed.path == "/app.js":
|
||||
_serve_static(self, asset_root / "app.js")
|
||||
return
|
||||
if parsed.path == "/styles.css":
|
||||
_serve_static(self, asset_root / "styles.css")
|
||||
return
|
||||
_json_response(self, 404, {"error": "not found"})
|
||||
|
||||
def do_POST(self) -> None:
|
||||
parsed = urlparse(self.path)
|
||||
length = int(self.headers.get("Content-Length", "0"))
|
||||
raw = self.rfile.read(length) if length else b"{}"
|
||||
payload = json.loads(raw.decode("utf-8") or "{}")
|
||||
|
||||
if parsed.path == "/api/save":
|
||||
self.workspace.apply_updates(
|
||||
concept_updates=payload.get("concept_updates"),
|
||||
citation_updates=payload.get("citation_updates"),
|
||||
reviewer=payload.get("reviewer"),
|
||||
)
|
||||
_json_response(
|
||||
self,
|
||||
200,
|
||||
{
|
||||
"ok": True,
|
||||
"import_dir": str(self.workspace.import_dir),
|
||||
"review_data": self.workspace.load_review_data(),
|
||||
},
|
||||
)
|
||||
return
|
||||
|
||||
if parsed.path == "/api/promote":
|
||||
store_dir = payload.get("store_dir") or self.default_store_dir
|
||||
if not store_dir:
|
||||
_json_response(self, 400, {"ok": False, "error": "store_dir is required"})
|
||||
return
|
||||
result = promote_import_to_store(
|
||||
import_dir=self.workspace.import_dir,
|
||||
store_dir=store_dir,
|
||||
reviewer=payload.get("reviewer"),
|
||||
snapshot_id=payload.get("snapshot_id"),
|
||||
)
|
||||
_json_response(self, 200, {"ok": True, "promotion": result})
|
||||
return
|
||||
if parsed.path == "/api/citations/verify":
|
||||
if not self.citegeist_bundle or not self.citegeist_bundle.get("available"):
|
||||
_json_response(self, 404, {"ok": False, "error": "citegeist unavailable"})
|
||||
return
|
||||
citation_review_id = str(payload.get("citation_review_id") or "").strip()
|
||||
if not citation_review_id:
|
||||
_json_response(self, 400, {"ok": False, "error": "citation_review_id is required"})
|
||||
return
|
||||
session = self.workspace.load_session()
|
||||
entry = next((item for item in session.citation_reviews if item.citation_review_id == citation_review_id), None)
|
||||
if entry is None:
|
||||
_json_response(self, 404, {"ok": False, "error": "citation review entry not found"})
|
||||
return
|
||||
api = self.citegeist_bundle["api"]
|
||||
show_entry_payload = _safe_show_entry(api, entry.citation_key) if entry.citation_key else None
|
||||
context = f"{entry.artifact_path} {entry.artifact_title}".strip()
|
||||
verification = _safe_verify_entry(api, entry, context=context, limit=int(payload.get("limit", 5)))
|
||||
_json_response(
|
||||
self,
|
||||
200,
|
||||
{
|
||||
"ok": True,
|
||||
"citation_review_id": citation_review_id,
|
||||
"entry": show_entry_payload,
|
||||
"verification": verification,
|
||||
},
|
||||
)
|
||||
return
|
||||
|
||||
_json_response(self, 404, {"error": "not found"})
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="GroundRecall local review server")
|
||||
parser.add_argument("import_dir")
|
||||
parser.add_argument("--host", default="127.0.0.1")
|
||||
parser.add_argument("--port", type=int, default=8766)
|
||||
parser.add_argument("--reviewer", default="GroundRecall Import")
|
||||
parser.add_argument("--store-dir", default=None)
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
args = build_parser().parse_args()
|
||||
GroundRecallReviewHandler.workspace = GroundRecallReviewWorkspace(args.import_dir, reviewer=args.reviewer)
|
||||
GroundRecallReviewHandler.default_store_dir = args.store_dir
|
||||
GroundRecallReviewHandler.workspace.ensure_review_bundle()
|
||||
session = GroundRecallReviewHandler.workspace.load_session()
|
||||
GroundRecallReviewHandler.citegeist_bundle = materialize_citegeist_store(
|
||||
args.import_dir,
|
||||
session.draft_pack.pack.get("source_root", ""),
|
||||
)
|
||||
server = HTTPServer((args.host, args.port), GroundRecallReviewHandler)
|
||||
print(f"GroundRecall review server listening on http://{args.host}:{args.port}")
|
||||
server.serve_forever()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,126 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from .groundrecall_review_bridge import export_review_bundle_from_import
|
||||
from .review_export import build_citation_review_entries_from_import, export_review_state_json, export_review_ui_data
|
||||
from .review_schema import ReviewAction, ReviewLedgerEntry, ReviewSession
|
||||
|
||||
|
||||
def _normalize_lines(value: Any) -> list[str]:
|
||||
if isinstance(value, list):
|
||||
return [str(item).strip() for item in value if str(item).strip()]
|
||||
if isinstance(value, str):
|
||||
return [line.strip() for line in value.splitlines() if line.strip()]
|
||||
return []
|
||||
|
||||
|
||||
class GroundRecallReviewWorkspace:
|
||||
def __init__(self, import_dir: str | Path, reviewer: str = "GroundRecall Import") -> None:
|
||||
self.import_dir = Path(import_dir)
|
||||
self.reviewer = reviewer
|
||||
|
||||
@property
|
||||
def review_session_path(self) -> Path:
|
||||
return self.import_dir / "review_session.json"
|
||||
|
||||
@property
|
||||
def review_data_path(self) -> Path:
|
||||
return self.import_dir / "review_data.json"
|
||||
|
||||
def ensure_review_bundle(self) -> None:
|
||||
if not self.review_session_path.exists():
|
||||
export_review_bundle_from_import(self.import_dir, reviewer=self.reviewer)
|
||||
return
|
||||
session = ReviewSession.model_validate_json(self.review_session_path.read_text(encoding="utf-8"))
|
||||
updated = False
|
||||
if (
|
||||
not session.citation_reviews
|
||||
or any(entry.source_kind == "citation_key" and not entry.title for entry in session.citation_reviews)
|
||||
or any(entry.source_kind == "citation_key" and not entry.source_bib_path for entry in session.citation_reviews)
|
||||
):
|
||||
session.citation_reviews = build_citation_review_entries_from_import(self.import_dir)
|
||||
updated = True
|
||||
if updated or not self.review_data_path.exists():
|
||||
self.save_session(session)
|
||||
|
||||
def load_session(self) -> ReviewSession:
|
||||
self.ensure_review_bundle()
|
||||
return ReviewSession.model_validate_json(self.review_session_path.read_text(encoding="utf-8"))
|
||||
|
||||
def save_session(self, session: ReviewSession) -> None:
|
||||
export_review_state_json(session, self.review_session_path)
|
||||
export_review_ui_data(session, self.import_dir, import_dir=self.import_dir)
|
||||
|
||||
def load_review_data(self) -> dict[str, Any]:
|
||||
self.ensure_review_bundle()
|
||||
return json.loads(self.review_data_path.read_text(encoding="utf-8"))
|
||||
|
||||
def apply_updates(
|
||||
self,
|
||||
*,
|
||||
concept_updates: list[dict[str, Any]] | None = None,
|
||||
citation_updates: list[dict[str, Any]] | None = None,
|
||||
reviewer: str | None = None,
|
||||
) -> ReviewSession:
|
||||
session = self.load_session()
|
||||
if reviewer:
|
||||
session.reviewer = reviewer
|
||||
concept_by_id = {concept.concept_id: concept for concept in session.draft_pack.concepts}
|
||||
citation_by_id = {entry.citation_review_id: entry for entry in session.citation_reviews}
|
||||
|
||||
for payload in concept_updates or []:
|
||||
concept_id = str(payload.get("concept_id", "")).strip()
|
||||
if not concept_id or concept_id not in concept_by_id:
|
||||
continue
|
||||
concept = concept_by_id[concept_id]
|
||||
if "status" in payload:
|
||||
concept.status = payload["status"]
|
||||
if "description" in payload:
|
||||
concept.description = str(payload.get("description", "")).strip()
|
||||
if "notes" in payload:
|
||||
concept.notes = _normalize_lines(payload.get("notes"))
|
||||
if "prerequisites" in payload:
|
||||
concept.prerequisites = _normalize_lines(payload.get("prerequisites"))
|
||||
session.ledger.append(
|
||||
ReviewLedgerEntry(
|
||||
reviewer=session.reviewer,
|
||||
action=ReviewAction(
|
||||
action_type="edit_concept",
|
||||
target=concept_id,
|
||||
payload={
|
||||
"status": concept.status,
|
||||
"description": concept.description,
|
||||
"notes": concept.notes,
|
||||
"prerequisites": concept.prerequisites,
|
||||
},
|
||||
rationale=str(payload.get("rationale", "")).strip(),
|
||||
),
|
||||
)
|
||||
)
|
||||
|
||||
for payload in citation_updates or []:
|
||||
citation_review_id = str(payload.get("citation_review_id", "")).strip()
|
||||
if not citation_review_id or citation_review_id not in citation_by_id:
|
||||
continue
|
||||
entry = citation_by_id[citation_review_id]
|
||||
if "status" in payload:
|
||||
entry.status = payload["status"]
|
||||
if "notes" in payload:
|
||||
entry.notes = _normalize_lines(payload.get("notes"))
|
||||
session.ledger.append(
|
||||
ReviewLedgerEntry(
|
||||
reviewer=session.reviewer,
|
||||
action=ReviewAction(
|
||||
action_type="edit_citation",
|
||||
target=citation_review_id,
|
||||
payload={"status": entry.status, "notes": entry.notes},
|
||||
rationale=str(payload.get("rationale", "")).strip(),
|
||||
),
|
||||
)
|
||||
)
|
||||
|
||||
self.save_session(session)
|
||||
return session
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from .. import groundrecall_source_adapters as _legacy_source_adapters # noqa: F401
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from ..groundrecall_source_adapters.base import * # noqa: F403
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from ..groundrecall_source_adapters.didactopus_pack import * # noqa: F403
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from ..groundrecall_source_adapters.llmwiki import * # noqa: F403
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from ..groundrecall_source_adapters.markdown_notes import * # noqa: F403
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from ..groundrecall_source_adapters.polypaper import * # noqa: F403
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from ..groundrecall_source_adapters.transcript import * # noqa: F403
|
||||
|
|
@ -0,0 +1,203 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import TypeVar
|
||||
|
||||
from pydantic import BaseModel
|
||||
|
||||
from .models import (
|
||||
ArtifactRecord,
|
||||
ClaimRecord,
|
||||
ConceptRecord,
|
||||
FragmentRecord,
|
||||
GroundRecallSnapshot,
|
||||
ObservationRecord,
|
||||
PromotionRecord,
|
||||
RelationRecord,
|
||||
ReviewCandidateRecord,
|
||||
SourceRecord,
|
||||
)
|
||||
|
||||
|
||||
ModelT = TypeVar("ModelT", bound=BaseModel)
|
||||
|
||||
|
||||
class GroundRecallStore:
|
||||
def __init__(self, base_dir: str | Path):
|
||||
self.base_dir = Path(base_dir)
|
||||
self.sources_dir = self.base_dir / "sources"
|
||||
self.fragments_dir = self.base_dir / "fragments"
|
||||
self.artifacts_dir = self.base_dir / "artifacts"
|
||||
self.observations_dir = self.base_dir / "observations"
|
||||
self.claims_dir = self.base_dir / "claims"
|
||||
self.concepts_dir = self.base_dir / "concepts"
|
||||
self.relations_dir = self.base_dir / "relations"
|
||||
self.review_candidates_dir = self.base_dir / "review_candidates"
|
||||
self.promotions_dir = self.base_dir / "promotions"
|
||||
self.snapshots_dir = self.base_dir / "snapshots"
|
||||
for path in [
|
||||
self.sources_dir,
|
||||
self.fragments_dir,
|
||||
self.artifacts_dir,
|
||||
self.observations_dir,
|
||||
self.claims_dir,
|
||||
self.concepts_dir,
|
||||
self.relations_dir,
|
||||
self.review_candidates_dir,
|
||||
self.promotions_dir,
|
||||
self.snapshots_dir,
|
||||
]:
|
||||
path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def _save(self, directory: Path, key: str, model: BaseModel) -> None:
|
||||
target = directory / f"{key}.json"
|
||||
payload = model.model_dump_json(indent=2)
|
||||
self._write_text_atomic(target, payload)
|
||||
|
||||
def _write_text_atomic(self, path: Path, text: str) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
fd, tmp_name = tempfile.mkstemp(
|
||||
prefix=f".{path.name}.",
|
||||
suffix=".tmp",
|
||||
dir=path.parent,
|
||||
text=True,
|
||||
)
|
||||
tmp_path = Path(tmp_name)
|
||||
try:
|
||||
with os.fdopen(fd, "w", encoding="utf-8") as handle:
|
||||
handle.write(text)
|
||||
handle.flush()
|
||||
os.fsync(handle.fileno())
|
||||
os.replace(tmp_path, path)
|
||||
finally:
|
||||
if tmp_path.exists():
|
||||
tmp_path.unlink()
|
||||
|
||||
def _load(self, directory: Path, key: str, model_type: type[ModelT]) -> ModelT | None:
|
||||
path = directory / f"{key}.json"
|
||||
if not path.exists():
|
||||
return None
|
||||
return model_type.model_validate_json(path.read_text(encoding="utf-8"))
|
||||
|
||||
def _list(self, directory: Path, model_type: type[ModelT]) -> list[ModelT]:
|
||||
items: list[ModelT] = []
|
||||
for path in sorted(directory.glob("*.json")):
|
||||
items.append(model_type.model_validate_json(path.read_text(encoding="utf-8")))
|
||||
return items
|
||||
|
||||
def save_source(self, record: SourceRecord) -> SourceRecord:
|
||||
self._save(self.sources_dir, record.source_id, record)
|
||||
return record
|
||||
|
||||
def get_source(self, source_id: str) -> SourceRecord | None:
|
||||
return self._load(self.sources_dir, source_id, SourceRecord)
|
||||
|
||||
def list_sources(self) -> list[SourceRecord]:
|
||||
return self._list(self.sources_dir, SourceRecord)
|
||||
|
||||
def save_fragment(self, record: FragmentRecord) -> FragmentRecord:
|
||||
self._save(self.fragments_dir, record.fragment_id, record)
|
||||
return record
|
||||
|
||||
def get_fragment(self, fragment_id: str) -> FragmentRecord | None:
|
||||
return self._load(self.fragments_dir, fragment_id, FragmentRecord)
|
||||
|
||||
def list_fragments(self) -> list[FragmentRecord]:
|
||||
return self._list(self.fragments_dir, FragmentRecord)
|
||||
|
||||
def save_artifact(self, record: ArtifactRecord) -> ArtifactRecord:
|
||||
self._save(self.artifacts_dir, record.artifact_id, record)
|
||||
return record
|
||||
|
||||
def get_artifact(self, artifact_id: str) -> ArtifactRecord | None:
|
||||
return self._load(self.artifacts_dir, artifact_id, ArtifactRecord)
|
||||
|
||||
def list_artifacts(self) -> list[ArtifactRecord]:
|
||||
return self._list(self.artifacts_dir, ArtifactRecord)
|
||||
|
||||
def save_observation(self, record: ObservationRecord) -> ObservationRecord:
|
||||
self._save(self.observations_dir, record.observation_id, record)
|
||||
return record
|
||||
|
||||
def get_observation(self, observation_id: str) -> ObservationRecord | None:
|
||||
return self._load(self.observations_dir, observation_id, ObservationRecord)
|
||||
|
||||
def list_observations(self) -> list[ObservationRecord]:
|
||||
return self._list(self.observations_dir, ObservationRecord)
|
||||
|
||||
def save_claim(self, record: ClaimRecord) -> ClaimRecord:
|
||||
self._save(self.claims_dir, record.claim_id, record)
|
||||
return record
|
||||
|
||||
def get_claim(self, claim_id: str) -> ClaimRecord | None:
|
||||
return self._load(self.claims_dir, claim_id, ClaimRecord)
|
||||
|
||||
def list_claims(self) -> list[ClaimRecord]:
|
||||
return self._list(self.claims_dir, ClaimRecord)
|
||||
|
||||
def save_concept(self, record: ConceptRecord) -> ConceptRecord:
|
||||
self._save(self.concepts_dir, record.concept_id.replace("::", "__"), record)
|
||||
return record
|
||||
|
||||
def get_concept(self, concept_id: str) -> ConceptRecord | None:
|
||||
return self._load(self.concepts_dir, concept_id.replace("::", "__"), ConceptRecord)
|
||||
|
||||
def list_concepts(self) -> list[ConceptRecord]:
|
||||
return self._list(self.concepts_dir, ConceptRecord)
|
||||
|
||||
def save_relation(self, record: RelationRecord) -> RelationRecord:
|
||||
self._save(self.relations_dir, record.relation_id, record)
|
||||
return record
|
||||
|
||||
def get_relation(self, relation_id: str) -> RelationRecord | None:
|
||||
return self._load(self.relations_dir, relation_id, RelationRecord)
|
||||
|
||||
def list_relations(self) -> list[RelationRecord]:
|
||||
return self._list(self.relations_dir, RelationRecord)
|
||||
|
||||
def save_review_candidate(self, record: ReviewCandidateRecord) -> ReviewCandidateRecord:
|
||||
self._save(self.review_candidates_dir, record.review_candidate_id, record)
|
||||
return record
|
||||
|
||||
def get_review_candidate(self, review_candidate_id: str) -> ReviewCandidateRecord | None:
|
||||
return self._load(self.review_candidates_dir, review_candidate_id, ReviewCandidateRecord)
|
||||
|
||||
def list_review_candidates(self) -> list[ReviewCandidateRecord]:
|
||||
return self._list(self.review_candidates_dir, ReviewCandidateRecord)
|
||||
|
||||
def save_promotion(self, record: PromotionRecord) -> PromotionRecord:
|
||||
self._save(self.promotions_dir, record.promotion_id, record)
|
||||
return record
|
||||
|
||||
def get_promotion(self, promotion_id: str) -> PromotionRecord | None:
|
||||
return self._load(self.promotions_dir, promotion_id, PromotionRecord)
|
||||
|
||||
def list_promotions(self) -> list[PromotionRecord]:
|
||||
return self._list(self.promotions_dir, PromotionRecord)
|
||||
|
||||
def save_snapshot(self, snapshot: GroundRecallSnapshot) -> GroundRecallSnapshot:
|
||||
self._save(self.snapshots_dir, snapshot.snapshot_id, snapshot)
|
||||
return snapshot
|
||||
|
||||
def get_snapshot(self, snapshot_id: str) -> GroundRecallSnapshot | None:
|
||||
return self._load(self.snapshots_dir, snapshot_id, GroundRecallSnapshot)
|
||||
|
||||
def list_snapshots(self) -> list[GroundRecallSnapshot]:
|
||||
return self._list(self.snapshots_dir, GroundRecallSnapshot)
|
||||
|
||||
def build_snapshot(self, snapshot_id: str, created_at: str, metadata: dict | None = None) -> GroundRecallSnapshot:
|
||||
return GroundRecallSnapshot(
|
||||
snapshot_id=snapshot_id,
|
||||
created_at=created_at,
|
||||
sources=self.list_sources(),
|
||||
fragments=self.list_fragments(),
|
||||
artifacts=self.list_artifacts(),
|
||||
observations=self.list_observations(),
|
||||
claims=self.list_claims(),
|
||||
concepts=self.list_concepts(),
|
||||
relations=self.list_relations(),
|
||||
promotions=self.list_promotions(),
|
||||
metadata=metadata or {},
|
||||
)
|
||||
|
|
@ -0,0 +1,17 @@
|
|||
import shutil
|
||||
import subprocess
|
||||
|
||||
|
||||
def test_groundrecall_console_script_help() -> None:
|
||||
executable = shutil.which("groundrecall")
|
||||
assert executable is not None
|
||||
|
||||
result = subprocess.run(
|
||||
[executable, "--help"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
)
|
||||
|
||||
assert result.returncode == 0
|
||||
assert "GroundRecall command-line tools" in result.stdout
|
||||
|
|
@ -0,0 +1,134 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from groundrecall.assistant_export import export_assistant_bundle
|
||||
from groundrecall.assistants.base import get_assistant_adapter, list_assistant_adapters
|
||||
import groundrecall.assistants.codex # noqa: F401
|
||||
import groundrecall.assistants.claude_code # noqa: F401
|
||||
from groundrecall.models import (
|
||||
ArtifactRecord,
|
||||
ClaimRecord,
|
||||
ConceptRecord,
|
||||
ObservationRecord,
|
||||
ProvenanceRecord,
|
||||
RelationRecord,
|
||||
)
|
||||
from groundrecall.query import build_query_bundle_for_concept
|
||||
from groundrecall.store import GroundRecallStore
|
||||
|
||||
|
||||
def _seed_store(store: GroundRecallStore) -> None:
|
||||
store.save_artifact(
|
||||
ArtifactRecord(
|
||||
artifact_id="ia_001",
|
||||
artifact_kind="compiled_page",
|
||||
title="Channel Capacity",
|
||||
path="wiki/channel-capacity.md",
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
store.save_observation(
|
||||
ObservationRecord(
|
||||
observation_id="obs_001",
|
||||
artifact_id="ia_001",
|
||||
role="claim",
|
||||
text="Reliable communication rate is bounded by channel capacity.",
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="grounded",
|
||||
),
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
store.save_concept(
|
||||
ConceptRecord(
|
||||
concept_id="concept::channel-capacity",
|
||||
title="Channel Capacity",
|
||||
description="Reliable communication limit.",
|
||||
source_artifact_ids=["ia_001"],
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_concept(
|
||||
ConceptRecord(
|
||||
concept_id="concept::shannon-entropy",
|
||||
title="Shannon Entropy",
|
||||
description="Average uncertainty.",
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id="clm_001",
|
||||
claim_text="Channel capacity bounds reliable communication rate.",
|
||||
concept_ids=["concept::channel-capacity"],
|
||||
source_observation_ids=["obs_001"],
|
||||
confidence_hint=0.8,
|
||||
review_confidence=0.9,
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="grounded",
|
||||
),
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_relation(
|
||||
RelationRecord(
|
||||
relation_id="rel_001",
|
||||
source_id="concept::channel-capacity",
|
||||
target_id="concept::shannon-entropy",
|
||||
relation_type="references",
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def test_assistant_adapter_registry_lists_known_adapters() -> None:
|
||||
assert "codex" in list_assistant_adapters()
|
||||
assert "claude_code" in list_assistant_adapters()
|
||||
|
||||
|
||||
def test_codex_adapter_exports_skill_and_json_bundle(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
manifest = export_assistant_bundle(store.base_dir, "codex", tmp_path / "codex", concept_refs=["channel-capacity"])
|
||||
|
||||
assert (tmp_path / "codex" / "SKILL.md").exists()
|
||||
assert (tmp_path / "codex" / "codex_bundle.json").exists()
|
||||
assert (tmp_path / "codex" / "assistant_export_manifest.json").exists()
|
||||
assert manifest["assistant"] == "codex"
|
||||
|
||||
|
||||
def test_claude_code_adapter_exports_memory_and_json_bundle(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
manifest = export_assistant_bundle(store.base_dir, "claude_code", tmp_path / "claude", concept_refs=["channel-capacity"])
|
||||
|
||||
assert (tmp_path / "claude" / "CLAUDE.md").exists()
|
||||
assert (tmp_path / "claude" / "claude_code_bundle.json").exists()
|
||||
assert manifest["assistant"] == "claude_code"
|
||||
|
||||
|
||||
def test_adapter_contexts_are_derived_from_assistant_neutral_query_bundles(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
query_bundle = build_query_bundle_for_concept(store.base_dir, "channel-capacity")
|
||||
assert query_bundle is not None
|
||||
|
||||
codex = get_assistant_adapter("codex")
|
||||
claude = get_assistant_adapter("claude_code")
|
||||
codex_context = codex.build_context(query_bundle)
|
||||
claude_context = claude.build_context(query_bundle)
|
||||
|
||||
assert codex_context["concept"]["concept_id"] == "concept::channel-capacity"
|
||||
assert claude_context["concept"]["concept_id"] == "concept::channel-capacity"
|
||||
assert codex_context["assistant"] == "codex"
|
||||
assert claude_context["assistant"] == "claude_code"
|
||||
assert "relevant_claims" in codex_context
|
||||
assert "claims" in claude_context
|
||||
|
|
@ -0,0 +1,136 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from groundrecall.export import export_canonical_bundle, export_query_bundle
|
||||
from groundrecall.models import (
|
||||
ArtifactRecord,
|
||||
ClaimRecord,
|
||||
ConceptRecord,
|
||||
ObservationRecord,
|
||||
ProvenanceRecord,
|
||||
RelationRecord,
|
||||
SourceRecord,
|
||||
)
|
||||
from groundrecall.store import GroundRecallStore
|
||||
|
||||
|
||||
def _read_jsonl(path: Path) -> list[dict]:
|
||||
text = path.read_text(encoding="utf-8").strip()
|
||||
if not text:
|
||||
return []
|
||||
return [json.loads(line) for line in text.splitlines()]
|
||||
|
||||
|
||||
def _seed_store(store: GroundRecallStore) -> None:
|
||||
store.save_source(SourceRecord(source_id="src_001", title="Source", current_status="promoted"))
|
||||
store.save_artifact(
|
||||
ArtifactRecord(
|
||||
artifact_id="ia_001",
|
||||
artifact_kind="compiled_page",
|
||||
title="Channel Capacity",
|
||||
path="wiki/channel-capacity.md",
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
store.save_observation(
|
||||
ObservationRecord(
|
||||
observation_id="obs_001",
|
||||
artifact_id="ia_001",
|
||||
role="claim",
|
||||
text="Reliable communication rate is bounded by channel capacity.",
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="grounded",
|
||||
),
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
store.save_concept(
|
||||
ConceptRecord(
|
||||
concept_id="concept::channel-capacity",
|
||||
title="Channel Capacity",
|
||||
description="Reliable communication limit.",
|
||||
source_artifact_ids=["ia_001"],
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_concept(
|
||||
ConceptRecord(
|
||||
concept_id="concept::shannon-entropy",
|
||||
title="Shannon Entropy",
|
||||
description="Average uncertainty.",
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id="clm_001",
|
||||
claim_text="Channel capacity bounds reliable communication rate.",
|
||||
concept_ids=["concept::channel-capacity"],
|
||||
source_observation_ids=["obs_001"],
|
||||
confidence_hint=0.8,
|
||||
review_confidence=0.9,
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="grounded",
|
||||
),
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_relation(
|
||||
RelationRecord(
|
||||
relation_id="rel_001",
|
||||
source_id="concept::channel-capacity",
|
||||
target_id="concept::shannon-entropy",
|
||||
relation_type="references",
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def test_export_canonical_bundle_writes_expected_files(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
|
||||
out_dir = tmp_path / "exports"
|
||||
payload = export_canonical_bundle(
|
||||
store_dir=store.base_dir,
|
||||
out_dir=out_dir,
|
||||
concept_refs=["channel-capacity"],
|
||||
snapshot_id="snap_export_001",
|
||||
)
|
||||
|
||||
assert (out_dir / "groundrecall_snapshot.json").exists()
|
||||
assert (out_dir / "claims.jsonl").exists()
|
||||
assert (out_dir / "concepts.jsonl").exists()
|
||||
assert (out_dir / "relations.jsonl").exists()
|
||||
assert (out_dir / "provenance_manifest.json").exists()
|
||||
assert (out_dir / "export_manifest.json").exists()
|
||||
assert (out_dir / "query_bundle__channel-capacity.json").exists()
|
||||
|
||||
snapshot = json.loads((out_dir / "groundrecall_snapshot.json").read_text(encoding="utf-8"))
|
||||
manifest = json.loads((out_dir / "export_manifest.json").read_text(encoding="utf-8"))
|
||||
claims = _read_jsonl(out_dir / "claims.jsonl")
|
||||
assert snapshot["snapshot_id"] == "snap_export_001"
|
||||
assert manifest["export_kind"] == "canonical"
|
||||
assert len(manifest["query_bundles"]) == 1
|
||||
assert claims[0]["claim_id"] == "clm_001"
|
||||
assert payload["query_bundles"]
|
||||
|
||||
|
||||
def test_export_query_bundle_is_assistant_neutral(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
|
||||
out_path = tmp_path / "bundle.json"
|
||||
payload = export_query_bundle(store.base_dir, "channel capacity", out_path)
|
||||
assert out_path.exists()
|
||||
assert payload["bundle_kind"] == "groundrecall_query_bundle"
|
||||
forbidden = {"assistant", "codex", "claude", "prompt_text"}
|
||||
assert set(payload).isdisjoint(forbidden)
|
||||
|
|
@ -0,0 +1,161 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from groundrecall.ingest import run_groundrecall_import
|
||||
from groundrecall.lint import lint_import_directory
|
||||
|
||||
|
||||
def _read_jsonl(path: Path) -> list[dict]:
|
||||
text = path.read_text(encoding="utf-8").strip()
|
||||
if not text:
|
||||
return []
|
||||
return [json.loads(line) for line in text.splitlines()]
|
||||
|
||||
|
||||
def test_groundrecall_import_emits_normalized_artifacts(tmp_path: Path) -> None:
|
||||
root = tmp_path / "llmwiki"
|
||||
(root / "wiki").mkdir(parents=True)
|
||||
(root / "raw").mkdir()
|
||||
(root / "logs").mkdir()
|
||||
|
||||
(root / "wiki" / "channel-capacity.md").write_text(
|
||||
"# Channel Capacity\n\n"
|
||||
"- Reliable rate upper bound for a noisy channel.\n\n"
|
||||
"See also [[Shannon Entropy]].\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
(root / "raw" / "notes.md").write_text(
|
||||
"Speculation: Capacity may depend on constraints.\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
(root / "logs" / "session.log").write_text(
|
||||
"Learner asked about entropy and communication limits.\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = run_groundrecall_import(root, mode="quick", import_id="import-test")
|
||||
|
||||
assert result.out_dir == root / "imports" / "import-test"
|
||||
manifest = json.loads((result.out_dir / "manifest.json").read_text(encoding="utf-8"))
|
||||
assert manifest["source_repo_kind"] == "llmwiki"
|
||||
assert manifest["artifact_count"] == 3
|
||||
assert manifest["claim_count"] >= 1
|
||||
|
||||
artifacts = _read_jsonl(result.out_dir / "artifacts.jsonl")
|
||||
assert {item["artifact_kind"] for item in artifacts} == {"compiled_page", "raw_note", "session_log"}
|
||||
|
||||
claims = _read_jsonl(result.out_dir / "claims.jsonl")
|
||||
assert any("Reliable rate upper bound" in item["claim_text"] for item in claims)
|
||||
|
||||
concepts = _read_jsonl(result.out_dir / "concepts.jsonl")
|
||||
concept_ids = {item["concept_id"] for item in concepts}
|
||||
assert "concept::channel-capacity" in concept_ids
|
||||
assert "concept::shannon-entropy" in concept_ids
|
||||
|
||||
relations = _read_jsonl(result.out_dir / "relations.jsonl")
|
||||
assert any(item["target_id"] == "concept::shannon-entropy" for item in relations)
|
||||
|
||||
lint_payload = json.loads((result.out_dir / "lint_findings.json").read_text(encoding="utf-8"))
|
||||
assert "summary" in lint_payload
|
||||
assert lint_payload["summary"]["warning_count"] >= 0
|
||||
|
||||
review_queue = json.loads((result.out_dir / "review_queue.json").read_text(encoding="utf-8"))
|
||||
assert review_queue["queue_length"] >= 1
|
||||
assert any(item["candidate_type"] == "claim" for item in review_queue["items"])
|
||||
review_session = json.loads((result.out_dir / "review_session.json").read_text(encoding="utf-8"))
|
||||
assert review_session["reviewer"] == "GroundRecall Import"
|
||||
assert review_session["draft_pack"]["pack"]["source_import_id"] == "import-test"
|
||||
assert any(item["concept_id"] == "channel-capacity" for item in review_session["draft_pack"]["concepts"])
|
||||
review_data = json.loads((result.out_dir / "review_data.json").read_text(encoding="utf-8"))
|
||||
assert review_data["reviewer"] == "GroundRecall Import"
|
||||
assert "field_specs" in review_data
|
||||
assert any(item["field"] == "status" for item in review_data["field_specs"])
|
||||
assert "review_guidance" in review_data
|
||||
assert "concept_reviews" in review_data
|
||||
assert "citations" in review_data
|
||||
assert "citation_reviews" in review_data
|
||||
|
||||
|
||||
def test_groundrecall_import_parses_explicit_claim_relations(tmp_path: Path) -> None:
|
||||
root = tmp_path / "llmwiki"
|
||||
(root / "wiki").mkdir(parents=True)
|
||||
(root / "wiki" / "notes.md").write_text(
|
||||
"# Notes\n\n"
|
||||
"- [claim_id: base] Channel capacity bounds reliable communication rate.\n"
|
||||
"- [claim_id: revised] [supersedes: base] Channel capacity bounds reliable communication rate for a specified channel model.\n"
|
||||
"- [claim_id: dissent] [contradicts: revised] Channel capacity has no stable interpretation.\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = run_groundrecall_import(root, mode="quick", import_id="relations-test")
|
||||
claims = _read_jsonl(result.out_dir / "claims.jsonl")
|
||||
by_id = {item["claim_id"]: item for item in claims}
|
||||
|
||||
assert "clm_base" in by_id
|
||||
assert by_id["clm_revised"]["supersedes_claim_ids"] == ["clm_base"]
|
||||
assert by_id["clm_dissent"]["contradicts_claim_ids"] == ["clm_revised"]
|
||||
|
||||
lint_payload = json.loads((result.out_dir / "lint_findings.json").read_text(encoding="utf-8"))
|
||||
codes = {item["code"] for item in lint_payload["findings"]}
|
||||
assert "unresolved_supersession_ref" not in codes
|
||||
assert "unresolved_contradiction_ref" not in codes
|
||||
|
||||
|
||||
def test_groundrecall_lint_flags_orphan_concepts_and_missing_targets(tmp_path: Path) -> None:
|
||||
root = tmp_path / "llmwiki"
|
||||
(root / "wiki").mkdir(parents=True)
|
||||
(root / "wiki" / "solo.md").write_text(
|
||||
"# Solo Concept\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
(root / "wiki" / "broken.md").write_text(
|
||||
"# Broken\n\nSee also [[Missing Concept]].\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = run_groundrecall_import(root, mode="quick", import_id="lint-test")
|
||||
lint_payload = json.loads((result.out_dir / "lint_findings.json").read_text(encoding="utf-8"))
|
||||
codes = {item["code"] for item in lint_payload["findings"]}
|
||||
assert "orphan_concept" in codes
|
||||
|
||||
|
||||
def test_groundrecall_lint_detects_relation_missing_target(tmp_path: Path) -> None:
|
||||
import_dir = tmp_path / "imports" / "broken-import"
|
||||
import_dir.mkdir(parents=True)
|
||||
(import_dir / "manifest.json").write_text(
|
||||
json.dumps({"import_id": "broken-import", "import_mode": "quick"}),
|
||||
encoding="utf-8",
|
||||
)
|
||||
(import_dir / "artifacts.jsonl").write_text("", encoding="utf-8")
|
||||
(import_dir / "observations.jsonl").write_text("", encoding="utf-8")
|
||||
(import_dir / "claims.jsonl").write_text("", encoding="utf-8")
|
||||
(import_dir / "concepts.jsonl").write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"concept_id": "concept::existing",
|
||||
"title": "Existing",
|
||||
"current_status": "triaged",
|
||||
}
|
||||
)
|
||||
+ "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
(import_dir / "relations.jsonl").write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"relation_id": "rel_1",
|
||||
"source_id": "concept::existing",
|
||||
"target_id": "concept::missing",
|
||||
"relation_type": "references",
|
||||
"current_status": "draft",
|
||||
}
|
||||
)
|
||||
+ "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
payload = lint_import_directory(import_dir)
|
||||
codes = {item["code"] for item in payload["findings"]}
|
||||
assert "relation_missing_target" in codes
|
||||
|
|
@ -0,0 +1,70 @@
|
|||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from groundrecall.cli import main as groundrecall_cli_main
|
||||
from groundrecall.export import export_canonical_bundle
|
||||
from groundrecall.ingest import run_groundrecall_import
|
||||
from groundrecall.inspect import inspect_store
|
||||
from groundrecall.models import ClaimRecord
|
||||
from groundrecall.query import query_concept
|
||||
from groundrecall.store import GroundRecallStore
|
||||
from groundrecall.lint import lint_import_directory
|
||||
from groundrecall.promotion import promote_import_to_store
|
||||
|
||||
|
||||
def _build_llmwiki_fixture(root: Path) -> Path:
|
||||
(root / "wiki").mkdir(parents=True)
|
||||
(root / "raw").mkdir()
|
||||
(root / "wiki" / "channel-capacity.md").write_text(
|
||||
"# Channel Capacity\n\n"
|
||||
"- Reliable rate upper bound for a noisy channel.\n\n"
|
||||
"See also [[Shannon Entropy]].\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
(root / "raw" / "notes.md").write_text(
|
||||
"Speculation: Capacity may depend on constraints.\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
return root
|
||||
|
||||
|
||||
def test_groundrecall_namespace_reexports_core_functions() -> None:
|
||||
assert run_groundrecall_import.__module__ == "groundrecall.ingest"
|
||||
assert query_concept.__module__ == "groundrecall.query"
|
||||
assert export_canonical_bundle.__module__ == "groundrecall.export"
|
||||
assert lint_import_directory.__module__ == "groundrecall.lint"
|
||||
assert promote_import_to_store.__module__ == "groundrecall.promotion"
|
||||
assert GroundRecallStore.__module__ == "groundrecall.store"
|
||||
assert ClaimRecord.__module__ == "groundrecall.models"
|
||||
|
||||
|
||||
def test_groundrecall_inspect_summarizes_store(tmp_path: Path) -> None:
|
||||
source_root = _build_llmwiki_fixture(tmp_path / "llmwiki")
|
||||
import_result = run_groundrecall_import(source_root, out_root=tmp_path / "imports", mode="quick", import_id="fixture-import")
|
||||
store_dir = tmp_path / "store"
|
||||
promote_import_to_store(import_result.out_dir, store_dir)
|
||||
|
||||
payload = inspect_store(store_dir, out_path=tmp_path / "inspect.json")
|
||||
|
||||
assert (tmp_path / "inspect.json").exists()
|
||||
assert payload["claim_count"] >= 1
|
||||
assert payload["concept_count"] >= 1
|
||||
assert payload["snapshot_count"] >= 1
|
||||
|
||||
|
||||
def test_groundrecall_cli_inspect_dispatches(tmp_path: Path, capsys) -> None:
|
||||
source_root = _build_llmwiki_fixture(tmp_path / "llmwiki")
|
||||
import_result = run_groundrecall_import(source_root, out_root=tmp_path / "imports", mode="quick", import_id="fixture-import")
|
||||
store_dir = tmp_path / "store"
|
||||
promote_import_to_store(import_result.out_dir, store_dir)
|
||||
|
||||
original_argv = sys.argv
|
||||
try:
|
||||
sys.argv = ["groundrecall.cli", "inspect", str(store_dir)]
|
||||
groundrecall_cli_main()
|
||||
finally:
|
||||
sys.argv = original_argv
|
||||
|
||||
output = capsys.readouterr().out
|
||||
assert '"claim_count"' in output
|
||||
assert '"concept_count"' in output
|
||||
|
|
@ -0,0 +1,96 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from groundrecall.ingest import run_groundrecall_import
|
||||
from groundrecall.promotion import promote_import_to_store
|
||||
from groundrecall.store import GroundRecallStore
|
||||
|
||||
|
||||
def test_groundrecall_promotion_writes_canonical_objects(tmp_path: Path) -> None:
|
||||
root = tmp_path / "llmwiki"
|
||||
(root / "wiki").mkdir(parents=True)
|
||||
(root / "wiki" / "channel-capacity.md").write_text(
|
||||
"# Channel Capacity\n\n"
|
||||
"- Reliable rate upper bound for a noisy channel.\n\n"
|
||||
"See also [[Shannon Entropy]].\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = run_groundrecall_import(root, mode="quick", import_id="promote-test")
|
||||
review_path = result.out_dir / "review_session.json"
|
||||
review_payload = json.loads(review_path.read_text(encoding="utf-8"))
|
||||
for concept in review_payload["draft_pack"]["concepts"]:
|
||||
concept["status"] = "trusted"
|
||||
review_path.write_text(json.dumps(review_payload, indent=2), encoding="utf-8")
|
||||
|
||||
store_dir = tmp_path / "groundrecall-store"
|
||||
payload = promote_import_to_store(result.out_dir, store_dir, reviewer="R")
|
||||
|
||||
store = GroundRecallStore(store_dir)
|
||||
concepts = store.list_concepts()
|
||||
claims = store.list_claims()
|
||||
relations = store.list_relations()
|
||||
promotions = store.list_promotions()
|
||||
snapshots = store.list_snapshots()
|
||||
|
||||
assert payload["promoted_concept_count"] >= 1
|
||||
assert payload["promoted_claim_count"] >= 1
|
||||
assert len(concepts) >= 2
|
||||
assert any(item.current_status == "promoted" for item in concepts)
|
||||
assert any(item.current_status == "promoted" for item in claims)
|
||||
assert len(relations) >= 1
|
||||
assert len(promotions) == 1
|
||||
assert promotions[0].reviewer == "R"
|
||||
assert len(snapshots) == 1
|
||||
assert snapshots[0].metadata["source_import_id"] == "promote-test"
|
||||
|
||||
|
||||
def test_groundrecall_promotion_respects_rejected_review_status(tmp_path: Path) -> None:
|
||||
root = tmp_path / "llmwiki"
|
||||
(root / "wiki").mkdir(parents=True)
|
||||
(root / "wiki" / "solo.md").write_text(
|
||||
"# Solo Concept\n\n- A solitary claim.\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = run_groundrecall_import(root, mode="quick", import_id="reject-test")
|
||||
review_path = result.out_dir / "review_session.json"
|
||||
review_payload = json.loads(review_path.read_text(encoding="utf-8"))
|
||||
review_payload["draft_pack"]["concepts"][0]["status"] = "rejected"
|
||||
review_path.write_text(json.dumps(review_payload, indent=2), encoding="utf-8")
|
||||
|
||||
store_dir = tmp_path / "groundrecall-store"
|
||||
promote_import_to_store(result.out_dir, store_dir, reviewer="R")
|
||||
|
||||
store = GroundRecallStore(store_dir)
|
||||
assert store.list_concepts()[0].current_status == "rejected"
|
||||
assert store.list_claims()[0].current_status == "rejected"
|
||||
|
||||
|
||||
def test_groundrecall_promotion_preserves_contradiction_and_supersession_links(tmp_path: Path) -> None:
|
||||
root = tmp_path / "llmwiki"
|
||||
(root / "wiki").mkdir(parents=True)
|
||||
(root / "wiki" / "notes.md").write_text(
|
||||
"# Notes\n\n"
|
||||
"- [claim_id: base] Channel capacity bounds reliable communication rate.\n"
|
||||
"- [claim_id: revised] [supersedes: base] Channel capacity bounds reliable communication rate for a specified channel model.\n"
|
||||
"- [claim_id: dissent] [contradicts: revised] Channel capacity has no stable interpretation.\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = run_groundrecall_import(root, mode="quick", import_id="graph-test")
|
||||
review_path = result.out_dir / "review_session.json"
|
||||
review_payload = json.loads(review_path.read_text(encoding="utf-8"))
|
||||
for concept in review_payload["draft_pack"]["concepts"]:
|
||||
concept["status"] = "trusted"
|
||||
review_path.write_text(json.dumps(review_payload, indent=2), encoding="utf-8")
|
||||
|
||||
store_dir = tmp_path / "groundrecall-store"
|
||||
promote_import_to_store(result.out_dir, store_dir, reviewer="R")
|
||||
|
||||
store = GroundRecallStore(store_dir)
|
||||
claims = {item.claim_id: item for item in store.list_claims()}
|
||||
assert claims["clm_revised"].supersedes_claim_ids == ["clm_base"]
|
||||
assert claims["clm_dissent"].contradicts_claim_ids == ["clm_revised"]
|
||||
|
|
@ -0,0 +1,190 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from groundrecall.models import (
|
||||
ArtifactRecord,
|
||||
ClaimRecord,
|
||||
ConceptRecord,
|
||||
ObservationRecord,
|
||||
ProvenanceRecord,
|
||||
RelationRecord,
|
||||
)
|
||||
from groundrecall.query import (
|
||||
build_query_bundle_for_concept,
|
||||
query_concept,
|
||||
query_provenance,
|
||||
search_claims,
|
||||
)
|
||||
from groundrecall.store import GroundRecallStore
|
||||
|
||||
|
||||
def _seed_store(store: GroundRecallStore) -> None:
|
||||
store.save_artifact(
|
||||
ArtifactRecord(
|
||||
artifact_id="ia_001",
|
||||
artifact_kind="compiled_page",
|
||||
title="Channel Capacity",
|
||||
path="wiki/channel-capacity.md",
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
store.save_observation(
|
||||
ObservationRecord(
|
||||
observation_id="obs_001",
|
||||
artifact_id="ia_001",
|
||||
role="claim",
|
||||
text="Reliable communication rate is bounded by channel capacity.",
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="grounded",
|
||||
),
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
store.save_concept(
|
||||
ConceptRecord(
|
||||
concept_id="concept::channel-capacity",
|
||||
title="Channel Capacity",
|
||||
description="Reliable communication limit.",
|
||||
source_artifact_ids=["ia_001"],
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_concept(
|
||||
ConceptRecord(
|
||||
concept_id="concept::shannon-entropy",
|
||||
title="Shannon Entropy",
|
||||
description="Average uncertainty.",
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id="clm_001",
|
||||
claim_text="Channel capacity bounds reliable communication rate.",
|
||||
concept_ids=["concept::channel-capacity"],
|
||||
source_observation_ids=["obs_001"],
|
||||
confidence_hint=0.8,
|
||||
review_confidence=0.9,
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="grounded",
|
||||
),
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id="clm_002",
|
||||
claim_text="Shannon entropy can inform channel coding intuition.",
|
||||
concept_ids=["concept::shannon-entropy"],
|
||||
contradicts_claim_ids=["clm_999"],
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="partially_grounded",
|
||||
),
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
store.save_relation(
|
||||
RelationRecord(
|
||||
relation_id="rel_001",
|
||||
source_id="concept::channel-capacity",
|
||||
target_id="concept::shannon-entropy",
|
||||
relation_type="references",
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def test_query_concept_returns_neighborhood_and_support(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
|
||||
payload = query_concept(store.base_dir, "channel-capacity")
|
||||
assert payload is not None
|
||||
assert payload["concept"]["concept_id"] == "concept::channel-capacity"
|
||||
assert len(payload["claims"]) == 1
|
||||
assert len(payload["relations"]) == 1
|
||||
assert any(item["concept_id"] == "concept::shannon-entropy" for item in payload["related_concepts"])
|
||||
assert payload["supporting_observations"][0]["origin_path"] == "wiki/channel-capacity.md"
|
||||
|
||||
|
||||
def test_search_claims_matches_text_and_concept_titles(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
|
||||
payload = search_claims(store.base_dir, "entropy")
|
||||
assert payload["query_type"] == "claim_search"
|
||||
assert any(match["claim"]["claim_id"] == "clm_002" for match in payload["matches"])
|
||||
|
||||
|
||||
def test_query_provenance_filters_by_origin_path(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
|
||||
payload = query_provenance(store.base_dir, origin_path="wiki/channel-capacity.md")
|
||||
assert len(payload["claims"]) == 2
|
||||
assert len(payload["observations"]) == 1
|
||||
|
||||
|
||||
def test_build_query_bundle_for_concept_is_assistant_neutral(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
|
||||
payload = build_query_bundle_for_concept(store.base_dir, "channel capacity")
|
||||
assert payload is not None
|
||||
assert payload["bundle_kind"] == "groundrecall_query_bundle"
|
||||
assert payload["concept"]["concept_id"] == "concept::channel-capacity"
|
||||
assert isinstance(payload["suggested_next_actions"], list)
|
||||
forbidden = {"assistant", "codex", "claude", "prompt_text"}
|
||||
assert set(payload).isdisjoint(forbidden)
|
||||
|
||||
|
||||
def test_query_bundle_surfaces_contradictions_and_supersessions(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
_seed_store(store)
|
||||
store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id="clm_003",
|
||||
claim_text="Channel capacity is undefined in practice.",
|
||||
concept_ids=["concept::channel-capacity"],
|
||||
contradicts_claim_ids=["clm_001"],
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="partially_grounded",
|
||||
),
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id="clm_004",
|
||||
claim_text="Channel capacity should be interpreted relative to a specific channel model.",
|
||||
concept_ids=["concept::channel-capacity"],
|
||||
supersedes_claim_ids=["clm_001"],
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="grounded",
|
||||
),
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
|
||||
payload = build_query_bundle_for_concept(store.base_dir, "channel-capacity")
|
||||
assert payload is not None
|
||||
contradiction_ids = {item["claim_id"] for item in payload["contradictions"]}
|
||||
supersession_ids = {item["claim_id"] for item in payload["supersessions"]}
|
||||
assert "clm_003" in contradiction_ids
|
||||
assert "clm_004" in supersession_ids
|
||||
|
|
@ -0,0 +1,60 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from types import SimpleNamespace
|
||||
|
||||
from groundrecall.review_server import _safe_show_entry, _safe_verify_entry
|
||||
|
||||
|
||||
class _StoreWithoutConflicts:
|
||||
def get_entry(self, citation_key: str):
|
||||
if citation_key != "baum1974generalized":
|
||||
return None
|
||||
return {"citation_key": citation_key, "title": "On two types of deviation"}
|
||||
|
||||
def get_field_provenance(self, citation_key: str):
|
||||
return [{"field_name": "title", "source_label": "refs.bib"}]
|
||||
|
||||
def get_entry_bibtex(self, citation_key: str):
|
||||
return "@article{baum1974generalized, title={On two types of deviation}}"
|
||||
|
||||
|
||||
class _ApiWithPartialSupport:
|
||||
def __init__(self):
|
||||
self.store = _StoreWithoutConflicts()
|
||||
|
||||
def show_entry(self, citation_key: str, **kwargs):
|
||||
raise AttributeError("get_conflicts missing in underlying store")
|
||||
|
||||
def verify_bibtex(self, bibtex_text: str, *, context: str = "", limit: int = 5):
|
||||
raise RuntimeError("pybtex unavailable")
|
||||
|
||||
def verify_strings(self, values: list[str], *, context: str = "", limit: int = 5):
|
||||
return {"context": context, "results": [{"values": values, "limit": limit}]}
|
||||
|
||||
|
||||
def test_safe_show_entry_falls_back_when_citegeist_show_entry_is_incompatible() -> None:
|
||||
api = _ApiWithPartialSupport()
|
||||
|
||||
payload = _safe_show_entry(api, "baum1974generalized")
|
||||
|
||||
assert payload is not None
|
||||
assert payload["citation_key"] == "baum1974generalized"
|
||||
assert payload["conflicts"] == []
|
||||
assert payload["provenance"][0]["source_label"] == "refs.bib"
|
||||
assert "bibtex" in payload
|
||||
|
||||
|
||||
def test_safe_verify_entry_falls_back_to_verify_strings() -> None:
|
||||
api = _ApiWithPartialSupport()
|
||||
entry = SimpleNamespace(
|
||||
citation_key="baum1974generalized",
|
||||
title="On two types of deviation",
|
||||
author="W. M. Baum",
|
||||
year="1974",
|
||||
raw_bibtex="@article{baum1974generalized, title={On two types of deviation}}",
|
||||
)
|
||||
|
||||
payload = _safe_verify_entry(api, entry, context="pieces/intro.tex Intro", limit=5)
|
||||
|
||||
assert payload["results"]
|
||||
assert payload["results"][0]["values"][0] == "baum1974generalized"
|
||||
|
|
@ -0,0 +1,86 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from groundrecall.ingest import run_groundrecall_import
|
||||
from groundrecall.review_workspace import GroundRecallReviewWorkspace
|
||||
|
||||
|
||||
def _build_citation_fixture(root: Path) -> Path:
|
||||
(root / "wiki").mkdir(parents=True)
|
||||
(root / "wiki" / "learning-theory.md").write_text(
|
||||
"# Learning Theory\n\n"
|
||||
"Matching-law style regularities can be compared with machine learning optimization.\n\n"
|
||||
"See \\\\cite{herrnstein1961matching} for the classic framing.\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
return root
|
||||
|
||||
|
||||
def test_review_workspace_populates_and_persists_citation_reviews(tmp_path: Path) -> None:
|
||||
source_root = _build_citation_fixture(tmp_path / "llmwiki")
|
||||
import_result = run_groundrecall_import(source_root, out_root=tmp_path / "imports", mode="quick", import_id="review-fixture")
|
||||
|
||||
workspace = GroundRecallReviewWorkspace(import_result.out_dir)
|
||||
payload = workspace.load_review_data()
|
||||
assert payload["citation_reviews"]
|
||||
citation_review_id = payload["citation_reviews"][0]["citation_review_id"]
|
||||
|
||||
workspace.apply_updates(
|
||||
concept_updates=[
|
||||
{
|
||||
"concept_id": "learning-theory",
|
||||
"status": "trusted",
|
||||
"notes": ["Strong framing concept.", "Citation support looks plausible."],
|
||||
}
|
||||
],
|
||||
citation_updates=[
|
||||
{
|
||||
"citation_review_id": citation_review_id,
|
||||
"status": "verified",
|
||||
"notes": ["Classic matching-law citation."],
|
||||
}
|
||||
],
|
||||
reviewer="Unit Test Reviewer",
|
||||
)
|
||||
|
||||
session = json.loads((import_result.out_dir / "review_session.json").read_text(encoding="utf-8"))
|
||||
concept = next(item for item in session["draft_pack"]["concepts"] if item["concept_id"] == "learning-theory")
|
||||
citation = next(item for item in session["citation_reviews"] if item["citation_review_id"] == citation_review_id)
|
||||
|
||||
assert session["reviewer"] == "Unit Test Reviewer"
|
||||
assert concept["status"] == "trusted"
|
||||
assert citation["status"] == "verified"
|
||||
|
||||
review_data = json.loads((import_result.out_dir / "review_data.json").read_text(encoding="utf-8"))
|
||||
assert any(item["citation_review_id"] == citation_review_id for item in review_data["citation_reviews"])
|
||||
|
||||
|
||||
def test_review_workspace_resolves_citation_metadata_from_bibtex(tmp_path: Path) -> None:
|
||||
root = tmp_path / "llmwiki"
|
||||
(root / "wiki").mkdir(parents=True)
|
||||
(root / "wiki" / "matching.md").write_text(
|
||||
"# Matching\n\n"
|
||||
"The manuscript cites \\\\cite{baum1974generalized} here.\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
(root / "refs.bib").write_text(
|
||||
"@article{baum1974generalized,\n"
|
||||
" author = {W. M. Baum},\n"
|
||||
" title = {On two types of deviation from the matching law: Bias and undermatching},\n"
|
||||
" journal = {Journal of the Experimental Analysis of Behavior},\n"
|
||||
" year = {1974}\n"
|
||||
"}\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
import_result = run_groundrecall_import(root, out_root=tmp_path / "imports", mode="quick", import_id="bib-fixture")
|
||||
workspace = GroundRecallReviewWorkspace(import_result.out_dir)
|
||||
payload = workspace.load_review_data()
|
||||
|
||||
entry = next(item for item in payload["citation_reviews"] if item["citation_key"] == "baum1974generalized")
|
||||
assert entry["title"] == "On two types of deviation from the matching law: Bias and undermatching"
|
||||
assert entry["source_bib_path"] == "refs.bib"
|
||||
assert entry["raw_bibtex"]
|
||||
assert payload["bibliography"]["entry_count"] >= 1
|
||||
|
|
@ -0,0 +1,235 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import groundrecall.ingest as ingest_module
|
||||
import groundrecall.source_adapters # noqa: F401
|
||||
from groundrecall.source_adapters.base import detect_source_adapter, list_source_adapters
|
||||
from groundrecall.ingest import run_groundrecall_import
|
||||
|
||||
|
||||
def test_groundrecall_source_adapter_registry_lists_expected_adapters() -> None:
|
||||
names = set(list_source_adapters())
|
||||
assert "llmwiki" in names
|
||||
assert "polypaper" in names
|
||||
assert "markdown_notes" in names
|
||||
assert "transcript" in names
|
||||
assert "didactopus_pack" in names
|
||||
assert "doclift_bundle" in names
|
||||
|
||||
|
||||
def test_detect_llmwiki_adapter(tmp_path: Path) -> None:
|
||||
(tmp_path / "wiki").mkdir()
|
||||
adapter = detect_source_adapter(tmp_path)
|
||||
assert adapter.name == "llmwiki"
|
||||
assert adapter.import_intent() == "grounded_knowledge"
|
||||
|
||||
|
||||
def test_detect_didactopus_pack_adapter(tmp_path: Path) -> None:
|
||||
(tmp_path / "pack.yaml").write_text("name: p\n", encoding="utf-8")
|
||||
(tmp_path / "concepts.yaml").write_text("concepts: []\n", encoding="utf-8")
|
||||
adapter = detect_source_adapter(tmp_path)
|
||||
assert adapter.name == "didactopus_pack"
|
||||
assert adapter.import_intent() == "both"
|
||||
|
||||
|
||||
def test_detect_doclift_bundle_adapter(tmp_path: Path) -> None:
|
||||
(tmp_path / "documents").mkdir()
|
||||
(tmp_path / "manifest.json").write_text('{"documents": []}\n', encoding="utf-8")
|
||||
adapter = detect_source_adapter(tmp_path)
|
||||
assert adapter.name == "doclift_bundle"
|
||||
assert adapter.import_intent() == "both"
|
||||
|
||||
|
||||
def test_groundrecall_import_records_adapter_and_intent(tmp_path: Path) -> None:
|
||||
(tmp_path / "wiki").mkdir()
|
||||
(tmp_path / "wiki" / "note.md").write_text("# Title\n\n- A note.\n", encoding="utf-8")
|
||||
result = run_groundrecall_import(tmp_path, mode="quick", import_id="adapter-test")
|
||||
assert result.manifest["source_adapter"] == "llmwiki"
|
||||
assert result.manifest["import_intent"] == "grounded_knowledge"
|
||||
|
||||
|
||||
def test_markdown_notes_adapter_ingests_tex_files(tmp_path: Path) -> None:
|
||||
(tmp_path / "draft.tex").write_text(
|
||||
"\\section{Related Work}\n\n"
|
||||
"We connect behaviorism and language models.\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
adapter = detect_source_adapter(tmp_path)
|
||||
assert adapter.name == "markdown_notes"
|
||||
|
||||
result = run_groundrecall_import(tmp_path, mode="quick", import_id="tex-test")
|
||||
assert result.manifest["source_adapter"] == "markdown_notes"
|
||||
assert result.manifest["artifact_count"] == 1
|
||||
assert result.artifacts[0]["path"] == "draft.tex"
|
||||
assert result.claims
|
||||
|
||||
|
||||
def test_tex_import_uses_pandoc_markdown_when_available(tmp_path: Path, monkeypatch) -> None:
|
||||
(tmp_path / "draft.tex").write_text(
|
||||
"\\section{Ignored by fallback}\n"
|
||||
"\\usepackage{amsmath}\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
monkeypatch.setattr(
|
||||
ingest_module,
|
||||
"_convert_tex_to_markdown",
|
||||
lambda path: "# Converted Draft\n\n- Converted claim from pandoc.\n",
|
||||
)
|
||||
|
||||
result = run_groundrecall_import(tmp_path, mode="quick", import_id="tex-pandoc-test")
|
||||
claim_texts = [item["claim_text"] for item in result.claims]
|
||||
concept_ids = {item["concept_id"] for item in result.concepts}
|
||||
|
||||
assert "Converted claim from pandoc." in claim_texts
|
||||
assert "concept::converted-draft" in concept_ids
|
||||
|
||||
|
||||
def test_detect_polypaper_adapter_and_exclude_support_files(tmp_path: Path) -> None:
|
||||
(tmp_path / "pieces").mkdir()
|
||||
(tmp_path / "figs").mkdir()
|
||||
(tmp_path / "setup").mkdir()
|
||||
(tmp_path / "main.tex").write_text(
|
||||
"\\include{pieces/discussion}\n"
|
||||
"\\include{pieces/table-results}\n"
|
||||
"\\input{figs/figure-system}\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
(tmp_path / "paper.org").write_text("* draft\n", encoding="utf-8")
|
||||
(tmp_path / "pieces" / "discussion.tex").write_text("\\section{Discussion}\n\nMore text.\n", encoding="utf-8")
|
||||
(tmp_path / "pieces" / "table-results.tex").write_text("\\begin{tabular}x\\end{tabular}\n", encoding="utf-8")
|
||||
(tmp_path / "pieces" / "unused.tex").write_text("\\section{Unused}\n\nIgnore me.\n", encoding="utf-8")
|
||||
(tmp_path / "figs" / "figure-system.tex").write_text("\\begin{figure}x\\end{figure}\n", encoding="utf-8")
|
||||
(tmp_path / "setup" / "venue-arxiv.tex").write_text("\\section{Setup}\n", encoding="utf-8")
|
||||
(tmp_path / ".pp-export-tmp.tex").write_text("\\section{Tmp}\n", encoding="utf-8")
|
||||
|
||||
adapter = detect_source_adapter(tmp_path)
|
||||
assert adapter.name == "polypaper"
|
||||
|
||||
result = run_groundrecall_import(tmp_path, mode="quick", import_id="polypaper-test")
|
||||
paths = {item["path"] for item in result.artifacts}
|
||||
assert "main.tex" not in paths
|
||||
assert "pieces/discussion.tex" in paths
|
||||
assert "pieces/table-results.tex" not in paths
|
||||
assert "figs/figure-system.tex" not in paths
|
||||
assert "pieces/unused.tex" not in paths
|
||||
assert "setup/venue-arxiv.tex" not in paths
|
||||
assert ".pp-export-tmp.tex" not in paths
|
||||
|
||||
|
||||
def test_tex_import_skips_table_and_figure_markup_from_pandoc(tmp_path: Path, monkeypatch) -> None:
|
||||
(tmp_path / "draft.tex").write_text("\\section{Draft}\n", encoding="utf-8")
|
||||
|
||||
monkeypatch.setattr(
|
||||
ingest_module,
|
||||
"_convert_tex_to_markdown",
|
||||
lambda path: "\n".join(
|
||||
[
|
||||
"# Draft",
|
||||
"",
|
||||
"",
|
||||
"| Col A | Col B |",
|
||||
"| --- | --- |",
|
||||
"| 1 | 2 |",
|
||||
"</div>",
|
||||
"\\begin{tabular}{ll}",
|
||||
"- Real manuscript claim.",
|
||||
]
|
||||
),
|
||||
)
|
||||
|
||||
result = run_groundrecall_import(tmp_path, mode="quick", import_id="tex-cleanup-test")
|
||||
claim_texts = [item["claim_text"] for item in result.claims]
|
||||
|
||||
assert claim_texts == ["Real manuscript claim."]
|
||||
|
||||
|
||||
def test_didactopus_pack_import_generates_structured_concepts_and_relations(tmp_path: Path) -> None:
|
||||
(tmp_path / "pack.yaml").write_text(
|
||||
"\n".join(
|
||||
[
|
||||
"name: sample-pack",
|
||||
"display_name: Sample Pack",
|
||||
"version: 0.1.0",
|
||||
"schema_version: 0.1.0",
|
||||
"didactopus_min_version: 0.1.0",
|
||||
"didactopus_max_version: 9.9.9",
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
(tmp_path / "concepts.yaml").write_text(
|
||||
"\n".join(
|
||||
[
|
||||
"concepts:",
|
||||
" - id: basics",
|
||||
" title: Basics",
|
||||
" description: Foundational concept.",
|
||||
" mastery_signals: [Explain the foundation.]",
|
||||
" - id: advanced",
|
||||
" title: Advanced",
|
||||
" description: Builds on basics.",
|
||||
" prerequisites: [basics]",
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
(tmp_path / "roadmap.yaml").write_text(
|
||||
"\n".join(
|
||||
[
|
||||
"stages:",
|
||||
" - id: stage1",
|
||||
" title: Stage One",
|
||||
" concepts: [basics, advanced]",
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = run_groundrecall_import(tmp_path, mode="quick", import_id="pack-test")
|
||||
assert result.manifest["source_adapter"] == "didactopus_pack"
|
||||
assert result.manifest["import_intent"] == "both"
|
||||
concept_ids = {item["concept_id"] for item in result.concepts}
|
||||
assert "concept::basics" in concept_ids
|
||||
assert "concept::advanced" in concept_ids
|
||||
relation_targets = {(item["source_id"], item["target_id"], item["relation_type"]) for item in result.relations}
|
||||
assert ("concept::basics", "concept::advanced", "prerequisite") in relation_targets
|
||||
claim_ids = {item["claim_id"] for item in result.claims}
|
||||
assert "clm_pack_basics" in claim_ids
|
||||
assert "clm_stage_stage1_basics" in claim_ids
|
||||
|
||||
|
||||
def test_doclift_bundle_import_generates_structured_concepts(tmp_path: Path) -> None:
|
||||
doc_dir = tmp_path / "documents" / "lesson-a"
|
||||
doc_dir.mkdir(parents=True)
|
||||
(tmp_path / "manifest.json").write_text(
|
||||
'\n'.join(
|
||||
[
|
||||
"{",
|
||||
' "documents": [',
|
||||
" {",
|
||||
' "document_id": "lesson-a",',
|
||||
' "title": "Lecture 1. Example",',
|
||||
' "document_kind": "lecture",',
|
||||
f' "output_dir": "{doc_dir}",',
|
||||
f' "markdown_path": "{doc_dir / "document.md"}",',
|
||||
f' "figures_path": "{doc_dir / "document.figures.json"}"',
|
||||
" }",
|
||||
" ]",
|
||||
"}",
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
(doc_dir / "document.md").write_text("# Lecture 1. Example\n\nBody.\n", encoding="utf-8")
|
||||
(doc_dir / "document.figures.json").write_text('{"source_path": "/tmp/source.doc"}\n', encoding="utf-8")
|
||||
|
||||
result = run_groundrecall_import(tmp_path, mode="quick", import_id="doclift-test")
|
||||
assert result.manifest["source_adapter"] == "doclift_bundle"
|
||||
assert result.manifest["import_intent"] == "both"
|
||||
concept_ids = {item["concept_id"] for item in result.concepts}
|
||||
assert "concept::lesson-a" in concept_ids
|
||||
claim_ids = {item["claim_id"] for item in result.claims}
|
||||
assert "clm_doclift_1" in claim_ids
|
||||
|
|
@ -0,0 +1,148 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from groundrecall.models import (
|
||||
ClaimRecord,
|
||||
ConceptRecord,
|
||||
GroundRecallSnapshot,
|
||||
PromotionRecord,
|
||||
ProvenanceRecord,
|
||||
RelationRecord,
|
||||
ReviewCandidateRecord,
|
||||
SourceRecord,
|
||||
)
|
||||
from groundrecall.store import GroundRecallStore
|
||||
|
||||
|
||||
def test_groundrecall_store_round_trips_canonical_objects(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
|
||||
source = store.save_source(
|
||||
SourceRecord(
|
||||
source_id="src_001",
|
||||
title="Channel Notes",
|
||||
source_type="markdown",
|
||||
path="wiki/channel-capacity.md",
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
claim = store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id="clm_001",
|
||||
claim_text="Channel capacity bounds reliable communication rate.",
|
||||
claim_kind="definition",
|
||||
concept_ids=["concept::channel-capacity"],
|
||||
confidence_hint=0.72,
|
||||
provenance=ProvenanceRecord(
|
||||
origin_artifact_id="ia_001",
|
||||
origin_path="wiki/channel-capacity.md",
|
||||
support_kind="derived_from_page",
|
||||
grounding_status="partially_grounded",
|
||||
),
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
concept = store.save_concept(
|
||||
ConceptRecord(
|
||||
concept_id="concept::channel-capacity",
|
||||
title="Channel Capacity",
|
||||
description="Imported concept.",
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
relation = store.save_relation(
|
||||
RelationRecord(
|
||||
relation_id="rel_001",
|
||||
source_id="concept::channel-capacity",
|
||||
target_id="concept::shannon-entropy",
|
||||
relation_type="references",
|
||||
current_status="draft",
|
||||
)
|
||||
)
|
||||
review_candidate = store.save_review_candidate(
|
||||
ReviewCandidateRecord(
|
||||
review_candidate_id="rc_001",
|
||||
candidate_type="claim",
|
||||
candidate_id="clm_001",
|
||||
triage_lane="knowledge_capture",
|
||||
priority=10,
|
||||
current_status="triaged",
|
||||
)
|
||||
)
|
||||
promotion = store.save_promotion(
|
||||
PromotionRecord(
|
||||
promotion_id="pr_001",
|
||||
candidate_type="claim",
|
||||
candidate_id="clm_001",
|
||||
reviewer="R",
|
||||
promoted_object_ids=["clm_001"],
|
||||
promoted_at="2026-04-17T12:00:00Z",
|
||||
)
|
||||
)
|
||||
|
||||
assert store.get_source(source.source_id) is not None
|
||||
assert store.get_claim(claim.claim_id) is not None
|
||||
assert store.get_concept(concept.concept_id) is not None
|
||||
assert store.get_relation(relation.relation_id) is not None
|
||||
assert store.get_review_candidate(review_candidate.review_candidate_id) is not None
|
||||
assert store.get_promotion(promotion.promotion_id) is not None
|
||||
|
||||
|
||||
def test_groundrecall_store_builds_and_persists_snapshot(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
store.save_source(SourceRecord(source_id="src_001", title="T", current_status="promoted"))
|
||||
store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id="clm_001",
|
||||
claim_text="A grounded claim.",
|
||||
concept_ids=["concept::c1"],
|
||||
current_status="promoted",
|
||||
)
|
||||
)
|
||||
store.save_concept(ConceptRecord(concept_id="concept::c1", title="C1", current_status="promoted"))
|
||||
|
||||
snapshot = store.build_snapshot(
|
||||
snapshot_id="snap_001",
|
||||
created_at="2026-04-17T12:00:00Z",
|
||||
metadata={"export_kind": "canonical"},
|
||||
)
|
||||
saved = store.save_snapshot(snapshot)
|
||||
|
||||
loaded = store.get_snapshot(saved.snapshot_id)
|
||||
assert loaded is not None
|
||||
assert isinstance(loaded, GroundRecallSnapshot)
|
||||
assert loaded.metadata["export_kind"] == "canonical"
|
||||
assert len(loaded.sources) == 1
|
||||
assert len(loaded.claims) == 1
|
||||
assert len(loaded.concepts) == 1
|
||||
|
||||
|
||||
def test_groundrecall_models_remain_assistant_neutral() -> None:
|
||||
claim_fields = set(ClaimRecord.model_fields)
|
||||
concept_fields = set(ConceptRecord.model_fields)
|
||||
snapshot_fields = set(GroundRecallSnapshot.model_fields)
|
||||
forbidden = {"assistant", "assistant_name", "codex", "claude", "skill_bundle", "prompt_text"}
|
||||
|
||||
assert claim_fields.isdisjoint(forbidden)
|
||||
assert concept_fields.isdisjoint(forbidden)
|
||||
assert snapshot_fields.isdisjoint(forbidden)
|
||||
|
||||
|
||||
def test_groundrecall_store_writes_json_atomically_without_tmp_artifacts(tmp_path: Path) -> None:
|
||||
store = GroundRecallStore(tmp_path / "groundrecall")
|
||||
|
||||
claim = store.save_claim(
|
||||
ClaimRecord(
|
||||
claim_id="clm_atomic",
|
||||
claim_text="Atomic writes should leave valid JSON on disk.",
|
||||
concept_ids=["concept::atomicity"],
|
||||
current_status="reviewed",
|
||||
)
|
||||
)
|
||||
|
||||
claim_path = store.claims_dir / f"{claim.claim_id}.json"
|
||||
payload = json.loads(claim_path.read_text(encoding="utf-8"))
|
||||
assert payload["claim_id"] == "clm_atomic"
|
||||
assert list(store.claims_dir.glob("*.tmp")) == []
|
||||
Loading…
Reference in New Issue