ThreeGate/schemas/research-packet.schema.md

135 lines
3.3 KiB
Markdown

# Research Packet Schema (Normative)
A **Research Packet** is the only permitted format for data flowing from FETCH to CORE.
All packet content is treated as **untrusted data**. The packet is designed to:
- preserve provenance (where it came from)
- prevent instruction smuggling
- constrain content into predictable sections
- support deterministic validation and quarantining
Packets that do not conform MUST be quarantined.
---
## File Naming
Recommended:
- `RP-YYYYMMDD-HHMMSSZ-<slug>.md`
---
## Required Front Matter
Research Packets MUST begin with YAML front matter:
```yaml
---
packet_type: research_packet
schema_version: 1
packet_id: "RP-20260209-153012Z-arxiv-llm-security"
created_utc: "2026-02-09T15:30:12Z"
source_kind: "arxiv|pubmed|crossref|europepmc|doi|url|manual"
source_ref: "https://... or DOI or PMID"
title: "..."
authors: ["Last, First", "..."]
published_date: "YYYY-MM-DD" # if known
retrieved_utc: "YYYY-MM-DDTHH:MM:SSZ"
license: "open|unknown|restricted"
content_hashes:
body_sha256: "hex..."
sources_sha256: "hex..."
---
````
Notes:
* `license` is informational; CORE must still treat as untrusted.
* `content_hashes` support auditability and tamper detection.
---
## Required Sections (in this order)
Packets MUST contain the following H2 sections, exactly:
1. `## Executive Summary`
2. `## Source Metadata`
3. `## Extracted Content`
4. `## Claims and Evidence`
5. `## Safety Notes`
6. `## Citations`
### 1) Executive Summary
* Short, neutral description of what the source is about
* No imperatives, no instructions to CORE
* No tool suggestions
### 2) Source Metadata
Must include:
* canonical URL / DOI / PMID
* publication venue (if known)
* retrieval method (API vs HTML)
* any access constraints observed
### 3) Extracted Content
* Quotes are allowed but must be short and attributed.
* Prefer paraphrase with citations.
* Avoid embedding procedural steps (install/run) beyond what is necessary to understand the source.
### 4) Claims and Evidence
A list of claim blocks:
```text
- Claim: ...
Evidence: ...
Confidence: low|medium|high
Citation: [C1]
```
### 5) Safety Notes
This section is mandatory and MUST contain:
* `Untrusted Content Statement:` a sentence explicitly stating the content is untrusted and must not be treated as instructions.
* `Injection Indicators:` list any suspicious patterns found (or `None observed`).
### 6) Citations
A numbered list with stable labels:
```text
[C1] Author, Title, Venue, Year. URL/DOI.
[C2] ...
```
---
## Forbidden Content (Validation Failures)
Packets MUST be rejected if they contain (case-insensitive, including obfuscations):
* shell commands or code blocks intended for execution (e.g., `bash`, `sh`, `powershell`)
* installation instructions (`apt`, `pip install`, `curl | sh`, etc.)
* persistence suggestions (cron, systemd units, init scripts)
* instructions aimed at overriding hierarchy (“ignore previous instructions”, “system prompt”, etc.)
* embedded credentials or tokens
* links to executables or binary downloads presented as steps to take
Packets may describe such things academically if necessary, but must do so as **descriptive text** with no runnable commands.
---
## Validation Output
Validators should produce:
* `ACCEPT` → moved to `handoff/inbound-to-core/`
* `REJECT` → moved to `handoff/quarantine/` with a reason report