diff --git a/docs/evo-edu-notebook-pipeline.md b/docs/evo-edu-notebook-pipeline.md new file mode 100644 index 0000000..8b89517 --- /dev/null +++ b/docs/evo-edu-notebook-pipeline.md @@ -0,0 +1,332 @@ +# evo-edu Notebook Pipeline + +This note turns the current `Notebook` idea into a concrete cross-repo +workflow for `doclift`, `GroundRecall`, `Didactopus`, and `CiteGeist`. + +The target is the conceptual resource at: + +- + +The important shift is that the Notebook should not be treated as "just another +wiki". The strongest differentiator available in the current stack is +graph-first navigation over reviewed concepts, claims, citations, and learner +next-step suggestions. + +## Why this fits the current stack + +The stack already divides responsibility in a useful way: + +- `doclift`: normalize messy source material into deterministic bundles +- `GroundRecall`: canonical reviewed claims, concept graph, provenance, and + query/export surfaces +- `Didactopus`: learner-facing packs, sequencing, workbench flows, and concept + navigation +- `CiteGeist`: bibliography extraction, enrichment, review, and expansion + +The Notebook use case needs all four: + +- explanation text +- accessible concept sequencing +- explicit source grounding +- bibliography compilation and enrichment +- illustration planning +- visible graph structure for "what to learn next" + +## Source classes + +The Notebook will likely need at least four source classes. + +### 1. Web corpora + +Examples: + +- TalkOrigins Archive FAQs and articles +- TalkDesign posts +- Panda's Thumb posts + +Operational note: + +- these corpora should be provisioned locally before ingestion +- do not rely on live scraping as the primary production path +- keep source snapshots versioned or at least manifest-tracked + +### 2. Scanned textbooks and monographs + +Examples already named: + +- Futuyma, `Evolutionary Biology` +- Pianka, `Evolutionary Ecology` +- Bowler, `Evolution: The History of an Idea` + +The current local library root is: + +- `/mnt/CIFS/pengolodh/Docs/Library` + +This should be treated as the upstream source corpus, not as the final working +directory for Notebook artifacts. + +### 3. Bibliographic seed corpora + +Examples: + +- TalkOrigins bibliographies +- textbook reference sections +- existing `.bib` files in the library + +These are where `CiteGeist` becomes especially important. + +### 4. Planned illustration sources + +These are not just assets. They should be reviewable planning objects: + +- target concept +- illustration intent +- source basis +- rights/compliance note +- status: planned / needed / drafted / reviewed / published + +## Recommended working position for the Notebook + +The Notebook should be positioned as: + +- a graph-guided conceptual atlas +- a source-grounded explanation layer +- a learner-facing bridge between articles, textbooks, and bibliographies + +It should not try to compete by being the flattest or largest encyclopedia. + +The distinguishing feature should be that the learner can see: + +- antecedent concepts +- nearby or "closer" concepts +- derivative or downstream concepts +- representative supporting sources +- bibliography growth points +- illustration opportunities + +That is much more consistent with the current stack than a generic article CMS. + +## Proposed pipeline + +### Phase 0. Provision the corpora locally + +Create a local Notebook source workspace containing: + +- provisioned web corpora snapshots +- selected textbook scan directories +- bibliography seeds +- source manifests + +Expected result: + +- stable local inputs for repeatable ingestion + +### Phase 1. Normalize source material with `doclift` + +Use `doclift` for: + +- OCR-derived text normalization where practical +- sidecar generation +- `document.chunks.json` emission +- bundle manifests for scanned or converted materials + +For web corpora, either: + +- convert into bundle-like normalized document trees, or +- ingest through direct text/markdown adapters where that is simpler + +Expected result: + +- deterministic source bundles for longer-form documents + +### Phase 2. Build bibliographic substrate with `CiteGeist` + +Use `CiteGeist` to: + +- scrape or ingest TalkOrigins bibliography materials +- expand weak references +- enrich textbook references +- cluster duplicates +- build review exports for uncertain entries +- maintain one or more Notebook `.bib` outputs + +Expected result: + +- a reviewed bibliography layer rather than ad hoc citation lists + +### Phase 3. Import canonical knowledge into `GroundRecall` + +Use `GroundRecall` to import: + +- `doclift` bundles for textbooks and scans +- provisioned article/essay corpora +- optional Didactopus-native artifacts where useful + +Then use its review flow to: + +- standardize concepts +- preserve fragments and provenance +- compute graph diagnostics +- queue bridge/isolated/small-component concepts for review +- retain review rationale in promoted candidates + +Expected result: + +- canonical Notebook concept/claim substrate with provenance and graph signals + +### Phase 4. Export pack-ready concept bundles from `GroundRecall` + +For important notebook concepts, export: + +- `groundrecall_query_bundle.json` + +This becomes the handoff object for learner-facing or page-facing pack flows. + +Expected result: + +- reviewed concept payloads that can feed Didactopus and page generation + +### Phase 5. Build `Didactopus` packs and learner navigation + +Use `Didactopus` to: + +- create draft packs around concept neighborhoods or topical modules +- carry `groundrecall_query_bundle.json` as a declared supporting artifact +- expose learner-workbench context that includes review and graph signals +- sequence "what next" items from prerequisites and nearby graph structure + +Expected result: + +- learner-facing concept packs grounded in reviewed Notebook knowledge + +### Phase 6. Publish the Notebook + +Publication outputs should probably include: + +- accessible concept pages +- graph-first navigation controls +- bibliography sections or per-page reading lists +- illustration status or image slots +- links into interactive apps and learner-workbench flows + +Expected result: + +- a Notebook that is not just readable, but navigable through conceptual + structure + +## Knowledge-graph-first navigation + +This is the main product differentiator. + +For each concept page, the learner should be able to see a small graph-guided +navigation panel with categories such as: + +- `Antecedent concepts` + Concepts that must usually be understood first + +- `Closer concepts` + Nearby concepts in the same explanatory neighborhood + +- `Derivative concepts` + Concepts that extend or depend on the current concept + +- `Supporting sources` + Canonical bibliography or source entries that materially support the concept + +- `Illustration opportunities` + Candidate figures or planned visual explanations + +The labels can be refined later, but the structure should come from typed graph +relations rather than from arbitrary page links alone. + +## Suggested relation types for Notebook navigation + +The current stack does not need all of these on day one, but they are useful as +target categories: + +- `prerequisite` +- `supports` +- `contrasts_with` +- `historical_predecessor` +- `historical_successor` +- `applies_to` +- `example_of` +- `misconception_about` +- `illustrated_by` + +Some can live in `GroundRecall` first and only later appear in learner-facing +Didactopus packs. + +## Illustration planning + +Illustrations should be tracked as structured planning artifacts, not buried in +page notes. + +At minimum, each planned illustration should record: + +- target concept id +- working caption or purpose +- source grounding +- rights/compliance note +- priority +- status + +This can begin as JSON or markdown sidecars before becoming a richer model. + +## Bibliography strategy + +The Notebook may want both: + +- per-concept reading lists +- larger topical bibliographies + +Recommended split: + +- `CiteGeist` maintains the main bibliography workbench and review discipline +- `GroundRecall` stores links between concepts/claims and source artifacts +- published Notebook pages surface only the citations relevant to the current + concept and nearby graph region + +That avoids turning the Notebook itself into the bibliography editor. + +## Concrete first pilot + +A good first Notebook pilot would be one narrow concept region rather than the +whole corpus. + +For example: + +- historical development of evolutionary thought +- evidence for common descent +- natural selection and adaptation + +Choose one region with: + +- 1 to 3 textbooks +- a small local article/blog corpus +- one reviewed bibliography export +- one explicit graph-navigation experiment + +## Recommended next implementation tasks + +1. Provision one local Notebook corpus workspace outside the library root. +2. Choose one pilot concept region and one target concept. +3. Normalize one textbook source with `doclift`. +4. Provision one local TalkOrigins or Panda's Thumb snapshot. +5. Run `CiteGeist` on the pilot bibliography inputs. +6. Import the pilot sources into `GroundRecall`. +7. Export one `groundrecall_query_bundle.json`. +8. Feed that into a `Didactopus` pack flow. +9. Prototype one Notebook page that exposes graph-guided next-to-learn links. + +## Bottom line + +The Notebook is a strong fit for the current stack if it is treated as: + +- concept-first +- graph-guided +- provenance-aware +- bibliography-backed +- learner-navigable + +It is a weaker fit if treated as only a flat wiki rewrite of source material.