ThreeGate/fetch/url/README.md

722 B

Allowlisted URL Fetcher (Size-Capped)

This fetcher retrieves small text-like content from allowlisted domains and emits a Research Packet.

Security Constraints

  • HTTPS only
  • In-code domain allowlist (defense in depth)
  • Size cap (default 250 KB)
  • Content-Type allowlist (text/*, application/json, application/xml, *+xml)
  • Honors proxy environment variables via urllib ProxyHandler

Usage

From repo root:

chmod +x fetch/url/fetch_text_allowlisted.py
export PYTHONPATH="$(pwd)"
export CONTACT_EMAIL="you@example.org"   # recommended etiquette

python3 fetch/url/fetch_text_allowlisted.py \
  --url "https://arxiv.org/abs/2401.00001" \
  --out infra/volumes/handoff/inbound-to-core/RP-url-arxiv-abs.md