|
|
||
|---|---|---|
| .. | ||
| README.md | ||
| fetch_text_allowlisted.py | ||
README.md
Allowlisted URL Fetcher (Size-Capped)
This fetcher retrieves small text-like content from allowlisted domains and emits a Research Packet.
Security Constraints
- HTTPS only
- In-code domain allowlist (defense in depth)
- Size cap (default 250 KB)
- Content-Type allowlist (text/*, application/json, application/xml, *+xml)
- Honors proxy environment variables via urllib ProxyHandler
Usage
From repo root:
chmod +x fetch/url/fetch_text_allowlisted.py
export PYTHONPATH="$(pwd)"
export CONTACT_EMAIL="you@example.org" # recommended etiquette
python3 fetch/url/fetch_text_allowlisted.py \
--url "https://arxiv.org/abs/2401.00001" \
--out infra/volumes/handoff/inbound-to-core/RP-url-arxiv-abs.md