26 lines
722 B
Markdown
26 lines
722 B
Markdown
# Allowlisted URL Fetcher (Size-Capped)
|
|
|
|
This fetcher retrieves small text-like content from allowlisted domains and emits a Research Packet.
|
|
|
|
## Security Constraints
|
|
|
|
- HTTPS only
|
|
- In-code domain allowlist (defense in depth)
|
|
- Size cap (default 250 KB)
|
|
- Content-Type allowlist (text/*, application/json, application/xml, *+xml)
|
|
- Honors proxy environment variables via urllib ProxyHandler
|
|
|
|
## Usage
|
|
|
|
From repo root:
|
|
|
|
```sh
|
|
chmod +x fetch/url/fetch_text_allowlisted.py
|
|
export PYTHONPATH="$(pwd)"
|
|
export CONTACT_EMAIL="you@example.org" # recommended etiquette
|
|
|
|
python3 fetch/url/fetch_text_allowlisted.py \
|
|
--url "https://arxiv.org/abs/2401.00001" \
|
|
--out infra/volumes/handoff/inbound-to-core/RP-url-arxiv-abs.md
|
|
|