Crawl System: Flow, Reuse, and Enterprise

This document describes the shared company crawl pipeline: flow, reuse and freshness, User-Agent and politeness, and an enterprise one-pager.

1. Crawl flow

Discovery — One discovery run per URL yields a list of pages (homepage, pricing, about, etc.) with types and priorities.
Product + UI capture (parallel) — Product extractor and UI capture (screenshots/HTML) run in parallel over the discovered pages.
Single-fetch dual use — HTML from UI capture is reused for the product step where useful (e.g. pricing, metadata).
Pricing pass — If pricing is empty or low-confidence, a dedicated pricing pass runs and merges results.
Competitor enrichment — Optional; runs with configurable tier (minimal / cmo / full) and optional cost cap.
Persist — Result is validated (Zod), then saved to company_crawls with execution_log, expires_at, and pipeline/intern versions.

APIs:

SSE (UI): POST /api/roles/crawl — streamed progress, then { crawlId, url }.
JSON (programmatic): POST /api/v1/crawl — returns crawlId, pipeline_version, missing_report, cost_summary, qualitySignals.

2. Reuse and freshness

Reuse window: A crawl is considered usable (and reused) if it is no older than CRAWL_REUSE_MAX_AGE_DAYS (default 7). Configured in lib/persistence/company-crawls.ts and used by isCrawlUsable(..., { maxAgeDays }).
Expiry: New crawls set expires_at = now + CRAWL_EXPIRES_DAYS when saved. Expired rows are not reused.
Optional Redis cache: When REDIS_URL is set, discovery and enrichment results can be cached by normalized URL (and enrichment tier). See lib/cache/crawl-cache.ts. Cache TTLs are 1 hour for discovery and enrichment.

3. User-Agent and politeness

Identifiable agents: Requests use identifiable User-Agent strings (e.g. Mozilla/5.0 (compatible; PMF-Intern/1.0), PMF-Analyzer/1.0) so operators can recognize and allowlist traffic.
No aggressive concurrency: Discovery runs once per URL; product and UI run in a single parallel batch with a bounded number of pages (e.g. up to 8 for UI capture). No unbounded parallel requests to the same host.
Optional queue: With CRAWL_USE_QUEUE=true, crawls can be enqueued and processed in the background; polling via GET /api/roles/crawl/job?jobId=....

Area	Capability
Verifiability	Zod validation before persist; pipeline and intern versions and full `execution_log` (extractor, durationMs, fromCache) stored with each crawl.
Scalability	Optional in-process crawl queue (`CRAWL_USE_QUEUE`); optional Redis for discovery/enrichment cache; versioned API `POST /api/v1/crawl` for integration.
Cost controls	Tiered enrichment (minimal / cmo / full); optional `costCap`; cost recorded in `execution_log` and returned in `cost_summary`; observability emits cost by service.
Quality and observability	Per-crawl quality signals (`hasPricing`, `hasTeam`, `productDetailsConfidence`) and missing-data reports; structured JSON log line per crawl (duration by stage, cost, extractor success) for aggregators.

All of the above are implemented in the codebase; see lib/pipeline/run-shared-crawl.ts, lib/persistence/company-crawls.ts, lib/cache/crawl-cache.ts, lib/crawl-queue.ts, lib/observability/crawl-metrics.ts, and app/api/v1/crawl/route.ts.

Crawl System: Flow, Reuse, and Enterprise

1. Crawl flow

2. Reuse and freshness

3. User-Agent and politeness

4. Enterprise one-pager