Crawl System: Flow, Reuse, and Enterprise
This document describes the shared company crawl pipeline: flow, reuse and freshness, User-Agent and politeness, and an enterprise one-pager.
1. Crawl flow
- Discovery — One discovery run per URL yields a list of pages (homepage, pricing, about, etc.) with types and priorities.
- Product + UI capture (parallel) — Product extractor and UI capture (screenshots/HTML) run in parallel over the discovered pages.
- Single-fetch dual use — HTML from UI capture is reused for the product step where useful (e.g. pricing, metadata).
- Pricing pass — If pricing is empty or low-confidence, a dedicated pricing pass runs and merges results.
- Competitor enrichment — Optional; runs with configurable tier (minimal / cmo / full) and optional cost cap.
- Persist — Result is validated (Zod), then saved to
company_crawlswithexecution_log,expires_at, and pipeline/intern versions.
APIs:
- SSE (UI):
POST /api/roles/crawl— streamed progress, then{ crawlId, url }. - JSON (programmatic):
POST /api/v1/crawl— returnscrawlId,pipeline_version,missing_report,cost_summary,qualitySignals.
2. Reuse and freshness
- Reuse window: A crawl is considered usable (and reused) if it is no older than
CRAWL_REUSE_MAX_AGE_DAYS(default 7). Configured inlib/persistence/company-crawls.tsand used byisCrawlUsable(..., { maxAgeDays }). - Expiry: New crawls set
expires_at = now + CRAWL_EXPIRES_DAYSwhen saved. Expired rows are not reused. - Optional Redis cache: When
REDIS_URLis set, discovery and enrichment results can be cached by normalized URL (and enrichment tier). Seelib/cache/crawl-cache.ts. Cache TTLs are 1 hour for discovery and enrichment.
3. User-Agent and politeness
- Identifiable agents: Requests use identifiable User-Agent strings (e.g.
Mozilla/5.0 (compatible; PMF-Intern/1.0),PMF-Analyzer/1.0) so operators can recognize and allowlist traffic. - No aggressive concurrency: Discovery runs once per URL; product and UI run in a single parallel batch with a bounded number of pages (e.g. up to 8 for UI capture). No unbounded parallel requests to the same host.
- Optional queue: With
CRAWL_USE_QUEUE=true, crawls can be enqueued and processed in the background; polling viaGET /api/roles/crawl/job?jobId=....
4. Enterprise one-pager
| Area | Capability |
|---|---|
| Verifiability | Zod validation before persist; pipeline and intern versions and full execution_log (extractor, durationMs, fromCache) stored with each crawl. |
| Scalability | Optional in-process crawl queue (CRAWL_USE_QUEUE); optional Redis for discovery/enrichment cache; versioned API POST /api/v1/crawl for integration. |
| Cost controls | Tiered enrichment (minimal / cmo / full); optional costCap; cost recorded in execution_log and returned in cost_summary; observability emits cost by service. |
| Quality and observability | Per-crawl quality signals (hasPricing, hasTeam, productDetailsConfidence) and missing-data reports; structured JSON log line per crawl (duration by stage, cost, extractor success) for aggregators. |
All of the above are implemented in the codebase; see lib/pipeline/run-shared-crawl.ts, lib/persistence/company-crawls.ts, lib/cache/crawl-cache.ts, lib/crawl-queue.ts, lib/observability/crawl-metrics.ts, and app/api/v1/crawl/route.ts.