For engineers and contributors. User-facing documentation lives at /docs.

Crawl System: Flow, Reuse, and Enterprise

This document describes the shared company crawl pipeline: flow, reuse and freshness, User-Agent and politeness, and an enterprise one-pager.


1. Crawl flow

  1. Discovery — One discovery run per URL yields a list of pages (homepage, pricing, about, etc.) with types and priorities.
  2. Product + UI capture (parallel) — Product extractor and UI capture (screenshots/HTML) run in parallel over the discovered pages.
  3. Single-fetch dual use — HTML from UI capture is reused for the product step where useful (e.g. pricing, metadata).
  4. Pricing pass — If pricing is empty or low-confidence, a dedicated pricing pass runs and merges results.
  5. Competitor enrichment — Optional; runs with configurable tier (minimal / cmo / full) and optional cost cap.
  6. Persist — Result is validated (Zod), then saved to company_crawls with execution_log, expires_at, and pipeline/intern versions.

APIs:

  • SSE (UI): POST /api/roles/crawl — streamed progress, then { crawlId, url }.
  • JSON (programmatic): POST /api/v1/crawl — returns crawlId, pipeline_version, missing_report, cost_summary, qualitySignals.

2. Reuse and freshness

  • Reuse window: A crawl is considered usable (and reused) if it is no older than CRAWL_REUSE_MAX_AGE_DAYS (default 7). Configured in lib/persistence/company-crawls.ts and used by isCrawlUsable(..., { maxAgeDays }).
  • Expiry: New crawls set expires_at = now + CRAWL_EXPIRES_DAYS when saved. Expired rows are not reused.
  • Optional Redis cache: When REDIS_URL is set, discovery and enrichment results can be cached by normalized URL (and enrichment tier). See lib/cache/crawl-cache.ts. Cache TTLs are 1 hour for discovery and enrichment.

3. User-Agent and politeness

  • Identifiable agents: Requests use identifiable User-Agent strings (e.g. Mozilla/5.0 (compatible; PMF-Intern/1.0), PMF-Analyzer/1.0) so operators can recognize and allowlist traffic.
  • No aggressive concurrency: Discovery runs once per URL; product and UI run in a single parallel batch with a bounded number of pages (e.g. up to 8 for UI capture). No unbounded parallel requests to the same host.
  • Optional queue: With CRAWL_USE_QUEUE=true, crawls can be enqueued and processed in the background; polling via GET /api/roles/crawl/job?jobId=....

4. Enterprise one-pager

AreaCapability
VerifiabilityZod validation before persist; pipeline and intern versions and full execution_log (extractor, durationMs, fromCache) stored with each crawl.
ScalabilityOptional in-process crawl queue (CRAWL_USE_QUEUE); optional Redis for discovery/enrichment cache; versioned API POST /api/v1/crawl for integration.
Cost controlsTiered enrichment (minimal / cmo / full); optional costCap; cost recorded in execution_log and returned in cost_summary; observability emits cost by service.
Quality and observabilityPer-crawl quality signals (hasPricing, hasTeam, productDetailsConfidence) and missing-data reports; structured JSON log line per crawl (duration by stage, cost, extractor success) for aggregators.

All of the above are implemented in the codebase; see lib/pipeline/run-shared-crawl.ts, lib/persistence/company-crawls.ts, lib/cache/crawl-cache.ts, lib/crawl-queue.ts, lib/observability/crawl-metrics.ts, and app/api/v1/crawl/route.ts.