For engineers and contributors. User-facing documentation lives at /docs.

Interns Reference

FieldValue
Last updated2026-02

Purpose

Technical reference for each intern agent: what it does, inputs, outputs, and key code paths.

  • Intern agents produce raw data only — they do not produce reports or verdicts.
  • For pipeline order, reuse, and freshness, see CRAWL_SYSTEM and Background flow.

Discovery crawler

Finds and classifies links from a product URL so the pipeline knows which pages to crawl.

Input

TypeDescription
URLSingle product URL (string).

Output

Type: DiscoveredPage[]

Each item has:

FieldDescription
urlDiscovered page URL.
typeOne of: homepage, about, pricing, product, docs, blog, careers, case-studies, contact, other.
priority1–5 (higher = more important).
reasoningOptional reasoning for classification.

Behavior

  1. Extract internal links from the homepage HTML.
  2. Optionally supplement with sitemap (/sitemap.xml, sitemap index).
  3. Probe conventional paths (e.g. /pricing, /about, /features, /product, /solutions, /docs, /integrations).
  4. AI classification and prioritization (GPT); dynamic page count (e.g. 3–8 pages, configurable via MAX_PAGES).
  5. Always include the homepage in the list.
  6. Guarantee probed pricing paths are in the list (so SPAs that don’t expose /pricing in HTML are still crawled).

Code

LocationSymbol
lib/discovery.tsdiscoverAndClassifyLinks
core/crawlers/discovery.tsrunDiscoveryCrawler

Caching

Optional Redis cache by normalized URL. See CRAWL_SYSTEM.


Product intern

Fetches HTML for discovered pages and extracts structured evidence (headlines, pricing, metadata, product details, etc.).

Input

TypeDescription
DiscoveredPage[]Pages from the Discovery crawler.
preFetchedHtmlOptional; e.g. from UI capture to avoid duplicate fetch.

Output

Type: PageEvidence[]

Per page: url, pageType, and content with:

  • Content fields: headlines, keyPoints, textContent
  • Pricing: pricing (plans, pricingModel, rawText)
  • Metadata: metadata (title, description, ogImage, ogSiteName)
  • Team: teamInfo (teamMembers, companyStory, mission, foundingYear)
  • Product: productDetails (features, useCases, benefits, technicalDetails)
  • Social proof: customerLogos, integrationMentions
  • Structured data: faqSchema, organizationSchema, productSchema (JSON-LD)

Behavior

  1. Resolve HTML: Use preFetchedHtml when provided; otherwise captureHtmlForUrls (smartFetch or Playwright).
  2. Process in batches: e.g. 3 at a time; for each page call crawlPageForEvidence.
  3. Cheerio-based extraction: Headlines from h1/h2, key points, pricing tables, meta tags, JSON-LD (Organization, Product, FAQ), team info, product details, customer logos, integration/partner mentions.
  4. Politeness: 500 ms delay between batches.

Code

LocationSymbol
lib/pipeline/steps/crawl.tscrawlStep
lib/multi-crawler.tscrawlPageForEvidence, crawlMultiplePages, PageEvidence
core/crawlers/product.tsrunProductCrawler

UI capture intern

Captures screenshots and full HTML for a subset of pages (for the CDO design audit and for re-use in product extraction).

Input

TypeDescription
URLsList of URLs to capture.
MetadataDiscovered-page metadata (for logging).
Log callbackFor progress/logging.

Output

Type: CaptureResult[]

Each item: url, html, screenshotBase64.

Behavior

  1. Single Playwright browser instance.
  2. For each URL: navigate, wait (e.g. 3 s after load), take screenshot, capture HTML.
  3. On browser init failure: fallback to fetch + optional screenshot service per URL.
  4. Viewport: 1920×1080. Navigation timeout: 20 s.

Code

LocationSymbol
lib/cdo/steps/capture.tscaptureStep
core/crawlers/ui-capture.tsrunUICaptureCrawler

Max pages (e.g. 8) is set in lib/pipeline/run-shared-crawl.ts (MAX_PAGES_UI_CAPTURE).


Competitor / enrichment intern

Enriches page evidence with external data: brand validation, traffic estimate, competitors, distribution channels, social metrics, value-prop clarity, etc. Competitor data comes from search APIs and optional crawl of competitor domains; no arbitrary third-party page scraping beyond that.

Input

TypeDescription
PageEvidence[]From the Product intern.
PipelineContextbody.url, getOpenAI, costTracker, addLog. Market category is computed inside this step via categoryStep.

Output

Type: EnrichedData

Includes: brandValidation, trafficData, competitorProfiles, distributionChannels, socialMetrics, valuePropClarity, tractionSignals, and optionally more (reviews, technical, content/SEO, hiring) depending on profile.

Behavior

  1. categoryStep: AI extracts market category from page evidence.
  2. enrichmentStep: Derive brand name and domain from evidence; then call dataEnricher.enrichPageEvidence, which:
    • Extracts traction signals from pages
    • Detects distribution channels from pages
    • Optionally scrapes social metrics (if not minimal tier)
    • Brand validation (search API + same-brand domain filtering)
    • Traffic (API + Google Trends + AI fallback)
    • Competitor analysis (search + optional competitor-domain crawl)
    • Value-prop clarity
    • Optionally: review aggregation, technical analyzer, content/SEO, hiring, feature comparison (by profile)

Tiers

TierScope
minimalBrand + traffic + competitors.
cmoSkip heavy technical/reviews.
fullFull enrichment. Cost cap can force downgrade to minimal.

Code

LocationSymbol
core/crawlers/competitor.tsrunCompetitorCrawler
lib/pipeline/steps/enrichment.tsenrichmentStep
lib/enrichment/data-enricher.tsDataEnricher.enrichPageEvidence

For cost and cache behavior see CRAWL_SYSTEM.


Pricing pass (supplement)

Not a standalone intern. When the product crawl has empty or low-confidence pricing, a dedicated pricing pass runs: resolve a pricing page URL (from evidence or conventional paths), fetch it, extract pricing, and merge into page evidence.

Code

lib/crawler/pricing-pass.ts


See also