Interns Reference

Field	Value
Last updated	2026-02

Purpose

Technical reference for each intern agent: what it does, inputs, outputs, and key code paths.

Intern agents produce raw data only — they do not produce reports or verdicts.
For pipeline order, reuse, and freshness, see CRAWL_SYSTEM and Background flow.

Discovery crawler

Finds and classifies links from a product URL so the pipeline knows which pages to crawl.

Input

Type	Description
URL	Single product URL (string).

Output

Type: DiscoveredPage[]

Each item has:

Field	Description
`url`	Discovered page URL.
`type`	One of: `homepage`, `about`, `pricing`, `product`, `docs`, `blog`, `careers`, `case-studies`, `contact`, `other`.
`priority`	1–5 (higher = more important).
`reasoning`	Optional reasoning for classification.

Behavior

Extract internal links from the homepage HTML.
Optionally supplement with sitemap (/sitemap.xml, sitemap index).
Probe conventional paths (e.g. /pricing, /about, /features, /product, /solutions, /docs, /integrations).
AI classification and prioritization (GPT); dynamic page count (e.g. 3–8 pages, configurable via MAX_PAGES).
Always include the homepage in the list.
Guarantee probed pricing paths are in the list (so SPAs that don’t expose /pricing in HTML are still crawled).

Code

Location	Symbol
`lib/discovery.ts`	`discoverAndClassifyLinks`
`core/crawlers/discovery.ts`	`runDiscoveryCrawler`

Caching

Optional Redis cache by normalized URL. See CRAWL_SYSTEM.

Product intern

Fetches HTML for discovered pages and extracts structured evidence (headlines, pricing, metadata, product details, etc.).

Input

Type	Description
DiscoveredPage[]	Pages from the Discovery crawler.
preFetchedHtml	Optional; e.g. from UI capture to avoid duplicate fetch.

Output

Type: PageEvidence[]

Per page: url, pageType, and content with:

Content fields: headlines, keyPoints, textContent
Pricing: pricing (plans, pricingModel, rawText)
Metadata: metadata (title, description, ogImage, ogSiteName)
Team: teamInfo (teamMembers, companyStory, mission, foundingYear)
Product: productDetails (features, useCases, benefits, technicalDetails)
Social proof: customerLogos, integrationMentions
Structured data: faqSchema, organizationSchema, productSchema (JSON-LD)

Behavior

Resolve HTML: Use preFetchedHtml when provided; otherwise captureHtmlForUrls (smartFetch or Playwright).
Process in batches: e.g. 3 at a time; for each page call crawlPageForEvidence.
Cheerio-based extraction: Headlines from h1/h2, key points, pricing tables, meta tags, JSON-LD (Organization, Product, FAQ), team info, product details, customer logos, integration/partner mentions.
Politeness: 500 ms delay between batches.

Code

Location	Symbol
`lib/pipeline/steps/crawl.ts`	`crawlStep`
`lib/multi-crawler.ts`	`crawlPageForEvidence`, `crawlMultiplePages`, `PageEvidence`
`core/crawlers/product.ts`	`runProductCrawler`

UI capture intern

Captures screenshots and full HTML for a subset of pages (for the CDO design audit and for re-use in product extraction).

Input

Type	Description
URLs	List of URLs to capture.
Metadata	Discovered-page metadata (for logging).
Log callback	For progress/logging.

Output

Type: CaptureResult[]

Each item: url, html, screenshotBase64.

Behavior

Single Playwright browser instance.
For each URL: navigate, wait (e.g. 3 s after load), take screenshot, capture HTML.
On browser init failure: fallback to fetch + optional screenshot service per URL.
Viewport: 1920×1080. Navigation timeout: 20 s.

Code

Location	Symbol
`lib/cdo/steps/capture.ts`	`captureStep`
`core/crawlers/ui-capture.ts`	`runUICaptureCrawler`

Max pages (e.g. 8) is set in lib/pipeline/run-shared-crawl.ts (MAX_PAGES_UI_CAPTURE).

Competitor / enrichment intern

Enriches page evidence with external data: brand validation, traffic estimate, competitors, distribution channels, social metrics, value-prop clarity, etc. Competitor data comes from search APIs and optional crawl of competitor domains; no arbitrary third-party page scraping beyond that.

Input

Type	Description
PageEvidence[]	From the Product intern.
PipelineContext	`body.url`, `getOpenAI`, `costTracker`, `addLog`. Market category is computed inside this step via `categoryStep`.

Output

Type: EnrichedData

Includes: brandValidation, trafficData, competitorProfiles, distributionChannels, socialMetrics, valuePropClarity, tractionSignals, and optionally more (reviews, technical, content/SEO, hiring) depending on profile.

Behavior

categoryStep: AI extracts market category from page evidence.
enrichmentStep: Derive brand name and domain from evidence; then call dataEnricher.enrichPageEvidence, which:
- Extracts traction signals from pages
- Detects distribution channels from pages
- Optionally scrapes social metrics (if not minimal tier)
- Brand validation (search API + same-brand domain filtering)
- Traffic (API + Google Trends + AI fallback)
- Competitor analysis (search + optional competitor-domain crawl)
- Value-prop clarity
- Optionally: review aggregation, technical analyzer, content/SEO, hiring, feature comparison (by profile)

Tiers

Tier	Scope
`minimal`	Brand + traffic + competitors.
`cmo`	Skip heavy technical/reviews.
`full`	Full enrichment. Cost cap can force downgrade to minimal.

Code

Location	Symbol
`core/crawlers/competitor.ts`	`runCompetitorCrawler`
`lib/pipeline/steps/enrichment.ts`	`enrichmentStep`
`lib/enrichment/data-enricher.ts`	`DataEnricher.enrichPageEvidence`

For cost and cache behavior see CRAWL_SYSTEM.

Pricing pass (supplement)

Not a standalone intern. When the product crawl has empty or low-confidence pricing, a dedicated pricing pass runs: resolve a pricing page URL (from evidence or conventional paths), fetch it, extract pricing, and merge into page evidence.

Code

lib/crawler/pricing-pass.ts

Interns Reference

Purpose

Discovery crawler

Input

Output

Behavior

Code

Caching

Product intern

Input

Output

Behavior

Code

UI capture intern

Input

Output

Behavior

Code

Competitor / enrichment intern

Input

Output

Behavior

Tiers

Code

Pricing pass (supplement)

Code

See also