Interns Reference
| Field | Value |
|---|---|
| Last updated | 2026-02 |
Purpose
Technical reference for each intern agent: what it does, inputs, outputs, and key code paths.
- Intern agents produce raw data only — they do not produce reports or verdicts.
- For pipeline order, reuse, and freshness, see CRAWL_SYSTEM and Background flow.
Discovery crawler
Finds and classifies links from a product URL so the pipeline knows which pages to crawl.
Input
| Type | Description |
|---|---|
| URL | Single product URL (string). |
Output
Type: DiscoveredPage[]
Each item has:
| Field | Description |
|---|---|
url | Discovered page URL. |
type | One of: homepage, about, pricing, product, docs, blog, careers, case-studies, contact, other. |
priority | 1–5 (higher = more important). |
reasoning | Optional reasoning for classification. |
Behavior
- Extract internal links from the homepage HTML.
- Optionally supplement with sitemap (
/sitemap.xml, sitemap index). - Probe conventional paths (e.g.
/pricing,/about,/features,/product,/solutions,/docs,/integrations). - AI classification and prioritization (GPT); dynamic page count (e.g. 3–8 pages, configurable via
MAX_PAGES). - Always include the homepage in the list.
- Guarantee probed pricing paths are in the list (so SPAs that don’t expose
/pricingin HTML are still crawled).
Code
| Location | Symbol |
|---|---|
lib/discovery.ts | discoverAndClassifyLinks |
core/crawlers/discovery.ts | runDiscoveryCrawler |
Caching
Optional Redis cache by normalized URL. See CRAWL_SYSTEM.
Product intern
Fetches HTML for discovered pages and extracts structured evidence (headlines, pricing, metadata, product details, etc.).
Input
| Type | Description |
|---|---|
| DiscoveredPage[] | Pages from the Discovery crawler. |
| preFetchedHtml | Optional; e.g. from UI capture to avoid duplicate fetch. |
Output
Type: PageEvidence[]
Per page: url, pageType, and content with:
- Content fields:
headlines,keyPoints,textContent - Pricing:
pricing(plans, pricingModel, rawText) - Metadata:
metadata(title, description, ogImage, ogSiteName) - Team:
teamInfo(teamMembers, companyStory, mission, foundingYear) - Product:
productDetails(features, useCases, benefits, technicalDetails) - Social proof:
customerLogos,integrationMentions - Structured data:
faqSchema,organizationSchema,productSchema(JSON-LD)
Behavior
- Resolve HTML: Use
preFetchedHtmlwhen provided; otherwisecaptureHtmlForUrls(smartFetch or Playwright). - Process in batches: e.g. 3 at a time; for each page call
crawlPageForEvidence. - Cheerio-based extraction: Headlines from h1/h2, key points, pricing tables, meta tags, JSON-LD (Organization, Product, FAQ), team info, product details, customer logos, integration/partner mentions.
- Politeness: 500 ms delay between batches.
Code
| Location | Symbol |
|---|---|
lib/pipeline/steps/crawl.ts | crawlStep |
lib/multi-crawler.ts | crawlPageForEvidence, crawlMultiplePages, PageEvidence |
core/crawlers/product.ts | runProductCrawler |
UI capture intern
Captures screenshots and full HTML for a subset of pages (for the CDO design audit and for re-use in product extraction).
Input
| Type | Description |
|---|---|
| URLs | List of URLs to capture. |
| Metadata | Discovered-page metadata (for logging). |
| Log callback | For progress/logging. |
Output
Type: CaptureResult[]
Each item: url, html, screenshotBase64.
Behavior
- Single Playwright browser instance.
- For each URL: navigate, wait (e.g. 3 s after load), take screenshot, capture HTML.
- On browser init failure: fallback to fetch + optional screenshot service per URL.
- Viewport: 1920×1080. Navigation timeout: 20 s.
Code
| Location | Symbol |
|---|---|
lib/cdo/steps/capture.ts | captureStep |
core/crawlers/ui-capture.ts | runUICaptureCrawler |
Max pages (e.g. 8) is set in lib/pipeline/run-shared-crawl.ts (MAX_PAGES_UI_CAPTURE).
Competitor / enrichment intern
Enriches page evidence with external data: brand validation, traffic estimate, competitors, distribution channels, social metrics, value-prop clarity, etc. Competitor data comes from search APIs and optional crawl of competitor domains; no arbitrary third-party page scraping beyond that.
Input
| Type | Description |
|---|---|
| PageEvidence[] | From the Product intern. |
| PipelineContext | body.url, getOpenAI, costTracker, addLog. Market category is computed inside this step via categoryStep. |
Output
Type: EnrichedData
Includes: brandValidation, trafficData, competitorProfiles, distributionChannels, socialMetrics, valuePropClarity, tractionSignals, and optionally more (reviews, technical, content/SEO, hiring) depending on profile.
Behavior
- categoryStep: AI extracts market category from page evidence.
- enrichmentStep: Derive brand name and domain from evidence; then call
dataEnricher.enrichPageEvidence, which:- Extracts traction signals from pages
- Detects distribution channels from pages
- Optionally scrapes social metrics (if not minimal tier)
- Brand validation (search API + same-brand domain filtering)
- Traffic (API + Google Trends + AI fallback)
- Competitor analysis (search + optional competitor-domain crawl)
- Value-prop clarity
- Optionally: review aggregation, technical analyzer, content/SEO, hiring, feature comparison (by profile)
Tiers
| Tier | Scope |
|---|---|
minimal | Brand + traffic + competitors. |
cmo | Skip heavy technical/reviews. |
full | Full enrichment. Cost cap can force downgrade to minimal. |
Code
| Location | Symbol |
|---|---|
core/crawlers/competitor.ts | runCompetitorCrawler |
lib/pipeline/steps/enrichment.ts | enrichmentStep |
lib/enrichment/data-enricher.ts | DataEnricher.enrichPageEvidence |
For cost and cache behavior see CRAWL_SYSTEM.
Pricing pass (supplement)
Not a standalone intern. When the product crawl has empty or low-confidence pricing, a dedicated pricing pass runs: resolve a pricing page URL (from evidence or conventional paths), fetch it, extract pricing, and merge into page evidence.
Code
lib/crawler/pricing-pass.ts
See also
- CRAWL_SYSTEM — Reuse, freshness, User-Agent/politeness, enterprise
- Background flow — Order of interns in the pipeline and entry points
- Agents overview — Intern vs report agents