How Aigentably Generates WebMCP Tools from a URL
Paste a URL, get a ranked list of WebMCP tools with input schemas and runnable executeJs. Here's what the pipeline actually does between those two events.
The naive approach (and why it fails)
The obvious way to generate WebMCP tools from a URL is to fetch the page, hand the HTML to an LLM, and ask "what tools should this site expose?" We tried it. It produces underwhelming results for three reasons.
One page is not enough. Homepages are marketing surfaces. The interesting actions (addToCart, applyCoupon, filterByCategory, submitOrder) live on product, cart, and search pages. Generating from the homepage alone gives you generic navigation tools and not much else.
Raw HTML is wasteful. Modern e-commerce HTML is mostly markup overhead: tracking pixels, framework hydration data, CSS classes. Feeding 200 KB of HTML per page to a model burns tokens on noise.
Models drift toward made-up tools.Given prose-like input, LLMs invent plausible tools that don't map to anything real on the page. The result looks reasonable but the executeJs code targets selectors that don't exist.
The pipeline
URL
└─ planCrawl (robots.txt + sitemap.xml + link fallback)
└─ crawlSite (fetch up to 5 pages, parallel)
└─ extractSignals (cheerio → structured SiteSignals)
└─ buildSignalsPrompt (per page)
└─ callLlm (max 3 concurrent, JSON mode, 1 retry)
└─ parseLlmResponse
└─ dedupTools (keep highest-importance per name)
└─ applyFreeTierFilter
└─ persist → responseFive clear stages: discover pages, fetch them, turn each into a structured signal object, ask an LLM what tools fit, merge results. Everything is cacheable, retryable, and inspectable.
Stage 1: Page discovery
planCrawl picks the 5 pages most likely to expose interactive affordances. It tries three sources in order:
/sitemap.xmland/sitemap_index.xml— recursively walked to flatten sitemap indexes.- Homepage
<a href>extraction as fallback when no sitemap exists. - Robots.txt filtering on top of both — Disallow rules for
*orAigentably-Botremove candidates before they're fetched.
From the candidate set we pick representatives by URL hint matching: paths containing product, cart, checkout, search, account, login, category, and similar. The homepage is always included as page one.
The output of this stage is a list of URLs with a reasontag ("homepage", "product-page", "cart", etc.) that flows through to the LLM prompt so the model knows what kind of page it's reasoning about.
Stage 2: Signal extraction
Each page goes through extractSignals, a cheerio-based extractor that produces a compact JSON object. Concrete shape:
| Field | Why it matters |
|---|---|
| forms | action + method + required field names. LLM proposes submit-style tools. |
| buttons | Text + id + data-* attributes. Drives click-style tools. |
| jsonLd | Product, Offer, BreadcrumbList. Names entities the agent should target. |
| framework | Shopify / Next / WordPress detection. Steers executeJs toward the right APIs. |
| globalApis | Exposed window.* objects (Shopify, __NEXT_DATA__). Lets LLM prefer JS calls over DOM scraping. |
| internalLinks | Hints at site IA for navigation tools. |
The whole signal object for a typical page is 2-4 KB. Compare with 200 KB of raw HTML. Lower cost, less noise, and the model is forced to ground its suggestions in identified DOM elements instead of free-associating from markup.
Stage 3: Per-page LLM call
Each page's signals get their own LLM call. Up to 3 calls run in parallel via Promise.allSettled. The prompt explicitly instructs the model to:
- Prefer exposed globals (Shopify.*, window.__NEXT_DATA__) over DOM scraping.
- Click existing buttons via querySelector rather than synthesizing new HTML.
- Never navigate away via
window.locationunless the tool is explicitly a navigation tool. - Return JSON only, no markdown fences — enforced with
response_format: json_object.
If the first response fails strict JSON validation we retry once with a hardened reminder prompt. After that, the page is allowed to contribute zero tools without failing the whole generation.
Free users hit Gemini 2.5 Flash. Pro users hit Gemini 2.5 Pro. The pipeline is otherwise identical — Pro pays for higher-quality reasoning over the same signals.
Stage 4: Dedup and ranking
Five pages can produce a lot of overlapping tool names. addToCart on the product detail page and addToCart on a quick-add widget collapse into one. The dedup pass keeps the highest-importance variant per name and caps the final set at 12 tools sorted by importance.
Every tool keeps a sourceUrl stamp. The dashboard shows it as from /products/widget next to each suggestion so you can audit which page produced which tool.
Stage 5: The 24h crawl cache
Crawling 5 pages on every regenerate would be wasteful and slow. We cache the crawl result in a SiteProfile table keyed by site, with a 24-hour TTL.
Inside the window: regenerate hits the LLM fresh on the cached signals. You can iterate on tool variants without re-fetching the site. Outside the window: a fresh crawl happens automatically. A Force refreshbutton bypasses the cache when you've actually changed the site.
The dashboard shows a cachedbadge on the crawl report when results came from cache, so there's no confusion about whether the LLM saw your latest changes.
What we explicitly avoid
- Headless browsers. Static HTML is enough for 90% of e-commerce. SPAs that render entirely client-side are a known limitation and planned for a future Pro tier.
- Auth-protected pages. The crawler never logs in. Public-only surface.
- Full-text page content. Body text is summarized via headings and link anchors. We aren't doing semantic search; we're proposing actions.
- Long-lived background jobs. One generation = one HTTP request. Quota is decremented before the LLM call and refunded on total failure.
Costs and limits
Rough per-generation cost: 15-25 K input tokens across 5 pages, 2-6 K output tokens. On Gemini 2.5 Pro that's roughly $0.05-0.10 per generation. A Pro user at the 20 generations / 30 days quota costs us about $1-2 per month in raw LLM spend.
Free users get one lifetime generation on Gemini 2.5 Flash for under a cent. Locked tools beyond the top 2 are persisted, not discarded — they unlock automatically on upgrade without re-running the pipeline.
Try it on your site
Paste a URL. Get a ranked list of WebMCP tools, with input schemas and runnable executeJs, in under a minute.
Generate tools free