All articles
Deep DiveJune 24, 20268 min read

How Aigentably Generates WebMCP Tools from a URL

Paste a URL, get a ranked list of WebMCP tools with input schemas and runnable executeJs. Here's what the pipeline actually does between those two events.

The naive approach (and why it fails)

The obvious way to generate WebMCP tools from a URL is to fetch the page, hand the HTML to an LLM, and ask "what tools should this site expose?" We tried it. It produces underwhelming results for three reasons.

One page is not enough. Homepages are marketing surfaces. The interesting actions (addToCart, applyCoupon, filterByCategory, submitOrder) live on product, cart, and search pages. Generating from the homepage alone gives you generic navigation tools and not much else.

Raw HTML is wasteful. Modern e-commerce HTML is mostly markup overhead: tracking pixels, framework hydration data, CSS classes. Feeding 200 KB of HTML per page to a model burns tokens on noise.

Models drift toward made-up tools.Given prose-like input, LLMs invent plausible tools that don't map to anything real on the page. The result looks reasonable but the executeJs code targets selectors that don't exist.

The pipeline

URL
 └─ planCrawl       (robots.txt + sitemap.xml + link fallback)
     └─ crawlSite   (fetch up to 5 pages, parallel)
         └─ extractSignals (cheerio → structured SiteSignals)
             └─ buildSignalsPrompt (per page)
                 └─ callLlm        (max 3 concurrent, JSON mode, 1 retry)
                     └─ parseLlmResponse
                         └─ dedupTools   (keep highest-importance per name)
                             └─ applyFreeTierFilter
                                 └─ persist → response

Five clear stages: discover pages, fetch them, turn each into a structured signal object, ask an LLM what tools fit, merge results. Everything is cacheable, retryable, and inspectable.

Stage 1: Page discovery

planCrawl picks the 5 pages most likely to expose interactive affordances. It tries three sources in order:

  1. /sitemap.xml and /sitemap_index.xml — recursively walked to flatten sitemap indexes.
  2. Homepage <a href> extraction as fallback when no sitemap exists.
  3. Robots.txt filtering on top of both — Disallow rules for * or Aigentably-Bot remove candidates before they're fetched.

From the candidate set we pick representatives by URL hint matching: paths containing product, cart, checkout, search, account, login, category, and similar. The homepage is always included as page one.

The output of this stage is a list of URLs with a reasontag ("homepage", "product-page", "cart", etc.) that flows through to the LLM prompt so the model knows what kind of page it's reasoning about.

Stage 2: Signal extraction

Each page goes through extractSignals, a cheerio-based extractor that produces a compact JSON object. Concrete shape:

FieldWhy it matters
formsaction + method + required field names. LLM proposes submit-style tools.
buttonsText + id + data-* attributes. Drives click-style tools.
jsonLdProduct, Offer, BreadcrumbList. Names entities the agent should target.
frameworkShopify / Next / WordPress detection. Steers executeJs toward the right APIs.
globalApisExposed window.* objects (Shopify, __NEXT_DATA__). Lets LLM prefer JS calls over DOM scraping.
internalLinksHints at site IA for navigation tools.

The whole signal object for a typical page is 2-4 KB. Compare with 200 KB of raw HTML. Lower cost, less noise, and the model is forced to ground its suggestions in identified DOM elements instead of free-associating from markup.

Stage 3: Per-page LLM call

Each page's signals get their own LLM call. Up to 3 calls run in parallel via Promise.allSettled. The prompt explicitly instructs the model to:

  • Prefer exposed globals (Shopify.*, window.__NEXT_DATA__) over DOM scraping.
  • Click existing buttons via querySelector rather than synthesizing new HTML.
  • Never navigate away via window.location unless the tool is explicitly a navigation tool.
  • Return JSON only, no markdown fences — enforced with response_format: json_object.

If the first response fails strict JSON validation we retry once with a hardened reminder prompt. After that, the page is allowed to contribute zero tools without failing the whole generation.

Free users hit Gemini 2.5 Flash. Pro users hit Gemini 2.5 Pro. The pipeline is otherwise identical — Pro pays for higher-quality reasoning over the same signals.

Stage 4: Dedup and ranking

Five pages can produce a lot of overlapping tool names. addToCart on the product detail page and addToCart on a quick-add widget collapse into one. The dedup pass keeps the highest-importance variant per name and caps the final set at 12 tools sorted by importance.

Every tool keeps a sourceUrl stamp. The dashboard shows it as from /products/widget next to each suggestion so you can audit which page produced which tool.

Stage 5: The 24h crawl cache

Crawling 5 pages on every regenerate would be wasteful and slow. We cache the crawl result in a SiteProfile table keyed by site, with a 24-hour TTL.

Inside the window: regenerate hits the LLM fresh on the cached signals. You can iterate on tool variants without re-fetching the site. Outside the window: a fresh crawl happens automatically. A Force refreshbutton bypasses the cache when you've actually changed the site.

The dashboard shows a cachedbadge on the crawl report when results came from cache, so there's no confusion about whether the LLM saw your latest changes.

What we explicitly avoid

  • Headless browsers. Static HTML is enough for 90% of e-commerce. SPAs that render entirely client-side are a known limitation and planned for a future Pro tier.
  • Auth-protected pages. The crawler never logs in. Public-only surface.
  • Full-text page content. Body text is summarized via headings and link anchors. We aren't doing semantic search; we're proposing actions.
  • Long-lived background jobs. One generation = one HTTP request. Quota is decremented before the LLM call and refunded on total failure.

Costs and limits

Rough per-generation cost: 15-25 K input tokens across 5 pages, 2-6 K output tokens. On Gemini 2.5 Pro that's roughly $0.05-0.10 per generation. A Pro user at the 20 generations / 30 days quota costs us about $1-2 per month in raw LLM spend.

Free users get one lifetime generation on Gemini 2.5 Flash for under a cent. Locked tools beyond the top 2 are persisted, not discarded — they unlock automatically on upgrade without re-running the pipeline.

Try it on your site

Paste a URL. Get a ranked list of WebMCP tools, with input schemas and runnable executeJs, in under a minute.

Generate tools free