How does Aigentably's AI tool generation work?

Aigentably crawls up to 5 representative pages of your site (homepage plus product, cart, checkout, search, account pages), extracts structured signals (forms, buttons, frameworks, exposed JavaScript globals) with cheerio, and calls a language model (Gemini 2.5 Pro for Pro users, 2.5 Flash for Free) per page. Tools are deduplicated by name, ranked by importance, and stamped with the source URL so you can see which page produced which tool.

Why does Aigentably crawl multiple pages instead of just the homepage?

Different pages reveal different actions. A homepage rarely contains the addToCart button or the search filters; those live on product pages and category pages. Crawling 5 representative pages picks up the actual interactive surface of the site, producing higher-quality tool suggestions than analyzing the homepage alone.

Does the AI see my full page HTML?

No. Aigentably extracts a compact signal object per page: title, headings, forms with field names and types, buttons with data-* attributes, internal links, JSON-LD, detected framework (Shopify, Next.js, WordPress, etc.), and exposed window globals. The raw HTML is discarded. This keeps token costs low and forces the LLM to reason about structured affordances rather than copy markup.

How does Aigentably respect robots.txt?

The crawler reads /robots.txt and honors Disallow rules for both User-agent: * and User-agent: Aigentably-Bot. Disallowed paths are filtered out of the candidate page list. Sites returning 401 or 403 are reported as blocking automated access and skipped.

All articles

Deep DiveJune 24, 20268 min read

How Aigentably Generates WebMCP Tools from a URL

Paste a URL, get a ranked list of WebMCP tools with input schemas and runnable executeJs. Here's what the pipeline actually does between those two events.

The naive approach (and why it fails)

The obvious way to generate WebMCP tools from a URL is to fetch the page, hand the HTML to an LLM, and ask "what tools should this site expose?" We tried it. It produces underwhelming results for three reasons.

One page is not enough. Homepages are marketing surfaces. The interesting actions (addToCart, applyCoupon, filterByCategory, submitOrder) live on product, cart, and search pages. Generating from the homepage alone gives you generic navigation tools and not much else.

Raw HTML is wasteful. Modern e-commerce HTML is mostly markup overhead: tracking pixels, framework hydration data, CSS classes. Feeding 200 KB of HTML per page to a model burns tokens on noise.

Models drift toward made-up tools.Given prose-like input, LLMs invent plausible tools that don't map to anything real on the page. The result looks reasonable but the executeJs code targets selectors that don't exist.

The pipeline

URL
 └─ planCrawl       (robots.txt + sitemap.xml + link fallback)
     └─ crawlSite   (fetch up to 5 pages, parallel)
         └─ extractSignals (cheerio → structured SiteSignals)
             └─ buildSignalsPrompt (per page)
                 └─ callLlm        (max 3 concurrent, JSON mode, 1 retry)
                     └─ parseLlmResponse
                         └─ dedupTools   (keep highest-importance per name)
                             └─ applyFreeTierFilter
                                 └─ persist → response

Five clear stages: discover pages, fetch them, turn each into a structured signal object, ask an LLM what tools fit, merge results. Everything is cacheable, retryable, and inspectable.

Stage 1: Page discovery

planCrawl picks the 5 pages most likely to expose interactive affordances. It tries three sources in order:

/sitemap.xml and /sitemap_index.xml — recursively walked to flatten sitemap indexes.
Homepage <a href> extraction as fallback when no sitemap exists.
Robots.txt filtering on top of both — Disallow rules for * or Aigentably-Bot remove candidates before they're fetched.

From the candidate set we pick representatives by URL hint matching: paths containing product, cart, checkout, search, account, login, category, and similar. The homepage is always included as page one.

The output of this stage is a list of URLs with a reasontag ("homepage", "product-page", "cart", etc.) that flows through to the LLM prompt so the model knows what kind of page it's reasoning about.

Stage 2: Signal extraction

Each page goes through extractSignals, a cheerio-based extractor that produces a compact JSON object. Concrete shape:

Field	Why it matters
forms	action + method + required field names. LLM proposes submit-style tools.
buttons	Text + id + data-* attributes. Drives click-style tools.
jsonLd	Product, Offer, BreadcrumbList. Names entities the agent should target.
framework	Shopify / Next / WordPress detection. Steers executeJs toward the right APIs.
globalApis	Exposed window.* objects (Shopify, __NEXT_DATA__). Lets LLM prefer JS calls over DOM scraping.
internalLinks	Hints at site IA for navigation tools.

The whole signal object for a typical page is 2-4 KB. Compare with 200 KB of raw HTML. Lower cost, less noise, and the model is forced to ground its suggestions in identified DOM elements instead of free-associating from markup.

Stage 3: Per-page LLM call

Each page's signals get their own LLM call. Up to 3 calls run in parallel via Promise.allSettled. The prompt explicitly instructs the model to:

Prefer exposed globals (Shopify.*, window.__NEXT_DATA__) over DOM scraping.
Click existing buttons via querySelector rather than synthesizing new HTML.
Never navigate away via window.location unless the tool is explicitly a navigation tool.
Return JSON only, no markdown fences — enforced with response_format: json_object.

If the first response fails strict JSON validation we retry once with a hardened reminder prompt. After that, the page is allowed to contribute zero tools without failing the whole generation.

Free users hit Gemini 2.5 Flash. Pro users hit Gemini 2.5 Pro. The pipeline is otherwise identical — Pro pays for higher-quality reasoning over the same signals.

Stage 4: Dedup and ranking

Five pages can produce a lot of overlapping tool names. addToCart on the product detail page and addToCart on a quick-add widget collapse into one. The dedup pass keeps the highest-importance variant per name and caps the final set at 12 tools sorted by importance.

Every tool keeps a sourceUrl stamp. The dashboard shows it as from /products/widget next to each suggestion so you can audit which page produced which tool.

Stage 5: The 24h crawl cache

Crawling 5 pages on every regenerate would be wasteful and slow. We cache the crawl result in a SiteProfile table keyed by site, with a 24-hour TTL.

Inside the window: regenerate hits the LLM fresh on the cached signals. You can iterate on tool variants without re-fetching the site. Outside the window: a fresh crawl happens automatically. A Force refreshbutton bypasses the cache when you've actually changed the site.

The dashboard shows a cachedbadge on the crawl report when results came from cache, so there's no confusion about whether the LLM saw your latest changes.

What we explicitly avoid

Headless browsers. Static HTML is enough for 90% of e-commerce. SPAs that render entirely client-side are a known limitation and planned for a future Pro tier.
Auth-protected pages. The crawler never logs in. Public-only surface.
Full-text page content. Body text is summarized via headings and link anchors. We aren't doing semantic search; we're proposing actions.
Long-lived background jobs. One generation = one HTTP request. Quota is decremented before the LLM call and refunded on total failure.

Costs and limits

Rough per-generation cost: 15-25 K input tokens across 5 pages, 2-6 K output tokens. On Gemini 2.5 Pro that's roughly $0.05-0.10 per generation. A Pro user at the 20 generations / 30 days quota costs us about $1-2 per month in raw LLM spend.

Free users get one lifetime generation on Gemini 2.5 Flash for under a cent. Locked tools beyond the top 2 are persisted, not discarded — they unlock automatically on upgrade without re-running the pipeline.

Try it on your site

Paste a URL. Get a ranked list of WebMCP tools, with input schemas and runnable executeJs, in under a minute.

Generate tools free

All articles