GEO benchmark: who AI engines actually cite

Everyone is talking about ranking in AI answers. Almost nobody is measuring it. So we built a tool that asks the major LLMs the same questions, thousands of times, and records exactly which brands they name back.

The premise was simple. If a customer types "best WordPress SEO agency" into ChatGPT instead of Google, the only thing that matters is whether your name comes out the other side. That's not a keyword you can track in Search Console; it's a probability distribution living inside a model. To manage it, we first had to measure it.

Polling LLMs like you're running a survey

An LLM answer isn't a fact lookup; it's a sample. Ask the same question twice and you can get two different lists of brands. Treat one response as truth and you're reading tea leaves. So we stopped thinking like SEOs and started thinking like pollsters.

For every query we care about, we fire the same prompt 30 to 50 times across each model, then aggregate. A brand mentioned in 4 out of 50 runs has an 8% "share of voice" for that question.

New here? See what SEO Insight actually does

We turn audits like this into ranking gains on real sites: technical fixes, Core Web Vitals, structured data and AI-search visibility, done for you.

Explore our services

The stack that made it work

We wanted one interface to call every provider. A thin SDK layer let us swap models with a single string and keep the polling logic identical across providers.

const models = ["gpt-4o", "claude-sonnet", "gemini-pro"];
 
for (const q of queries) {
  for (const model of models) {
    const runs = await pollMany(model, q.prompt, {
      n: 50, // samples per query
      schema: BrandList, // force structured output
    });
    record(q.id, model, tallyBrands(runs));
  }
}

The accuracy problem of structured output

We needed every answer as a clean list of brand names, not prose. So we asked each model to return JSON. That solved parsing, and quietly introduced a new bias.

Structure your prompts for the answer you want to measure, not the answer that's easiest to parse.

The economics: tokens aren't free

Fifty samples across 2,000 queries and four models is 400,000 calls per full run.

2,000

commercial queries tracked

400K

model calls per full run

~9h

wall-clock per benchmark

We cut the bill three ways: caching identical prompts within a run, dropping sample counts on stable queries, and reserving the expensive extraction pass for answers where the cheap parser disagreed with itself.

What this means for your site

Be quotable. Write pages a model can lift a sentence from without hedging: concrete claims, clear definitions, real numbers.
Earn third-party mentions. Models trust what other sources already say about you. Off-site citations are the new backlinks.
Measure share of voice, not rank. Track how often you're named for your money queries, and watch the trend.

AI search isn't a different game so much as the same game with the scoreboard hidden. Build the scoreboard, and the strategy gets obvious fast.

We polled 50 AI engines on 2,000 queries: here's who actually gets cited

Polling LLMs like you're running a survey

New here? See what SEO Insight actually does

The stack that made it work

The accuracy problem of structured output

The economics: tokens aren't free

What this means for your site

Lucie Marin

You might also like

Core Web Vitals in 2026: the only metrics that still move rankings

Headless WordPress: a complete SEO playbook

Schema markup that actually earns rich results