16 Brands, 4 AI Engines, 3,837 Queries: A Study

AI Visibility Is Not One Problem. It's Four.

If you are optimizing your brand for AI visibility, you are probably treating "AI" as a single channel. That is a mistake.

We ran a large-scale study across ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), and Perplexity to measure how consistently each AI engine mentions the same brands when asked similar questions. The short answer: they don't agree. Not even close.

53% of all pairwise provider comparisons showed statistically significant differences in brand visibility. The average gap between a brand's best-performing provider and its worst was 22 percentage points.

This article walks through the full dataset: what we tested, what we found, and what it means for your generative engine optimization strategy.

Study Design: 3,837 API Calls, 16 Brands, 4 Providers

We tested 16 brands across four SaaS categories: CRM, Project Management, Email Marketing, and Design Tools. Each brand was tested with 12 distinct prompt types and 20 samples per provider, totaling 3,837 validated API calls. Every data point met our threshold for robust statistical validity (n>=100 per cell).

The providers tested were OpenAI GPT-4.1-mini, Anthropic Claude Haiku, Google Gemini Flash Lite, and Perplexity Sonar. The brands covered include HubSpot, Salesforce, Zoho CRM, Pipedrive, Asana, Monday.com, ClickUp, Trello, Mailchimp, ConvertKit, Klaviyo, ActiveCampaign, Canva, Figma, Adobe Creative Cloud, and Sketch.

Info

Every model had live web search enabled. All four providers were configured with their respective web search or grounding tools turned on, so each model could access live web results when generating its answer. The differences you see in this study are not caused by some models having internet access and others not. They all had it. They still disagreed.

The Heatmap: A Patchwork, Not a Pattern

The first thing that jumps out when you look at the data is how uneven the landscape is. There is no uniform "AI visibility."

Heatmap showing presence rates for 16 brands across 4 AI providers — Brand presence rates across ChatGPT, Claude, Gemini, and Perplexity. Darker cells indicate higher mention rates. The patchwork pattern tells the story: each provider has its own view of which brands matter.

At the aggregate level, the four providers cluster closer than you might expect:

Provider	Avg. Presence Rate
Google (Gemini)	71%
OpenAI (ChatGPT)	70%
Perplexity	67%
Anthropic (Claude)	63%

But those averages mask enormous brand-level variance. The story is not that one provider is "better." It is that each provider has its own pattern of which brands it surfaces and which it buries.

The Most Fragmented Brands

Some brands have wildly inconsistent visibility depending on which AI engine a user happens to query.

Bar chart showing the top 10 most fragmented brands with provider labels — Top 10 most fragmented brands, ranked by the gap between their best and worst AI provider.

Here are the standout cases:

Adobe Creative Cloud: Google mentions it 44% of the time. Perplexity? 2%. That is a 42-percentage-point gap. Adobe is functionally invisible on one of the fastest-growing AI search platforms.
ConvertKit: OpenAI gives it a 54% presence rate. On Perplexity, Claude, and Gemini, it ranges from 12% to 16%. If your email marketing strategy relies on ConvertKit's brand showing up in AI answers, you are only reaching a fraction of users.
Klaviyo: The opposite pattern. Perplexity surfaces Klaviyo 67% of the time, while OpenAI only mentions it 32%. Same category, same prompts, completely different winners.
Canva: Strong everywhere, but not equally strong. Perplexity gives it 85% visibility versus Anthropic's 57%. A 28-point spread on a brand most people would assume "just shows up."

Do Providers Even Agree With Each Other?

We measured how often any two providers agree on a brand's visibility within 10 percentage points. The results show just how fragmented the landscape is.

Provider agreement heatmap showing agreement percentages between provider pairs — Provider agreement matrix: what percentage of brands do any two providers agree on (within 10pp)? Lower numbers mean more fragmentation between those two engines.

This is why tracking your brand on a single AI provider gives you a false sense of confidence. Two providers looking at the same brand, asked the same questions, can give you completely different pictures.

The HubSpot Exception (and Why It's Rare)

HubSpot was the closest thing to a consensus pick in our dataset. Perplexity mentions it 99% of the time, and even its lowest score (86% on OpenAI) is dominant.

But HubSpot is the exception, not the rule.

Per-brand dot plots with 95% Wilson Score confidence intervals. Overlapping confidence intervals mean the difference is not statistically meaningful. Non-overlapping intervals mean it is.

The brands where providers agreed the most tended to be clear category leaders with massive web footprints. The brands where providers disagreed the most tended to be mid-market or niche players, exactly the brands that would benefit most from understanding these dynamics.

Prompt Type Matters More Than You Think

We tested 12 prompt types ranging from direct recommendation requests ("What's the best CRM?") to comparison prompts ("Compare X and Y") to switching prompts ("I'm switching from X, what should I try?").

Switching prompts produced the most fragmentation across providers. When a user asks an AI about switching away from a product, the variation in which alternatives get surfaced is extreme.

This matters because switching intent is high-value intent. These are users actively looking for a new solution. If your brand shows up on one AI engine but not another for these queries, you are leaving money on the table.

Fragmentation Varies by Category

Box plot showing fragmentation scores by SaaS category — Fragmentation distribution by category. Higher values mean more disagreement between providers on which brands to surface.

The more competitive the category and the less dominant any single player, the more the AI engines diverge in their recommendations. Categories with a clear market leader (like CRM with HubSpot and Salesforce) showed less fragmentation than categories where multiple tools compete on more equal footing.

Five Takeaways for Your GEO Strategy

This data points to a set of concrete actions for anyone working on AI brand visibility:

1. Monitor per-provider, not in aggregate. A brand visibility score averaged across AI engines is misleading. You need to know your presence rate on each provider individually. A brand that looks healthy at 50% average might be at 80% on one provider and 20% on another.

2. Identify your weak providers. If you are invisible on a specific AI engine, that is a specific problem with a specific cause, likely related to how that provider retrieves and weighs source material. Fixing it requires provider-specific tactics, not generic "AI SEO."

3. Prioritize switching and comparison prompts. These are the highest-intent queries and also the most fragmented. If you can win these prompts across providers, you are capturing the most valuable AI-referred traffic.

4. Mid-market brands have the most to gain. Category leaders tend to show up everywhere regardless. If you are a ConvertKit or Klaviyo-sized brand, the upside of provider-specific optimization is enormous. You could double or triple your presence on your weakest providers.

5. Revisit your assumptions regularly. AI models get updated. Retrieval pipelines change. What is true today may shift in 90 days. This is not a set-and-forget channel.

Tip

The first step to fixing a fragmentation problem is measuring it. You cannot optimize what you cannot see, and you cannot see provider-level differences without per-provider tracking.

Methodology

We queried 4 LLM providers (OpenAI GPT-4.1-mini, Anthropic Claude Haiku, Google Gemini Flash Lite, Perplexity Sonar) across 16 brands in 4 SaaS categories. Each brand was tested with 12 prompt templates x 20 samples per provider, yielding 3,837 validated API calls. All providers were queried with web search enabled.

Statistical rigor was maintained throughout:

Confidence intervals: Wilson Score intervals at 95% confidence, chosen for their superior performance with proportion data near 0 or 1.
Pairwise comparisons: Newcombe intervals for the difference between two proportions, used to determine whether provider differences are statistically meaningful.
Effect sizes: Cohen's h to quantify the practical magnitude of differences beyond simple statistical significance.
Sample validity: All cells achieved "robust" validity status (n>=100), meaning margins of error are tight enough for reliable inference.

Popsight tracks your brand's visibility across ChatGPT, Claude, Gemini, and Perplexity with per-provider breakdowns and statistical confidence intervals. See where you stand.

Try Popsight Free

Data collected Q1 2026.