Where Digital Ad Agencies Get A/B Testing Wrong: Performance vs. Insight

Sep 17

Executive Summary

Marketers often lean on Meta and Google’s built-in A/B testing tools, but new research (Boegershausen, Cornil, Yi et al., IJRM, 2025) confirms these are not true randomized controlled trials. Algorithms interfere with delivery, meaning the people who see version A are systematically different from those who see version B.

That doesn’t mean platform testing is useless. The key is to separate pragmatic applications—where platform tests help you tune performance—from misuses that waste budget or mislead strategy.

Pragmatic A/B tests: Campaign types, objectives, settings, and structures where the goal is simply to give the algorithm clearer instructions.
Audience testing: Directional, not definitive. Useful for comparing responsiveness inside a platform and for shaping cross-channel targeting hypotheses.
Creative testing: Should not be used for micro-tweaks. Best applied to uncover submarket opportunities or validate whether a new message resonates at all.

For marketers who need rigorous, generalizable insights, stronger designs are required: placebo ads, holdout groups, or geo-split tests. These methods isolate true incremental lift and help you distinguish message effects from platform bias.

The bottom line: platform A/B tests aren’t “real” A/B tests, but they can still be valuable if you know what they’re good for—and when you need to reach for something stronger.

Academics Flag the issue on A/B PLatform Tests

On January 2, 2025, a landmark paper by Boegershausen, Cornil, Yi, and colleagues showed that what Meta and Google label as “A/B tests” are not truly randomized experiments. Published in the International Journal of Research in Marketing, it is the leading authority on the issue.

But here’s the problem: in the real world, most marketers don’t have the luxury of running pristine lab experiments. Stakeholders have demands, budgets are finite, timelines are tight, and the platforms control how impressions are delivered. That leaves us with a tension: how do you respect academic rigor while still extracting useful insights in the messy business reality of digital advertising?

Why Platform “A/B Tests” Aren’t Real A/B Tests

On the surface, platforms split an eligible audience into A and B. But then randomization breaks down:

Algorithm interference – Each cell’s algorithm selects a different subset of the audience based on predicted responsiveness. In other words, the people seeing ad A are not the same type of people as those seeing ad B.
Engagement bias – Ads accumulate different visible engagement counts (likes, shares, reactions). This can shape how future users perceive and respond to the ad.

That means platform “A/B tests” don’t hold up as randomized controlled trials.

Pragmatic A/B Tests: How Digital Agencies should use them for maximizing performance

Still, platform tests aren’t useless. There’s a category of what I’ll call Pragmatic A/B tests. Tests that aren’t academically rigorous but don’t need to be as they’re in pursuit of the best possible performance, not insight for broader application.

Examples include:

Campaign type (Advantage+ vs. traditional)
Platform objective (Awareness, Traffic, Conversions)
Campaign settings (inventory sources, locations, optimization signals)
Campaign structure (consolidated vs. fragmented ad groups)

These aren’t about making broad causal claims. They’re about giving the algorithm the clearest instructions to improve campaign performance.

Audience Testing: Opportunity for insight

Audience testing isn’t a true A/B test, as the premise of an A/B test is to give different treatments to the same audience. But it can still add directional value. What you get is a read on relative responsiveness between different audience definitions, which is useful both inside the channel you’re testing and for informing decisions in other channels.

The A/B functionality is still worth using because people can qualify for both audiences you’re testing. The platform creates a firewall, ensuring each person only receives ads from one audience. that prevents contamination and enables cleaner directional comparisons, even if it’s not a randomized experiment.

The result can provide insights on which audiences have better responses and therefore inform whether to add new ones to your platform campaign or adjust targeting across other channels. Just be careful not to take the insights as absolutes.

Creative Testing: Where Digital and Paid Social agencies unknowingly mislead marketers

This is where academics are most critical and they’re right to be. If you hand two different creatives to the platform, the algorithm will intentionally serve them to different subsets of people. That means differences in performance can’t be attributed solely to the creative. It could be purely driven by the audience subset. Taking the learnings from a creative test and applying them broadly is risky; there’s no guarantee that the performance difference is a function of the creative. If an agency is using these insights to give direction on harder to measure channels like TV, proceed with caution!

Still, there are two cases where creative testing can add real value.

Submarket message resonance – Platforms may suppress a niche but promising message in favor of a broadly stronger one. By structuring delivery so each message gets airtime, you can uncover whether a submarket exists that resonates with a different proposition.
Validating message resonance – The question here is: will this different message work in the market? An A/B structure guards against existing dominant messages sucking up all the dollars. Even though the test isn’t perfectly randomized, a strong positive response tells you the message is worth further investment.

Some agencies oversell the value of micro-tests on headlines, CTAs, button colors, or near-identical images. Don’t buy it. Platforms like Meta and Google Ads handle this rotation automatically. Manually testing rarely moves the needle and can create significant opportunity cost of time and budget that could have been spent on larger, more meaningful tests.

When evaluating creative tests, it’s important to take in the analytics thoughtfully: response rate matters, but so does the volume of responders. Because different subsets are being delivered to, you need to ask if there’s enough scale behind the response to make the idea worth pursuing.

How Not to Use Creative A/B Tests

Don’t treat them as proof of causality.
Don’t waste cycles on micro-tweaks.
Don’t assume learnings translate broadly without further validation.

That’s why, if you want message insights you can confidently act on, you’ll need more rigorous structures.

Structuring Tests That Provide Stronger Signal

When you want insights that go beyond “pragmatic” optimization, you need designs that hold up more rigorously:

Placebo (PSA) ads – Both groups see ads, but one sees your campaign and the other sees a neutral public service ad. This isolates the effect of your message.
Holdout groups – A randomly assigned slice sees no ads at all, letting you measure the true incremental lift of advertising.
Geo-split tests – Randomize at the market level. One region gets the campaign, another is excluded (holdout) or served neutral ads (placebo). Geo is simply a framework for assignment—it can implement either placebo or holdout logic.

The Bottom Line

Platform A/B tests aren’t “real” A/B tests, as the January 2025 research makes clear. But dismissing them outright ignores the reality marketers face every day.

The path forward is threading the needle:

Use pragmatic A/B tests to optimize platform mechanics.
Treat audience tests as directional, not definitive.
Avoid wasting cycles on tiny creative tweaks.
Run submarket discovery and resonance validation when creative insights matter.
For true causal answers, rely on placebo ads, holdouts, or geo-splits.

Academic rigor matters. So does business reality. The real skill is knowing which to lean on, and when.

The Ant-Agency Advantage

Most digital and paid social agencies either live in academia or worship at the altar of the algorithm. Few can balance academic rigor with the messy realities of business.

That’s where we live. At Perform Marketing Partners, we know the research, but we also know how to navigate the ecosystem amidst stakeholder demands, budgets, and timelines.

If you want a partner who can toe the line between insight and execution, let’s talk.

Michael Santee