AI vs human photos: how to actually run a Shopify A/B test that produces real data

Every fashion brand asking whether AI-generated product imagery actually converts better, the same, or worse than traditional photography eventually arrives at the same answer: run the test on your own store. The vendor case studies are cherry-picked, the industry reports use sample brands you cannot match, and the honest answer depends on your category, your audience, and your existing image quality. This is the practical guide to running an AI-versus-human photography A/B test on a Shopify fashion store, with the test design, the metrics that matter, and the discipline that keeps the conclusion honest.

Why most AI-versus-human tests produce useless data

Most casual AI-versus-human tests fail because they confound too many variables. The test runs on a single product, the sample size is too small, the test runs across a sale weekend where conversion is artificially high, the variant images differ in more than one dimension, the carousel order is different between variants. Any one of these defeats the result. The brand looks at the data, sees a five percent difference, decides AI “works” or “does not work,” and ships a strategy on noise.

The discipline of a clean A/B test is not difficult, but it does require commitment. Run on at least five products simultaneously to control for product-specific variance. Run for at least two full weeks to span weekday and weekend traffic. Run with all carousel positions identical except the single image being tested. Run on traffic segments large enough to power the comparison. A clean test gives you a defensible answer; a sloppy test gives you a number you do not trust.

Three test designs that actually work

The first design is the main image swap: same product, identical PDP except the lead image. Variant A is the brand's existing main image (typically a traditional shoot or a stock supplier image). Variant B is the AI-generated main image. Measure add-to-cart and conversion on the same traffic segment. This isolates the impact of the image type cleanly.

The second design is the carousel-completeness test: same product, same main image, but variant A ships with two carousel images and variant B ships with five AI-generated carousel images. This isolates the impact of catalog completeness, which is usually the larger lift for most brands — not whether AI is “better” but whether the brand can finally afford to ship a complete carousel.

The third design is the ad-creative test: identical Meta or TikTok ad placements, identical copy, different creative format. Variant A is studio creative, variant B is AI lifestyle creative from a creator photo set. Measure CPM, CTR, and post-click conversion. This isolates the impact at the channel level rather than on the PDP, and the answer often differs between PDP and ad-creative contexts.

How Apiway fits into the test design

For brands running the test specifically against Apiway, the approach matters because Apiway is a real-anchor workflow rather than a pure-AI workflow. The variant B in your test should be the actual content type Apiway produces — on-model imagery from the creator marketplace with the brand's garment fitted on. This is a different comparison from “is fully synthetic AI better than my traditional photoshoot,” which is a comparison with different expected outcomes.

The categories where Apiway has historically shown the largest A/B lift versus baseline are the ones where the baseline is supplier-stock or single-shot catalog imagery rather than full real-photoshoot creative. For brands already running cinematic editorial photography, the lift is smaller and the value of AI is more about volume and cadence than about per-image conversion. Both are valid commercial outcomes; the test reveals which one matters most for your specific stage and category.

The metrics that actually matter

Most A/B-test dashboards highlight conversion rate as the single number, which is too narrow for fashion content evaluation. The full set of metrics that matter: session duration on the PDP (longer sessions usually correlate with lift), add-to-cart rate (the upstream signal), checkout completion (the downstream signal), revenue per visitor (the bottom-line signal), and return rate post-purchase (the long-tail signal that AI imagery can sometimes worsen if the image misrepresents fit).

For ad-creative tests, add cost-per-acquisition on the channel layer and creative fatigue rate over the test window. AI lifestyle imagery sourced from creator sets often fatigues slower than studio creative because the visual variety is higher per dollar of production cost.

Sample size and traffic discipline

Most fashion stores running this test underpower it badly. The product PDP test typically needs around 5,000 sessions per variant to detect a 10 percent conversion lift at statistical significance. Most stores have one or two products that hit that traffic threshold inside a two-week window. Run the test on those products specifically; do not spread across low-traffic SKUs and then read the results as if they were powered.

For ad-creative tests, the sample is faster — a single ad set can power the comparison in days at modest spend — but the same discipline applies on test duration to span weekday and weekend impressions and to control for creative fatigue.

Reading results without fooling yourself

Three honesty rules for reading the data. First, if the difference is under five percent and the test is underpowered, the result is noise; do not ship a strategy on it. Second, if the variant wins on conversion but loses on return rate, the AI imagery is misrepresenting fit; investigate before scaling. Third, if the variant wins on a single product but loses on others in the same test, the result is product- specific rather than format-specific; treat it as category-level evidence rather than as a verdict on AI photography overall.

The brands that run this test cleanly typically end up with a nuanced answer: AI imagery wins clearly in some categories (lingerie, swimwear, niche apparel where their existing imagery was undersized) and breaks even in others (heritage categories where the existing photoshoot quality was already high). Both are useful answers and both inform catalog-production strategy.

What other published tests have found

Botika has published a useful guide to ecommerce image A/B testing worth reading alongside this one for an additional perspective on AI-versus-human tests at scale. Veeton has published case-study data on conversion lifts from AI imagery on EILEEN FISHER and other brand sites. Each external dataset is useful as triangulation but cannot replace the test on your own store, your own products, your own audience.

Run the test

The fastest way to run a clean test is to ship a single product with AI variant against your existing baseline and let two weeks of real traffic answer the question. Sign up for a free Apiway account — 100 one-time credits, enough to ship a full PDP-and- carousel pack on one product. Run the variant on Shopify, set the traffic split fairly, and let conversion data make the decision. The vendor opinions are noise; the data on your own store is the only signal that matters.