gpt-image-2 — multi-scene & n-coherence test

Tests three orthogonal capabilities of Azure gpt-image-2 via /v1/images/edits: (1) the n parameter for variants in one call, (2) multi-panel composite rendering with cross-panel identity hold, (3) the 8-frame coherence ceiling. All calls use the same single photographic reference of the same person.

Reference photo

female_asian headshot from the eval cast pool, attached as image[] on every call below.

Model: Azure gpt-image-2 /v1/images/edits
Quality: low
Input fidelity: high
Style: dreamscape (painted anime)
Identity lock: full match_exactly + do_not_match block (production)
Total cost: $0.536 (8 images @ low quality)

The three findings

Azure /edits accepts n=4. Four variants returned in 41.3 s total — only 2.3 s slower than the n=1 call. ~75% wall-clock saving vs four separate gens.
2×2 composite preserves identity across all 4 in-image panels. Face, hair, age stable; outfits varied appropriately by time-of-day cue.
2×4 storyboard (8 panels) also preserves identity at both 1024² and 1536×1024. The "8-frame coherence" ceiling holds. Landscape (1536×1024) panels are visibly more detailed per panel.

Mode A — baseline

n=1 · single scene $0.067 · 39.0 s wall

One scene, one image. Baseline cost and latency. Identity preserved from the single attached reference photo. Style register lands cleanly in dreamscape.

Mode B — n=4 variants in one call

n=4 · single scene $0.268 · 41.3 s wall

Azure /edits supports n=4. Four independent samples of the same prompt in one HTTP call. Wall-clock went from 39.0 s (n=1) to 41.3 s (n=4) — only +2.3 s. Cost still scales per-image, but parallel sampling on the server side is essentially free on latency. Identity (face, hair, age) stable across all 4; outfit drifts between variants (striped vs floral pajamas, plain top) since do_not_match.clothing doesn't tie the outfit down to one specific design. Useful as a low-cost re-roll without paying 4× the wall-clock.

Mode C — 2×2 composite, one image

composite · 4 panels · n=1 $0.067 · 48.3 s wall

A single 1024×1024 image with four panels of the same person: TL = morning coffee, TR = midday park walk, BL = afternoon reading, BR = evening at a balcony. Identity locks across all four panels. Face shape, dark hair, skin tone, apparent age — all consistent. Outfits vary appropriately by time-of-day cue (pajamas → jacket → cardigan). The layout block in the canonical spec (panel_top_left, panel_top_right, etc.) carries enough structure for the model to coordinate panels sub-divisionally.

Mode C — 2x2 composite of a day in the life

Mode D — 2×4 storyboard, 8 panels in one image, square

composite · 8 panels · 1024² · n=1 $0.067 · 47.0 s wall

The 8-frame coherence ceiling. Eight chronological beats in a single image: early-morning bedroom, coffee, train commute, cafe lunch, afternoon reading, midday park, dinner cooking, dusk balcony. The model holds the same person's identity across all eight panels. At 1024² each sub-panel is small (~250×250) so fine facial detail is compressed, but the silhouette / hair / age are unmistakably one person across the row.

Mode E — 2×4 storyboard at 1536×1024 landscape

composite · 8 panels · 1536×1024 · n=1 $0.067 · 45.2 s wall

Same 8-panel storyboard, larger landscape canvas. Each sub-panel is now ~384×512 instead of ~256×256 — facial features are visibly sharper, and the model gets more pixels to express clothing and environment detail per panel. This is the recommended mode for an 8-frame day-in-the-life: same per-call cost as the 1024² variant (latency 45.2 s vs 47.0 s), substantially more usable per sub-panel.

Mode E — 8-panel storyboard at 1536x1024

What this means for Pikumo

Today the worker pipeline issues one /edits call per panel (2–4 panels per story → 2–4 calls). The findings above suggest two levers:

Use n for variants on a single panel. When the user explicitly re-rolls a panel, send n=2 or n=4 instead of N separate calls. Cheaper per re-roll on latency, identical on dollar cost.
Consider the composite mode for the "story preview" surface. A 2×2 (4-panel) or 2×4 (8-panel) composite of the user's full story in one image, at quality=low, costs the same as one single-panel low-quality gen (~$0.067) and renders in ~45 s. Per-panel face fidelity is lower than separate gens (since each sub-panel is ~250×512 pixels instead of 1024×1024) but identity hold across the composite is robust. Useful as a thumbnail-tier "scrollable story" surface.

The composite mode does not replace per-panel high-quality generation for the user-facing final album — sub-panel resolution is too low. But it is a viable cheap-preview tier.

Generated 2026-05-26. Subject: female_asian headshot from the eval cast pool. Style: dreamscape. Model: Azure gpt-image-2.