gpt-image-2 — multi-scene & n-coherence test
Tests three orthogonal capabilities of Azure gpt-image-2
via /v1/images/edits:
(1) the n parameter for variants in one call,
(2) multi-panel composite rendering with cross-panel identity hold,
(3) the 8-frame coherence ceiling. All calls use the same single
photographic reference of the same person.
Reference photo
female_asian headshot from the eval cast pool, attached as image[] on every call below.
The three findings
- Azure /edits accepts
n=4. Four variants returned in 41.3 s total — only 2.3 s slower than the n=1 call. ~75% wall-clock saving vs four separate gens. - 2×2 composite preserves identity across all 4 in-image panels. Face, hair, age stable; outfits varied appropriately by time-of-day cue.
- 2×4 storyboard (8 panels) also preserves identity at both 1024² and 1536×1024. The "8-frame coherence" ceiling holds. Landscape (1536×1024) panels are visibly more detailed per panel.
Mode A — baseline
One scene, one image. Baseline cost and latency. Identity preserved
from the single attached reference photo. Style register lands cleanly
in dreamscape.
Mode B — n=4 variants in one call
Azure /edits supports n=4. Four
independent samples of the same prompt in one HTTP call. Wall-clock
went from 39.0 s (n=1) to 41.3 s (n=4) — only +2.3 s. Cost still scales
per-image, but parallel sampling on the server side is essentially free
on latency. Identity (face, hair, age) stable across all 4; outfit
drifts between variants (striped vs floral pajamas, plain top) since
do_not_match.clothing doesn't tie the outfit down to one
specific design. Useful as a low-cost re-roll without paying 4× the
wall-clock.
Mode C — 2×2 composite, one image
A single 1024×1024 image with four panels of the same person:
TL = morning coffee, TR = midday park walk, BL = afternoon reading,
BR = evening at a balcony. Identity locks across all four panels.
Face shape, dark hair, skin tone, apparent age — all consistent.
Outfits vary appropriately by time-of-day cue (pajamas → jacket →
cardigan). The layout block in the canonical spec
(panel_top_left, panel_top_right, etc.)
carries enough structure for the model to coordinate panels
sub-divisionally.
Mode D — 2×4 storyboard, 8 panels in one image, square
The 8-frame coherence ceiling. Eight chronological beats in a single image: early-morning bedroom, coffee, train commute, cafe lunch, afternoon reading, midday park, dinner cooking, dusk balcony. The model holds the same person's identity across all eight panels. At 1024² each sub-panel is small (~250×250) so fine facial detail is compressed, but the silhouette / hair / age are unmistakably one person across the row.
Mode E — 2×4 storyboard at 1536×1024 landscape
Same 8-panel storyboard, larger landscape canvas. Each sub-panel is now ~384×512 instead of ~256×256 — facial features are visibly sharper, and the model gets more pixels to express clothing and environment detail per panel. This is the recommended mode for an 8-frame day-in-the-life: same per-call cost as the 1024² variant (latency 45.2 s vs 47.0 s), substantially more usable per sub-panel.
What this means for Pikumo
Today the worker pipeline issues one /edits call per panel
(2–4 panels per story → 2–4 calls). The findings above suggest two
levers:
- Use
nfor variants on a single panel. When the user explicitly re-rolls a panel, sendn=2orn=4instead of N separate calls. Cheaper per re-roll on latency, identical on dollar cost. - Consider the composite mode for the "story preview" surface.
A 2×2 (4-panel) or 2×4 (8-panel) composite of the user's full story
in one image, at
quality=low, costs the same as one single-panel low-quality gen (~$0.067) and renders in ~45 s. Per-panel face fidelity is lower than separate gens (since each sub-panel is ~250×512 pixels instead of 1024×1024) but identity hold across the composite is robust. Useful as a thumbnail-tier "scrollable story" surface.
The composite mode does not replace per-panel high-quality generation for the user-facing final album — sub-panel resolution is too low. But it is a viable cheap-preview tier.
Generated 2026-05-26. Subject: female_asian headshot from the eval cast pool.
Style: dreamscape. Model: Azure gpt-image-2.