Any Aspect Ratio: Rearchitecting the Pipeline’s Input Contract
Early users uploaded portrait and square videos. The pipeline rejected them. Here’s how SourceMetadata was refactored to support multiple output ratios.
The original pipeline had one job: take a 16:9 landscape video and produce a 9:16 portrait crop. SourceMetadata carried crop_width, crop_height, and max_crop_x — fields that only made sense for a single target ratio.
When early users uploaded portrait videos, the pipeline exited with code 1. When they uploaded square videos, same result. The error messages weren't helpful. The architecture needed to change before the product could scale.
The old architecture's problem
Every downstream module received crop dimensions baked into SourceMetadata. When the source wasn't 16:9, those dimensions were wrong. And with a single set of dimensions, there was no path to multiple output ratios (9:16, 1:1, 4:5) without duplicating the entire pipeline run.
The new data model
SourceMetadata loses all crop fields. A new CropSpec dataclass carries the per-ratio geometry: (ratio, crop_w, crop_h, max_crop_x, max_crop_y).
def compute_crop_spec(source_w, source_h, ratio) -> CropSpec:
# height-constrained first, falls back to width-constrained
# always produces even dimensions
...
def is_center_crop_path(source_w, source_h, ratio) -> bool:
# when source AR ≤ target AR (e.g. portrait source, portrait target),
# the crop is centered and static — no attention analysis needed
...The fast path
is_center_crop_path is the key insight. When a portrait video is being cropped to 9:16, the source is already narrower than the target. The crop is centered and static. No attention analysis, no motion solving, no scene detection needed. The pipeline skips Phase A entirely and goes directly to render.
This eliminated false rejections and made the "upload any video" promise actually work.
The new execution model
Modules 1–4 (ingestion, attention, scene detection, effect classification) run once per job — they're properties of the source, not the output. A per-ratio Phase B loop then runs: motion solving → EDL generation → render → verify. One Phase A, multiple Phase B.
The key invariant: the subject's location in the frame is a property of the source video, not of the output format. The motion solver translates subject pixels to crop coordinates per-ratio. Mixing these concerns — as the old code did — is what made multi-ratio expansion hard.