Product
How vdowrx works
A 9-module computer vision pipeline, 9 content-type presets, and a non-bypassable verifier.
Architecture
The 9-module pipeline
Every vdowrx job runs the same nine modules in sequence. Modules 1, 2, 5, 7, and 8 are blocking — a failure stops the job with a clear error. The others contribute to quality but have fallback behaviour. Jobs requesting multiple output ratios run modules 1–4 once (shared across ratios) and modules 5–9 independently per ratio.
ingestionSource validation & crop geometryblockingProbes the source with ffprobe. Validates landscape orientation. Computes the even-floor crop spec for every requested output ratio: crop_w = floor(h × 9/16) rounded down to the nearest even number. Fails fast on portrait, zero-width, or equal-dimension inputs. The even-floor formula can produce a crop window 1–2 pixels narrower than naive rounding — this is intentional and the verifier is calibrated to accept it.
attentionPer-frame attention analysisblockingComputes a per-frame attention vector using a weighted sum of six signals: face centroid (MediaPipe), pose centroid (MediaPipe), optical flow dominant direction, YOLO object centroid, audio energy, and Whisper speech activity. The output is subject_cx — the horizontal centre of the dominant subject in source pixels. Content-type preset weights tune the contribution of each signal.
scene_detectionScene boundary detectionRuns PySceneDetect to find scene cuts. Boundaries inform the effect classifier and motion solver, ensuring crop transitions don't cross hard cuts. Scenes shorter than a minimum duration threshold are merged into adjacent scenes.
effect_classifierPer-scene effect classificationClassifies each scene into one of five effects: STATIC (crop doesn't move), BREATHE (slow gentle oscillation), PAN_FOLLOW (active subject tracking), FAST_CUT (tight rapid tracking for action), or INTERVIEW (two-subject alternation). Classification is based on subject movement statistics across the scene — standard deviation of crop_x, mean tracking confidence, and optical flow magnitude.
motion_solverPer-frame crop position & zoomblockingSolves the temporal crop trajectory: smooths the attention signal with a Gaussian filter, applies per-scene effect behaviour, clips zoom to [1.0, ceiling]. Critical detail: after Gaussian smoothing, any zoom value within 1e-9 of 1.0 is snapped exactly to 1.0 — preventing floating-point epsilon from introducing a spurious Lanczos scale step in the renderer.
edl_generatorEdit Decision ListWrites the frame-by-frame crop decisions to an EDL file before the renderer starts. This makes the pipeline resumable from the EDL without re-running the CV stages. Zoom levels are rounded to 4 decimal places. The EDL is stored alongside output files and included in the API response.
rendererFFmpeg crop + renderblockingBuilds a per-frame FFmpeg sendcmd file and runs the filter graph. Uses sendcmd rather than FFmpeg filter expressions because expressions hit a recursive stack overflow at approximately 3,600 nesting levels — roughly 120 seconds at 30fps. One sendcmd line per frame bypasses the expression evaluator entirely. No upscaling: if zoom is exactly 1.0, no scale filter is inserted.
verifierNon-bypassable output verificationblockingSix tests run on every output. Blocking tests (any failure = REJECTED): playability (ffprobe decode check), dimensions (even-floor compatible), duration parity, audio preservation. Quality tests (tiered): pixel provenance (frame sample match rate ≥ 0.90 for VERIFIED, ≥ 0.70 for DELIVERED_WITH_QUALITY_FLAGS), sharpness ratio. Jobs between the floor and warn thresholds are delivered with quality_flags metadata rather than failed.
narrative_validatorTemporal coherence validationPost-delivery validation of the crop trajectory. Measures subject_continuity (how consistently the subject stays centred across the output) and motion_smoothness (absence of jitter in the crop position). If either metric falls below its threshold, the pipeline re-solves the motion with adjusted parameters — up to one re-solve loop per job.
Content intelligence
Content presets
Each preset is a named set of six attention weights (face, speech, pose, optical flow, YOLO objects, audio energy) that sum to 1.0. The weights tune Module 02 (attention analysis) — changing how the pipeline decides what to track in each frame.
Weights are resolved at job creation time and stored immutably on the job record. Enterprise accounts can define custom presets with arbitrary weight vectors, stored per-account and selectable by name in the job request.
talking-headTalking HeadStarterFace ↑↑↑ · Speech ↑↑ · Pose ↑ · Flow ↓ · Objects ↓ · Audio ↓
Best for: Single-camera presenter, vlog, product demo with presenter
Prioritises face position and speech activity. The crop follows the presenter's face smoothly and centrally. Minimal reaction to background motion. Ideal for any content where one person is the primary subject.
podcastPodcastStarterFace ↑↑ · Speech ↑↑↑ · Pose ↑ · Flow ↓ · Objects ↓ · Audio ↑
Best for: Two-person interview, panel discussion, roundtable
Speech activity weight is elevated above face weight. When the active speaker changes, the crop follows speech cues rather than waiting for face detection to confirm the switch. Handles multi-speaker scenarios more responsively.
actionActionStarterFace ↓ · Speech ↓ · Pose ↑ · Flow ↑↑↑ · Objects ↑↑ · Audio ↑
Best for: Fast-moving subjects, stunts, physical performance
High optical flow and object tracking weights. The crop follows movement energy rather than face position — appropriate when the subject's face may not be visible or when the body as a whole is more relevant than the face.
documentaryDocumentaryStarterFace ↑ · Speech ↑ · Pose ↑ · Flow ↑ · Objects ↑ · Audio ↑
Best for: Mixed content: interviews + B-roll + subject shots
Balanced across all six signals. Adapts to the current scene type: interview segments get face-led behaviour, B-roll segments respond to motion and objects. The effect classifier handles scene-level transitions.
cinematicCinematicProFace ↑ · Speech ↓ · Pose ↑↑ · Flow ↑ · Objects ↑ · Audio ↓
Best for: Narrative film, scripted drama, music video
Pose weight elevated above face weight. Respects composed shots — the crop doesn't chase faces aggressively when the shot has deliberate framing. Lower speech weight because dialogue is rarely the primary composition signal in cinematic content.
wide-screenWide ScreenProFace ↓ · Speech ↓ · Pose ↓ · Flow ↑ · Objects ↑↑ · Audio ↓
Best for: Landscape content where scene preservation matters
Prioritises scene-level composition over individual subject tracking. Useful for wide establishing shots, travel content, or scenic footage where the environment is as important as any person in it.
educationalEducationalProFace ↑↑ · Speech ↑↑ · Pose ↑ · Flow ↓ · Objects ↑ · Audio ↑
Best for: Tutorial, screen recording with presenter, lecture, how-to
Face and speech weighted similarly to talking-head, but objects weight is elevated — because the presenter often gestures toward or holds objects relevant to the instruction. The crop stays with the presenter while remaining sensitive to deliberate object placement.
dramaDramaProFace ↑↑↑ · Speech ↑ · Pose ↑↑ · Flow ↓ · Objects ↓ · Audio ↑
Best for: Scripted scenes, emotional performance, character-focused narrative
Very high face weight combined with elevated pose. Performance and expression are the primary signals. Audio weight is elevated to capture emotional beats from dialogue and ambient sound. Flow and object tracking are suppressed to prevent distraction by background elements.
natureNatureProFace ↓ · Speech ↓ · Pose ↓ · Flow ↑↑ · Objects ↑↑↑ · Audio ↑
Best for: Wildlife, landscape, animal behaviour, outdoor documentary
Low face and speech weight — the subjects are not human. Object detection (YOLO) and optical flow are the primary tracking signals. Follows animals, birds, and environmental movement. Audio weight is elevated because environmental sound is often the attention cue.
Enterprise custom presets — define any weight vector, give it a name, and use it by name in your job requests. Custom presets are account-scoped, validated on write, and resolved at job creation. They flow through the same pipeline as built-in presets.
Plan details
Plan caps, features, and overages
| Feature | Starter Free | Pro $99/mo | Studio $299/mo | Enterprise Custom |
|---|---|---|---|---|
| Monthly cap | 120 minutes | 300 minutes | 900 minutes | Unlimited |
| Output ratios | 9:16 only | 9:16 · 1:1 · 4:5 | 9:16 · 1:1 · 4:5 | 9:16 · 1:1 · 4:5 |
| Built-in presets | 4 (Starter tier) | All 9 | All 9 | All 9 |
| Custom presets | — | — | — | ✓ Unlimited |
| Webhooks | — | ✓ | ✓ | ✓ |
| URL ingestion | — | ✓ | ✓ | ✓ |
| Source bucket ingestion | — | — | — | ✓ |
| Cloud push delivery | — | — | — | ✓ (S3 + GCS) |
| Queue priority | Standard (10) | Priority (20) | High (25) | Highest (30) |
| Price | Free | $99/mo | $299/mo | Custom |
What counts as a processing minute?
One processing minute is one minute of source video duration submitted to the pipeline. Output duration equals source duration — we crop, we don't change the length of your video. If you request multiple output ratios (e.g. 9:16 and 1:1) in the same job, that counts as one minute — not two. Minutes are counted at job confirmation time, reset on your billing cycle start date, and do not roll over to the next period.
What happens when you hit the cap?
Jobs submitted after the cap is reached are queued and will be processed at the start of your next billing period. No jobs are lost and no automatic overage charges apply. You can upgrade your plan at any time to process immediately.
Enterprise plans are uncapped by default. Volume pricing is available for accounts that regularly exceed 1,000 minutes per month — contact us at hello@vdowrx.ai.
Queue priority
vdowrx operates four separate AWS Batch job queues with priority weights of 10 (Starter), 20 (Pro), 25 (Studio), and 30 (Enterprise). Higher-priority jobs are preferentially assigned to available compute capacity. During peak periods, Enterprise jobs will always start before Studio and Pro jobs. During off-peak periods, all tiers typically see sub-minute queue times.
Verification and quality flags
Every output passes through a non-bypassable 6-test verifier (Module 08). Jobs that fail blocking tests (unplayable output, wrong dimensions, missing audio) are marked VERIFICATION_FAILED and do not consume processing minutes.
Jobs where quality metrics fall between the warn threshold and the hard floor are delivered as DELIVERED_WITH_QUALITY_FLAGS with per-metric values in the API response. You receive the output and can decide whether to accept it or resubmit. These jobs do count toward your monthly cap.
Ready to integrate?
Start with a free Starter account. No credit card required.