Velora - AI Video Studio

The Problem We Set Out to Solve

When we launched Velora, our rendering pipeline was built on a conventional sequential processing model — each scene was processed one element at a time: script → voiceover → imagery → compositing → export. For short-form content, this felt fast enough. But as creators began requesting 15- and 30-minute productions, cracks appeared. What should have been a 12-minute render was routinely taking 35–45 minutes per job. Creator satisfaction scores dropped. Compute costs ballooned.

We knew we needed a fundamental re-architecture — not a patch, but a ground-up rethink of how Velora's engine parallelizes work.

Introducing the Velora Parallel Scene Graph (PSG)

The core breakthrough is what we internally call the Parallel Scene Graph. Instead of processing scenes sequentially, PSG constructs a dependency graph of every scene in a project at the moment you submit a job. Scenes with no interdependencies — which typically accounts for 70–80% of all scenes in a video — are immediately dispatched to separate GPU workers simultaneously.

A 30-minute video that previously had 90 scenes processed one-by-one now fans out across our cluster in waves. The first wave processes ~60 independent scenes in parallel. The second wave handles the remaining scenes that depend on earlier output. Total wall-clock render time collapses dramatically.

Voiceover Pre-Synthesis Pipeline

A second major bottleneck was voiceover generation. Previously, TTS synthesis happened after imagery was composited — a sequential dependency that stalled the entire pipeline waiting on ElevenLabs API calls.

In the new engine, voiceover synthesis is kicked off in parallel with scene imagery generation the moment a job is submitted. By the time the final visual frame is composited, the audio track is already fully rendered and waiting. This alone shaved an average of 4.2 minutes off every job regardless of length.

Smart Frame Caching

The third pillar of the new engine is a perceptual frame cache. If two scenes share very similar visual elements — same background, same avatar model, same lighting — Velora now reuses pre-computed frame data instead of re-rendering from scratch. Our perceptual hashing algorithm compares scenes at the semantic level, not pixel-by-pixel, which means meaningful re-use even when scenes aren't identical.

In real-world creator projects, average cache hit rates are sitting at 22–31%, which translates directly into skipped compute cycles and faster delivery.

Real-World Results

After rolling out to 100% of production traffic over a two-week canary period, the numbers confirm what our internal testing predicted:

Average render time for 5-min videos: 3.1 min → 1.9 min (−39%)
Average render time for 30-min videos: 41 min → 24.4 min (−40%)
P99 tail latency reduced from 68 min → 38 min
GPU utilization efficiency increased from 61% → 89%
Compute cost per job reduced by ~28% despite higher parallelism

What's Next

PSG v2 is already in development. The next iteration will introduce predictive pre-warming — spinning up GPU workers before a creator even hits "Generate" based on behavioral signals. We're also exploring streaming delivery, where completed scenes begin downloading to the creator's browser while later scenes are still rendering. Our target is sub-60-second delivery for 5-minute videos by Q3 2026.