A real clip (Johnny Harris / "Mapping the Trump Shooting") pulled, transcribed to word-level timing, then captioned four different ways by our engine and composited back over the footage. Same words, same timing - only the style swaps. Click any clip to play it.
How it works. 1. yt-dlp pulls the clip. 2. Whisper transcribes it to a
word-by-word timeline with start/end times. 3. That timeline is a plain JSON file you can edit -
fix a misheard word or nudge timing. 4. Our engine renders the captions as a transparent animated
layer and ffmpeg composites it over the video. The style is just a setting, so the same
transcript can become any look - and a style is reusable across every future video.
Test segment is 8 seconds. Timing here uses Whisper "base"; a tighter model (WhisperX / stable-ts) gets sub-100ms word sync for true karaoke precision.