← GalleryReport
SUBTITLES STUDIO

One transcript, four caption styles, our own engine

A real clip (Johnny Harris / "Mapping the Trump Shooting") pulled, transcribed to word-level timing, then captioned four different ways by our engine and composited back over the footage. Same words, same timing - only the style swaps. Click any clip to play it.

Reveal (Fern / Johnny Harris)
Clean, centered. Each word pops in exactly when it's spoken. Documentary look.
Karaoke highlight
Whole line is shown; the spoken word lights up and grows as it's said.
Hormozi / punchy
Big bold caps, two or three words, active word in a yellow box. The Shorts/TikTok look.
Active-word color
Line shown in white; each word flips to green precisely on time.

How it works. 1. yt-dlp pulls the clip. 2. Whisper transcribes it to a word-by-word timeline with start/end times. 3. That timeline is a plain JSON file you can edit - fix a misheard word or nudge timing. 4. Our engine renders the captions as a transparent animated layer and ffmpeg composites it over the video. The style is just a setting, so the same transcript can become any look - and a style is reusable across every future video.

Test segment is 8 seconds. Timing here uses Whisper "base"; a tighter model (WhisperX / stable-ts) gets sub-100ms word sync for true karaoke precision.

← Back to the gallery