Transcribe any audio or video file with automatic speaker identification. WhisperX-powered diarization that runs locally — no cloud APIs, no data leaving your machine. Handles meetings, interviews, podcasts, and voice memos.
Get the Skill Voice Kit SetupAlso available in ai-voice-kit.
Point the skill at any audio or video file. It auto-detects the format, runs WhisperX with speaker diarization, and produces a clean transcript with speaker labels. Then asks if you want to replace generic labels with real names.
WhisperX extracts the audio track from any supported container format. Point it at a video file and it just works — no need to extract audio manually first.
iPhone Voice Memos (m4a), Zoom recordings (mp4), podcast downloads (mp3), lossless archives (flac) — whatever you have, drop it in.
| Format | Type | Notes |
|---|---|---|
| m4a | Audio | iPhone recordings, Voice Memos |
| mp3 | Audio | Standard compressed audio |
| wav | Audio | Uncompressed, best quality |
| flac | Audio | Lossless compression |
| ogg | Audio | Open format |
| webm | Audio/Video | Web recordings |
| mp4 | Video | Standard video format |
| mkv | Video | Container format |