Transcribe audio
with speaker labels.

Transcribe any audio or video file with automatic speaker identification. WhisperX-powered diarization that runs locally — no cloud APIs, no data leaving your machine. Handles meetings, interviews, podcasts, and voice memos.

Get the Skill Voice Kit Setup

Also available in ai-voice-kit.

/voice-to-text transcript /strategic-analysis analysis + voice text /text-to-voice briefing.mp3

Drop in a file, get a transcript

Point the skill at any audio or video file. It auto-detects the format, runs WhisperX with speaker diarization, and produces a clean transcript with speaker labels. Then asks if you want to replace generic labels with real names.

  • Speaker diarization (who said what)
  • Supports m4a, mp3, wav, flac, ogg, webm, mp4, mkv, avi
  • Auto-checks ~/Downloads for recent recordings
  • Replace generic labels with real names
  • Output as txt, srt, vtt, or json
  • 30-min file transcribes in ~3-5 minutes on Mac
> /voice-to-text Checking ~/Downloads for recent audio... Found: client-call-2026-02-27.m4a (42 min) Is this the right file? > yes Transcribing with WhisperX... Speakers: 2 detected Language: en Duration: 42:18 SPEAKER_00: I wanted to walk you through our current document processing workflow... SPEAKER_01: Great, and how many documents are you processing per month roughly? Replace speaker labels with real names? > SPEAKER_00 = Aaron, SPEAKER_01 = Bruce

Audio and video, all handled

WhisperX extracts the audio track from any supported container format. Point it at a video file and it just works — no need to extract audio manually first.

iPhone Voice Memos (m4a), Zoom recordings (mp4), podcast downloads (mp3), lossless archives (flac) — whatever you have, drop it in.

Format Type Notes
m4a Audio iPhone recordings, Voice Memos
mp3 Audio Standard compressed audio
wav Audio Uncompressed, best quality
flac Audio Lossless compression
ogg Audio Open format
webm Audio/Video Web recordings
mp4 Video Standard video format
mkv Video Container format