Transcribe audio
with speaker labels.

Transcribe any audio or video file with automatic speaker identification. WhisperX-powered diarization that runs locally — no cloud APIs, no data leaving your machine. Handles meetings, interviews, podcasts, and voice memos.

Get the Skill Voice Kit Setup

Also available in ai-voice-kit.

How It Works

Drop in a file, get a transcript

Point the skill at any audio or video file. It auto-detects the format, runs WhisperX with speaker diarization, and produces a clean transcript with speaker labels. Then asks if you want to replace generic labels with real names.

Speaker diarization (who said what)
Supports m4a, mp3, wav, flac, ogg, webm, mp4, mkv, avi
Auto-checks ~/Downloads for recent recordings
Replace generic labels with real names
Output as txt, srt, vtt, or json
30-min file transcribes in ~3-5 minutes on Mac

> /voice-to-text

Checking ~/Downloads for recent audio...
Found: client-call-2026-02-27.m4a (42 min)
Is this the right file?

> yes

Transcribing with WhisperX...
  Speakers: 2 detected
  Language: en
  Duration: 42:18

SPEAKER_00: I wanted to walk you through
our current document processing workflow...

SPEAKER_01: Great, and how many documents
are you processing per month roughly?

Replace speaker labels with real names?

> SPEAKER_00 = Aaron, SPEAKER_01 = Bruce
      

Supported Formats

Audio and video, all handled

WhisperX extracts the audio track from any supported container format. Point it at a video file and it just works — no need to extract audio manually first.

iPhone Voice Memos (m4a), Zoom recordings (mp4), podcast downloads (mp3), lossless archives (flac) — whatever you have, drop it in.

Format	Type	Notes
m4a	Audio	iPhone recordings, Voice Memos
mp3	Audio	Standard compressed audio
wav	Audio	Uncompressed, best quality
flac	Audio	Lossless compression
ogg	Audio	Open format
webm	Audio/Video	Web recordings
mp4	Video	Standard video format
mkv	Video	Container format

Transcribe audio
with speaker labels.

Drop in a file, get a transcript

Audio and video, all handled

strategic-analysis

text-to-voice

voice-briefing

due-diligence

Transcribe audiowith speaker labels.

Drop in a file, get a transcript

Audio and video, all handled

strategic-analysis

text-to-voice

voice-briefing

due-diligence

Transcribe audio
with speaker labels.