Text to natural speech
in seconds.

Convert any text or markdown file to an MP3. Three local TTS engines: Kokoro for speed (50 voices, generates in seconds), Orpheus for natural prosody (emotion tags like <laugh>), and Coqui XTTS v2 for voice cloning from any 6-second sample. No cloud APIs.

Get the Skill Voice Kit Setup

Also available in ai-voice-kit.

/voice-to-text transcript /strategic-analysis analysis + voice text /text-to-voice briefing.mp3

Point it at a file, get an MP3

The skill handles everything: markdown cleanup, acronym expansion, voice selection, WAV generation, MP3 conversion, and optional Dropbox copy. Works standalone or as the final step in the voice chain.

  • Auto-converts markdown to speech-friendly text
  • 50 voices (American & British English)
  • Three engines: Kokoro, Orpheus, Coqui XTTS v2
  • Generates MP3 via ffmpeg
  • Copies to Dropbox for mobile listening
  • Works standalone or chained after /strategic-analysis
> /text-to-voice meeting-notes.md Reading meeting-notes.md... 847 words, ~4 min estimated Converting markdown to speech text... Stripping formatting Expanding acronyms (AI → A.I.) Adding spoken transitions Generating with Kokoro (voice: bf_lily)... Output: └── meeting-notes.mp3 (4m 12s, 3.8 MB) └── Engine: Kokoro, Voice: bf_lily Dropbox: └── Voice Files Work/meeting-notes.mp3

Pick the right tool

Kokoro is the default and handles 95% of use cases — it generates audio in seconds with 50 built-in voices. Orpheus adds emotion tags like <laugh> for natural prosody. Coqui XTTS v2 clones any voice from a 6-second WAV sample.

The skill asks which engine you want. If you don't specify, it defaults to Kokoro.

Engine Speed Voices Best For
Kokoro Seconds 50 built-in Daily use, bulk generation
Orpheus ~3.5x real-time 8 voices + emotion tags Natural prosody, technical terms
Coqui XTTS v2 ~1x real-time Clone any voice Voice cloning from 6-sec WAV sample

50 voices, two accents

Kokoro ships with 50 built-in voices across American and British English. Each voice has a distinct character — from warm and conversational to clear and authoritative. The skill defaults to bf_lily (British female) but you can pick any voice when it asks.

Voice names follow a pattern: the first letter is accent (a = American, b = British), the second is gender (f = female, m = male), then the name.

bf_lily
British female (default)
af_heart
American female
am_adam
American male
bm_george
British male
af_bella
American female
af_sarah
American female
am_michael
American male
bf_emma
British female

Run kokoro-tts --help-voices for the full list of 50 voices.