Convert any text or markdown file to an MP3. Three local TTS engines: Kokoro for speed (50 voices, generates in seconds), Orpheus for natural prosody (emotion tags like <laugh>), and Coqui XTTS v2 for voice cloning from any 6-second sample. No cloud APIs.
Get the Skill Voice Kit SetupAlso available in ai-voice-kit.
The skill handles everything: markdown cleanup, acronym expansion, voice selection, WAV generation, MP3 conversion, and optional Dropbox copy. Works standalone or as the final step in the voice chain.
Kokoro is the default and handles 95% of use cases — it generates audio in seconds with 50 built-in voices. Orpheus adds emotion tags like <laugh> for natural prosody. Coqui XTTS v2 clones any voice from a 6-second WAV sample.
The skill asks which engine you want. If you don't specify, it defaults to Kokoro.
| Engine | Speed | Voices | Best For |
|---|---|---|---|
| Kokoro | Seconds | 50 built-in | Daily use, bulk generation |
| Orpheus | ~3.5x real-time | 8 voices + emotion tags | Natural prosody, technical terms |
| Coqui XTTS v2 | ~1x real-time | Clone any voice | Voice cloning from 6-sec WAV sample |
Kokoro ships with 50 built-in voices across American and British English. Each voice has a distinct character — from warm and conversational to clear and authoritative. The skill defaults to bf_lily (British female) but you can pick any voice when it asks.
Voice names follow a pattern: the first letter is accent (a = American, b = British), the second is gender (f = female, m = male), then the name.
Run kokoro-tts --help-voices for the full list of 50 voices.
Transcribe audio with automatic speaker identification.
Analyze a transcript with full project context.
Full chain: transcript to analysis to audio in one command.
Scaffold a new opportunity directory for evaluation.