Text to natural speech
in seconds.

Convert any text or markdown file to an MP3. Three local TTS engines: Kokoro for speed (50 voices, generates in seconds), Orpheus for natural prosody (emotion tags like <laugh>), and Coqui XTTS v2 for voice cloning from any 6-second sample. No cloud APIs.

Get the Skill Voice Kit Setup

Also available in ai-voice-kit.

How It Works

Point it at a file, get an MP3

The skill handles everything: markdown cleanup, acronym expansion, voice selection, WAV generation, MP3 conversion, and optional Dropbox copy. Works standalone or as the final step in the voice chain.

Auto-converts markdown to speech-friendly text
50 voices (American & British English)
Three engines: Kokoro, Orpheus, Coqui XTTS v2
Generates MP3 via ffmpeg
Copies to Dropbox for mobile listening
Works standalone or chained after /strategic-analysis

> /text-to-voice meeting-notes.md

Reading meeting-notes.md...
847 words, ~4 min estimated

Converting markdown to speech text...
  Stripping formatting
  Expanding acronyms (AI → A.I.)
  Adding spoken transitions

Generating with Kokoro (voice: bf_lily)...

Output:
└── meeting-notes.mp3  (4m 12s, 3.8 MB)
└── Engine: Kokoro, Voice: bf_lily

Dropbox:
└── Voice Files Work/meeting-notes.mp3
      

Three Engines

Pick the right tool

Kokoro is the default and handles 95% of use cases — it generates audio in seconds with 50 built-in voices. Orpheus adds emotion tags like <laugh> for natural prosody. Coqui XTTS v2 clones any voice from a 6-second WAV sample.

The skill asks which engine you want. If you don't specify, it defaults to Kokoro.

Engine	Speed	Voices	Best For
Kokoro	Seconds	50 built-in	Daily use, bulk generation
Orpheus	~3.5x real-time	8 voices + emotion tags	Natural prosody, technical terms
Coqui XTTS v2	~1x real-time	Clone any voice	Voice cloning from 6-sec WAV sample

Voice Selection

50 voices, two accents

Kokoro ships with 50 built-in voices across American and British English. Each voice has a distinct character — from warm and conversational to clear and authoritative. The skill defaults to bf_lily (British female) but you can pick any voice when it asks.

Voice names follow a pattern: the first letter is accent (a = American, b = British), the second is gender (f = female, m = male), then the name.

bf_lily

British female (default)

af_heart

American female

am_adam

American male

bm_george

British male

af_bella

American female

af_sarah

American female

am_michael

American male

bf_emma

British female

Run kokoro-tts --help-voices for the full list of 50 voices.

Text to natural speech
in seconds.

Point it at a file, get an MP3

Pick the right tool

50 voices, two accents

voice-to-text

strategic-analysis

voice-briefing

new-idea

Text to natural speechin seconds.

Point it at a file, get an MP3

Pick the right tool

50 voices, two accents

voice-to-text

strategic-analysis

voice-briefing

new-idea

Text to natural speech
in seconds.