AI Voice Kit — Local Text-to-Speech & Transcription for Claude Code

Three Skills, One Chain

      /voice-to-text → transcript
      → /strategic-analysis → analysis + voice text
      → /text-to-voice → briefing.mp3
    

/voice-to-text

Transcribe any audio or video file with automatic speaker identification. Handles meetings, interviews, podcasts, voice memos — any recording with one or more speakers.

Speaker diarization (who said what)
Supports m4a, mp3, wav, mp4, mkv, and more
Replace generic labels with real names
Output as txt, srt, vtt, or json
30-min file transcribes in ~3-5 minutes

/strategic-analysis

The bridge between transcription and audio. Reads your transcript, gathers full project context (CLAUDE.md, docs, git history), and produces a strategic analysis with a TTS-ready voice version.

4 modes: sales, consulting, competitive, debrief
Project-aware (reads CLAUDE.md, docs, git log)
Outputs analysis.md + analysis-voice.txt
Voice text written for natural spoken delivery
Works standalone or chained with the other skills

/text-to-voice

Convert any text or markdown file to an MP3. The skill handles markdown cleanup, acronym expansion, voice selection, and MP3 encoding automatically.

50 voices (American & British English)
Generates audio in seconds (Kokoro engine)
Auto-converts markdown to speech-friendly text
Outputs MP3 via ffmpeg
Optional: Orpheus (emotion tags) or Coqui (voice cloning)

# Each skill works standalone too
> /text-to-voice any-text-file.txt
> /voice-to-text any-recording.m4a
> /strategic-analysis any-document.md
        

Setup Guide

Install the skills

Clone the repo and copy the skill folders into your Claude Code config directory. That's all the "installation" there is for the skills themselves.

git clone https://github.com/pengasuzie/ai-voice-kit.git

cp -r ai-voice-kit/skills/text-to-voice ~/.claude/skills/
cp -r ai-voice-kit/skills/voice-to-text ~/.claude/skills/
cp -r ai-voice-kit/skills/strategic-analysis ~/.claude/skills/
        

Install Kokoro TTS

Kokoro is the default engine. Install the CLI with pipx, then download the model files (~350 MB). This is a one-time setup.

pipx install kokoro-tts

mkdir -p ~/.local/share/kokoro

curl -L -o ~/.local/share/kokoro/kokoro-v1.0.onnx \
  https://github.com/thewh1teagle/kokoro-onnx/
  releases/download/model-files-v1.0/
  kokoro-v1.0.onnx

curl -L -o ~/.local/share/kokoro/voices-v1.0.bin \
  https://github.com/thewh1teagle/kokoro-onnx/
  releases/download/model-files-v1.0/
  voices-v1.0.bin
        

Install ffmpeg

Required for converting WAV output to MP3. If you're on a Mac, one command.

# macOS
brew install ffmpeg

# Ubuntu / Debian
sudo apt install ffmpeg
        

Use it

Open Claude Code and type the slash command. That's it. The skill handles engine selection, voice picking, format conversion, and output reporting.

> /text-to-voice meeting-notes.md

Reading meeting-notes.md... 847 words, ~4 min
Converting markdown to speech text...
Generating with Kokoro (voice: bf_lily)...

Output:
├── meeting-notes.mp3  (4m 12s, 3.8 MB)
└── Engine: Kokoro, Voice: bf_lily
        

Kokoro Voices (Selection of 50)

bf_lily

British female (default)

af_heart

American female

am_adam

American male

bm_george

British male

af_bella

American female

af_sarah

American female

am_michael

American male

bf_emma

British female

af_nova

American female

am_eric

American male

bm_daniel

British male

bf_isabella

British female

Run kokoro-tts --help-voices for the full list of 50 voices.

The Full Chain

Meeting to audio briefing

Record a client call on your phone. Transcribe it with speaker labels. Run a strategic analysis that pulls in your full project context. Then generate an audio briefing you can listen to on the drive home.

The strategic analysis skill reads your CLAUDE.md, docs, competitor files, and git history — so the output is grounded in everything you know, not just the transcript.

# Step 1: Transcribe
> /voice-to-text client-call.m4a
2 speakers detected. 4,231 words.

# Step 2: Analyze
> /strategic-analysis client-call.txt
Gathering project context...
├── analysis.md
└── analysis-voice.txt

# Step 3: Generate audio
> /text-to-voice analysis-voice.txt
├── analysis-voice.mp3  (6m 18s)
└── Engine: Kokoro, Voice: bf_lily
      

Three Engines, One Skill

Pick the right tool

Kokoro is the default and handles 95% of use cases. But the skill also supports Orpheus (for emotional, natural-sounding speech with tags like <laugh>) and Coqui XTTS v2 (for cloning any voice from a 6-second WAV sample).

Just tell the skill which engine you want when it asks. Setup instructions for the optional engines are in the GitHub README.

Engine	Speed	Best For
Kokoro	Seconds	Daily use, bulk
Orpheus	~3.5x RT	Emotion, prosody
Coqui XTTS	~1x RT	Voice cloning

Local AI voice toolsfor Claude Code.

/voice-to-text

/strategic-analysis

/text-to-voice

Install the skills

Install Kokoro TTS

Install ffmpeg

Use it

Meeting to audio briefing

Pick the right tool

Local AI voice tools
for Claude Code.