On this page
๐๏ธ Three levels of voice
| Level | What it does | Requires | Cost |
|---|---|---|---|
| 1. Voice notes | Send voice โ get text reply | Just Whisper STT (via OpenAI key) | ~$0.006/min |
| 2. TTS responses | Agent replies as audio | ElevenLabs API key | ~$5-15/mo |
| 3. Talk Mode | Hands-free, real-time conversation | macOS/iOS/Android node + mic | ~$10-20/mo |
โ Start at Level 1 โ voice notes on Telegram/WhatsApp work out of the box with just an OpenAI API key. Add ElevenLabs TTS in week 2 if you want spoken responses. Talk Mode is week 3.
๐ค Level 1: Voice notes (easiest)
Send a voice note on Telegram or WhatsApp. OpenClaw automatically transcribes it using Whisper and processes the transcript like a normal text message. No extra config needed if you have an OpenAI API key.
This works because:
- Telegram/WhatsApp voice notes arrive as audio files
- OpenClaw runs them through Whisper (OpenAI API or local)
- The transcript is treated as the message body
- Slash commands in speech work too ("slash reset" โ
/reset)
# STT config (usually auto-detected from your OpenAI key)
{
"messages": {
"stt": {
"provider": "openai",
"model": "whisper-1",
"language": "en"
}
}
}
๐ฃ๏ธ Level 2: TTS responses
Make your agent talk back with ElevenLabs:
- Sign up at elevenlabs.io (free tier available for testing)
- Copy your API key from Profile โ API Keys
- Browse the voice library and pick a voice ID
# Set the env var
export ELEVENLABS_API_KEY="your_key_here"
# Or configure in openclaw.json
{
"messages": {
"tts": {
"auto": "inbound",
"provider": "elevenlabs",
"voiceId": "EXAVITQu4vr4xnSDxMaL",
"maxTextLength": 4000,
"timeoutMs": 30000
}
}
}
TTS auto modes
| Mode | Behavior |
|---|---|
"inbound" | Only speaks when user sent voice first (recommended) |
"always" | Always responds with audio |
"tagged" | Only speaks when the agent's response includes a voice tag |
false | TTS disabled |
โ
Use "inbound" โ the agent only talks back when you spoke first. Prevents surprise audio and unnecessary cost.
๐ง Level 3: Talk Mode
Full hands-free: you speak, the agent listens, processes, and speaks back. Requires a paired node device with microphone and speaker.
Architecture
โโโโโโโโโโโโโโโ WebSocket โโโโโโโโโโโโโโโโโโโโโโโโ
โ Node device โโโโโโโโโโโโโบโ Gateway โ
โ (mic + spkr) โ โ (model + tools) โ
โ macOS/iOS/ โ โ ws://127.0.0.1:18789 โ
โ Android โ โโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ
โ Whisper STT โ ElevenLabs TTS
The Gateway handles AI processing; the node handles audio I/O. This keeps gateway headless and stable.
Activate Talk Mode
- Web client: Click the microphone icon at
localhost:18789 - macOS app: Push-to-talk overlay in the menu bar
- iOS/Android: Via the companion node app
- Wake word: Say "hey claw" (see below)
- Chat command: Send "Start talk mode" or "Let's talk"
Talk Mode config
{
"talk": {
"voiceId": "EXAVITQu4vr4xnSDxMaL",
"modelId": "eleven_v3",
"outputFormat": "mp3_44100_128",
"interruptOnSpeech": true,
"stability": 0.5,
"similarityBoost": 0.75
}
}
Interrupt on speech: If you start talking while the agent is speaking, it stops and listens. Feels natural.
๐ Wake word detection
Always-on listening that activates on a keyword (like "Hey Siri" or "OK Google"):
{
"voice": {
"wake_word": {
"enabled": true,
"engine": "porcupine",
"keyword": "hey claw",
"sensitivity": 0.5
}
}
}
| Engine | Notes |
|---|---|
| Porcupine | Most popular. Requires Picovoice API key. Custom keywords supported. |
| Snowboy | Open-source alternative. Less accurate but free. |
โ ๏ธ Tune sensitivity carefully. Too high = false triggers on random words. Too low = won't hear you. Start at 0.5 and adjust. Use headphones to prevent echo.
๐ Provider comparison
| Provider | Type | Quality | Cost | Latency |
|---|---|---|---|---|
| ElevenLabs | TTS | โญโญโญ | $5-22/mo | ~1-2s |
| OpenAI TTS | TTS | โญโญ | $15/1M chars | ~1s |
| Edge TTS | TTS | โญ | Free | ~2s |
| Whisper (API) | STT | โญโญโญ | $0.006/min | ~1s |
| Whisper (local) | STT | โญโญโญ | Free | ~2-5s |
| Deepgram | STT | โญโญ | $0.0059/min | ~0.5s |
๐ก Recommended stack: ElevenLabs for TTS (best natural voice) + Whisper API for STT (best accuracy). Run local Whisper as fallback. Total: ~$10-20/mo for moderate use.
โ๏ธ Full config reference
{
"messages": {
"stt": {
"provider": "openai",
"model": "whisper-1",
"language": "en"
},
"tts": {
"auto": "inbound",
"provider": "elevenlabs",
"voiceId": "EXAVITQu4vr4xnSDxMaL",
"maxTextLength": 4000,
"timeoutMs": 30000
}
},
"talk": {
"voiceId": "EXAVITQu4vr4xnSDxMaL",
"modelId": "eleven_v3",
"interruptOnSpeech": true,
"stability": 0.5,
"similarityBoost": 0.75
}
}
๐ก Tips & best practices
- Start with voice notes โ Level 1 costs almost nothing and works today
- Use
"inbound"TTS mode โ agent only speaks when you spoke first - Set
maxTextLengthโ prevents TTS from reading 300-line stack traces - Use headphones for Talk Mode โ prevents echo and improves wake word detection
- Choose voice wisely โ ElevenLabs has dozens of pre-made voices. Browse the library to match your agent's personality.
- Monitor ElevenLabs credits โ chatty setups can hit $50+/mo. The free tier is fine for testing.
- Local Whisper as fallback โ if cloud STT is down, local model keeps working
๐ง Troubleshooting
| Problem | Fix |
|---|---|
| Voice notes not transcribed | Check STT config has valid OpenAI API key. Try openclaw logs --follow for errors. |
| TTS audio not playing | Verify ElevenLabs API key and voice ID. Check tts.auto is not false. |
| Talk Mode no audio | Ensure node device is paired. Check mic permissions in OS settings. |
| High latency | Use eleven_v3 model (fastest). Try local Whisper for STT. Check network. |
| WebCrypto error in dashboard | Access Control UI over HTTPS (Tailscale Serve) or localhost. HTTP on remote hosts breaks WebCrypto. |
| Agent reads code/markdown aloud | Set maxTextLength: 4000. The TTS pipeline should strip markdown, but long responses still waste credits. |
| Wake word false positives | Lower sensitivity (closer to 0.0). Use a more unique wake phrase. Quiet environment helps. |