Voice Cloning
Create custom AI voices and generate speech with ElevenLabs integration.
Beta
Overview
Voice cloning on elizaOS Cloud enables you to:
- Clone voices: Create AI replicas of any voice
- Generate speech: Convert text to natural-sounding audio
- Custom voices: Use cloned voices in your agents
- Multi-language: Support for 29+ languages
Quick Start
Dashboard
Navigate to Dashboard → Voices for the visual interface.
API
Clone Voice
# Clone a voice from audio samples
curl -X POST "https://cloud.milady.ai/api/elevenlabs/voices/clone" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "name=My Voice Clone" \
-F "files=@sample1.mp3" \
-F "files=@sample2.mp3"Voice Cloning
Prepare Audio Samples
Gather 1-3 minutes of clean audio from the target voice.
Requirements:
- Clear speech without background noise
- Single speaker only
- High quality (WAV or MP3, 44.1kHz+)
Upload Samples
Upload audio files via dashboard or API.
Create Clone
Submit the cloning request and wait for processing.
Verify Quality
Test the cloned voice with sample text.
Sample Requirements
| Requirement | Recommendation |
|---|---|
| Duration | 1-3 minutes total |
| Format | WAV, MP3, M4A |
| Quality | 44.1kHz, 16-bit minimum |
| Content | Natural speech, varied intonation |
| Noise | Minimal background noise |
Using someone’s voice without permission may violate their rights. Only clone voices you have rights to use.
Text-to-Speech
Generate Speech
const response = await fetch("https://cloud.milady.ai/api/elevenlabs/tts", {
method: "POST",
headers: {
Authorization: "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
text: "Welcome to elizaOS Cloud!",
voice_id: "voice_abc123",
model_id: "eleven_multilingual_v2",
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
},
}),
});
const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);Voice Settings
| Setting | Range | Description |
|---|---|---|
stability | 0-1 | Higher = more consistent, lower = more expressive |
similarity_boost | 0-1 | How closely to match the original voice |
style | 0-1 | Style exaggeration (v2 models only) |
use_speaker_boost | bool | Enhance speaker similarity |
Available Models
| Model | Languages | Quality | Speed |
|---|---|---|---|
eleven_multilingual_v2 | 29 | Highest | Medium |
eleven_monolingual_v1 | English | High | Fast |
eleven_turbo_v2 | English | Good | Fastest |
Pre-built Voices
elizaOS Cloud provides pre-built voices:
curl -X GET "https://cloud.milady.ai/api/elevenlabs/voices" \
-H "Authorization: Bearer YOUR_API_KEY"{
"voices": [
{
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"name": "Rachel",
"labels": { "accent": "american", "age": "young" },
"preview_url": "https://..."
},
{
"voice_id": "AZnzlk1XvdvUeBnXmlld",
"name": "Domi",
"labels": { "accent": "american", "age": "young" }
}
]
}Voice Management
Get Voice Details
curl -X GET "https://cloud.milady.ai/api/elevenlabs/voices/voice_abc123" \
-H "Authorization: Bearer YOUR_API_KEY"Delete Voice
curl -X DELETE "https://cloud.milady.ai/api/elevenlabs/voices/voice_abc123" \
-H "Authorization: Bearer YOUR_API_KEY"Check Clone Status
curl -X GET "https://cloud.milady.ai/api/elevenlabs/voices/jobs" \
-H "Authorization: Bearer YOUR_API_KEY"Agent Integration
Use cloned voices with your agents:
{
"name": "Voice Assistant",
"bio": ["Helpful AI assistant with custom voice"],
"settings": {
"voice": {
"provider": "elevenlabs",
"voiceId": "voice_abc123",
"model": "eleven_multilingual_v2"
}
}
}Speech-to-Text
Convert audio to text:
curl -X POST "https://cloud.milady.ai/api/elevenlabs/stt" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@audio.mp3"{
"text": "This is the transcribed text from the audio.",
"confidence": 0.95,
"words": [
{ "word": "This", "start": 0.0, "end": 0.2, "confidence": 0.98 },
{ "word": "is", "start": 0.2, "end": 0.3, "confidence": 0.99 }
]
}Pricing
See Billing & Credits for current pricing. Voice cloning costs 5 credits, TTS/STT are usage-based.
Monitor your voice usage in the billing dashboard.
Best Practices
- Quality Samples — Use high-quality, noise-free audio (44.1kHz+, minimal background noise)
- Natural Speech — Include varied intonation, pacing, and emotional range in samples
- Sufficient Length — Provide 1-3 minutes of audio for best clone quality
- Test Thoroughly — Verify clone quality with diverse text before production use