Audio nodes
Spaces includes a set of audio nodes that let you generate voiceovers, music, and sound effects — all from text descriptions. Use them to add narration, soundtracks, and ambient audio to your video workflows without recording or licensing anything.
In this article
- Available audio nodes
- Voiceover
- Voiceover models
- Music Generator
- Music Generator models
- Sound Effects
- Video Audio Mix
- Typical audio workflow
- Prompting tips
- Tips and best practices
Available audio nodes
| Node | What it does |
|---|---|
| Voiceover | Converts text into natural-sounding speech using AI voices |
| Music Generator | Composes original music tracks from text descriptions |
| Sound Effects | Generates AI-powered foley and ambient sounds from text |
| Video Audio Mix | Combines multiple audio tracks with video |
All audio nodes are new additions to Spaces.
Voiceover
The Voiceover node converts text into natural-sounding speech. Choose from hundreds of AI voices across multiple providers to narrate scripts, create dialogue, or produce voice content.
How to use the Voiceover node
Add the node
Add a Voiceover node to your Space.
Enter your script
Type or paste your script into the text field — or connect a Text node to the input port.
Select a model
Choose ElevenLabs v2, ElevenLabs v3, or Gemini 2.5 Pro.
Pick a voice
Click the voice chip to open the voice library — browse and preview voices before selecting one.
Adjust parameters
Set speed, stability, and similarity boost if needed.
Generate
Set the number of generations from 1 to 10 and run the node.
Voice parameters
| Parameter | Range | What it does |
|---|---|---|
| Speed | 0.7x to 1.2x | Speaking rate — lower is slower, higher is faster |
| Stability | 0 to 1 | Voice consistency — lower is more expressive, higher is more stable |
| Similarity Boost | 0 to 1 | How closely the output matches the selected voice |
Input and output
The Voiceover node accepts a Text input — your script — and outputs Audio — the generated voiceover. You can type directly on the card or connect a Text node for dynamic scripts.
Use cases
- Video narration — Write a script, generate a voiceover, then combine it with your video using Video Audio Mix.
- Podcast creation — Generate voiceovers for individual segments and combine them into a full episode.
- Character dialogue — Use multiple Voiceover nodes with different voices, then mix them together for conversation scenes.
- Lip-sync video — Generate a voiceover, then connect the audio output to a lip-sync video model so your character speaks in sync.
Voiceover models
Three AI models are available for voice generation. Each has different strengths — pick the one that matches your project.
| Model | Provider | Speed | Quality | Best for |
|---|---|---|---|---|
| ElevenLabs v2 Turbo | ElevenLabs | Fast | Good | Quick narration, batch processing |
| ElevenLabs v3 | ElevenLabs | Moderate | High | Final production, emotional narration |
| Gemini 2.5 Pro | Moderate | High | Multi-language content, conversational tone |
ElevenLabs v2 is the fastest option. Use it when you need to generate many voiceovers quickly or iterate on scripts during production. It supports Speed, Stability, and Similarity Boost parameters.
ElevenLabs v3 delivers more natural prosody and intonation than v2 — pacing, emphasis, and emotional tone feel closer to a human read. Use it for final output when quality matters most. Same parameters as v2.
Gemini 2.5 Pro excels at multi-language content and conversational delivery. It supports Temperature, System Instruction, and Language selection. Multi-speaker configuration is possible for dialogue-style output.
Quick guide: which model to choose
| You need... | Use this |
|---|---|
| Fast turnaround | ElevenLabs v2 |
| Best audio quality | ElevenLabs v3 |
| Multiple languages | Gemini 2.5 Pro |
| Emotional, expressive narration | ElevenLabs v3 |
| Conversational or dialogue tone | Gemini 2.5 Pro |
| Batch processing many clips | ElevenLabs v2 |
Music Generator
The Music Generator node creates original music from text descriptions. Describe the mood, genre, tempo, and instruments — the AI composes a unique track.
How to use the Music Generator node
Add the node
Add a Music Generator node to your Space.
Describe your music
Write a description of the music you want in the prompt field — or connect a Text node to the input port.
Select a model
Choose Google Lyria or ElevenLabs Music.
Set the duration
Up to 30 seconds for Lyria, up to 10 seconds for ElevenLabs.
Generate
Set the number of generations from 1 to 10 and run the node.
Input and output
The Music Generator node accepts a Text input — your music description — and outputs Audio — the generated track.
Use cases
- Video soundtrack — Describe a mood and generate a background track, then combine it with your video using Video Audio Mix.
- Podcast intro music — Generate a short, branded intro with ElevenLabs Music.
- Background music — Create lo-fi, ambient, or genre-specific loops for content.
- Compare styles — Write several genre descriptions, generate them all, and pick the best fit for your project.
Music Generator models
Two AI models are available for music generation, each optimized for different use cases.
| Model | Provider | Max duration | Best for |
|---|---|---|---|
| Google Lyria | 30 seconds | Background music, soundtracks, ambient, varied genres | |
| ElevenLabs Music | ElevenLabs | 10 seconds | Jingles, intros, sound logos, short loops |
Google Lyria excels at longer compositions with natural musical structure. It handles a wide range of genres and can produce pieces with evolving arrangement — intros, builds, and transitions.
ElevenLabs Music is optimized for short, concentrated pieces. The output is clean and well-defined — ideal for branding elements, transitions, and loop-ready clips.
Quick guide: which model to choose
| You need... | Use this |
|---|---|
| More than 10 seconds | Google Lyria |
| Short, punchy audio | ElevenLabs Music |
| Varied instrumentation | Google Lyria |
| Quick generation | ElevenLabs Music |
| Soundtrack or background music | Google Lyria |
| Jingle, intro, or sound logo | ElevenLabs Music |
Sound Effects
The Sound Effects node generates AI-powered audio from text descriptions. Describe any sound — from rain on a tin roof to a spaceship engine humming — and the AI creates it. Use it to add atmosphere and foley to video projects.
How to use the Sound Effects node
Add the node
Add a Sound Effects node to your Space.
Describe the sound
Write a description in the prompt field — or connect a Text node to the input port.
Set the duration
Choose the desired length for the sound effect.
Enable Loop if needed
Turn on Loop for sounds that need to play continuously, like ambient or background audio.
Generate
Set the number of generations from 1 to 10 and run the node.
Input and output
The Sound Effects node accepts a Text input — your sound description — and outputs Audio — the generated effect.
Use cases
- Video foley — Describe scene sounds, generate them, and layer them onto your video with Video Audio Mix.
- Ambient loops — Create continuous background audio like coffee shop ambiance or rain, with Loop enabled.
- Podcast intros — Create unique audio branding with dramatic stings or transition sounds.
- Game audio — Generate UI sounds, environmental ambience, or action effects for game prototyping.
Video Audio Mix
The Video Audio Mix node combines multiple audio tracks with a video. Use it as the final step in your audio workflow — connect your voiceover, music, and sound effects, then mix them with your generated or uploaded video.
Typical audio workflow
A common pattern for producing narrated video with a soundtrack in Spaces:
Write your script
Use a Text node or type directly into the Voiceover node.
Generate voiceover
The Voiceover node converts your script to natural speech.
Add music
The Music Generator creates a background track from a mood description.
Layer sound effects
The Sound Effects node generates ambient audio or foley.
Mix everything
Connect all audio outputs and your video to a Video Audio Mix node.
You can connect as many audio sources as you need before combining them with the final video.
Prompting tips
Good prompts lead to better audio. Here is what works for each node.
Voiceover
Your voiceover prompt is the script itself — write it exactly as you want it spoken. Keep sentences natural and conversational. If you need specific pacing or emphasis, choose the right model and adjust the voice parameters rather than trying to encode delivery instructions in the text.
Music Generator
Include genre, mood, tempo, and instruments for the most control.
Cinematic orchestral piece, dramatic, building tension, strings and brass, 120 BPM
Acoustic folk guitar, warm and nostalgic, fingerpicking style, 90 BPM
For ElevenLabs Music — shorter pieces — keep descriptions focused on a single mood or purpose.
Upbeat electronic jingle, happy, synth lead
Dark ambient drone, eerie, low frequency hum
Sound Effects
Be descriptive and specific. The more detail you include, the closer the result will match what you hear in your head.
Heavy rain on a tin roof with distant thunder — works better than just rain
Busy coffee shop ambiance with quiet conversation and espresso machine — works better than cafe sounds
Tips and best practices
Preview voices before committing. The voice library includes sample playback for every voice — listen before you generate.
Lower stability for expression. If your voiceover sounds too flat or robotic, try reducing the Stability parameter. This adds more variation and emotional range to the delivery.
Generate multiple variations. Audio generation — especially music and sound effects — is highly variable. Generate several versions of the same prompt and pick the best one.
Use Loop for ambient sounds. Enable the Loop toggle on the Sound Effects node when you need continuous background audio like rain, traffic, or office ambiance.
Draft with v2, finish with v3. Use ElevenLabs v2 for fast iteration on scripts, then switch to v3 for the final voiceover when quality matters.
Be specific about tempo in music prompts. Adding a BPM value like 120 BPM gives the AI a concrete target and produces more consistent results.
Use multiple Voiceover nodes for dialogue. Add several Voiceover nodes with different voices to create conversation scenes, then combine them with Video Audio Mix.
Combine Sound Effects for rich soundscapes. Layer multiple Sound Effects nodes — for example, rain plus distant traffic plus indoor echo — and mix them together for depth.
Can't find an answer to your question?
Our support team is here to help you with any questions or issues.
Submit a request