The Do's and Don'ts of AI Voice Generation with Sonantic: Avoiding Common Pitfalls

Achieve natural, expressive AI voice - without the common mistakes

You want a voice that moves people. Natural emotion. Clear pacing. No robotic stumbles. This guide shows you exactly how to get that with Sonantic, and what to avoid that will otherwise sabotage your audio.

Read fast for the quick wins. Stay for the deep-dive examples and troubleshooting. By the end you’ll be able to deliver polished, believable voice performances that fit games, film, ads, or narration.

Why this matters right away

A great synthetic voice can raise production value, speed up iteration, and give creative teams tools to prototype performances instantly. A poor one wastes time, undermines immersion, and makes listeners distrust the content. Small mistakes in script formatting, prosody cues, or post-processing often cause the biggest quality losses.

Follow the do’s to avoid those traps. Follow the don’ts to stop common audio disasters before they start.

Quick summary: Top do’s and don’ts

Do:

Use clear, actor-style scripts with emotional direction.
Add prosody hints via punctuation, SSML, and short stage directions.
Choose the right voice persona and sample frequently during iteration.
Normalize levels and target loudness (LUFS) in post.
Provide pronunciations for uncommon names or acronyms.
Respect legal and ethical limits (consent for voice cloning, disclosure when required).

Don’t:

Feed long, unpunctuated runs of text expecting natural breaks.
Overuse filler punctuation (excessive ellipses or commas) to force pauses.
Skip auditioning multiple emotional settings or speeds.
Export at low sample rates or use aggressive lossy codecs during edit passes.
Use synthetic voice to impersonate real people without consent.

Before you start: set the objective

What is the voice for? A 15-second ad spot. An interactive NPC. A long-form audiobook. The use case determines pacing, breath patterns, and emotional consistency. Decide the objective first. Then configure voice, pacing, and post-processing to match.

Do: Write scripts like an actor would read them

Good scripts are short, purposeful, and include emotional direction.

Break long lines into shorter phrases. Actors think in breaths. So should your text.
Add brief parenthetical cues - (warm), (angry), (hesitant), (breath). Keep them simple.
Put stage directions on their own lines to avoid them being spoken.

Example - before and after:

Before:

“Welcome to our product where everything is fast efficient and secure get started now”

After:

“Welcome to our product. (warm) We’re fast, efficient, and secure. (short breath) Get started now.”

The after example gives the model clear breakpoints and an emotional target.

Do: Use punctuation and SSML thoughtfully

Punctuation affects prosody. So does SSML.

Commas create short pauses. Periods create longer ones. Ellipses and em dashes add nuance. Use them sparingly.
Sonantic and other TTS platforms often accept SSML-like controls to set pitch, rate, emphasis, and breaks. Learn the platform’s parameters and test them.

SSML snippet (example):

<speak>
  <p>
    <prosody rate="95%" pitch="-1st">Welcome to the mission.</prosody>
  </p>
  <p>
    <break time="300ms"/>
    <prosody volume="-1dB">Are you ready?</prosody>
  </p>
</speak>

Reference: the W3C SSML specification for general guidance: https://www.w3.org/TR/speech-synthesis11/

Don’t: Try to brute-force emotion with punctuation only

Adding twenty exclamation marks or multiple ellipses won’t reliably produce a believable shout or a dramatic pause. Use emotion controls where available, and pair them with sensible script editing.

Also avoid stuffing stage directions inline like this: “I am so excited (scream)!!!” Put cues on separate lines or use semantic tags so the system can treat them as directives rather than text to speak.

Do: Choose voice, emotion, and style deliberately - then A/B test

Sonantic provides expressive voices and emotional tuning. Don’t assume the first voice you pick is the one. A/B multiple voices and emotional intensities with short clips:

Record reference audio from a human speaker if you have a target performance.
Create short, identical test scripts (10–30s) and generate 3–5 variants.
Compare for clarity, timbre, and emotional match.

If your project requires lip-sync or animation timing, test with the actual animations early. Tiny timing mismatches break immersion.

Do: Give clear pronunciation guidance for names and acronyms

Phonetic hints are essential for uncommon names, brand words, or foreign phrases. Provide pronunciations in parentheses or use SSML phoneme tags so the engine produces consistent results.

Example: “My name is Xaeon (ZAY-on).”

Don’t: Export and iterate on compressed audio formats

During editing and mastering, keep workfiles in high-quality formats (48 kHz / 24-bit preferred). Avoid repeatedly exporting and re-importing MP3s - lossy codecs accumulate artifacts. Save a high-resolution master and create compressed variants only for final delivery.

Audio technical do’s: sample rate, loudness, breath handling

Aim for 48 kHz and 24-bit render if possible. Many games and films expect 48 kHz.
Target consistent loudness (e.g., -16 LUFS for streaming and -23 LUFS for broadcast workflows depending on platform). See EBU R128 as a guideline: https://en.wikipedia.org/wiki/EBU_R_128
Use gentle de-essing to control sibilance.
Retain natural breaths unless the content calls for them. Over-automatic breath-removal makes speech sound clipped.

Post-processing: subtlety wins

Apply light compression and gentle EQ to add presence; avoid over-compression that flattens expressiveness.
Use short crossfades when stitching segments to remove tiny clicks.
If you layer effects (reverb, delay), keep them consistent across takes to maintain an audible identity.

Don’t: Expect perfect performance from the first pass

Iterate. Tweak rate, pitch, emotion, or the script itself. Small copy edits often fix stumbling prosody more than aggressive signal processing.

Ethical and legal do’s and don’ts

Do:

Obtain consent to synthesize voices derived from real people.
Disclose synthetic voices when required by law or platform policy.
Keep secure any training data or voice files that are subject to privacy agreements.

Don’t:

Clone or impersonate a living person without clear permission.
Use synthetic audio to deceive (fraud, misinformation, malicious impersonation).

Context: Sonantic’s technology and its role in the industry have raised important ethical conversations; review platform policies and legal counsel when planning voice cloning or likeness reproduction. For background reporting on ownership and industry developments see articles such as this one about Sonantic’s acquisition: https://www.theverge.com/2022/7/21/23272050/spotify-sonantic-ai-voice-acquisition

Troubleshooting: common artifacts and fixes

Artifact: robotic or flat delivery

Cause - too-fast rate, no emotional setting, long unbroken input
Fix - slow rate 90–95%, add emotional cue, break copy into phrases

Artifact: popping, clicks

Cause - hard cuts between renders, misaligned edits, clipping
Fix - add short crossfades, normalize levels, ensure no clipping during rendering

Artifact: overly sibilant “s” and “sh”

Cause - synthetic high-frequency emphasis
Fix - de-esser or gentle EQ dip around 5–8 kHz, or retune voice emphasis

Artifact: inconsistent pronunciation

Cause - plain text without phonetic hints
Fix - add phonetic guidance or SSML phoneme tags

Artifact: unnatural breathing (too frequent or missing)

Cause - default breath modeling not matching desired style
Fix - add (breath) directives or force breath placements in the script; remove breaths in post when necessary

Production workflow checklist

Define use case and target loudness.
Write actor-style scripts with emotion cues and short phrases.
Select a voice persona and create 3–5 short test clips.
Iterate on SSML/prosody settings and pronunciations.
Render high-resolution stems (48 kHz / 24-bit) and keep a master.
Light post-processing - LUFS normalization, gentle EQ, de-essing, crossfades.
Run a final quality check on multiple playback systems (headphones, phone, studio monitors).
Verify legal/ethical compliance for voice usage.

Example: small script to polished audio (workflow)

Original copy:

“Click here to learn more about our service and sign up today.”

Actor-style edit:

“Click here to learn more about our service. (inviting) Sign up today. (brief breath)”

SSML/prosody tweak:

<speak>
  <p>
    <prosody rate="95%">Click here to learn more about our service.</prosody>
  </p>
  <p>
    <break time="220ms"/>
    <prosody rate="98%" pitch="+1st">Sign up today.</prosody>
  </p>
</speak>

Post - render at 48k/24b, normalize to target LUFS, apply a small presence boost +2–3 dB at 3–5 kHz if needed, light compression 2:1.

Final thought - what separates believable from distracting

Believability comes from consistent choices across script, voice settings, and post-production. There is no single magic knob. Small, concerted improvements in phrasing, prosody guidance, and audio hygiene compound into a natural performance. The strongest single lever? Treat the script like a performance - short lines, clear direction, and deliberate breaths - and the rest becomes polishing, not rescue.

References