Audio Puppetry with AI Voices

Text to speech tools have become quite popular recently now that they sound more and more like real human voices with somewhat natural inflection and speech patterns. That said, I'm not a big fan of text to speech for the projects I work on, mostly because they are so close to natural without being fully convincing. Upload text with an uncommon proper noun or a line of dialog that requires a bit of top-spin, and most of these AI voices fall solidly in the "uncanny valley," breaking whatever immersion you had worked so hard to establish. There is a better approach that can help AI voices sound more like professional, directable voices.

Side Note: I should say up front that I would prefer to use real voices from actual, flesh-and-blood people. I've worked with voice over artists in the past and beyond being lovely people, they can also inject unexpected sparks of life into your project. It's genuinely the best way to go. Alas, I don't have a real budget or collaborators to speak of, and assuming you're in the same boat because you're reading this, it's AI voices for us.

Text to speech tools are everywhere, and many of them can pull off basic narration or enthusiastic marketing copy and do it quite well. If you're looking for a conversational read with specific intonation and lots of proper nouns, you may be out of luck with even the fanciest text to speech tools.

Game dialog is a great example where you're going to need a lot of funky proper nouns pronounced the same way with different voices. You can sometimes pull off made-up proper nouns with text to speech by writing it out phonetically, but emphasis can still be hit or miss. And when it misses, it can pull players right out of the immersion.

In my case, I've been working on an audio project called "We Will Have Always Had Time Travel," a podcast-like program with elements of sci fi that tries its best to be funny. I needed a main host voice along with several guest voices that would change frequently. I needed a consistent voice done cheaply, and for that I could go to a whole slew of sources.

However, what I really wanted was a way to direct the voices. I couldn't find any way to do that. Until I found Speech to Speech, a very dumb-sounding feature but I have had great luck with it. It allows me to record the inflection and pronunciation I want then let the AI voice take over, a bit like vocal puppetry.

The Process

For speech to speech, first you record a guide track. My project is fully scripted because I tend to ramble and stammer, but you could just record yourself speaking extemporaneously. In fact, that might achieve and even more natural end result.

Then you can cut it all together in a guide track edit to make sure you didn't miss anything and that the inflections work together well. Then collapse the clips into separate tracks for each voice and export separate files of these compiled lines.

On many services there is a file size limit. If your files are too big you may need to further split them into parts based on the limitations of the AI voice service you're using.

Pick a voice you like, either through the gallery or by making your own. Each generated sample of text counts against your monthly limit, so be judicious. You can listen to the pre-baked samples to pick a voice as well. That doesn't touch your monthly limit.

Please note that some accented voices are available and work quite well for text to speech but when they are used in speech to speech, the accent goes away. This may be a temporary bug or it could be that it's picking up on my corn-fed Indiana accent from the guide track and stripping out the South London sound I was looking for.

Once you have your voices and samples arranged, run them through the speech to speech tool and monitor your usage to prevent running out of credits before your project is complete.

Then take these AI voices back into your audio editor. I sometimes use tools like Audition but I find that a free app like Audacity works better and makes me less reliant on Adobe.

Soapbox: Don't get me started on Adobe's pricing model. My mom was a commercial artist since the 70's, and their "gougey" pricing policies made even her mad. You made my mom angry, Adobe, and that is something I cannot forgive.

Cut it all together and let the lines overlap a little if you want a more realistic feel. In real conversations people cut each other off all the time without even realizing it. When folks politely take turns to speak it starts to sound fake. Maybe that's just me.

In this process, I also pulled text to speech versions of some of the content. You may be wondering why. Well, as I mentioned, some accented stuff won't work properly with speech to speech and sometimes with speech to speech you get glitches or the AI just can't interpret my mumbly guide track, so I patch it with the text to speech version and you would never know, a few words here or there. In small chunks it sounds just fine. That's another advantage of AI voices I should mention: the consistency.

If you've ever recorded voice over from a real human at 9am and then attempted to record a punch-in replacement for a few words a few hours later, the voice will sound very slightly different and the patch will be obvious. You have to record the whole sentence or even a whole paragraph to keep the transition from sounding too jarring. You don't have to worry about that with AI voices, though, because voices made up of ones and zeros are perfectly consistent.

The podcast that isn't actually a podcast is out on YouTube if you're into sci-fi and poor attempts at humor.