I built some pretty decent phone voice prompts using Google’s Text-to-Speech API and a bit of file conversion. This saves a lot of time wasted trying to record messages with humans.
I was looking to record a holiday message for Episodic’s IVR. We use 3CX as a PBX for some of our voice channels. 3CX does allow you to record your messages directly as well as upload your own recordings, however, I didn’t want to waste time stammering and stuttering into my phone to get the “perfect take” nor wait a few days to get it done by a voice artist. I also find that I never quite know what my recording script should be until I hear it played back.
So I needed something that I could play with in real-time that didn’t sound half bad. Google, being a leader in machine learning, artificial intelligence and general computer wizardy, was a natural choice so I fired up their Text-to-Speech API and this is how we got it done.
Choosing a Voice
The first challenge was choosing a voice. Google provides a number of voices for a number of languages/nationalities, however, South Africa being the global powerhouse that it isn’t, we didn’t find a South African language pack.
Now we could’ve gone and trained our own custom voice models (Google supports that) but I really didn’t want to spend too much time on this, I just wanted to use something that was already there. South Africa has 11 official languages and who knows how many dialects and accents. We do speak English though so that was a start. We needed an English voice pack that would be the most understandable. American English was not going to work. I tried the Australian packs but they didn’t quite nail it. UK English was going to be the best bet.
Long story short, I found that en-GB-Wavenet-C or en-GB-Wavenet-F were the best options ( I leaned to F over C) for a voice prompt that was clear, had good intonation and wasn’t too accented.
Setting up the project
Starting with a Google Cloud project with a linked billing account, you simply enable the API. You can find it in APIs and Services.
Next up we need some credentials so that we can use the API. We create a service account with no role and download the JSON key file.
I was following a command-line approach as per the Quick Start article which, for me at least, proved to be the quickest and easiest approach. To “load” the credentials, we just set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to that JSON file. Something like:
Pretty easy. The shell session will just use these credentials with various Cloud SDK calls. Which reminds me, you’ll need to install and initialise the Google Cloud SDK.
When making the API calls, we’ll generate a Bearer token using the SDK as below (you’ll see this in action alter).
gcloud auth application-default print-access-token
Calling the API
To start making API calls, we simply use our favourate HTTP shell script/command like curl or httpie. Nothing stops you from using Python, Perl or whatever else of course. We just need to create a JSON request and POST it to the API endpoint.
The JSON request tells the API what you want. It’s got the script, which voice to use, hoe to modulate it and what formats the output must be. My example looks like this:
"text": "This is a recorded message.
I can give email addresses like firstname.lastname@example.org,
phone numbers like 021 001 2490 and much more.
The intonation is not bad for IVR.
Press 1 for sales, 2 for service and 3 to repeat.
Making the API call results in a response JSON payload which you want to capture in a file.
api-output.txt will contain the JSON response. You’re looking for the “audio-content” field which holds the base64 encoded audio data.
"audioContent": "base64 encoded data"
Edit the file and keep just the base64 data. You rename the file if you want but it’s just a convenience. We really want to decode the thing into an MP3 in the next step
base64 api-output.txt — decode > audio.mp3
The audio is not bad. Now you can just play with the request.json until you get the message jsut right and the timing etc. spot on.
Loading up to 3CX
3CX expects a WAV file and not an MP3 so we need to convert this. Now you can use a tool like Audacity or any number of command-line tools, but I found the quickest and simplest way was to use the 3CX Online Audio Converter. Quality is not the highest but it’s fine for a phone message.
Then we add it to an IVR option like below.
This really is a simple and quick way to generate a voice prompt from a script in a voice that isn’t too bad at all. The advantage is that you don’t have to constantly record and rerecord everytime you sneeze, your voice croaks or a dog barks. You also don’t have to worry about your microphone quality.
The synthesised voice is fine for an IVR and Google have done pretty well to intonate phonenumbers and email addresses. Pretty impressive stuff.
If you get stuck, you can use Google’s own quick start documentation as a reference. It takes you as far as building the MP3.