Batch Transcription
Last updated
Last updated
Consider the following scenario:
Your job as service desk manager is to keep the wheels on the helpdesk spinning smoothly. Part of that is ensuring verbal phone conversations are following company policy and are polite
You need to perform this task once per week, and even with a random selection of calls, it still can take 1-2 hours to review
The end result of this is a simple summary of the call and general sentiment. You realize this can be automated!
With the above criteria, we have what we need to start automating! But let's get a handle on how this API works first
Before we process speech, we need something to give the API to process. With batch transcriptions, you must provide a URI to download the call audio from. In practice, you would want to point this to the recordings endpoint of your calling software that houses the recordings. For the purposes of demonstration, I used AI Text-to-Speech to make a demo call and published it to a publicly available storage blog for ease of access. You can listen to it here. We can use this same link to feed the audio into the speech service
We'll be using our API keys for this demonstration. These can be found in the Keys and Endpoint section of your Speech Service in the Azure portal:
Copy Key 1 somewhere safe for now. We need a secure way to get these keys, so let's use a KeyVault!
Configure the deployment details on the Basics page
Similar to the Speech service, the cost of a KeyVault is very small. 10,000 secrets transactions, much more than we will need, only costs $0.03. The Standard pricing tier is sufficient for most things.
Proceed to Review + Create, then create the resource
In your keyvault, head to the secrets page and select Generate/Import
Enter the name and paste your API key into the Secret Value box. Ensure the secret is enabled and click Create
Now that we've securely stored our key somewhere we can pull from, we need a way to authenticate to the vault itself. If you've worked with Azure Service Principals, these steps should be familiar
Enter the name, select the Single Tenant account type, and create the registration
In the app registration, go to the Secrets page, then create a new secret
This secret will not be displayed again! Save it somewhere safe
Go to the overview and also note down the Application (client) ID and the Directory (tenant) ID. I've saved all as environment variables in my API interaction tool, Insomnia
Lastly, we need to authorize this app registration to get secrets from our KeyVault. In the KeyVault, go to the Access Control page and add a new role assignment
Select the role Key Vault Secrets User, then add the Service Principal we just created
First up, we need our API key to interact with the service. Let's get it from our KeyVault. Using our API tool, we can send the following request to get an access token:
The response should contain access_token, token_type, and expiration values. Note the access token down.
We can now request our secret from the vault by using the following request:
The response should contain a "Value", which is the secret. Note this down
Now that we have our authentication, let's ask the Speech API to make us a new batch job. The endpoint you need to send it to is based on the region the service was deployed in. For example, I deployed to US West 3, so my endpoint URL starts with https://westus3.api.cognitive.microsoft.com
. You can find this on the overview of your provisioned service.
We want to ask the Transcriptions service to do something, so the full URL should look like this: https://westus3.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions
. The first part tells it what region to look in for our service, what service we'll be using, the API version, and finally the task we want to happen.
The body for this request should be formatted as:
This sets the property for our batch transcription job. If we wanted to provide multiple URLs they can be added to the contentUrls
array. Properties sets the properties of the job itself, be sure to add any additional candidateLocales
if needed.
The display name of the batch job must be unique. Since we will be pulling these via an automation once implemented to your favorite RPA platform, we can simply use a UNIX timestamp to ensure a unique name
Lastly, we need to auth the request. The API key we grabbed from KeyVault should be put in a header named "Ocp-Apim-Subscription-Key". All Togther, the request looks like:
Then we can expect a response of:
The important bits here are the displayName
and the top level self
link. Note them down.
Now we've created and started our batch job, and we want to make sure it's finished. By running a simple GET request against the self
URL, we will receive a report for the batch job. This will contain all the above attribute, but with the addition of a status
. This status
string will show Succeeded
once processed. Once successful, you'll also see a new Files object with a new URL. Query the batch job until it shows Successful:
Now we can check where our content can be found for this transcription. Query the same URL, but with /files
appended
Your response should look like:
With that, we're just about done! Very last step is to actually get the results. Query the contentURL for the contenturl_0.json
file and inspect the results. You'll see an object for CombinedRecognizedPhrases, then one for lexical
within that - That's our transcription!
As you can see, that output is not quite human readable. But that's OK, now we can do a number of things to this output to make it something useful! Now that we have a transcript, we can send it to an AI Language Model like GPT-4 to create a summary, for example. That's a bit outside the scope of this doc, but here's what you can expect as an output:
Provision the keyvault resource. Similar to the speech service, click the Create button in your Resource Group, then search for and choose Key Vault
In the Entra Identity portal, select Applications, then App Registrations. Create a New Registration