How to Convert Audio to .WAV for Azure Speech Service Using MoviePy

Azure Speech Service offers robust speech recognition, translation, text-to-speech and many more capabilities, providing developers with powerful tools to integrate voice-based interactions into their applications.

However, to ensure seamless compatibility and optimal performance, Azure Speech Service requires audio files to adhere to specific standards regarding format, bitrate, sampling rate, and channel configuration.

Let's dive into how we can use Moviepy to convert your different audio files to Azure Speech Service Compatible.

Leveraging MoviePy for Audio Conversion

According to its documentation, MoviePy is a Python module for video editing, which can be used for basic operations (like cuts, concatenations, and title insertions), video compositing (a.k.a. non-linear editing), video processing, or to create advanced effects. It can read and write the most common video formats, including GIF.

MoviePy is a powerful Python module primarily designed for video editing but equally adept at manipulating audio. Let's explore how MoviePy can help convert various audio formats to comply with Azure Speech Service requirements.

Understanding Azure Speech Service Audio Requirements

For the use case being discussed in this article, we would need to know the audio configurations. Azure Speech Service demands audio files in the WAV format with specific bitrate, sampling rate, and channel configurations. Here are the key criteria:

File Format: WAV (Microsoft PCM)
Bit Depth: 16-bit
Sampling Rate: 16 kHz or 8 kHz (some scenarios support 32 kHz or 48 kHz)
Channels: Mono or Stereo (Mono recommended for speech recognition)
Bitrate: 256 kbps (kilobits per second) for mono, 512 kbps for stereo
Codec: PCM (Pulse Code Modulation)
File Size: Generally up to 4 GB for continuous recognition

Step-by-Step Guide

Now to convert your audio file in a different format to have these configurations, follow the steps below.

Step 1 - Install Moviepy

Begin by installing MoviePy using the command pip install moviepy in your terminal. This would install all the dependencies needed to run Moviepy in your Python environment along with Moviepy.

Step 2 - Import Required Functions

Import required attributes from MoviePy's editor method, particularly AudioFileClip, which will help load and manipulate audio files. It has a method you can call to set the output file and format, bitrate and all configurations necessary.

from moviepy.editor import *

from moviepy.editor import AudioFileClip

Step 3 - Load the Audio File

Load your audio file using the AudioFileClip method. This method takes in the audio file path and loads up your audio file as a Moviepy AudioFileClip format that can be manipulated to the file format required for compatibility with Azure Speech Service API and/or SDK.

Assuming the file format is .ogg and the file is saved as audio.ogg, load the file as shown below. Ensure to use a relative path if the audio file is in a different directory from your Python script.

audioclip = AudioFileClip("audio.ogg")

Step 4 - Set Audio Configurations

Define the audio parameters necessary for compatibility as required for your solution.

audio_params = {
        "codec": "pcm_s16le",
        "fps": 16000,  # Set the desired sampling rate: 16000 Hz
        # "fps": 8000,  # Alternatively, set the sampling rate to 8000 Hz
        "nchannels": 1,  # Mono audio
        "bitrate": "16k"  # Set the desired bitrate
    }

Step 5 - Pass Configurations Parameters and Convert Audio

Define the file path for the output. Since we want to convert our audio file to .wav format, we would define a file with the extension .wav.

Pass the defined required audio parameters that you have defined to the write_audiofile method from AudioFileClip. This would convert the audio file to the extension you have defined in your output file path and also configure the audio to all the parameters you have set and passed into the method.

output_file = "audio.wav"
audioclip.write_audiofile(output_file, codec=audio_params["codec"],fps=audio_params["fps"],nbytes=2,bitrate=audio_params["bitrate"])

Step 6 - Load Your New Audio File To Use with Azure Speech Service

Your new audio file should be located in the file path you have defined. If you followed the naming convention above, you will find your "audio.wav" file in the same directory as your python script/code.

You can then load up this file with the Azure Speech Service SDK (for example) to create a simple transcription solution. A simple sample is found below:

After getting your Speech Service key and region from your Azure portal and saving them as environment variables, follow the code below to test run with your new audio file.

import os
import azure.cognitiveservices.speech as speechsdk

def recognize_from_file():
    # This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))
    speech_config.speech_recognition_language="en-US"

    audio_config = speechsdk.audio.AudioConfig(filename="audio.wav")
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("Recognized: {}".format(speech_recognition_result.text))
    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")

recognize_from_file()

This example uses the recognize_once_async operation to transcribe utterances of up to 30 seconds, or until silence is detected.

This sample code was taken from the Azure Speech Service Documentation. You can explore more on the Azure Speech Service through the documentation and also watch this tutorial on building transcription and translation services with the Azure Speech Service and Translator service from the video below.

https://youtu.be/ikNPMomeZKs?si=Bup8lJ5P-voImVqc

Conclusion

MoviePy simplifies the process of converting audio files to the precise format required by Azure Speech Service. By adhering to these guidelines and utilizing MoviePy's flexibility, you can seamlessly prepare your audio data for optimal performance within Azure's powerful Speech Service.

That's it!!! We have successfully converted our audio to be compatible with Azure Speech Service. 😌