Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Introduction
The Voice Live API supports multiple languages and configuration options. In this document, you learn which languages the Voice Live API supports and how to configure them.
Depending on which model is being used Voice Live speech input is processed either with one of the multimodal models (for example, gpt-realtime, gpt-realtime-mini, and phi4-mm-realtime) or by azure speech to text models.
Azure speech to text supported languages
Azure speech to text is used for all configuration where a non-multimodal model is being used and for speech input transcriptions with phi4-mm-realtime.
It supports all languages documented on the Language and voice support for the Speech service - Speech to text tab.
There are three options for Voice Live language processing:
- Automatic multilingual configuration using multilingual model (default): When setting an empty
languageconfiguration, Voice Live uses a multilingual model that works well for multiple languages. This is the default and recommended configuration for most customers. - Single language configuration: Customer can specify a single language to restrict the transcription languages detected.
- Multilingual configuration using up to 10 defined languages: Use this ONLY if the input voice includes multiple languages that aren't fully covered by the automatic multilingual mode. The order of languages matters: the first language in the list is treated as the primary language. Note this can incur extra latency. In some cases, for example, with short sentences, transcript quality can be lower.
The current multi-lingual model supports the following languages:
- Chinese (China) [zh-CN]
- English (Australia) [en-AU]
- English (Canada) [en-CA]
- English (India) [en-IN]
- English (United Kingdom) [en-GB]
- English (United States) [en-US]
- French (Canada) [fr-CA]
- French (France) [fr-FR]
- German (Germany) [de-DE]
- Hindi (India) [hi-IN]
- Italian (Italy) [it-IT]
- Japanese (Japan) [ja-JP]
- Korean (Korea) [ko-KR]
- Spanish (Mexico) [es-MX]
- Spanish (Spain) [es-ES]
To use Automatic multilingual configuration using multilingual model no extra configuration is required. If you do add the language string to the session.update message, make sure to leave it empty.
{
"session": {
"input_audio_transcription": {
"model": "azure-speech",
"language": ""
}
}
Note
The multilingual model generates results for unsupported languages, if no language is defined. In these cases transcription, quality is low. Ensure to configure defined languages, if you're setting up application with languages unsupported by the multilingual model.
To configure a single or multiple languages not supported by the multimodal model, you must add them to the language string in the session.update message. A maximum of 10 languages are supported. When specifying multiple languages, the order matters: the first language in the list is treated as the primary language.
{
"session": {
"input_audio_transcription": {
"model": "azure-speech",
"language": "en-US,fr-FR,de-DE"
}
}
gpt-realtime and gpt-realtime-mini supported languages
While the underlying model was trained on 98 languages, OpenAI only lists the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model returns results for languages not listed but the quality will be low.
The following languages are supported by gpt-realtime and gpt-realtime-mini:
- Afrikaans
- Arabic
- Armenian
- Azerbaijani
- Belarusian
- Bosnian
- Bulgarian
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Estonian
- Finnish
- French
- Galician
- German
- Greek
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Italian
- Japanese
- Kannada
- Kazakh
- Korean
- Latvian
- Lithuanian
- Macedonian
- Malay
- Marathi
- Maori
- Nepali
- Norwegian
- Persian
- Polish
- Portuguese
- Romanian
- Russian
- Serbian
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Thai
- Turkish
- Ukrainian
- Urdu
- Vietnamese
- Welsh
Multimodal models don't require a language configuration for the general processing. If you configure input audio transcription, you can provide the transcription models with a single language hint as ISO-639-1 locale to improve transcription quality. In this case you need to add the language string to the sessionsession.update message.
{
"session": {
"input_audio_transcription": {
"model": "gpt-4o-transcribe",
"language": "en"
}
}
phi4-mm-realtime supported languages
The following languages are supported by phi4-mm-realtime:
- Chinese
- English
- French
- German
- Italian
- Japanese
- Portuguese
- Spanish
Multimodal models don't require a language configuration for the general processing. If you configure input audio transcription for phi4-mm-realtime you need to use the same configuration as for all non-multimodal model configuration where azure-speech is used for transcription as described.
Note
Multimodal phi models only support the following transcription models: azure-speech.