An Azure service to easily conduct machine translation with a simple REST API call.
This behavior is consistent with how the TTS system’s text normalization and grapheme‑to‑phoneme (G2P) components can vary pronunciation based on casing, especially for single-word inputs and less common words. The provided context does not list this specific Malay capitalization issue as a known bug, but it does document the supported mechanisms to override pronunciation deterministically.
Based on the available documentation, the reliable way to avoid capitalization‑dependent pronunciation is to explicitly control pronunciation via SSML, rather than relying on the default text normalization.
Programmatic mitigations supported by the service:
- Use
<phoneme>to force pronunciation- For words where capitalization causes abnormal output (for example, “pharmacy”, “farmasi”), specify the phonetic form explicitly using the
ipaalphabet and the Malay phonetic set. - Example pattern (illustrative only; actual IPA needs to match the desired Malay pronunciation):
The Malay IPA inventory for vowels and consonants is documented in the<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="ms-MY"> <voice name="ms-MY-YasminNeural"> <phoneme alphabet="ipa" ph="faɾmasi">Farmasi</phoneme> </voice> </speak>ms-MYphonetic set, so phoneme strings can be constructed using the tables foripasymbols (for example,a,i,u,p,b,t,d,k,g,s,f,v,m,n,ŋ, etc.). - For words where capitalization causes abnormal output (for example, “pharmacy”, “farmasi”), specify the phonetic form explicitly using the
- Use
<sub>to decouple written form from spoken form- When the visual casing must be preserved (for UI labels, proper nouns, acronyms), but a stable spoken form is needed, use
<sub>so that the displayed text is the original word and thealiasis the normalized form that should be spoken:
This ensures consistent pronunciation regardless of the casing of the visible text.<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="ms-MY"> <voice name="ms-MY-YasminNeural"> <sub alias="farmasi">Farmasi</sub> </voice> </speak> - When the visual casing must be preserved (for UI labels, proper nouns, acronyms), but a stable spoken form is needed, use
- Combine
<sub>with<phoneme>for full control- For maximum robustness (especially for medical terms or domain‑specific vocabulary), combine both tags so that:
- The user sees the original casing.
- The engine uses a fixed phonetic pronunciation.
Or, if the visual text is handled outside SSML, use only<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="ms-MY"> <voice name="ms-MY-YasminNeural"> <sub alias=""> <phoneme alphabet="ipa" ph="faɾmasi">Farmasi</phoneme> </sub> </voice> </speak><phoneme>inside the SSML and keep the UI label in the application. - For maximum robustness (especially for medical terms or domain‑specific vocabulary), combine both tags so that:
- Use SSML systematically for single‑word and critical terms
- The transparency guidance for neural TTS explicitly recommends SSML as best practice to improve quality and control pronunciation, especially for specialized or domain‑specific vocabulary. For a system that frequently synthesizes isolated words (buttons, labels, medical terms), routing those terms through SSML with either
<sub>or<phoneme>is the supported way to avoid unexpected behavior from text normalization.
- The transparency guidance for neural TTS explicitly recommends SSML as best practice to improve quality and control pronunciation, especially for specialized or domain‑specific vocabulary. For a system that frequently synthesizes isolated words (buttons, labels, medical terms), routing those terms through SSML with either
- Custom lexicon for broader vocabularies
- For larger controlled vocabularies (for example, many medical terms), the service supports custom lexicons to improve pronunciation of domain‑specific terms. This is the recommended approach when many words need stable, capitalization‑independent pronunciation.
Regarding the specific questions:
- Known issue status
- The provided known‑issues list for Foundry/Speech does not mention this Malay capitalization case. No explicit confirmation is present in the context that this is a tracked bug.
- Relation to text normalization / G2P
- The behavior is consistent with capitalization‑sensitive text normalization and G2P, especially for single‑word inputs, but the context does not provide internal implementation details.
- Recommended SSML tags/attributes
-
<phoneme>withipa(using the documentedms-MYphonetic set) to force pronunciation. -
<sub>to keep the written form while overriding the spoken form. - Prosody tags like
<prosody>can be used for rate, pitch, and volume, but they do not address capitalization‑driven pronunciation changes.
-
- Plans or timelines for a backend fix
- The context does not provide any roadmap or ETA for changes related to capitalization handling in Malay TTS.
In practice, for a production system that cannot fully control input casing, the robust mitigation is to:
- Maintain a list of critical terms (for example, medical vocabulary, UI labels).
- Map them to SSML snippets using
<sub>and/or<phoneme>so pronunciation is stable regardless of how the original text is cased.
References: