Share via

TTS Pronunciation Issue with Uppercase Letters in Single-Word SSML Requests (Malay Language, ms-MY-YasminNeural)​

Zhizhen Peng 彭志真 0 Reputation points
2026-05-06T11:09:55.2733333+00:00

Issue Description:​

We are encountering inconsistent text-to-speech (TTS) pronunciation when synthesizing individual words via the Azure Cognitive Services TTS API, specifically for the Malay (ms) language using the ms-MY-YasminNeuralvoice. The issue appears to be related to capitalization patterns within words.

Observed Behavior:​

  1. For the word "pharmacy":
    • Lowercase "p": Pronunciation is correct.
      • ❌ **Uppercase "P"**​ (e.g., "Pharmacy"): Pronunciation is abnormal.
      1. For the word **"farmasi"**​ (Malay for "pharmacy"):
        • ❌ Abnormal pronunciation occurs regardless of capitalization.
        1. Additional testing reveals a pattern:
          • Pronunciation is normal if only the middle letters​ are capitalized (e.g., "pHArMaCy").
            • Pronunciation becomes abnormal only when the first or last letter is capitalized​ (e.g., "Pharmacy", "pharmacY", "Farmasi", "farmasI").

SSML Request Example:​

curl --location --request POST "https://southeastasia.tts.speech.microsoft.com/cognitiveservices/v1" \
--header "Ocp-Apim-Subscription-Key: ${subscriptionKey}" \
--header 'Content-Type: application/ssml+xml' \
--header 'X-Microsoft-OutputFormat: audio-48khz-96kbitrate-mono-mp3' \
--header 'User-Agent: curl' \
--data-raw '<speak version="1.0" xml:lang="ms" xmlns:mstts="https://www.w3.org/2001/mstts">
    <voice xml:lang="ms" name="ms-MY-YasminNeural">
        <prosody rate="+0%">
            pharmacy
        </prosody>
    </voice>
</speak>'

Impact:​

Our use case involves TTS synthesis for individual words (e.g., labels, buttons, or medical terms), where we cannot fully control input text formatting. The inconsistency disrupts user experience, especially in critical scenarios like accessibility tools or multilingual applications.

Questions for Azure Engineering Team:​

  1. Is this a known issue with the neural TTS engine, particularly for Malay or other languages?
  2. Could it be related to text normalization or grapheme-to-phoneme conversion​ when processing uppercase letters at word boundaries?
  3. Are there recommended SSML tags or attributes​ (e.g., <say-as>, <phoneme>) to enforce consistent pronunciation regardless of capitalization?
  4. If this is a bug, are there plans to address it in future updates?

Suggested Workarounds (Attempted/Considered):​

  • Pre-processing text to lowercase (not always feasible for proper nouns or acronyms).
  • Using SSML <say-as interpret-as="verbatim">, but this may not suit all use cases.

Request:​

We seek guidance on how to programmatically avoid this issue​ (e.g., API parameters, SSML configurations) or an estimated timeline for a backend fix. Detailed documentation on TTS capitalization handling would also be helpful.

Environment:​

  • Region: southeastasia
  • Voice: ms-MY-YasminNeural
  • Output Format: audio-48khz-96kbitrate-mono-mp3
  • Language: Malay (xml:lang="ms")

Thank you for your support!

Azure Translator in Foundry Tools
0 comments No comments

1 answer

Sort by: Most helpful
  1. Q&A Assist
    2026-05-06T11:10:17.7566667+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    This behavior is consistent with how the TTS system’s text normalization and grapheme‑to‑phoneme (G2P) components can vary pronunciation based on casing, especially for single-word inputs and less common words. The provided context does not list this specific Malay capitalization issue as a known bug, but it does document the supported mechanisms to override pronunciation deterministically.

    Based on the available documentation, the reliable way to avoid capitalization‑dependent pronunciation is to explicitly control pronunciation via SSML, rather than relying on the default text normalization.

    Programmatic mitigations supported by the service:

    1. Use <phoneme> to force pronunciation
      • For words where capitalization causes abnormal output (for example, “pharmacy”, “farmasi”), specify the phonetic form explicitly using the ipa alphabet and the Malay phonetic set.
      • Example pattern (illustrative only; actual IPA needs to match the desired Malay pronunciation):
         <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="ms-MY">
           <voice name="ms-MY-YasminNeural">
             <phoneme alphabet="ipa" ph="faɾmasi">Farmasi</phoneme>
           </voice>
         </speak>
      
      The Malay IPA inventory for vowels and consonants is documented in the ms-MY phonetic set, so phoneme strings can be constructed using the tables for ipa symbols (for example, a, i, u, p, b, t, d, k, g, s, f, v, m, n, ŋ, etc.).
    2. Use <sub> to decouple written form from spoken form
      • When the visual casing must be preserved (for UI labels, proper nouns, acronyms), but a stable spoken form is needed, use <sub> so that the displayed text is the original word and the alias is the normalized form that should be spoken:
         <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="ms-MY">
           <voice name="ms-MY-YasminNeural">
             <sub alias="farmasi">Farmasi</sub>
           </voice>
         </speak>
      
      This ensures consistent pronunciation regardless of the casing of the visible text.
    3. Combine <sub> with <phoneme> for full control
      • For maximum robustness (especially for medical terms or domain‑specific vocabulary), combine both tags so that:
        • The user sees the original casing.
        • The engine uses a fixed phonetic pronunciation.
         <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="ms-MY">
           <voice name="ms-MY-YasminNeural">
             <sub alias="">
               <phoneme alphabet="ipa" ph="faɾmasi">Farmasi</phoneme>
             </sub>
           </voice>
         </speak>
      
      Or, if the visual text is handled outside SSML, use only <phoneme> inside the SSML and keep the UI label in the application.
    4. Use SSML systematically for single‑word and critical terms
      • The transparency guidance for neural TTS explicitly recommends SSML as best practice to improve quality and control pronunciation, especially for specialized or domain‑specific vocabulary. For a system that frequently synthesizes isolated words (buttons, labels, medical terms), routing those terms through SSML with either <sub> or <phoneme> is the supported way to avoid unexpected behavior from text normalization.
    5. Custom lexicon for broader vocabularies
      • For larger controlled vocabularies (for example, many medical terms), the service supports custom lexicons to improve pronunciation of domain‑specific terms. This is the recommended approach when many words need stable, capitalization‑independent pronunciation.

    Regarding the specific questions:

    1. Known issue status
      • The provided known‑issues list for Foundry/Speech does not mention this Malay capitalization case. No explicit confirmation is present in the context that this is a tracked bug.
    2. Relation to text normalization / G2P
      • The behavior is consistent with capitalization‑sensitive text normalization and G2P, especially for single‑word inputs, but the context does not provide internal implementation details.
    3. Recommended SSML tags/attributes
      • <phoneme> with ipa (using the documented ms-MY phonetic set) to force pronunciation.
      • <sub> to keep the written form while overriding the spoken form.
      • Prosody tags like <prosody> can be used for rate, pitch, and volume, but they do not address capitalization‑driven pronunciation changes.
    4. Plans or timelines for a backend fix
      • The context does not provide any roadmap or ETA for changes related to capitalization handling in Malay TTS.

    In practice, for a production system that cannot fully control input casing, the robust mitigation is to:

    • Maintain a list of critical terms (for example, medical vocabulary, UI labels).
    • Map them to SSML snippets using <sub> and/or <phoneme> so pronunciation is stable regardless of how the original text is cased.

    References:

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.