zh-CN voices: mstts:express-as styles and paralinguistic tags produce identical output regardless of value

Question

zh-CN voices: mstts:express-as styles and paralinguistic tags produce identical output regardless of value

Ming-Li Lin 0

curl --location --request POST "https://${SPEECH_REGION}.tts.speech.microsoft.com/cognitiveservices/v1" \
--header "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" \
--header 'Content-Type: application/ssml+xml' \
--header 'X-Microsoft-OutputFormat: audio-16khz-128kbitrate-mono-mp3' \
--header 'User-Agent: curl' \
--data-raw '
<speak version="1.0" xml:lang="en-US" xmlns:mstts="http://www.w3.org/2001/mstts">
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[laughter] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[coughing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[throat_clearing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[breathing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[sighing] 拜託別說出去
</voice>
<voice name="zh-CN-Xiaochen:DragonHDLatestNeural">
[yawning] 拜託別說出去
</voice>
</speak>
' > styleParalingZh.mp3

The paralinguistic tags are not working at all.

According to the wiki page, it should work: Styles are available on all English content for all voices. Style results are strongly relevant to the input content: the model adapts style application based on the semantic meaning of the text. Paralingsuitics are available on all voices with all languages.

I tried to fetch voice details, and use the style listed in the styleList. Still no difference.

<speak version="1.0" xml:lang="zh-CN" xmlns:mstts="

makes no difference for all 3 sentences.

Ming-Li Lin 0 Reputation points

2026-05-05T19:39:56.25+00:00
curl --location --request POST "

gentle, angry and fearful are in zh-CN-XiaomoNeural's style list. Still no difference.

1 answer

Your answer

Ming-Li Lin 0 Reputation points

2026-05-05T19:39:56.25+00:00

curl --location --request POST "

gentle, angry and fearful are in zh-CN-XiaomoNeural's style list. Still no difference.

Answer 1

Paralinguistic tags and speaking styles are only honored by voices that explicitly support them. For Chinese (zh-CN), support is voice- and style-specific, and not all Dragon HD / Dragon HD Flash voices implement paralinguistics or the same style set.

From the language support tables:

zh-CN-Xiaochen:DragonHDFlashLatestNeural lists supported styles (cheerful, debating, empathetic, live-commercial, poetry-reading, sad, sorry) and explicitly notes Roles: Not supported.
Other zh-CN voices list different style sets, and only some support roles (for example, zh-CN-XiaomoNeural and zh-CN-YunyeNeural support roles like Boy, Girl, YoungAdultFemale, etc.).
The documentation for SSML styles and roles shows that paralinguistics and roles are controlled via mstts:express-as and role attributes, not by inserting bracketed tokens like [laughter] in the plain text.

In the provided SSML, the text uses literal tokens like [laughter] instead of SSML elements. For Azure Speech, paralinguistic effects are not triggered by bracketed words; they require supported SSML constructs (for example, mstts:express-as with a supported style value, or role where the voice supports roles). Since zh-CN-Xiaochen:DragonHDLatestNeural is not listed with paralinguistic support in the tables and the text does not use SSML paralinguistic elements, the synthesized audio will be identical regardless of the bracketed markers.

To get audible differences:

Use a zh-CN voice that lists the desired styles or roles in the language support table (for example, zh-CN-XiaomoNeural, zh-CN-YunyeNeural, or other zh-CN voices with rich style sets).
Apply styles via mstts:express-as with supported style values (such as cheerful, sad, serious, etc.) and, where available, role attributes as documented in the SSML customization page.
Do not rely on bracketed tokens like [laughter] in the text; instead, use the documented SSML mechanisms for styles, paralinguistics, and roles.

If a specific Dragon HD voice does not list paralinguistic or role support in the language support tables, style and paralinguistic tags will not change the output for that voice.

References:

Share via

zh-CN voices: mstts:express-as styles and paralinguistic tags produce identical output regardless of value

1 answer

Your answer