An Azure service that integrates speech processing into apps and services.
The v1 real-time STT endpoint returns a constant Confidence value of 0.039347406 for all NBest results globally.
The v1 real-time Speech-to-Text endpoint (/speech/recognition/{mode}/cognitiveservices/v1) returns a fixed Confidence value of 0.039347406 in the NBest array for every recognition result, regardless of audio quality, region, resource type, or SDK version. Text recognition is accurate — only the Confidence field is broken.
Affected endpoint
https://{region}.stt.speech.microsoft.com/speech/recognition/{mode}/cognitiveservices/v1
All modes affected: conversation, dictation, interactive
To Reproduce
curl -X POST \
'https://eastus.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US&format=detailed' \
-H 'Ocp-Apim-Subscription-Key: <ANY_VALID_KEY>' \
-H 'Content-Type: audio/wav' \
--data-binary @clear_speech.wav
v1 Response (broken):
{
"RecognitionStatus": "Success",
"DisplayText": "Hello, I'd love to help you order a pizza. What type of pizza would you like?",
"NBest": [
{
"Confidence": 0.039347406,
"Lexical": "hello i'd love to help you order a pizza what type of pizza would you like"
},
{
"Confidence": 0.039347406,
"Lexical": "hello i'd loved to help you order a pizza what type of pizza would you like"
}
]
}
Note: both NBest hypotheses have identical Confidence, which should not occur for different hypotheses.
Testing performed
| Variable | Values tested | v1 Confidence |
|---|---|---|
| Region | westus2, eastus | 0.039347406 |
| Resource type | CognitiveServices.S0 (multi-service), SpeechServices.F0 (dedicated) | 0.039347406 |
| Subscription key | key1, key2 | 0.039347406 |
| SDK version | 1.40.0, 1.49.0 | 0.039347406 |
| No SDK (raw curl) | REST API directly | 0.039347406 |
| SpeechConfig method | fromSubscription, fromEndpoint, fromHost | 0.039347406 |
| enableDictation | on, off | 0.039347406 |
| Recognition mode | conversation, dictation, interactive | 0.039347406 |
| Output format | Simple, Detailed | 0.039347406 |
| wordLevelTimestamps | on, off | 0.039347406 |
| profanity | masked, raw | 0.039347406 |
| lidEnabled | true, false | 0.039347406 |
| initialSilenceTimeoutMs | default, 5000 | 0.039347406 |
| storeAudio | true, false | 0.039347406 |
| Audio | Clean 24kHz 16-bit PCM TTS-generated WAV | 0.039347406 |
All combinations return the same broken confidence. The Fast Transcription API returns correct confidence (0.986) for the same audio.
Expected behavior
NBest Confidence should vary based on recognition quality (typically 0.8–0.97 for clear speech). Different NBest hypotheses should have different confidence values.
Actual behavior
Confidence is pinned at exactly 0.039347406 for every NBest entry, every request, across all regions and resource types tested. The value is identical for all hypotheses and does not change with audio content — short words ("yes"), phrases ("hello how are you"), and long sentences all return the same value.
Environment
- Speech SDK: 1.49.0 (also tested 1.40.0)
- Also reproduced via raw REST API (no SDK)
- Regions tested: eastus, westus2
- Resource types tested: SpeechServices F0, CognitiveServices S0
- Runtime: Node.js v20.19.6 on Ubuntu 22.04
- Date first observed: April 23, 2026
- Last known good: March 24, 2026 (confidence was 0.88 on same subscription)