Share via

The v1 real-time STT endpoint returns a constant Confidence value of 0.039347406 for all NBest results globally.

OneReachAdmin-8533 0 Reputation points
2026-04-23T13:50:50.8333333+00:00

The v1 real-time Speech-to-Text endpoint (/speech/recognition/{mode}/cognitiveservices/v1) returns a fixed Confidence value of 0.039347406 in the NBest array for every recognition result, regardless of audio quality, region, resource type, or SDK version. Text recognition is accurate — only the Confidence field is broken.

Affected endpoint

https://{region}.stt.speech.microsoft.com/speech/recognition/{mode}/cognitiveservices/v1

All modes affected: conversation, dictation, interactive

To Reproduce

curl -X POST \
  'https://eastus.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US&format=detailed' \
  -H 'Ocp-Apim-Subscription-Key: <ANY_VALID_KEY>' \
  -H 'Content-Type: audio/wav' \
  --data-binary @clear_speech.wav

v1 Response (broken):

{
  "RecognitionStatus": "Success",
  "DisplayText": "Hello, I'd love to help you order a pizza. What type of pizza would you like?",
  "NBest": [
    {
      "Confidence": 0.039347406,
      "Lexical": "hello i'd love to help you order a pizza what type of pizza would you like"
    },
    {
      "Confidence": 0.039347406,
      "Lexical": "hello i'd loved to help you order a pizza what type of pizza would you like"
    }
  ]
}

Note: both NBest hypotheses have identical Confidence, which should not occur for different hypotheses.

Testing performed

Variable Values tested v1 Confidence
Region westus2, eastus 0.039347406
Resource type CognitiveServices.S0 (multi-service), SpeechServices.F0 (dedicated) 0.039347406
Subscription key key1, key2 0.039347406
SDK version 1.40.0, 1.49.0 0.039347406
No SDK (raw curl) REST API directly 0.039347406
SpeechConfig method fromSubscription, fromEndpoint, fromHost 0.039347406
enableDictation on, off 0.039347406
Recognition mode conversation, dictation, interactive 0.039347406
Output format Simple, Detailed 0.039347406
wordLevelTimestamps on, off 0.039347406
profanity masked, raw 0.039347406
lidEnabled true, false 0.039347406
initialSilenceTimeoutMs default, 5000 0.039347406
storeAudio true, false 0.039347406
Audio Clean 24kHz 16-bit PCM TTS-generated WAV 0.039347406

All combinations return the same broken confidence. The Fast Transcription API returns correct confidence (0.986) for the same audio.

Expected behavior

NBest Confidence should vary based on recognition quality (typically 0.8–0.97 for clear speech). Different NBest hypotheses should have different confidence values.

Actual behavior

Confidence is pinned at exactly 0.039347406 for every NBest entry, every request, across all regions and resource types tested. The value is identical for all hypotheses and does not change with audio content — short words ("yes"), phrases ("hello how are you"), and long sentences all return the same value.

Environment

  • Speech SDK: 1.49.0 (also tested 1.40.0)
  • Also reproduced via raw REST API (no SDK)
  • Regions tested: eastus, westus2
  • Resource types tested: SpeechServices F0, CognitiveServices S0
  • Runtime: Node.js v20.19.6 on Ubuntu 22.04
  • Date first observed: April 23, 2026
  • Last known good: March 24, 2026 (confidence was 0.88 on same subscription)
Azure Speech in Foundry Tools

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.