Choose how to handle speech (preview)

[This article is prerelease documentation and is subject to change.]

After you choose conversation control, voice agents need to make another decision: speech architecture.

Important

  • This is a preview feature.
  • Preview features aren’t meant for production use and might have restricted functionality. These features are subject to supplemental terms of use, and are available before an official release so that customers can get early access and provide feedback.

Pattern 1: Basic voice mode

Speech > Text > NLU/NLU+ > Classic orchestration > Speech

In this pattern, the caller's speech is transcribed first, then Copilot Studio dialog flows process the text. Finally, the text is converted back to speech.

Use this pattern when

  • You're using a fully classic, deterministic flow.

  • Cost minimization is critical.

  • You need a custom or neural voice.

  • You need fine-grained control over speech recognition.

  • You're working with DTMF-heavy flows.

Tradeoffs

  • Works with classic orchestration only.

  • Can't support hybrid or generative orchestration.

  • Supporting multilingual and mixed-language input takes more work. It requires language detection, language-specific prompts and grammar, Speech-to-Text (STT) locale setup, and fallback handling.

Important

Basic voice mode isn't just a "voice model choice." It fundamentally constrains orchestration.

Pattern 2: Streaming mode

Speech > AI model > Speech

A voice architecture in which a single language model processes audio end-to-end and natively handles audio input and output. There's no separate STT or Text-to-Speech (TTS) step. The model receives the caller's audio stream directly and returns a synthesized audio response in real time.

This architecture uses a tightly integrated, real-time model pipeline to deliver ultra-low latency, natural conversation flow, and simpler deployment. This approach works best when speed and natural conversation are top priorities, such as high-volume customer interactions in well-supported languages and regions. This approach has a limited number of available voices and limited customization options.

Key benefit: Ultra-low latency, natural conversational turn-taking.

Use this pattern when

  • Conversational naturalness and enhanced prosody are a top priority.

  • The business wants a premium conversational experience.

  • Superior handling of multilingual and mixed‑language input is required, including seamless language switching.

  • Contextual understanding matters (tone, intent, and conversational nuance), reducing reliance on explicit translation layers.

  • Low-latency, real-time responsiveness is essential to the experience.

  • The team is ready to invest in testing, tuning, evaluation, and guardrails.

Tradeoffs

  • Fewer customization points.

  • Limited voice options.

  • Strong dependency on prompt quality.

  • Pricing and model choice matter more.

  • The real-time speech model limits reasoning depth. It also gives you less flexibility to use higher-capacity text language model orchestration or specialized agents for complex reasoning.

  • Reasoning depth with the real-time speech model is relatively lower than with text language model orchestration, as the latter gives you the flexibility to use the strongest model available when needed.