How to use azure tts by websocket api rather than sdk？

Question

How to use azure tts by websocket api rather than sdk？

datou ai 0

I would like to use the Azure TTS input text streaming capability, as described in this documentation: https://learn.microsoft.com/zh-cn/azure/ai-services/speech-service/how-to-lower-speech-synthesis-latency?pivots=programming-language-python&source=docs#input-text-streaming

However, it seems that this feature is only supported by a limited set of programming language SDKs.

Is it possible to use this capability directly via the WebSocket API instead of the SDK?

If so, are there any official examples or protocol documentation available?

Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-03T18:53:59.0133333+00:00

Hi datou ai,

Sorry for the inconvenience caused. While Azure Text-to-Speech uses a WebSocket V2 endpoint internally for low-latency scenarios, the input text streaming feature is not supported for direct use over raw WebSocket.
This capability is only available through the Azure Speech SDK, which manages the underlying connection, message format, and streaming behavior. There is no official WebSocket protocol documentation provided to implement this feature without the SDK.

Therefore, using the Speech SDK is the recommended and supported approach for text streaming scenarios.
Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-04T11:59:52.5866667+00:00

Hi datou ai,
Did you get any chance to check the response. Thank you!
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

datou ai 0

I use python3.13. Under the fastapi project, I found that when implementing stream tts using the azure sdk, it caused the event loop thread to be permanently stuck, which was almost an inevitable situation. This issue seems to be caused by the gc of objects in the azure sdk. Could you tell me how to solve it? When this issue occurs, the thread trace is printed as follows:

Current thread 0x000070bced8a3340 (most recent call first):
  Garbage-collecting
  File "/app/.venv/lib/python3.13/site-packages/azure/cognitiveservices/speech/interop.py", line 133 in 
  File "/usr/local/lib/python3.13/weakref.py", line 428 in 
  File "/app/.venv/lib/python3.13/site-packages/anyio/_backends/_asyncio.py", line 427 in 
  File "/app/.venv/lib/python3.13/site-packages/httpcore/_synchronization.py", line 214 in 
  File "/app/.venv/lib/python3.13/site-packages/httpcore/_async/connection_pool.py", line 343 in _close_connections
  File "/app/.venv/lib/python3.13/site-packages/httpcore/_async/connection_pool.py", line 229 in handle_async_request
  File "/app/.venv/lib/python3.13/site-packages/httpx/_transports/default.py", line 394 in handle_async_request
  File "/app/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1730 in _send_single_request
  File "/app/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1694 in _send_handling_redirects
  File "/app/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1657 in _send_handling_auth
  File "/app/.venv/lib/python3.13/site-packages/httpx/_client.py", line 1629 in send
  File "/app/.venv/lib/python3.13/site-packages/openai/_base_client.py", line 1604 in request
  File "/app/.venv/lib/python3.13/site-packages/openinference/instrumentation/openai/_request.py", line 398 in 
  File "/app/.venv/lib/python3.13/site-packages/openai/_base_client.py", line 1884 in post
  File "/app/.venv/lib/python3.13/site-packages/openai/resources/chat/completions/completions.py", line 2714 in create
  File "/app/app/chain/general/component/agent/chain_chat_agent/agent.py", line 414 in agent_loop
  File "/usr/local/lib/python3.13/asyncio/runners.py", line 118 in run
  File "/usr/local/lib/python3.13/asyncio/runners.py", line 195 in run
  File "/app/.venv/lib/python3.13/site-packages/uvicorn/workers.py", line 107 in run
  File "/app/.venv/lib/python3.13/site-packages/gunicorn/workers/base.py", line 144 in init_process
  File "/app/.venv/lib/python3.13/site-packages/uvicorn/workers.py", line 75 in init_process
  File "/app/.venv/lib/python3.13/site-packages/gunicorn/arbiter.py", line 684 in spawn_worker
  File "/app/.venv/lib/python3.13/site-packages/gunicorn/arbiter.py", line 719 in spawn_workers
  File "/app/.venv/lib/python3.13/site-packages/gunicorn/arbiter.py", line 634 in manage_workers
  File "/app/.venv/lib/python3.13/site-packages/gunicorn/arbiter.py", line 206 in run
  File "/app/.venv/lib/python3.13/site-packages/gunicorn/app/base.py", line 71 in run
  File "/app/.venv/lib/python3.13/site-packages/gunicorn/app/base.py", line 235 in run
  File "/app/.venv/lib/python3.13/site-packages/gunicorn/app/wsgiapp.py", line 66 in run
  File "/app/.venv/bin/gunicorn", line 10 in <module>

I just follow https://learn.microsoft.com/zh-cn/azure/ai-services/speech-service/how-to-lower-speech-synthesis-latency?pivots=programming-language-python#input-text-streaming. I have considered the issue of synchronous function calls under asynchronous conditions. (Use the "run_in_threadpool" function of fastapi to submit the threadpool for asynchronous execution of synchronous function in azure sdk)

Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-17T11:54:40.5833333+00:00

Hi datou ai,

Sorry for the late reply, and thank you for your patience.
As mentioned previously, the input text streaming feature in Azure Text-to-Speech is supported only through the Azure Speech SDK and is not available for direct use via raw WebSocket.

Regarding the behavior you are experiencing in your FastAPI application, this is a known interaction between the Azure Speech SDK’s native Python bindings and the asyncio event loop. The SDK relies on native resources and finalizers that are triggered during Python’s garbage collection. In certain scenarios, particularly when using newer Python versions and async frameworks, these finalizers may execute on the event loop thread and can result in blocking or deadlock conditions.

While using run_in_threadpool helps offload execution, it may not fully prevent this issue if SDK objects are still referenced within the async context and later cleaned up by the event loop thread. To mitigate this, we recommend ensuring that the full lifecycle of SDK objects including creation, usage, and cleanup is contained within a dedicated background thread that is separate from the asyncio event loop. Additionally, explicitly releasing resources within that thread for example, deleting the synthesizer object before the thread exits can help avoid unintended garbage collection on the event loop thread.

As a potential workaround, you may also consider managing garbage collection manually by disabling automatic GC and invoking it from a separate background thread at controlled intervals. This can help prevent finalizers from running on the event loop thread. We also recommend testing with a supported Python version such as 3.10 or 3.11, and ensuring you are using the latest version of the Azure Speech SDK, as improvements and fixes are continuously being made.
Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-19T16:35:14.1833333+00:00

Hi datou ai,
Did you get any chance to check the response. Thank you!
Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-21T16:04:53.8366667+00:00

Hi datou ai,
Just following up to see if you had a chance to review the above response. Thank you!

1 answer

Your answer

Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-03T18:53:59.0133333+00:00

Hi datou ai,

Sorry for the inconvenience caused. While Azure Text-to-Speech uses a WebSocket V2 endpoint internally for low-latency scenarios, the input text streaming feature is not supported for direct use over raw WebSocket.
This capability is only available through the Azure Speech SDK, which manages the underlying connection, message format, and streaming behavior. There is no official WebSocket protocol documentation provided to implement this feature without the SDK.

Therefore, using the Speech SDK is the recommended and supported approach for text streaming scenarios.
Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-04T11:59:52.5866667+00:00

Hi datou ai,
Did you get any chance to check the response. Thank you!
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-17T11:54:40.5833333+00:00

Hi datou ai,

Sorry for the late reply, and thank you for your patience.
As mentioned previously, the input text streaming feature in Azure Text-to-Speech is supported only through the Azure Speech SDK and is not available for direct use via raw WebSocket.

Regarding the behavior you are experiencing in your FastAPI application, this is a known interaction between the Azure Speech SDK’s native Python bindings and the asyncio event loop. The SDK relies on native resources and finalizers that are triggered during Python’s garbage collection. In certain scenarios, particularly when using newer Python versions and async frameworks, these finalizers may execute on the event loop thread and can result in blocking or deadlock conditions.

While using run_in_threadpool helps offload execution, it may not fully prevent this issue if SDK objects are still referenced within the async context and later cleaned up by the event loop thread. To mitigate this, we recommend ensuring that the full lifecycle of SDK objects including creation, usage, and cleanup is contained within a dedicated background thread that is separate from the asyncio event loop. Additionally, explicitly releasing resources within that thread for example, deleting the synthesizer object before the thread exits can help avoid unintended garbage collection on the event loop thread.

As a potential workaround, you may also consider managing garbage collection manually by disabling automatic GC and invoking it from a separate background thread at controlled intervals. This can help prevent finalizers from running on the event loop thread. We also recommend testing with a supported Python version such as 3.10 or 3.11, and ensuring you are using the latest version of the Azure Speech SDK, as improvements and fixes are continuously being made.
Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-19T16:35:14.1833333+00:00

Hi datou ai,
Did you get any chance to check the response. Thank you!
Pavankumar Purilla 11,495 Reputation points Microsoft External Staff Moderator

2026-04-21T16:04:53.8366667+00:00

Hi datou ai,
Just following up to see if you had a chance to review the above response. Thank you!

Answer 1

Hello Datou !

Thank you for posting on MS Learn Q&A.

I think you had an issue posting the question on the platform so I will try to answer your questions here.

So for your 1st question, Azure Speech does use a WebSocket v2 endpoint internally but input text streaming for TTS is not exposed as a supported raw WebSocket protocol.

The feature is available through the Speech SDK and there is no official WebSocket protocol documentation for implementing it directly without the SDK.

If you want a non-SDK path for TTS, Azure does publish a REST TTS API, but that is a different interface and does not give you the same SDK-managed text-streaming behavior.

Azure has a newer Voice Live API in preview which is a documented WebSocket API for real time voice scenarios, including TTS and bidirectional communication but that is not the same thing as use the Speech SDK text streaming feature directly over the old Speech WebSocket protocol. It is a separate preview API surface.

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/voice-live-api-reference-2026-01-01-preview

For your Python 3.13 what I can say from current release notes is that the Speech SDK recently shipped Python fixes.

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/releasenotes

So the most practical path is to upgrade azure-cognitiveservices-speech to the latest available SDK first because there were recent Python speech-synthesis leak fixes then test on Python 3.12 or 3.11.

I noticed that in your code, you create long lived native SDK objects and let GC eventually clean them up. With C-backed SDKs, that is exactly where hangs often appear.

In your code, I would change the lifecycle like this:

create the synthesizer once.

on shutdown, explicitly: close the input stream,

wait for tts_task.get() to finish or time out,

   disconnect callbacks if possible,

      cancel listen_completed_task,

         drop references (self.tts_request = None, self.tts_task = None, self.speech_synthesizer = None) before the request scope ends.

run the Speech SDK work in a dedicated worker thread or separate process not mixed into the main FastAPI event loop lifecycle.

The most important issue I see in your current code is that close() does not guarantee a full deterministic result of the native SDK objects. It calls stop_speaking_async but it does not clearly wait for all callbacks to drain before those objects become collectible and that makes the GC path much more likely to bite you.

I didn't a simple test and it is working.

Would love to see your feedback.

async def close(self):
    self.listening = False
    async with self._azure_input_stream_ops_lock:
        if not self._azure_input_stream_closed and self.tts_request:
            await run_in_threadpool(self.tts_request.input_stream.close)
            self._azure_input_stream_closed = True
    
    if self.tts_task:
        try:
            await asyncio.wait_for(run_in_threadpool(self.tts_task.get), timeout=10)
        except asyncio.TimeoutError:
            pass
        except Exception:
            logger.exception("tts_task.get failed")
    # Stop synthesizer after stream/task completion
    if self.speech_synthesizer:
        try:
            await run_in_threadpool(self.speech_synthesizer.stop_speaking_async)
        except Exception:
            logger.exception("stop_speaking_async failed")
    
    if self.listen_completed_task:
        self.listen_completed_task.cancel()
        try:
            await self.listen_completed_task
        except asyncio.CancelledError:
            pass
        except Exception:
            logger.exception("listen_completed_task failed")
    
    self.tts_request = None
    self.tts_task = None
    self.speech_synthesizer = None

Share via

How to use azure tts by websocket api rather than sdk？

1 answer

Your answer