An Azure service that integrates speech processing into apps and services.
Hello @Manuel Tospann,
Thank you for your detailed breakdown,
1. Custom Avatar – Cost Components
Training (one-time)
- €13.021 per compute hour
- Typical duration: 20–40 hours
Estimated cost:
- ~€260 to €520
- There is also a service-side cap (~96 hours), so costs will not exceed ~€1,250
Endpoint Hosting
- €0.521 per model per hour
- If kept running continuously:
- ~720 hours/month → ~€375/month
This is charged as long as the endpoint is deployed, regardless of usage. You can stop/delete the endpoint to control costs.
Real-time Avatar Usage
- €0.521 per minute (standard)
- €0.695 per minute (4K)
Billing is based on session duration, not just active speech. If a real-time session remains open, idle time may still be billed.
2. Custom Voice – Cost Components
You are correct that custom voice is a separate cost from the avatar.
Voice Training (one-time)
- €45.137 per compute hour
- Typically capped around:
Estimated cost Up to ~€812 per model
Voice Endpoint Hosting
- €3.50 per hour
- If running continuously ~720 hours/month → ~€2,520/month
This is typically the largest recurring cost component.
Voice Synthesis
- €20.833 per 1M characters (standard)
- €41.665 per 1M characters (Neural HD)
This is fully usage-based and depends on your workload.
3. Combined Cost View
One-time costs
- Avatar training: ~€260–€520
- Voice training: up to ~€812
Recurring monthly costs
- Avatar hosting: ~€375
- Voice hosting: ~€2,520
- Avatar runtime: based on minutes used (e.g., 1,000 min → ~€521)
- Voice synthesis: based on characters (e.g., 2M chars HD → ~€83)
4. Key Points
- Avatar and voice are billed separately both must be included in your estimate
- Real-time avatar sessions may incur cost while open (including idle time)
- Endpoint hosting is billed continuously while deployed, even with no traffic
- Voice endpoint hosting is the primary cost driver in most scenarios
5. Cost Optimization Guidance
To manage costs effectively:
- Avoid running endpoints 24/7 unless required
- Start/stop deployments based on usage patterns
- Consider batch avatar generation if real-time interaction is not required
- Estimate usage based on:
- Minutes of avatar interaction
- Characters synthesized
- Minutes of avatar interaction
Your calculations are generally correct
The main adjustments are around:
- Session-based billing for real-time avatar
- High recurring cost of voice hosting
- Opportunity to reduce cost by controlling endpoint uptime
Please refer this
Speech service pricing (avatar & TTS): https://azure.microsoft.com/pricing/details/speech/
Custom avatar training time & cap: https://learn.microsoft.com/azure/ai-services/speech-service/text-to-speech?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider#pricing-note
Custom Neural Voice info & limits: https://learn.microsoft.com/azure/cognitive-services/speech-service/custom-neural-voice
TTS avatar feature overview & billing: https://learn.microsoft.com/azure/ai-services/speech-service/text-to-speech-avatar/what-is-text-to-speech-avatar
I Hope this helps. Do let me know if you have any further queries.
If this answers your query, please do click Accept Answer and Yes for was this answer helpful.
Thank you!