Share via

How to estimate pricing (training + compute) for a custom tts avatar?

Manuel Tospann 336 Reputation points
2026-04-17T07:43:02.3566667+00:00

We want to create a custom avatar in Speech Studio.

Of course, we need to estimate the costs for this project.

We need the custom avatar + custom voice.

Azure Pricing (https://azure.microsoft.com/en-us/pricing/details/speech/) is a bit confusing.

For the custom avatar in West Europe, I see the following costs.

Custom: Avatar model training: €13.021 per compute hour Interactive avatar (real-time): €0.521 per minute Interactive 4K avatar (real-time): €0.695 per minute Avatar video (batch): €1.737 per minute 4K avatar video (batch): €2.344 per minute Endpoint hosting: €0.521 per model per hour

According to MS Learn (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/custom-avatar-create?pivots=ai-foundry-portal#:~:text=It%20normally%20takes-,20%2D40,-compute%20hours%20on) training takes 20 to 40 hours.

Avatar model training: €260.42 to €520.84

Endpoint hosting: 720h per month = €375.12

Then we have €0.521 per minute for the real-time avatar. This doesn't include idle time, does it?

Do we have to add custom voice to it? I assume that the avatar is just the visual model.

Professional Voice: Synthesis (real-time and batch): €20.833 per 1M characters Synthesis (neural HD real-time and batch): €41.665 per 1M characters Voice model training: €45.137 per compute hour, up to €812.465 per training Endpoint hosting: €3.50 per model per hour

Then we'd have to add up to €812.465 for training, €2,520 for endpoint hosting and synthesis (x * €41.665 per 1M characters).

Does this add up? Did I miss anything?

Thanks in advance for your help.

Azure Speech in Foundry Tools

2 answers

Sort by: Most helpful
  1. SRILAKSHMI C 18,035 Reputation points Microsoft External Staff Moderator
    2026-04-17T14:29:32.0833333+00:00

    Hello @Manuel Tospann,

    Thank you for your detailed breakdown,

    1. Custom Avatar – Cost Components

    Training (one-time)

    • €13.021 per compute hour
    • Typical duration: 20–40 hours

    Estimated cost:

    • ~€260 to €520
    • There is also a service-side cap (~96 hours), so costs will not exceed ~€1,250

    Endpoint Hosting

    • €0.521 per model per hour
    • If kept running continuously:
      • ~720 hours/month → ~€375/month

    This is charged as long as the endpoint is deployed, regardless of usage. You can stop/delete the endpoint to control costs.

    Real-time Avatar Usage

    • €0.521 per minute (standard)
    • €0.695 per minute (4K)

    Billing is based on session duration, not just active speech. If a real-time session remains open, idle time may still be billed.

    2. Custom Voice – Cost Components

    You are correct that custom voice is a separate cost from the avatar.

    Voice Training (one-time)

    • €45.137 per compute hour
    • Typically capped around:

    Estimated cost Up to ~€812 per model

    Voice Endpoint Hosting

    • €3.50 per hour
    • If running continuously ~720 hours/month → ~€2,520/month

    This is typically the largest recurring cost component.

    Voice Synthesis

    • €20.833 per 1M characters (standard)
    • €41.665 per 1M characters (Neural HD)

    This is fully usage-based and depends on your workload.

    3. Combined Cost View

    One-time costs

    • Avatar training: ~€260–€520
    • Voice training: up to ~€812

    Recurring monthly costs

    • Avatar hosting: ~€375
    • Voice hosting: ~€2,520
    • Avatar runtime: based on minutes used (e.g., 1,000 min → ~€521)
    • Voice synthesis: based on characters (e.g., 2M chars HD → ~€83)

    4. Key Points

    • Avatar and voice are billed separately both must be included in your estimate
    • Real-time avatar sessions may incur cost while open (including idle time)
    • Endpoint hosting is billed continuously while deployed, even with no traffic
    • Voice endpoint hosting is the primary cost driver in most scenarios

    5. Cost Optimization Guidance

    To manage costs effectively:

    • Avoid running endpoints 24/7 unless required
    • Start/stop deployments based on usage patterns
    • Consider batch avatar generation if real-time interaction is not required
    • Estimate usage based on:
      • Minutes of avatar interaction
        • Characters synthesized

    Your calculations are generally correct

    The main adjustments are around:

    • Session-based billing for real-time avatar
    • High recurring cost of voice hosting
    • Opportunity to reduce cost by controlling endpoint uptime

    Please refer this

    Speech service pricing (avatar & TTS): https://azure.microsoft.com/pricing/details/speech/

    Custom avatar training time & cap: https://learn.microsoft.com/azure/ai-services/speech-service/text-to-speech?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider#pricing-note

    Custom Neural Voice info & limits: https://learn.microsoft.com/azure/cognitive-services/speech-service/custom-neural-voice

    TTS avatar feature overview & billing: https://learn.microsoft.com/azure/ai-services/speech-service/text-to-speech-avatar/what-is-text-to-speech-avatar

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!

    1 person found this answer helpful.
    0 comments No comments

  2. Q&A Assist
    2026-04-17T07:43:24.57+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    For a custom text-to-speech avatar with a custom voice, the cost components in the documentation break down as follows.

    1. Avatar model training
      • Training is billed per compute hour.
      • Training time is typically 20–40 compute hours for a custom avatar.
      • Billing is per second with a cap of 96 compute hours per training job. If training takes more than 96 hours, only 96 hours are billed.
      • Formula:
        Avatar training cost ≈ (avatar training rate per compute hour) × (20–40 hours, capped at 96)
    2. Avatar hosting (endpoint)
      There are two related hosting concepts in the docs:
      • Avatar endpoint hosting (for the avatar model):
        • “Avatar hosting is billed per second per endpoint. You can suspend your endpoint to save costs. If you want to suspend your endpoint, you can delete it directly. To use it again, redeploy the endpoint.”
        • Effectively, this is the “endpoint hosting: €X per model per hour” line in the pricing page.
        • Hosting time is measured as actual running time; if the endpoint is not hosted, it is not billed.
      • Custom voice endpoint hosting (for the voice model):
        • “Custom voice endpoint hosting is measured by the actual time (hour). The hosting time (hours) for each endpoint is calculated at 00:00 UTC every day for the previous 24 hours… If the endpoint isn't currently hosted, it isn't billed.”
      If using both a custom avatar and a custom voice, there are two endpoints to consider:
      • Avatar endpoint (avatar hosting rate).
      • Custom voice endpoint (custom voice hosting rate).
    3. Avatar usage (real-time vs batch)
      • For text-to-speech avatar:
        • “When you use the text-to-speech avatar feature, charges are billed per second based on the length of video output. However, for the real-time avatar, charges are billed per second based on the time when the avatar is active, regardless of whether it's speaking or remaining silent.”
      • This means:
        • Batch avatar video: billed on the duration of the generated video.
        • Real-time avatar: billed on wall-clock active time of the avatar session, including idle/silent time. The “€0.521 per minute” rate applies to the full active time, not just speaking time.
    4. Custom voice (Professional/Personal voice)
      For a custom voice used with the avatar:
      • Voice model training:
        • “Voice sync for avatar (via custom avatar training) is charged the same as a personal voice in terms of voice creation and synthesis. The storage of the voice is free.”
        • Custom voice training is billed per compute hour, with a maximum cap per training as shown on the pricing page.
      • Voice endpoint hosting:
        • As above, billed per actual hosted hour for the custom voice endpoint.
      • Synthesis:
        • “When you use the personal voice feature, you're billed for both profile storage and synthesis.
          Synthesis: Billed per character.
        • For Professional/Personal voice, the pricing page gives a per 1M characters rate.
        • This synthesis charge is in addition to avatar video charges. The avatar is the visual layer; the voice synthesis is a separate meter.
    5. Putting it together for estimation
      For a solution with custom avatar + custom voice, the cost components are:
      1. Avatar training:
        • (avatar training rate per compute hour) × (20–40 hours, capped at 96).
      2. Avatar hosting:
        • (avatar endpoint hosting rate per hour) × (hours the avatar endpoint is hosted).
      3. Avatar usage:
        • Real-time: (real-time avatar rate per second/minute) × (total active session time, including idle).
        • Batch: (batch avatar video rate per second/minute) × (total rendered video duration).
      4. Custom voice training:
        • (custom voice training rate per compute hour) × (training hours, up to the documented cap per training).
      5. Custom voice hosting:
        • (custom voice endpoint hosting rate per hour) × (hours the voice endpoint is hosted).
      6. Custom voice synthesis:
        • (synthesis rate per 1M characters) × (total characters synthesized).
      The documentation confirms:
      • Avatar training and hosting are separate from voice training, hosting, and synthesis.
      • Real-time avatar billing includes idle time while the avatar is active.
      • Voice sync for avatar is billed like personal voice (voice creation + synthesis), and voice storage itself is free.
      The numeric examples in the question follow the correct structure: avatar training + avatar hosting + avatar real-time/batch usage plus custom voice training + custom voice hosting + custom voice synthesis. The exact euro amounts should be computed directly from the current pricing page using the formulas above.

    References:

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.