Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article is a reference for inference API endpoints, request and response payload formats, and authentication header options for models deployed on Foundry Local on Azure Local. Use this article for data-plane endpoint paths and methods, request and response payload shapes, and client request examples for chat, transcription, and predictive inference.
For platform API surface and control-plane contracts, see Foundry inference API reference for Foundry Local on Azure Local.
For authentication architecture and authorization behavior, see Authentication and authorization in Foundry Local enabled by Azure Arc.
Important
- Foundry Local is available in preview. Preview releases provide early access to features that are in active deployment.
- Features, approaches, and processes can change or have limited capabilities before general availability (GA).
API endpoints
Each deployed model exposes the following endpoints. Replace <base-url> with your ingress address or internal cluster URL.
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Liveness check. Returns 200 OK when the service is running. |
/ready |
GET | Readiness check. Returns 200 OK when the model is loaded and ready to serve requests. |
/v1/model |
GET | Model information. Returns metadata about the loaded model. |
/v1/chat/completions |
POST | Generative inference. Use for chat and text generation workloads. When using models with tool calling capabilities, include the tool_choice field in the request payload. |
/v1/audio/transcriptions |
POST | Generative inference. Use for audio to text transcription. For models with automatic-speech-recognition capabilities (for example, whisper). |
/v1/predict |
POST | Predictive inference. Use for ONNX-based classification, regression, and other ML tasks. |
Authentication
All endpoints require authentication. The platform supports two methods: API key authentication and Microsoft Entra ID JSON Web Token (JWT) authentication. For API key authentication, include the key in your request using one of these header formats:
| Header format | Example |
|---|---|
| Bearer token (standard) | Authorization: Bearer <api-key> |
| api-key header (OpenAI-compatible) | api-key: <api-key> |
| Entra ID JWT (enterprise) | Authorization: Bearer <jwt-token> |
The platform supports the Authorization: Bearer and api-key header formats. The application-layer authentication middleware validates the key against the deployment's primary and secondary keys and rejects invalid keys with 401 Unauthorized.
To use Microsoft Entra ID authentication, acquire a JWT and send it in the Authorization: Bearer header.
For token acquisition steps, see Run inference on Foundry Local on Azure Local.
For JWT validation, API key detection, and Azure RBAC authorization behavior, see Authentication and authorization in Foundry Local enabled by Azure Arc.
Generative inference request examples
The /v1/chat/completions endpoint follows OpenAI Chat Completions conventions.
Authorization: Bearer
The following example authenticates by using an API key in a standard Bearer token header.
curl -X POST https://<your-domain>/phi-3.5-gpu/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "Phi-3.5-mini-instruct-cuda-gpu:1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital/major city of France?"}
],
"temperature": 0.7,
"max_tokens": 100
}'
api-key
Use this format for OpenAI-compatible clients that send the key in an api-key header.
curl -X POST https://<your-domain>/phi-3.5-gpu/v1/chat/completions \
-H "Content-Type: application/json" \
-H "api-key: $API_KEY" \
-d '{
"model": "Phi-3.5-mini-instruct-cuda-gpu:1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital/major city of France?"}
],
"temperature": 0.7,
"max_tokens": 100
}'
Authorization: Bearer (Entra ID JWT)
For enterprise scenarios, you can authenticate by using a Microsoft Entra ID JSON Web Token instead of an API key.
curl -X POST https://<your-domain>/phi-3.5-gpu/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $JWT_TOKEN" \
-d '{
"model": "Phi-3.5-mini-instruct-cuda-gpu:1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital/major city of France?"}
],
"temperature": 0.7,
"max_tokens": 100
}'
Generative response
The following example shows the response shape from a successful chat completion request.
{
"model": "Phi-3.5-mini-instruct-cuda-gpu:1",
"choices": [{
"message": {
"role": "assistant",
"content": "The capital/major city of France is Paris."
},
"index": 0,
"finish_reason": "stop"
}],
"object": "chat.completion"
}
Predictive inference request examples
The /v1/predict endpoint accepts ONNX model inputs. The exact payload structure depends on your model's input schema.
Image input (base64-encoded)
Convert your image to base64 format by using one of the following commands:
BASE64_IMAGE=$(base64 -w 0 <PATH_TO_IMAGE_FILE>)
curl -k -X POST "https://<URL>/v1/predict" \
-H "Content-Type: application/json" \
-H "X-API-KEY: $API_KEY" \
-d "{
\"items\": [{
\"content_type\": \"image/jpeg\",
\"encoder\": \"base64\",
\"data\": \"$BASE64_IMAGE\"
}]
}"
Note
The -k flag (curl) and -SkipCertificateCheck (PowerShell) skip certificate validation for self-signed certificates. In production, configure proper TLS certificates.