Inference runtimes in Foundry Local on Azure Local

Applies to: Foundry Local on Azure Local

Foundry Local on Azure Local supports two runtimes for generative inference: ONNX Runtime and vLLM. Each runtime is optimized for different scenarios, and the model you choose determines which runtime is used. The selected runtime affects hardware requirements, model format, and performance behavior. This article explains how runtime selection works and when each runtime is the better fit.

Important

Foundry Local is available in preview. Preview releases provide early access to features that are in active deployment.
Features, approaches, and processes can change or have limited capabilities before general availability (GA).

How the runtime is selected

The model you choose determines the runtime. Each model in the Foundry catalog includes a framework field that specifies which runtime it uses. If the same model is available for both runtimes, it appears as two separate entries in the catalog, each with its own alias and framework.

For example, a model might appear as:

Alias	Device	Framework	Runtime used
Phi-4-generic-cpu	CPU	ONNX	ONNX Runtime
Phi-4-cuda-gpu	GPU	ONNX	ONNX Runtime
Phi-4	GPU	vllm	vLLM

When you deploy a model, the operator reads the framework from the catalog and automatically selects the correct container image and configuration. You don't need to set the runtime manually for catalog models.

For custom (BYO) models, set the runtime field on the ModelDeployment spec to specify which engine to use. The default is onnx-genai.

ONNX Runtime

ONNX Runtime is the default inference engine. It uses the ONNX-GenAI runtime through the Microsoft Foundry Local SDK to serve generative models in ONNX format. It supports both CPU and GPU execution.

When to use

Use ONNX Runtime when you need broad hardware support or want a lower-overhead option for generative inference.

CPU inference — The only runtime that supports CPU-based execution. Use it when GPU hardware isn't available.
Smaller models — Well-suited for compact models such as Phi-4 and Qwen 2.5 that fit in CPU memory or a single GPU.
Edge and constrained environments — Lower resource overhead than vLLM.

Key characteristics

The following characteristics describe how ONNX Runtime behaves in Foundry Local on Azure Local.

Runs on CPU (default) or GPU (CUDA).
Serves ONNX-format models from the Foundry catalog or custom (BYO) registries.
Exposes OpenAI-compatible endpoints: /v1/chat/completions and /v1/models.
Supports streaming responses and tool calling (depending on the model).
Single model per pod.

vLLM

vLLM is a high-throughput inference engine for large language models on GPU hardware. It uses PagedAttention for efficient GPU memory management and continuous batching to maximize throughput under concurrent load.

When to use

Use vLLM when your workload runs on GPUs and you want higher throughput or more efficient memory use for large generative models.

High throughput — Continuous batching and PagedAttention deliver higher tokens-per-second than ONNX Runtime under concurrent load.
Large models — Efficient memory management allows serving models that might otherwise exceed GPU memory.
Production GPU workloads — Built-in GPU memory planning automatically sizes batch parameters and context length based on available hardware.

Key characteristics

The following characteristics highlight how vLLM is optimized for GPU-based generative inference.

Requires GPU (CUDA). CPU isn't supported.
Serves HuggingFace-format models (safetensors) from the Foundry catalog or custom (BYO) registries.
Exposes OpenAI-compatible endpoints: /v1/chat/completions and /v1/models.
Supports streaming responses and tool calling (depending on the model).
Includes a GPU-aware planner that automatically tunes memory utilization, context length, and batch sizes.
Tunable through the spec.vllm.preferences field on the ModelDeployment.

Comparison

Use the following comparison to quickly identify which runtime best matches your model format, hardware, and performance requirements.

Criteria	ONNX Runtime	vLLM
GPU required	No (CPU or GPU)	Yes (GPU only)
Model format	ONNX	Hugging Face safetensors
Best for	Smaller models, CPU inference, edge scenarios	Large models, high concurrency, maximum throughput
Memory optimization	Standard ONNX Runtime	PagedAttention, floating-point 8 (FP8), key-value (KV) cache, chunked prefill
Auto-tuning	None	GPU-aware planner sizes parameters automatically
Catalog models	Yes	Yes
Custom (BYO) models	Yes	Yes
API compatibility	OpenAI chat completions	OpenAI chat completions

Predictive workloads

For non-generative workloads such as classification, object detection, and regression, Foundry Local uses a separate predictive inference engine based on ONNX Runtime. Predictive workloads use the /v1/predict endpoint and support custom (BYO) ONNX models. The runtime selection described earlier applies to generative workloads only.

For more information, see Predictive models in Inference operator and model lifecycle.

Running Foundry Local on multiple nodes

Foundry Local enabled by Azure Arc supports deployment on multinode Kubernetes clusters. This architecture extends AI inference beyond single-node setups and supports production-scale deployments. As an Azure Arc extension, Foundry Local installs on Azure Arc-enabled Kubernetes clusters and uses a Kubernetes-native operator to manage model lifecycle operations across nodes, including model caching, deployment, and serving. You can deploy multiple AI models at the same time, and each model workload is scheduled to a node that meets its CPU, memory, and GPU requirements. The operator uses standard Kubernetes scheduling controls, including resource requests and limits, node selectors, and affinity rules. This approach supports heterogeneous clusters where some nodes are CPU-only and others are GPU-capable.

The platform supports both generative AI and predictive inference under one operational model. For GPU-based models, the system validates GPU resource limits and schedules workloads to GPU-capable nodes. CPU-based models are scheduled to nodes with sufficient CPU and memory capacity. This cluster-aware scheduling model lets organizations scale inference capacity by adding nodes to the cluster.

Feedback

Was this page helpful?

Last updated on 2026-05-04

Inference runtimes in Foundry Local on Azure Local

How the runtime is selected

ONNX Runtime

When to use

Key characteristics

vLLM

When to use

Key characteristics

Comparison

Predictive workloads

Running Foundry Local on multiple nodes

Related content

Feedback

Additional resources