INFRASTACKS

Production inference, on your cluster

Production inference is mostly the boring parts. We’ve built them.

Open-model inference on your own GPUs, with the cold-start, observability, canary, and quality-gate machinery a serious team would otherwise spend a year writing. NVIDIA Dynamo under the hood. Bring your cluster.

Invitation-only · BYOC · Air-gap ready · NVIDIA Dynamo

Day-zero GA catalog

Frontier open models — Llama 3.1 70B, Mixtral 8x7B, DeepSeek V3, Qwen 2.5 72B, Phi-4, Nemotron 70B, and three more — built per engine × GPU combination, benchmarked, deployable in one click.

vLLM

Engine

SGLang

Engine

TRT-LLM

Engine

What’s in the box

Fifteen capabilities. Three jobs.

Most “inference platforms” sell you an API. We sell you the platform an API team would build if they had eighteen months. Grouped by what you’re actually trying to do.

Run

Get tokens out of GPUs

Autoscaling + scale-to-zero
5 named knobs
Pay for traffic, not headroom
Cold-start engineering
Cluster-local weight cache
First token in seconds, not minutes
Multi-model pipelines
DAG-native
Whisper → LLM → TTS as one billable unit
MoE expert parallelism presets
Preset library
DeepSeek V3 + Mixtral without a math degree
Prefill / decode disaggregation
Toggle, not config file
Tune for latency or throughput, not both

Operate

Day two, every day

Canary + shadow deployments
Per-stage dwell + auto-rollback
Promote with confidence, roll back in seconds
Quality gate on every model
MMLU + HumanEval + streaming-sanity
Refuses GA promotion if eval scores regress
Datadog · Grafana · PagerDuty
Prometheus + OTLP
Your SRE team's stack, not ours
Load testing as a service
Standard workloads + custom
Validate the deployment before you trust it
Per-team cost showback
TCO calculator
Know who is spending your GPU budget

Govern

Pass the audit

Five named compliance modes
See section below
Standard, FIPS, HIPAA, GovCloud, Air-gapped
LoRA-aware multi-tenant serving
Cache-aware routing
One base model, hundreds of tenants
KV cache tiering
GPU → host → NVMe → network
128k context windows without re-paying for prefill
Per-request inference metering
Streamed, not sampled
Bill by token, by team, by model
Priority queues + admission control
Per-tier latency floors
Free tier doesn't crowd out enterprise

How teams get here

Two ways in. Same destination.

InfraStacks is the production layer for open-model inference. Most teams arrive from one of two starting points.

From a managed endpoint

“This workload needs to run on our infrastructure.”

A managed API got the feature shipped. But this particular workload now needs sovereignty, weight-level control, in-cluster latency, or all three — regulated data, a fine-tune your team owns, a voice agent where the round-trip is the product. We stand the workload up on dedicated GPUs in your existing cloud, often in your existing AKS / EKS / GKE. The OpenAI-compatible gateway means your application code doesn't change. The frontier APIs you keep using elsewhere keep working — this is the workload that has to live closer.

Azure AI FoundryAWS BedrockVertex AISelf-managed APIs

Especially clean if you’re already on Azure AKS — Microsoft for Startups partner.

From a self-built cluster

“We’re spending more time on inference than on the product.”

You already run vLLM or SGLang on your K8s. It works. But cold-start is whatever K8s does, observability is grafana-by-vibes, you can't promote a new revision without holding your breath, and nobody on the team wants to touch the LoRA serving path. We give you the day-two operations layer you've been writing piecemeal — without taking your cluster away from you.

vLLM on K8sSGLangTRT-LLMBYOC

Deployment modes

Five modes. Pick the one your auditor signs.

Some workloads — classified, PHI, regulated IP, sovereign data — can only live on infrastructure you control. The same control plane runs in five postures so each workload lands on the posture it actually needs. The mode you choose determines what we enforce on your behalf — TLS floor, registry allowlist, pod security profile, egress allowlist, audit retention. Air-gap means zero outbound.

Standard

Teams without regulatory constraints

TLS 1.2 floor, baseline pod security, open egress, 90-day audit retention. Workload plane on your K8s; control plane on Cloudflare. Telemetry is opt-out, never opt-in.

Azure AKS · AWS EKS · GKE · on-prem

FIPS

Federal contractors, FIPS 140-3 boundary

FIPS crypto, TLS 1.3, restricted pod security, restricted registry allowlist, egress restricted to control plane only, 365-day audit retention. Compute-agent enforces. Profile version stamped on every push.

FIPS 140-3 · TLS 1.3 · restricted pod security

HIPAA

Healthcare, PHI workloads

TLS 1.2 floor, restricted pod security, restricted registries, egress restricted to control plane only, and 6-year audit retention per HIPAA §164.316(b)(2)(i). Compute-agent enforces.

6-year audit · BAA-ready · PHI-safe egress

GovCloud

Gov-cloud, FedRAMP-bound deployments

All of FIPS plus 6-year audit retention. Workload plane runs in your gov-cloud subscription; control plane stays on Cloudflare unless you require otherwise (a sovereign-tenancy mode is on the roadmap, not shipped today).

Azure Government · AWS GovCloud · GCC High

Air-gap

DoD, intelligence, classified workloads

All of GovCloud plus zero outbound network and indefinite audit retention. Compute-agent refuses to dial out. Workload images served from your private registry via --registry-config overrides. Control plane connectivity is the part you accept; the workload plane is dark.

Zero outbound · indefinite audit · private registry only

SOC 2 · HIPAA · ISO 27001 · FedRAMP Moderate — control mappings documented in profiles. External certifications in progress.

Day-zero model support

Nine models in day-zero GA. The tenth one is one CI run away.

Every model in the catalog is built per (engine × GPU × quant) combination, benchmarked against published TTFT targets, and gated on quality before it can flip to GA. When DeepSeek V4 ships at 11pm, the catalog runs the matrix overnight and the platform refuses the promotion if the eval scores regress past your threshold.

Llama 3.1 70B Instruct

meta-llama/Llama-3.1-70B-Instruct · min VRAM 140 GB

General Availability

Engine ↓ / GPU →	A100 NC48ads_A100_v4	H100 NC80adis_H100_v5
vLLM	300msTTFT target · none	—
SGLang	—	200msTTFT target · fp8

Published TTFT target per combo. Lower is better. Recommended configurations are one-click deployable from the catalog.

One model shown. Browse the full catalog after sign-in.

What it costs

Priced on your GPU spend, not your seat count.

InfraStacks is the control plane. You pay for the GPUs, in your account, at your negotiated rate. We charge a percentage on top — the assessment tells you exactly what.

Open waitlist

We’re onboarding design partners. Sign up to get a slot when your industry or use case opens.

Showback today, chargeback tomorrow

The dashboard splits cost by team, model, and tenant — across InfraStacks deployments and your own Azure-native ones alike.

TCO calculator

For workloads where dedicated capacity is the right architecture, project the GPU-hour math against current per-token spend and see the crossover. Available the moment you sign in.