New·Day-zero quality gate is wired into every catalog model
INFRASTACKS
Sign In
Production inference, on your cluster

Production inference is mostly the boring parts. We’ve built them.

Open-model inference on your own GPUs, with the cold-start, observability, canary, and quality-gate machinery a serious team would otherwise spend a year writing. NVIDIA Dynamo under the hood. Bring your cluster.

Invitation-only · BYOC · Air-gap ready · NVIDIA Dynamo

Day-zero GA catalog
09
Frontier open models — Llama 3.1 70B, Mixtral 8x7B, DeepSeek V3, Qwen 2.5 72B, Phi-4, Nemotron 70B, and three more — built per engine × GPU combination, benchmarked, deployable in one click.
vLLM
Engine
SGLang
Engine
TRT-LLM
Engine
What’s in the box

Fifteen capabilities. Three jobs.

Most “inference platforms” sell you an API. We sell you the platform an API team would build if they had eighteen months. Grouped by what you’re actually trying to do.

Run
Get tokens out of GPUs
  • Autoscaling + scale-to-zero
    5 named knobs
    Pay for traffic, not headroom
  • Cold-start engineering
    Cluster-local weight cache
    First token in seconds, not minutes
  • Multi-model pipelines
    DAG-native
    Whisper → LLM → TTS as one billable unit
  • MoE expert parallelism presets
    Preset library
    DeepSeek V3 + Mixtral without a math degree
  • Prefill / decode disaggregation
    Toggle, not config file
    Tune for latency or throughput, not both
Operate
Day two, every day
  • Canary + shadow deployments
    Per-stage dwell + auto-rollback
    Promote with confidence, roll back in seconds
  • Quality gate on every model
    MMLU + HumanEval + streaming-sanity
    Refuses GA promotion if eval scores regress
  • Datadog · Grafana · PagerDuty
    Prometheus + OTLP
    Your SRE team's stack, not ours
  • Load testing as a service
    Standard workloads + custom
    Validate the deployment before you trust it
  • Per-team cost showback
    TCO calculator
    Know who is spending your GPU budget
Govern
Pass the audit
  • Five named compliance modes
    See section below
    Standard, FIPS, HIPAA, GovCloud, Air-gapped
  • LoRA-aware multi-tenant serving
    Cache-aware routing
    One base model, hundreds of tenants
  • KV cache tiering
    GPU → host → NVMe → network
    128k context windows without re-paying for prefill
  • Per-request inference metering
    Streamed, not sampled
    Bill by token, by team, by model
  • Priority queues + admission control
    Per-tier latency floors
    Free tier doesn't crowd out enterprise
How teams get here

Two ways in. Same destination.

InfraStacks is the production layer for open-model inference. Most teams arrive from one of two starting points.

From a managed endpoint

“This workload needs to run on our infrastructure.”

A managed API got the feature shipped. But this particular workload now needs sovereignty, weight-level control, in-cluster latency, or all three — regulated data, a fine-tune your team owns, a voice agent where the round-trip is the product. We stand the workload up on dedicated GPUs in your existing cloud, often in your existing AKS / EKS / GKE. The OpenAI-compatible gateway means your application code doesn't change. The frontier APIs you keep using elsewhere keep working — this is the workload that has to live closer.

Azure AI FoundryAWS BedrockVertex AISelf-managed APIs

Especially clean if you’re already on Azure AKS — Microsoft for Startups partner.

From a self-built cluster

“We’re spending more time on inference than on the product.”

You already run vLLM or SGLang on your K8s. It works. But cold-start is whatever K8s does, observability is grafana-by-vibes, you can't promote a new revision without holding your breath, and nobody on the team wants to touch the LoRA serving path. We give you the day-two operations layer you've been writing piecemeal — without taking your cluster away from you.

vLLM on K8sSGLangTRT-LLMBYOC
Deployment modes

Five modes. Pick the one your auditor signs.

Some workloads — classified, PHI, regulated IP, sovereign data — can only live on infrastructure you control. The same control plane runs in five postures so each workload lands on the posture it actually needs. The mode you choose determines what we enforce on your behalf — TLS floor, registry allowlist, pod security profile, egress allowlist, audit retention. Air-gap means zero outbound.

01

Standard

Teams without regulatory constraints

TLS 1.2 floor, baseline pod security, open egress, 90-day audit retention. Workload plane on your K8s; control plane on Cloudflare. Telemetry is opt-out, never opt-in.

Azure AKS · AWS EKS · GKE · on-prem
02

FIPS

Federal contractors, FIPS 140-3 boundary

FIPS crypto, TLS 1.3, restricted pod security, restricted registry allowlist, egress restricted to control plane only, 365-day audit retention. Compute-agent enforces. Profile version stamped on every push.

FIPS 140-3 · TLS 1.3 · restricted pod security
03

HIPAA

Healthcare, PHI workloads

TLS 1.2 floor, restricted pod security, restricted registries, egress restricted to control plane only, and 6-year audit retention per HIPAA §164.316(b)(2)(i). Compute-agent enforces.

6-year audit · BAA-ready · PHI-safe egress
04

GovCloud

Gov-cloud, FedRAMP-bound deployments

All of FIPS plus 6-year audit retention. Workload plane runs in your gov-cloud subscription; control plane stays on Cloudflare unless you require otherwise (a sovereign-tenancy mode is on the roadmap, not shipped today).

Azure Government · AWS GovCloud · GCC High
05

Air-gap

DoD, intelligence, classified workloads

All of GovCloud plus zero outbound network and indefinite audit retention. Compute-agent refuses to dial out. Workload images served from your private registry via --registry-config overrides. Control plane connectivity is the part you accept; the workload plane is dark.

Zero outbound · indefinite audit · private registry only

SOC 2 · HIPAA · ISO 27001 · FedRAMP Moderate — control mappings documented in profiles. External certifications in progress.

Day-zero model support

Nine models in day-zero GA. The tenth one is one CI run away.

Every model in the catalog is built per (engine × GPU × quant) combination, benchmarked against published TTFT targets, and gated on quality before it can flip to GA. When DeepSeek V4 ships at 11pm, the catalog runs the matrix overnight and the platform refuses the promotion if the eval scores regress past your threshold.

Llama 3.1 70B Instruct
meta-llama/Llama-3.1-70B-Instruct · min VRAM 140 GB
General Availability
Engine ↓ / GPU →
A100
NC48ads_A100_v4
H100
NC80adis_H100_v5
vLLM
300msTTFT target · none
SGLang
200msTTFT target · fp8
Published TTFT target per combo. Lower is better. Recommended configurations are one-click deployable from the catalog.

One model shown. Browse the full catalog after sign-in.

What it costs

Priced on your GPU spend, not your seat count.

InfraStacks is the control plane. You pay for the GPUs, in your account, at your negotiated rate. We charge a percentage on top — the assessment tells you exactly what.

Open waitlist
We’re onboarding design partners. Sign up to get a slot when your industry or use case opens.
Showback today, chargeback tomorrow
The dashboard splits cost by team, model, and tenant — across InfraStacks deployments and your own Azure-native ones alike.
TCO calculator
For workloads where dedicated capacity is the right architecture, project the GPU-hour math against current per-token spend and see the crossover. Available the moment you sign in.