Production inference is mostly the boring parts. We’ve built them.
Open-model inference on your own GPUs, with the cold-start, observability, canary, and quality-gate machinery a serious team would otherwise spend a year writing. NVIDIA Dynamo under the hood. Bring your cluster.
Invitation-only · BYOC · Air-gap ready · NVIDIA Dynamo
Fifteen capabilities. Three jobs.
Most “inference platforms” sell you an API. We sell you the platform an API team would build if they had eighteen months. Grouped by what you’re actually trying to do.
- Autoscaling + scale-to-zero5 named knobsPay for traffic, not headroom
- Cold-start engineeringCluster-local weight cacheFirst token in seconds, not minutes
- Multi-model pipelinesDAG-nativeWhisper → LLM → TTS as one billable unit
- MoE expert parallelism presetsPreset libraryDeepSeek V3 + Mixtral without a math degree
- Prefill / decode disaggregationToggle, not config fileTune for latency or throughput, not both
- Canary + shadow deploymentsPer-stage dwell + auto-rollbackPromote with confidence, roll back in seconds
- Quality gate on every modelMMLU + HumanEval + streaming-sanityRefuses GA promotion if eval scores regress
- Datadog · Grafana · PagerDutyPrometheus + OTLPYour SRE team's stack, not ours
- Load testing as a serviceStandard workloads + customValidate the deployment before you trust it
- Per-team cost showbackTCO calculatorKnow who is spending your GPU budget
- Five named compliance modesSee section belowStandard, FIPS, HIPAA, GovCloud, Air-gapped
- LoRA-aware multi-tenant servingCache-aware routingOne base model, hundreds of tenants
- KV cache tieringGPU → host → NVMe → network128k context windows without re-paying for prefill
- Per-request inference meteringStreamed, not sampledBill by token, by team, by model
- Priority queues + admission controlPer-tier latency floorsFree tier doesn't crowd out enterprise
Two ways in. Same destination.
InfraStacks is the production layer for open-model inference. Most teams arrive from one of two starting points.
“This workload needs to run on our infrastructure.”
A managed API got the feature shipped. But this particular workload now needs sovereignty, weight-level control, in-cluster latency, or all three — regulated data, a fine-tune your team owns, a voice agent where the round-trip is the product. We stand the workload up on dedicated GPUs in your existing cloud, often in your existing AKS / EKS / GKE. The OpenAI-compatible gateway means your application code doesn't change. The frontier APIs you keep using elsewhere keep working — this is the workload that has to live closer.
Especially clean if you’re already on Azure AKS — Microsoft for Startups partner.
“We’re spending more time on inference than on the product.”
You already run vLLM or SGLang on your K8s. It works. But cold-start is whatever K8s does, observability is grafana-by-vibes, you can't promote a new revision without holding your breath, and nobody on the team wants to touch the LoRA serving path. We give you the day-two operations layer you've been writing piecemeal — without taking your cluster away from you.
Five modes. Pick the one your auditor signs.
Some workloads — classified, PHI, regulated IP, sovereign data — can only live on infrastructure you control. The same control plane runs in five postures so each workload lands on the posture it actually needs. The mode you choose determines what we enforce on your behalf — TLS floor, registry allowlist, pod security profile, egress allowlist, audit retention. Air-gap means zero outbound.
Standard
Teams without regulatory constraintsTLS 1.2 floor, baseline pod security, open egress, 90-day audit retention. Workload plane on your K8s; control plane on Cloudflare. Telemetry is opt-out, never opt-in.
FIPS
Federal contractors, FIPS 140-3 boundaryFIPS crypto, TLS 1.3, restricted pod security, restricted registry allowlist, egress restricted to control plane only, 365-day audit retention. Compute-agent enforces. Profile version stamped on every push.
HIPAA
Healthcare, PHI workloadsTLS 1.2 floor, restricted pod security, restricted registries, egress restricted to control plane only, and 6-year audit retention per HIPAA §164.316(b)(2)(i). Compute-agent enforces.
GovCloud
Gov-cloud, FedRAMP-bound deploymentsAll of FIPS plus 6-year audit retention. Workload plane runs in your gov-cloud subscription; control plane stays on Cloudflare unless you require otherwise (a sovereign-tenancy mode is on the roadmap, not shipped today).
Air-gap
DoD, intelligence, classified workloadsAll of GovCloud plus zero outbound network and indefinite audit retention. Compute-agent refuses to dial out. Workload images served from your private registry via --registry-config overrides. Control plane connectivity is the part you accept; the workload plane is dark.
SOC 2 · HIPAA · ISO 27001 · FedRAMP Moderate — control mappings documented in profiles. External certifications in progress.
Nine models in day-zero GA. The tenth one is one CI run away.
Every model in the catalog is built per (engine × GPU × quant) combination, benchmarked against published TTFT targets, and gated on quality before it can flip to GA. When DeepSeek V4 ships at 11pm, the catalog runs the matrix overnight and the platform refuses the promotion if the eval scores regress past your threshold.
| Engine ↓ / GPU → | A100 NC48ads_A100_v4 | H100 NC80adis_H100_v5 |
|---|---|---|
| vLLM | 300msTTFT target · none | — |
| SGLang | — | 200msTTFT target · fp8 |
One model shown. Browse the full catalog after sign-in.
Priced on your GPU spend, not your seat count.
InfraStacks is the control plane. You pay for the GPUs, in your account, at your negotiated rate. We charge a percentage on top — the assessment tells you exactly what.