All Services
Service

AI / GPU Cloud
Infrastructure

Production-grade GPU infrastructure on Kubernetes for LLM inference and AI workloads — from NVIDIA GPU Operator setup and MIG partitioning to vLLM serving, autoscaling, and multi-tenant platform design.

+65%
GPU utilization improvement
Models per A100 with MIG
−55%
Inference cost per request
Full
Tenant isolation

MIG profile sizing — matched to model size

Instead of wasting a full A100 on a 7B model, MIG lets you carve out exactly the right slice.

1g.10gb
≤ 7B params
7× per GPU
2g.20gb
7B – 13B params
3× per GPU
4g.40gb
30B – 40B params
2× per GPU
7g.80gb
70B+ params
1× per GPU

What's covered

Building AI infrastructure on Kubernetes involves six interconnected layers. I work across all of them so nothing falls through the cracks.

GPU Node Pools & Operator

I set up and configure NVIDIA GPU node pools on your cloud provider and deploy the NVIDIA GPU Operator — which handles driver installation, container runtime configuration, and device plugin management automatically across all GPU nodes.

  • GPU node group setup (AWS, GCP, Azure)
  • NVIDIA GPU Operator deployment
  • Driver and container runtime automation
  • Node labels and taints for GPU scheduling
  • Spot / preemptible GPU pools for cost savings
  • Multi-GPU node support (A100, H100, L40S)

MIG Partitioning (A100 / H100)

NVIDIA's Multi-Instance GPU (MIG) technology lets you slice a single A100 or H100 into isolated GPU instances — each with dedicated memory, compute, and bandwidth. I configure MIG profiles matched to your model size tiers so every GPU cycle is used efficiently.

  • MIG mode enablement on all GPU nodes
  • Profile sizing: 1g.10gb / 2g.20gb / 4g.40gb / 7g.80gb
  • mig-parted dynamic reconfiguration pipeline
  • Kubernetes device plugin for MIG resource exposure
  • Custom node labels per MIG profile
  • Namespace resource quotas per tenant / model size

LLM Inference Serving

I deploy and configure the right inference server for your models — vLLM for OpenAI-compatible APIs, Triton for multi-framework serving, or Ollama for simpler setups. Each is tuned for throughput, latency, and memory efficiency.

  • vLLM deployment with PagedAttention optimization
  • NVIDIA Triton Inference Server setup
  • Ollama for lightweight model serving
  • OpenAI-compatible API endpoint configuration
  • Model loading from S3 / HuggingFace Hub
  • Batching and KV cache configuration

Autoscaling for Inference

GPU inference workloads have spiky demand patterns. I set up KEDA-based autoscaling that reacts to queue depth, request rate, or custom metrics — spinning up inference replicas fast and scaling down to save cost when idle.

  • KEDA event-driven autoscaling setup
  • Queue-depth scaling (Kafka, SQS, Redis)
  • Request-rate based HPA configuration
  • Scale-to-zero for idle models
  • Warm-up strategies to reduce cold start latency
  • Spot instance interruption handling

Multi-Tenant AI Platform

When multiple teams or customers share GPU infrastructure, isolation and fairness matter. I design namespace-based multi-tenancy with GPU quotas, priority classes, and network policies so no single team can starve others.

  • Namespace isolation per team / customer
  • GPU ResourceQuota enforcement
  • PriorityClass design for model tiers
  • Network policy isolation between tenants
  • Chargeback metrics per namespace
  • Model registry with access control

Observability & Cost Control

GPU infrastructure is expensive — you need to know exactly what's using what, and when. I set up DCGM-based GPU metrics, Grafana dashboards, and cost allocation tools so you can optimize spend and catch waste early.

  • DCGM Exporter for per-GPU metrics
  • Grafana dashboards: utilization, memory, throughput
  • Inference latency and throughput SLO tracking
  • Cost allocation by namespace / team
  • Idle GPU alerting
  • Spot vs on-demand cost optimization reports

Common questions

What GPU types do you work with?

Primarily NVIDIA A100 (40GB and 80GB) and H100. I've also worked with L40S and A10G for smaller inference workloads. The infrastructure patterns are similar across GPU types, but MIG is only available on A100 and H100.

We just have one GPU node. Is this still worth it?

Absolutely — especially MIG partitioning. A single A100 80GB can serve up to 7 isolated model instances simultaneously. Proper Kubernetes integration also gives you scheduling, health checks, and autoscaling even with one node.

How do you handle model storage and loading?

Models are stored in object storage (S3, GCS) or pulled from HuggingFace Hub and cached on shared persistent volumes. For large models (70B+), I design pre-loading strategies to minimize cold start time — often using init containers or node-level model caching.

Can you help with fine-tuning infrastructure too?

Yes. Distributed fine-tuning (multi-node, multi-GPU) requires different infrastructure from inference — typically PyTorch DDP or FSDP with high-bandwidth networking (EFA on AWS, GPUDirect). I've set up training clusters for both LoRA fine-tuning and full-parameter training.

What's the typical cost saving from MIG partitioning?

On inference workloads with a mix of model sizes, MIG typically improves GPU utilization from 20–30% (one model per GPU) to 70–85% (multiple MIG instances per GPU). That directly translates to fewer GPUs needed for the same throughput — often 40–60% cost reduction.

Ready to put your GPUs to work?

Book a free 30-minute call. I'll ask about your model sizes, traffic patterns, and cloud setup — and give you a clear picture of what the right infrastructure looks like.

Book a Free Call