Using Kubernetes¶

The vLLM Documentation on Deploying with Kubernetes is a comprehensive guide for configuring deployments of models on kubernetes. This guide highlights some key differences when deploying on kubernetes with Spyre accelerators.

Deploying on Spyre Accelerators¶

Note

Prerequisite: Ensure that you have a running Kubernetes cluster with Spyre accelerators.

(Optional) Create PVCs and secrets for vLLM.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: hf-cache
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: default
  volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: graph-cache
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: default
  volumeMode: Filesystem
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
  namespace: default
type: Opaque
stringData:
  token: "REPLACE_WITH_TOKEN"

Create a deployment and service for the model you want to deploy. This example demonstrates how to deploy ibm-granite/granite-3.3-8b-instruct.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: granite-8b-instruct
  namespace: default
  labels:
    app: granite-8b-instruct
spec:
  # Defaults to 600 and must be set higher if your startupProbe needs to wait longer than that 
  progressDeadlineSeconds: 1200
  replicas: 1
  selector:
    matchLabels:
      app: granite-8b-instruct
  template:
    metadata:
      labels:
        app: granite-8b-instruct
    spec:
      # Required for scheduling spyre cards
      schedulerName: aiu-scheduler
      volumes:
      - name: hf-cache-volume
        persistentVolumeClaim:
          claimName: hf-cache
      # vLLM needs to access the host's shared memory for tensor parallel inference.
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      # vLLM can cache model graphs previously compiled on Spyre cards
      - name: graph-cache-volume
        persistentVolumeClaim:
          claimName: graph-cache
      containers:
      - name: vllm
        image: quay.io/ibm-aiu/vllm-spyre:latest.amd64
        args: [
          "ibm-granite/granite-3.3-8b-instruct"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        - name: TORCH_SENDNN_CACHE_ENABLE
          value: "1"
        - name: TORCH_SENDNN_CACHE_DIR
          value: /root/.cache/torch
        - name: VLLM_SPYRE_WARMUP_BATCH_SIZES
          value: "1,4"
        - name: VLLM_SPYRE_WARMUP_PROMPT_LENS
          value: "1024,256"
        - name: VLLM_SPYRE_WARMUP_NEW_TOKENS
          value: "256,64"
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "10"
            memory: 20G
            ibm.com/aiu_pf: "1"
          requests:
            cpu: "2"
            memory: 6G
            ibm.com/aiu_pf: "1"
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: hf-cache-volume
        - mountPath: /dev/shm
          name: shm
        - mountPath: /root/.cache/torch
          name: graph-cache-volume
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          periodSeconds: 5
        startupProbe:
          httpGet:
            path: /health
            port: 8000
          periodSeconds: 10
          # Long startup delays are necessary for graph compilation
          failureThreshold: 120
---
apiVersion: v1
kind: Service
metadata:
  name: granite-8b-instruct
  namespace: default
spec:
  ports:
  - name: http-granite-8b-instruct
    port: 80
    protocol: TCP
    targetPort: 8000
  selector:
    app: granite-8b-instruct
  sessionAffinity: None
  type: ClusterIP

Deploy and Test

Apply the manifests using kubectl apply -f <filename>:

kubectl apply -f pvcs.yaml
kubectl apply -f deployment.yaml

To test the deployment, run the following curl command:

curl http://granite-8b-instruct.default.svc.cluster.local/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "ibm-granite/granite-3.3-8b-instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
      }'

If the service is correctly deployed, you should receive a response from the vLLM model.