Using Red Hat OpenShift AI¶

Red Hat OpenShift AI is a cloud-native AI platform that bundles together many popular model management projects, including KServe.

This example shows how to use KServe with RHOAI to deploy a model on OpenShift, using a modelcar image to load the model without requiring any connection to Huggingface Hub.

Deploying with KServe¶

Prerequisites

A running Kubernetes cluster with RHOAI installed
Image pull credentials for registry.redhat.io/rhelai1
Spyre accelerators available in the cluster

Create a ServingRuntime to serve your models.

  oc apply -f - <<EOF
  apiVersion: serving.kserve.io/v1alpha1
  kind: ServingRuntime
  metadata:
    name: vllm-spyre-runtime
    annotations:
      openshift.io/display-name: vLLM IBM Spyre ServingRuntime for KServe
      opendatahub.io/recommended-accelerators: '["ibm.com/aiu_pf"]'
    labels:
      opendatahub.io/dashboard: "true"
  spec:
    multiModel: false
    supportedModelFormats:
      - autoSelect: true
        name: vLLM
    containers:
      - name: kserve-container
        image: quay.io/ibm-aiu/vllm-spyre:latest.amd64
        args:
          - /mnt/models
          - --served-model-name={{.Name}}
        env:
          - name: HF_HOME
            value: /tmp/hf_home
          # Static batching configurations can also be set on each InferenceService
          - name: VLLM_SPYRE_WARMUP_BATCH_SIZES
            value: '4'
          - name: VLLM_SPYRE_WARMUP_PROMPT_LENS
            value: '1024'
          - name: VLLM_SPYRE_WARMUP_NEW_TOKENS
            value: '256'
        ports:
          - containerPort: 8000
            protocol: TCP
  EOF

Create an InferenceService for each model you want to deploy. This example demonstrates how to deploy the Granite model ibm-granite/granite-3.1-8b-instruct.

oc apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: granite-3-1-8b-instruct
    serving.kserve.io/deploymentMode: RawDeployment
  name: granite-3-1-8b-instruct
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    imagePullSecrets:
      - name: oci-registry
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          ibm.com/aiu_pf: '1'
        requests:
          ibm.com/aiu_pf: '1'
      runtime: vllm-spyre-runtime
      storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5'
      volumeMounts:
        - mountPath: /dev/shm
          name: shm
    schedulerName: aiu-scheduler
    tolerations:
      - effect: NoSchedule
        key: ibm.com/aiu_pf
        operator: Exists
              spec:
    volumes:
      # This volume may need to be larger for bigger models and running tensor-parallel inference with more cards
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
EOF

To test your InferenceService, refer to the KServe documentation on model inference with vLLM.