Serving LLMs using Huggingface Text Generation Inference Server on Kubernetes

# Intro

In case you want to host your own LLM instance of popular models like Mistral, Llama-2 or your own fine-tuned version of one of these models, Hugginface Text Generation Inference (TGI) is a great tool to get the job done. Often this requires running your inference on a Kubernetes or Openshift cluster providing the necessary GPU infrastructure.

In this post, I’ll show a quick example of how to get your own LLM instance up and running on your Kubernetes cluster.

# Prerequisite

A suitable GPU (A10, A100, H100, L40s) available within your cluster, depending on the model you want to run. Usually this involves the installation of the NVIDIA Node Feature Discovery (NFD) Operator and the NVIDIA GPU operator to make the GPU available to your pod.

# Example: Deploying Mistral using TGI in Kubernetes

Running TGI on Kubernetes is pretty straightforward, once you’ve figured out the basic setup. To avoid downloading the model weights every time you restart the pod, create a PersistentVolumeClaim (PVC) to persist the model weights:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: hf-cache
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  selector:
    matchLabels:
      pv: local
  storageClassName: "" # your storage class, if omitted, the default storage class will be used

Now that you have created a PVC, you can create the deployment using the following deployment yaml. This deployment will download your model weights, then start the text generation inference servcer. In this example, it will run the mistralai/Mistral-7B-Instruct-v0.2 model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-mistral
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-mistral
  template:
    metadata:
      labels:
        app: tgi-mistral
    spec:
      volumes:
        - name: hub
          persistentVolumeClaim:
            claimName: hf-cache
        - name: dshm
          emptyDir:
            medium: Memory
      containers:
        - name: model
          image: ghcr.io/huggingface/text-generation-inference:1.4.0
          command: ["bash", "-c"]
          args:
            - text-generation-server download-weights mistralai/Mistral-7B-Instruct-v0.2;
              text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --port 8080
          env:
            - name: PORT
              value: "8080"
            - name: HF_HUB_CACHE
              value: /data
            - name: HF_HOME
              value: /data
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - mountPath: /data
              name: hub
          resources:
            requests:
              cpu: 1.0
              memory: 8Gi
              nvidia.com/gpu: 1
            limits:
              cpu: 5.0
              memory: 32Gi
              nvidia.com/gpu: 1
          ports:
            - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: tgi-mistral-service
  labels:
    app: tgi-mistral-service
spec:
  selector:
    app: tgi-mistral
  type: NodePort
  ports:
    - protocol: TCP
      port: 8080

Within your cluster, you can now reach the TGI service using the following curl command in another pod. Make sure to replace the deployment-namespace with the name of the namespace you deployed TGI to:

1
2
3
4
curl http://tgi-mistral.<deployment-namespace>.svc.cluster.local::8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

If you want to expose TGI to the public internet, you can create a Route (Openshift) or Ingress (Kubernetes).

Please make sure to adjust the timeout of your Route/Ingress if you have long running inference jobs. Larger models like Mixtral might run for more than 60s - which is the default timeout for Routes/Ingress - for large prompts wiht long contexts.

# Further information

For those interested in diving deeper, please refer to the Hugginface TGI documentation for CLI parameters, Python SDK, etc.