# Intro
In case you want to host your own LLM instance of popular models like Mistral, Llama-2 or your own fine-tuned version of one of these models, Hugginface Text Generation Inference (TGI) is a great tool to get the job done. Often this requires running your inference on a Kubernetes or Openshift cluster providing the necessary GPU infrastructure.
In this post, I’ll show a quick example of how to get your own LLM instance up and running on your Kubernetes cluster.
# Prerequisite
A suitable GPU (A10, A100, H100, L40s) available within your cluster, depending on the model you want to run. Usually this involves the installation of the NVIDIA Node Feature Discovery (NFD) Operator and the NVIDIA GPU operator to make the GPU available to your pod.
# Example: Deploying Mistral using TGI in Kubernetes
Running TGI on Kubernetes is pretty straightforward, once you’ve figured out the basic setup. To avoid downloading the model weights every time you restart the pod, create a PersistentVolumeClaim (PVC) to persist the model weights:
|
|
Now that you have created a PVC, you can create the deployment using the following deployment yaml. This deployment will download your model weights, then start the text generation inference servcer. In this example, it will run the mistralai/Mistral-7B-Instruct-v0.2
model:
|
|
Within your cluster, you can now reach the TGI service using the following curl command in another pod. Make sure to replace the deployment-namespace
with the name of the namespace you deployed TGI to:
|
|
If you want to expose TGI to the public internet, you can create a Route (Openshift) or Ingress (Kubernetes).
Please make sure to adjust the timeout of your Route/Ingress if you have long running inference jobs. Larger models like Mixtral might run for more than 60s - which is the default timeout for Routes/Ingress - for large prompts wiht long contexts.
# Further information
For those interested in diving deeper, please refer to the Hugginface TGI documentation for CLI parameters, Python SDK, etc.