Deploying and Managing NVIDIA GPUs with K3s and Rancher

Sure! Here is a new blog post based on the information provided, with a minimum word count of 500:

Virtualization, Storage, and various other ramblings: Leveraging the Nvidia GPU operator in a K3s cluster

In this article, we will explore how to leverage the Nvidia GPU operator in a K3s cluster, using a cheap Nvidia T400 GPU that is on the supported list for the operator. We will go over the necessary steps to install and configure the GPU operator, as well as some tips and tricks to keep in mind when working with this technology.

First, we will need to create a new VM in vSphere with PCI Passthrough enabled, so that we can present the Nvidia GPU to the VM. We will then need to install the Nvidia GPU operator chart within K3s, and configure the Containerd runtime to use the GPU.

To begin, we will create a new VM in vSphere with PCI Passthrough enabled. This will allow us to present the Nvidia GPU to the VM, so that we can use it for computing tasks. Once the VM is created, we will need to install the Nvidia GPU operator chart within K3s.

To install the Nvidia GPU operator chart, we will first need to import an existing cluster in Rancher. This will allow us to manage our K3s cluster and deploy the GPU operator chart. Once the cluster is imported, we can navigate to the Cluster -> Apps -> Charts page, and search for the “GPU” chart. We can then select the chart and click the “Install” button to install it within our K3s cluster.

Once the GPU operator chart is installed, we will need to configure the Containerd runtime to use the GPU. This can be done by adding the following configuration to the Containerd config file:

“`

[runtime]

class = nvidia.com/gpu-operator

“`

This will tell Containerd to use the Nvidia GPU operator when running containerized workloads. We can then restart the Containerd service to apply the changes:

“`

sudo systemctl restart containerd

“`

At this point, we should be able to run containerized workloads on our Nvidia GPU. However, there are a few things to keep in mind when working with the GPU operator.

First, you will notice that there are two devices represented by the Nvidia GPU – one for the video controller, and another for the audio controller. We only need the video controller, so we should ignore the audio controller device.

Second, you may encounter issues with the GPU driver not being installed correctly. To resolve this, you can try installing the Nvidia driver manually before installing the GPU operator chart. This will ensure that the driver is properly installed and configured.

Finally, you may notice that the GPU operator Pods are in a crashloop state initially. This is expected until the nvidia-driver-daemonset Pod has finished building and installing the Nvidia drivers. You can follow the Pod logs to get more insight into what’s occurring.

In conclusion, leveraging the Nvidia GPU operator in a K3s cluster can be a powerful way to improve the performance of containerized workloads. By following these steps and keeping a few tips and tricks in mind, you should be able to successfully install and configure the GPU operator within your K3s cluster.