Python Decorators for Monitoring GPU Usage
One typically needs to monitor GPU usage for various reasons, such as checking if we are maximising utilisation, i.e., maximising training throughput, or if we are over-utilising GPU memory. The following assumes Nvidia GPUs.
Monitor nvidia-smi
nvidia-smi
might be one of the first things a command-line ML person learns because of its universality and convenience. However it is quite klunky to print everything and even worse, monitor/eyeball the numbers based on the refresh rate as we are training. Imagine you have 8 GPUs and only 3 of them (which one indeed) is being used by your script while the rest are used by others. Or if you need to check the performance of multiple functions.
pynvml
: Python bindings for Nvidia Management Library
Recently I discovered the Python bindings for the Nvidia Management Library: pynvml
which is a library for monitoring and managing various states within NVIDIA GPUs. I was thrilled to know it the underlying library for the nvidia-smi
tool! With this library, we can construct a simple gpu utilisation function, print_gpu_utilisation()
, and insert it together with training code. Before training, I would recommend running 1 epoch just to check no further OOM, and then checking gpu utilisation (to check for underutilisation).
Monitoring GPU utilisation with Python decorator
We can also turn this into a Decorator, where we can do something like the following:
Ultimately we need to trade off between training stability, memory requirements, resource requirements and time. The main thing to pin down first is training stability i.e., batch size and learning rate. It is far better to take longer to train the models, than to launch many jobs and later realise that convergence was poor and have to restart the experiments anyway.
References
HuggingFace:Efficient Training Techniques
Nvidia Management Library API