nvgpu利用率

less than 1 minute read

Published: August 29, 2025

本文回顾nvgpu 利用率相关技术。

GPU利用率到底是什么吗？

在运行任务时，大部分人通过 nvidia-smi 来查看GPU的利用率“Volatile GPU Util”, (或用 nvidia-smi pmon ，对于tegra 产品用户，使用 tegrastats )

NVIDIA官方对利用率（utilization）定义：

Percent of time over the past sample period during which one or more kernels was executing on the GPU.The sample period may be between 1 second and 1/6 second depending on the product.

https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf

中文是：采样时间段内，1个或更多的核（kernel）运行的时间的占比。这个采样时间段取多长，跟产品相关，可能值1/6s~1s。

注意：“one or more kernels”。这个意味着里面只要有1个kernel在执行，你的GPU利用率也可以很高。而1个kernel不是一定让GPU内部的运算单元（core）都工作的。

因此，要注意了下面两个错误认识：

GPU的利用率 = GPU内计算单元干活的比例。利用率越高，算力就必然发挥得越充分。
同条件下，利用率越高，耗时一定越短。

只要记住一个事实：GPU利用率99%，并不能说明你的GPU就拿出了99%的算力。这个利用率只能说明，有kernel在运算，至于算得怎么样，算力发挥出来了没有，sm 中的core 是否都处在计算状态，光看这个利用率是判断不出来的。你可以使用一个空kernel 也会造成GPU利用率达到很高的假象。

__global__ void warmup_kernel() {
  while(1){

  }
}

void gpu_warmup() {
  warmup_kernel<<<1, 1>>>();
  CheckCudaErrors(cudaDeviceSynchronize());
}

并不是利用率越高，耗时就会越短，反之亦然。举个例子，用了MPS后，利用率降低了，运行反而更快了。
所以在同条件下，应该是GPU算力用得越充分，耗时才会越短。

提升sm真正利用率

GPU-Util 并不可信。
提升利用率的方法可以从任务本身着手，比如

增加数据大小（更大的batch_size、网络）
缩减I/O时间
提前加载数据等手段
同时送入多个任务给GPU，多个任务操作互补（有些操作不需要GPU工作），让GPU运算不闲着

Ref

https://github.com/wookayin/gpustat
https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t
https://github.com/NVIDIA/gpu-monitoring-tools/issues/12
https://docs.nvidia.com/deploy/nvidia-smi/index.html
https://www.trainy.ai/blog/gpu-utilization-misleading
https://prophetstor.com/white-papers/why-100-gpu-utilization-doesnt-mean-100-heat/
https://stackoverflow.com/questions/8223811/a-top-like-utility-for-monitoring-cuda-activity-on-a-gpu
https://rafay.co/ai-and-cloud-native-blog/deep-dive-into-nvidia-smi-monitoring-your-nvidia-gpu-with-real-examples
https://forums.developer.nvidia.com/t/something-like-top-to-monitor-the-gpu/20714/14
https://github.com/NVIDIA/gpu-monitoring-tools/issues/12
https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Jinwen Li

GPU利用率到底是什么吗？

提升sm真正利用率

Ref

Share on