K8S集群基于GPU显卡指标实现HPA功能

前言

已经部署好了Prometheus server+dcgm-exporter相关组件，dcgm-exporter包含exporter、service、servicemonitor。
调整Prometheus的configmap，可以获取dcgm-exporter的采集指标
dcgm-exporter 2.0 官网指标数据

原理： Kubernetes 支持HPA模块进行容器伸缩，默认支持CPU和内存等指标。原生的HPA基于Heapster，不支持GPU指标的伸缩，但是支持通过CustomMetrics的方式进行HPA指标的扩展。我们可以通过部署一个基于Prometheus Adapter 作为CustomMetricServer，它能将Prometheus指标注册的APIServer接口，提供HPA调用。通过配置，HPA将CustomMetric作为扩缩容指标，可以进行GPU指标的弹性伸缩。
size:800,1000

dcgm-exporter组件部署

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  namespace: monitoring
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.0"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.1.0"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.1.0"
      name: "dcgm-exporter"
    spec:
      containers:
      - image: "mirrors.com:80/rancher/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      nodeSelector:
        gpu-type: T4
---
kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  namespace: monitoring
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.0"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.0"
  ports:
  - name: "metrics"
    port: 9400
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/version: "2.1.0"
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/version: "2.1.0"
  endpoints:
    - port: metrics
      interval: 30s # 根据需要调整抓取间隔
      scheme: http
  namespaceSelector:
    matchNames:
      - monitoring # 指定ServiceMonitor所在的命名空间

使用helm 部署安装Prometheus-adapter组件

项目地址：https://github.com/kubernetes-sigs/prometheus-adapter

helm地址：https://github.com/helm/helm/releases

#离线部署，需要找到公网机器，提前把对应的chart包下载
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm update
helm pull prometheus-community/prometheus-adapter

修改values.yaml

adapter rules规则文档：https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config-walkthrough.md

...
image:
  repository: mirrors.com:80/monitoring/prometheus-adapter
  tag: v0.12.0

...
prometheus:
  # Value is templated
  url: http://prometheus-k8s.monitoring.svc  #匹配当前环境的Prometheus svc地址
  port: 9090
  path: ""

...
rules:
  default: true

  custom: #配置rule指标转化
    - seriesQuery: '{UUID!=""}'
      resources:
        overrides:
          node: {resource: "node"}
          exported_pod: {resource: "pod"}
          exported_namespace: {resource: "namespace"}
      name:
        matches: ^DCGM_FI_(.*)$
        as: "${1}_over_time"
      metricsQuery: ceil(avg_over_time(<<.Series>>{<<.LabelMatchers>>}[3m]))
    - seriesQuery: '{UUID!=""}'
      resources:
        overrides:
          node: {resource: "node"}
          exported_pod: {resource: "pod"}
          exported_namespace: {resource: "namespace"}
      name:
        matches: ^DCGM_FI_(.*)$
        as: "${1}_current"
      metricsQuery: <<.Series>>{<<.LabelMatchers>>}

安装Prometheus-adapter

helm install prometheus-adapter -f values.yaml -n kube-system .

# 查看对应指标是否获取
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | grep 'DEV_GPU_UTIL_current'

修改Prometheus的configmap, 确认是否配置自动发现

$ kubectl edit cm -n monitoring prometheus-config
apiVersion: v1
data:
  config.yml: |
    basic_auth_users:
      admin: $2y$12$jlwC.4777WgcQaSb14aFROxK6sRvQCKNBAxgYzM6guEjD.E2/HH4e
  prometheus.yml: |
    global:  
      scrape_interval: 15s   
      scrape_timeout: 10s   
      evaluation_interval: 1m   
    scrape_configs:     
    - job_name: 'kubernetes-gpu'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          own_namespace: false
          names:
          - monitoring
      relabel_configs:
      - source_labels: [__address__]
        action: keep
        regex: '(.*):9400'   
      - source_labels: [__meta_kubernetes_pod_controller_name]
        action: keep
        regex: 'dcgm-exporter'
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_pod_host_ip]
        action: replace
        target_label: node_ip

测试GPU 服务的弹性扩缩容

创建推理服务

#bert.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bert-intent-detection
spec:
  replicas: 1
  selector:
    matchLabels:
      app: bert-intent-detection
  template:
    metadata:
      labels:
        app: bert-intent-detection
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: gpu-type
                operator: In
                values:
                - T4
      containers:
      - name: bert-container
        image: mirrors.com:80/xiaomishu/bert-intent-detection:1.0.1
        ports:
        - containerPort: 80
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: bert-intent-detection-svc
  labels:
    app: bert-intent-detection
spec:
  selector:
    app: bert-intent-detection
  ports:
  - protocol: TCP
    name: http
    port: 8081
    targetPort: 80

#服务测试
 curl -v http://10.43.214.241:8081/predict?query=Music
*   Trying 10.43.214.241:8081...
* Connected to 10.43.214.241 (10.43.214.241) port 8081 (#0)
> GET /predict?query=Music HTTP/1.1
> Host: 10.43.214.241:8081
> User-Agent: curl/7.71.1
> Accept: */*
> 
* Mark bundle as not supporting multiuse
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Content-Type: text/html; charset=utf-8
< Content-Length: 9
< Server: Werkzeug/1.0.1 Python/3.6.9
< Date: Tue, 10 Dec 2024 06:06:15 GMT
< 
* Closing connection 0
PlayMusi

创建hpa，根据GPU 利用率进行扩缩容

#bert-hpa.yaml，注意集群版本，当前环境是小于 1.23
apiVersion: autoscaling/v2beta1  # 使用autoscaling/v2beta1版本的HPA配置。
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bert-intent-detection
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metricName: DEV_GPU_UTIL_current  #当前GPU使用率
      targetAverageValue: 20

$ kubectl get hpa
NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        1          4h1m

$ kubectl describe hpa gpu-hpa 
Name:                              gpu-hpa
Namespace:                         default
Labels:                            <none>
Annotations:                       <none>
CreationTimestamp:                 Tue, 10 Dec 2024 10:01:52 +0800
Reference:                         Deployment/bert-intent-detection
Metrics:                           ( current / target )
  "DEV_GPU_UTIL_current" on pods:  0 / 20
Min replicas:                      1
Max replicas:                      10
Deployment pods:                   1 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric DEV_GPU_UTIL_current
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:           <none>

压力测试

hey 工具地址：https://gitcode.com/gh_mirrors/he/hey/

#压力测试扩容
hey -n 10000 -c 200 "http://10.43.214.241:8081/predict?query=Music"

$ kubectl get hpa -w
NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-intent-detection   19/20     1         10        1          4h7m
gpu-hpa   Deployment/bert-intent-detection   36/20     1         10        1          4h7m
gpu-hpa   Deployment/bert-intent-detection   36/20     1         10        2          4h8m

$ kubectl get pod -w
NAME                                        READY   STATUS    RESTARTS   AGE
bert-intent-detection-985fd9b57-cq25p       1/1     Running   0          16s
bert-intent-detection-985fd9b57-lqbxb       1/1     Running   0          4h21m

#压测停止，GPU利用率降低且低于20%后，pod五分钟后开始进行弹性缩容
$ kubectl get hpa 
NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        2          4h11m

$ kubectl get pod
NAME                                        READY   STATUS    RESTARTS   AGE
bert-intent-detection-985fd9b57-lqbxb       1/1     Running   0          4h35m

Kubernetes

#K8S #GPU

K8S集群基于GPU显卡指标实现HPA功能

http://example.com/2025/04/25/k8s集群基于GPU显卡实现hpa 功能/

作者

种田人

发布于

2025年4月25日

许可协议

Docker部署OpenVPN AS 上一篇