前言
已经部署好了Prometheus server+dcgm-exporter相关组件,dcgm-exporter包含exporter、service、servicemonitor。
调整Prometheus的configmap,可以获取dcgm-exporter的采集指标
dcgm-exporter 2.0 官网指标数据
原理: Kubernetes 支持HPA模块进行容器伸缩,默认支持CPU和内存等指标。原生的HPA基于Heapster,不支持GPU指标的伸缩,但是支持通过CustomMetrics的方式进行HPA指标的扩展。我们可以通过部署一个基于Prometheus Adapter 作为CustomMetricServer,它能将Prometheus指标注册的APIServer接口,提供HPA调用。 通过配置,HPA将CustomMetric作为扩缩容指标, 可以进行GPU指标的弹性伸缩。
dcgm-exporter组件部署 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 apiVersion: apps/v1 kind: DaemonSetmetadata: name: "dcgm-exporter" namespace: monitoring labels: app.kubernetes.io/ name: "dcgm-exporter" app.kubernetes.io/ version: "2.1.0" spec: updateStrategy: type: RollingUpdate selector: matchLabels: app.kubernetes.io/ name: "dcgm-exporter" app.kubernetes.io/ version: "2.1.0" template: metadata: labels: app.kubernetes.io/ name: "dcgm-exporter" app.kubernetes.io/ version: "2.1.0" name: "dcgm-exporter" spec: containers: - image: "mirrors.com:80/rancher/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" env: - name: "DCGM_EXPORTER_LISTEN" value: ":9400" - name: "DCGM_EXPORTER_KUBERNETES" value: "true" name: "dcgm-exporter" ports: - name: "metrics" containerPort: 9400 securityContext: runAsNonRoot: false runAsUser: 0 volumeMounts: - name: "pod-gpu-resources" readOnly: true mountPath: "/var/lib/kubelet/pod-resources" volumes: - name: "pod-gpu-resources" hostPath: path: "/var/lib/kubelet/pod-resources" nodeSelector: gpu-type: T4- --kind: ServiceapiVersion: v1metadata: name: "dcgm-exporter" namespace: monitoring labels: app.kubernetes.io/ name: "dcgm-exporter" app.kubernetes.io/ version: "2.1.0" spec: selector: app.kubernetes.io/ name: "dcgm-exporter" app.kubernetes.io/ version: "2.1.0" ports: - name: "metrics" port: 9400 - --apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitormetadata: name: dcgm-exporter namespace: monitoring labels: app.kubernetes.io/ name: dcgm-exporter app.kubernetes.io/ version: "2.1.0" spec: selector: matchLabels: app.kubernetes.io/ name: dcgm-exporter app.kubernetes.io/ version: "2.1.0" endpoints: - port: metrics interval: 30 s scheme: http namespaceSelector: matchNames: - monitoring
使用helm 部署安装Prometheus-adapter组件 项目地址:https://github.com/kubernetes-sigs/prometheus-adapter
helm地址:https://github.com/helm/helm/releases
1 2 3 4 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm update helm pull prometheus-community/prometheus-adapter
adapter rules规则文档:https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config-walkthrough.md
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 ...image: repository: mirrors.com:80 /monitoring/prometheus-adapter tag: v0.12.0 ...prometheus: url: http://prometheus-k8s.monitoring.svc port: 9090 path: "" ...rules: default: true custom: - seriesQuery: '{UUID!= "" }' resources: overrides: node: {resource: "node" } exported_pod: {resource: "pod" } exported_namespace: {resource: "namespace" } name: matches: ^DCGM_FI_(.* )$ as: "${1 } _over_time" metricsQuery: ceil(avg_over_time(< < .Series> > {< < .LabelMatchers> > }[3 m])) - seriesQuery: '{UUID!= "" }' resources: overrides: node: {resource: "node" } exported_pod: {resource: "pod" } exported_namespace: {resource: "namespace" } name: matches: ^DCGM_FI_(.* )$ as: "${1 } _current" metricsQuery: < < .Series> > {< < .LabelMatchers> > }
1 2 3 4 helm install prometheus-adapter -f values .yaml -n kube-system . # 查看对应指标是否获取 kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | grep 'DEV_GPU_UTIL_current'
修改Prometheus的configmap, 确认是否配置自动发现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 $ kubectl edit cm - n monitoring prometheus-configapiVersion: v1data: config.yml: | basic_auth_users: admin: $2 y$12 $jlwC.4777 WgcQaSb14aFROxK6sRvQCKNBAxgYzM6guEjD.E2/HH4e prometheus.yml: | global: scrape_interval: 15 s scrape_timeout: 10 s evaluation_interval: 1 m scrape_configs: - job_name: 'kubernetes-gpu' kubernetes_sd_configs: - role: pod namespaces: own_namespace: false names: - monitoring relabel_configs: - source_labels: [__address__] action: keep regex: '(.* ):9400 ' - source_labels: [__meta_kubernetes_pod_controller_name] action: keep regex: 'dcgm-exporter' - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node - source_labels: [__meta_kubernetes_pod_host_ip] action: replace target_label: node_ip
测试GPU 服务的弹性扩缩容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 apiVersion: apps/v1 kind: Deployment metadata: name: bert-intent-detection spec: replicas: 1 selector: matchLabels: app: bert-intent-detection template: metadata: labels: app: bert-intent-detection spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: gpu-type operator: In values: - T4 containers: - name: bert-container image: mirrors.com:80/xiaomishu/bert-intent-detection:1.0.1 ports: - containerPort: 80 resources: limits: nvidia.com/gpu: 1 --- apiVersion: v1 kind: Service metadata: name: bert-intent-detection-svc labels: app: bert-intent-detection spec: selector: app: bert-intent-detection ports: - protocol: TCP name: http port: 8081 targetPort: 80 curl -v http://10.43.214.241:8081/predict?query=Music * Trying 10.43 .214 .241 :8081... * Connected to 10.43 .214 .241 (10.43.214.241) port 8081 (#0) > GET /predict?query=Music HTTP/1.1 > Host: 10.43 .214 .241 :8081 > User-Agent: curl/7.71.1 > Accept: */* > * Mark bundle as not supporting multiuse * HTTP 1.0 , assume close after body < HTTP/1.0 200 OK < Content-Type: text/html; charset=utf-8 < Content-Length: 9 < Server: Werkzeug/1.0.1 Python/3.6.9 < Date: Tue, 10 Dec 2024 06:06:15 GMT < * Closing connection 0 PlayMusi
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscalermetadata: name: gpu-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: bert-intent-detection minReplicas: 1 maxReplicas: 10 metrics: - type: Pods pods: metricName: DEV_GPU_UTIL_current targetAverageValue: 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 $ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE gpu-hpa Deployment/bert-intent-detection 0 /20 1 10 1 4 h1m $ kubectl describe hpa gpu-hpa Name: gpu-hpaNamespace: defaultLabels: <none> Annotations: <none> CreationTimestamp: Tue, 10 Dec 2024 10 :01 :52 + 0800 Reference: Deployment/bert-intent-detection Metrics: ( current / target ) "DEV_GPU_UTIL_current" on pods: 0 / 20 Min replicas: 1 Max replicas: 10 Deployment pods: 1 current / 1 desiredConditions: Type Status Reason Message - --- - ----- - ----- - ------ AbleToScale True ReadyForNewScale recommended size matches current size ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric DEV_GPU_UTIL_current ScalingLimited True TooFewReplicas the desired replica count is less than the minimum replica countEvents: <none>
hey 工具地址:https://gitcode.com/gh_mirrors/he/hey/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 hey -n 10000 -c 200 "http://10.43.214.241:8081/predict?query=Music" $ kubectl get hpa -w NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE gpu-hpa Deployment/bert-intent-detection 19/20 1 10 1 4h7m gpu-hpa Deployment/bert-intent-detection 36/20 1 10 1 4h7m gpu-hpa Deployment/bert-intent-detection 36/20 1 10 2 4h8m $ kubectl get pod -w NAME READY STATUS RESTARTS AGE bert-intent-detection-985fd9b57-cq25p 1/1 Running 0 16s bert-intent-detection-985fd9b57-lqbxb 1/1 Running 0 4h21m $ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE gpu-hpa Deployment/bert-intent-detection 0/20 1 10 2 4h11m $ kubectl get pod NAME READY STATUS RESTARTS AGE bert-intent-detection-985fd9b57-lqbxb 1/1 Running 0 4h35m