etcd 集群接入 Prometheus 监控系统

K8S集群的etcd,分为以pod的方式启动和外置etcd集群两种,都需要配置监控。

大概分为三个部分:

  • 配置 Target,让 Prometheus 采集 etcd 指标
  • 配置 Grafana 模板
  • 配置告警规则

了解 Etcd metrics

官方文档:https://etcd.io/docs/v3.5/metrics/

etcd 默认会在2379 端口上暴露 metrics。

测试检测数据暴露是否正常

$ curl -I 127.0.0.1:2379/metrics
HTTP/1.1 200 OK
Access-Control-Allow-Headers: accept, content-type, authorization
Access-Control-Allow-Methods: POST, GET, OPTIONS, PUT, DELETE
Access-Control-Allow-Origin: *
Content-Type: text/plain; version=0.0.4; charset=utf-8
Date: Wed, 27 Dec 2023 08:01:43 GMT

对于开启 tls的 etcd 集群,则需要指定客户端证书,例如

certs=/etc/kubernetes/pki/etcd/
curl --cacert $certs/ca.crt --cert $certs/peer.crt --key $certs/peer.key https://192.168.10.55:2379/metrics

Etcd 中的指标可以分为三大类:

  • Server
  • Disk
  • Network

Server

以下指标前缀为etcd_server_

  • has_leader: 集群是否存在Leader
    • 没有的话集群异常
  • leader_changes_seen_total: leader 切换次数
    • 次数太多,说明集群不稳定,容易出现问题,丢失数据
  • proposals_committed_total: 已提交的共识提案总数。
    • 如果member 和 leader 差距较大,说明member 可能有问题,比如运行比较慢
  • proposals_applied_total:记录应用的提案总数。
    • 已提交提案数和已提交提案数差值应该比较小,持续升高说明etcd 负责较高
  • proposals_pending:等待提交的提案数
    • 该值升高说明 etcd 负载较高或者 member 无法提交提案
  • proposals_failed_total:失败的提案数

Disk

以下指标前缀为 etcd_disk_

  • wal_fsync_duration_seconds_bucket:反映系统执行 fdatasync 系统调用耗时
  • backend_commit_duration_seconds_bucket:提交耗时

高磁盘操作延迟(wal_fsync_duration_secondsbackend_commit_duration_seconds)通常表明磁盘存在问题。它可能会导致较高的请求延迟或使集群不稳定。

Network

以下指标前缀为etcd_network_

  • peer_sent_bytes_total:发送到其他 member 的字节数
  • peer_received_bytes_total:从其他 member 收到的字节数
  • peer_sent_failures_total: 发送失败数
  • peer_received_failures_total:接收失败数
  • peer_round_trip_time_seconds:节点间的 RTT
  • client_grpc_sent_bytes_total:发送给客户端的字节数
  • client_grpc_received_bytes_total:从客户端收到的字节数

核心指标

上面就是 etcd 的所有指标了,其中比较核心的是下面这几个:

  • **etcd_server_has_leader**:etcd 集群是否存在 leader,为 0 则表示不存在 leader,整个集群不可用
  • **etcd_server_leader_changes_seen_total**: etcd 集群累计 leader 切换次数
  • **etcd_disk_wal_fsync_duration_seconds_bucket**:wal fsync 调用耗时,正常应该低于 10ms
  • **etcd_disk_backend_commit_duration_seconds_bucket**:db fsync 调用耗时,正常应该低于 120ms
  • **etcd_network_peer_round_trip_time_seconds_bucket**:节点间 RTT 时间

配置 etcd 监控

配置Prometheus所需的ETCD证书

$ tree /etc/kubernetes/ssl/
/etc/kubernetes/ssl/
├── aggregator-proxy-key.pem
├── aggregator-proxy.pem
├── ca-key.pem
├── ca.pem
├── etcd
│   ├── ca.crt
│   ├── healthcheck-client.crt
│   └── healthcheck-client.key
├── etcd-key.pem
├── etcd.pem
├── kubelet-key.pem
├── kubelet.pem
├── kubernetes-key.pem
└── kubernetes.pem
# 证书对应关系
ca.crt  ->  ca.pem
healthcheck-client.crt ->  etcd.pem
healthcheck-client.key ->  etcd-key.pem

#检查证书是否有效
$ ETCDCTL_API=3 etcdctl --endpoints=https://192.168.1.70:2379 --cacert=/etc/kubernetes/ssl/etcd/ca.cert --cert=/etc/kubernetes/ssl/etcd/healthcheck-client.crt        --key=/etc/kubernetes/ssl/etcd/healthcheck-client.key  --write-out=table endpoint status
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.70:2379 | fd5ffac0d7256dc2 |   3.5.9 |   17 MB |      true |      false |         8 |    3029796 |            3029796 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

创建secret 给Prometheus 挂载

$ kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/ssl/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/ssl/etcd/healthcheck-client.key --from-file=/etc/kubernetes/ssl/etcd/ca.crt
$ kubectl describe secrets -n monitoring etcd-certs
Name:         etcd-certs
Namespace:    monitoring
Labels:       <none>
Annotations:  <none>

Type:  Opaque

Data
====
ca.crt:                  1310 bytes
healthcheck-client.crt:  1415 bytes
healthcheck-client.key:  1675 bytes

挂载etcd-certs 到Prometheus Server中。

文件目录:/apps/kube-prometheus/manifests/prometheus-prometheus.yaml

---
nodeSelector:
  beta.kubernetes.io/os: linux
replicas: 2
secrets:
- etcd-certs

登录查看证书文件

$ kubectl apply -f /apps/kube-prometheus/manifests/prometheus-prometheus.yaml
$ kubectl exec -it -n monitoring prometheus-k8s-1 -- tree /etc/prometheus/secrets/etcd-certs/
/etc/prometheus/secrets/etcd-certs/
├── ca.crt -> ..data/ca.crt
├── healthcheck-client.crt -> ..data/healthcheck-client.crt
└── healthcheck-client.key -> ..data/healthcheck-client.key

0 directories, 3 files

创建ServiceMonitor

创建一个 ServiceMonitor 对象,让 Prometheus 去采集 etcd 的指标

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd-k8s
  namespace: monitoring
  labels:
    k8s-app: etcd-k8s
spec:
  jobLabel: k8s-app
  endpoints:
  - port: https-metrics # 这个port 对应 etcd-svc 的 spec.ports.name
    interval: 30s
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
      certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt
      keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key
      insecureSkipVerify: true
  selector:
    matchLabels:
      k8s-app: etcd   # 跟 etcd-svc 的lables 保持一致
  namespaceSelector:
    matchNames:
    - kube-system

创建 etcd service & endpoint

对于外置的 etcd 集群,或者以静态 pod 方式启动的 etcd 集群,都不会在 k8s 里创建 service,而 Prometheus 需要根据 service + endpoint 来抓取,因此需要手动创建。

apiVersion: v1
kind: Service
metadata:
  name: etcd-k8s
  namespace: kube-system
  labels:
    k8s-app: etcd
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: https-metrics
    port: 2379
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: etcd-k8s
  namespace: kube-system
  labels:
    k8s-app: etcd
subsets:
- addresses:
  - ip: 192.168.1.70   # etcd 主机节点IP
    nodeName: master-70
  ports:
  - name: https-metrics
    port: 2379
    protocol: TCP

配置告警

官方策略

Etcd 官方也提供了一个告警规则,点击查看–> etcd3_alert.rules.yml

简化指标内容如下:

# these rules synced manually from https://github.com/etcd-io/etcd/blob/master/Documentation/etcd-mixin/mixin.libsonnet
groups:
  - name: etcd
    rules:
      - alert: etcdInsufficientMembers
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": insufficient members ({{ $value
        }}).'
        expr: |
          sum(up{job=~".*etcd.*"} == bool 1) by (job) < ((count(up{job=~".*etcd.*"}) by (job) + 1) / 2)
        for: 3m
        labels:
          severity: critical
      - alert: etcdNoLeader
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": member {{ $labels.instance }} has
        no leader.'
        expr: |
          etcd_server_has_leader{job=~".*etcd.*"} == 0
        for: 1m
        labels:
          severity: critical
      - alert: etcdHighNumberOfLeaderChanges
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": instance {{ $labels.instance }}
        has seen {{ $value }} leader changes within the last hour.'
        expr: |
          rate(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}[15m]) > 3
        for: 15m
        labels:
          severity: warning
      - alert: etcdHighNumberOfFailedGRPCRequests
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
        $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
        expr: |
          100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method)
            /
          sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
            > 1
        for: 10m
        labels:
          severity: warning
      - alert: etcdHighNumberOfFailedGRPCRequests
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
        $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
        expr: |
          100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method)
            /
          sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
            > 5
        for: 5m
        labels:
          severity: critical
      - alert: etcdHighNumberOfFailedProposals
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": {{ $value }} proposal failures within
        the last hour on etcd instance {{ $labels.instance }}.'
        expr: |
          rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5
        for: 15m
        labels:
          severity: warning
      - alert: etcdHighFsyncDurations
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": 99th percentile fync durations are
        {{ $value }}s on etcd instance {{ $labels.instance }}.'
        expr: |
          histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
          > 0.5
        for: 10m
        labels:
          severity: warning
      - alert: etcdHighCommitDurations
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations
        {{ $value }}s on etcd instance {{ $labels.instance }}.'
        expr: |
          histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
          > 0.25
        for: 10m
        labels:
          severity: warning
      - alert: etcdHighNodeRTTDurations
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": node RTT durations
      {{ $value }}s on etcd instance {{ $labels.instance }}.'
        expr: |
          histogram_quantile(0.99,rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning

核心策略

实际上只需要对前面提到的几个核心指标配置上告警规则即可

# 当前存活的 etcd 节点数是否小于 (n+1)/2
sum(up{job=~".\*etcd.\*"} == bool 1) by (job) < ((count(up{job=~".\*etcd.\*"}) by (job) + 1) / 2)

# etcd 是否存在 leader,为 0 则表示不存在 leader,表示集群不可用
etcd_server_has_leader{job=~".\*etcd.\*"} == 0

# 15 分钟 内集群 leader 切换次数是否超过 3 次
rate(etcd_server_leader_changes_seen_total{job=~".\*etcd.\*"}[15m]) > 3

# 5 分钟内 WAL fsync 调用延迟 p99 大于 500ms
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".\*etcd.\*"}[5m])) > 0.5

# 5 分钟内 DB fsync 调用延迟 p99 大于 500ms
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".\*etcd.\*"}[5m])) > 0.25

# 5 分钟内 节点之间 RTT 大于 500 ms
histogram_quantile(0.99,rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.5

只包含上述指标告警的精简版

# these rules synced manually from https://github.com/etcd-io/etcd/blob/master/Documentation/etcd-mixin/mixin.libsonnet
groups:
  - name: etcd
    rules:
      - alert: etcdInsufficientMembers
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": insufficient members ({{ $value
        }}).'
        expr: |
                    sum(up{job=~".*etcd.*"} == bool 1) by (job) < ((count(up{job=~".*etcd.*"}) by (job) + 1) / 2)
        for: 3m
        labels:
          severity: critical
      - alert: etcdNoLeader
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": member {{ $labels.instance }} has
        no leader.'
        expr: |
                    etcd_server_has_leader{job=~".*etcd.*"} == 0
        for: 1m
        labels:
          severity: critical
      - alert: etcdHighFsyncDurations
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": 99th percentile fync durations are
        {{ $value }}s on etcd instance {{ $labels.instance }}.'
        expr: |
          histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
          > 0.5          
        for: 10m
        labels:
          severity: warning
      - alert: etcdHighCommitDurations
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations
        {{ $value }}s on etcd instance {{ $labels.instance }}.'
        expr: |
          histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
          > 0.25          
        for: 10m
        labels:
          severity: warning
      - alert: etcdHighNodeRTTDurations
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": node RTT durations
      {{ $value }}s on etcd instance {{ $labels.instance }}.'
        expr: |
                    histogram_quantile(0.99,rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning

配置到 Prometheus

创建 PrometheusRule

使用 PrometheusRule 对象来存储这部分规则

比较重要的是label,后续会根据 label 来关联到此告警规则。

这里的 label 和 默认生成的内置 PrometheusRule 对象label 一致,就不会影响到其他 rule。

cat > pr.yaml << "EOF"
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: etcd-rules
  namespace: monitoring
spec:
  groups:
    - name: etcd
      rules:
        - alert: etcdInsufficientMembers
          annotations:
            message: 'etcd cluster "{{ $labels.job }}": insufficient members ({{ $value }}).'
          expr: |
                        sum(up{job=~".*etcd.*"} == bool 1) by (job) < ((count(up{job=~".*etcd.*"}) by (job) + 1) / 2)
          for: 3m
          labels:
            severity: critical
        - alert: etcdNoLeader
          annotations:
            message: 'etcd cluster "{{ $labels.job }}": member {{ $labels.instance }} has no leader.'
          expr: |
                        etcd_server_has_leader{job=~".*etcd.*"} == 0
          for: 1m
          labels:
            severity: critical
        - alert: etcdHighFsyncDurations
          annotations:
            message: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations are {{ $value }}s(normal is < 10ms) on etcd instance {{ $labels.instance }}.'
          expr: |
                        histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.5
          for: 3m
          labels:
            severity: warning
        - alert: etcdHighCommitDurations
          annotations:
            message: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s(normal is < 120ms) on etcd instance {{ $labels.instance }}.'
          expr: |
                        histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.25
          for: 3m
          labels:
            severity: warning
        - alert: etcdHighNodeRTTDurations
          annotations:
            message: 'etcd cluster "{{ $labels.job }}": node RTT durations {{ $value }}s on etcd instance {{ $labels.instance }}.'
          expr: |
                        histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.5
          for: 3m
          labels:
            severity: warning
EOF

配置到 Prometheus

Prometheus server的yaml文件,默认ruleSelector 是匹配所有 label,所以不需要配置此字段,Prometheus 就可以自动获取到 PrometheusRules

---
ruleNamespaceSelector: {}
  ruleSelector: {}
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  version: 2.46.0
---

etcd 集群接入 Prometheus 监控系统
http://www.qiqios.cn/2023/12/19/etcd集群接入Prometheus监控系统/
作者
一亩三分地
发布于
2023年12月19日
许可协议