K8S集群Etcd 备份与恢复

Etcd 常见操作

查看集群状态

$ ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints=https://192.168.1.60:2379 endpoint health
https://192.168.1.60:2379 is healthy: successfully committed proposal: took = 11.099322ms

查看集群的leader相关信息

$ ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints=https://192.168.1.60:2379 member list --write-out=table
+------------------+---------+-------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |       NAME        |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-------------------+---------------------------+---------------------------+------------+
| 2175f1dc23237781 | started | etcd-192.168.1.60 | https://192.168.1.60:2380 | https://192.168.1.60:2379 |      false |
+------------------+---------+-------------------+---------------------------+---------------------------+------------+

获取某个key信息

# 获取apiserver key的相关信息
$ ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints=https://192.168.1.60:2379 get /registry/apiregistration.k8s.io/apiservices/v1.apps
/registry/apiregistration.k8s.io/apiservices/v1.apps
{"kind":"APIService","apiVersion":"apiregistration.k8s.io/v1","metadata":{"name":"v1.apps","uid":"3c4ffb30-8956-40dc-bf06-18936c38f27c","creationTimestamp":"2024-03-04T12:44:36Z","labels":{"kube-aggregator.kubernetes.io/automanaged":"onstart"},"managedFields":[{"manager":"kube-apiserver","operation":"Update","apiVersion":"apiregistration.k8s.io/v1","time":"2024-03-04T12:44:36Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:labels":{".":{},"f:kube-aggregator.kubernetes.io/automanaged":{}}},"f:spec":{"f:group":{},"f:groupPriorityMinimum":{},"f:version":{},"f:versionPriority":{}}}}]},"spec":{"group":"apps","version":"v1","groupPriorityMinimum":17800,"versionPriority":15},"status":{"conditions":[{"type":"Available","status":"True","lastTransitionTime":"2024-03-04T12:44:36Z","reason":"Local","message":"Local APIServices are always available"}]}}

# 查询calico网络为各节点分配的网段
$ ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints=https://192.168.1.60:2379  get  --prefix /calico/ipam/v2/host
/calico/ipam/v2/host/master-60/ipv4/block/172.20.125.128-26
{"state":"confirmed","deleted":false}
/calico/ipam/v2/host/worker-61/ipv4/block/172.20.92.192-26
{"state":"confirmed","deleted":false}
/calico/ipam/v2/host/worker-62/ipv4/block/172.20.248.128-26
{"state":"confirmed","deleted":false}
/calico/ipam/v2/host/worker-63/ipv4/block/172.20.234.192-26
{"state":"confirmed","deleted":false}

获取Etcd所有key列表信息

$ ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints=https://192.168.1.60:2379  get / --prefix --keys-only
...
/registry/services/specs/kube-system/kube-dns

/registry/services/specs/kube-system/kube-dns-upstream

/registry/services/specs/kube-system/metrics-server

/registry/services/specs/kube-system/node-local-dns

/registry/storageclasses/qiqios-nfs-storage
...

获取Etcd版本信息

$ ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints=https://192.168.1.60:2379  version
etcdctl version: 3.5.9
API version: 3.5

备份Etcd

使用snapshot save 来备份对应节点的数据,脚本如下

$ cat /data/backup/etcd_backup.sh
#!/bin/sh

ETCDCTL_PATH='/opt/kube/bin/etcdctl'
BACKUP_DIR="/data/backup/etcd/"
node="192.168.1.60"
BackupFile="snapshot"$(date +%Y%m%d)_$(date +%H%M%S)".db"

ETCDCTL_CERT="/etc/kubernetes/ssl/etcd.pem"
ETCDCTL_KEY="/etc/kubernetes/ssl/etcd-key.pem"
ETCDCTL_CA_FILE="/etc/kubernetes/ssl/ca.pem"

[ ! -d $BACKUP_DIR ] && mkdir -p $BACKUP_DIR

export ETCDCTL_API=3;
$ETCDCTL_PATH --endpoints="https://$node:2379" \
snapshot save $BACKUP_DIR/$BackupFile \
--cacert="$ETCDCTL_CA_FILE" \
--cert="$ETCDCTL_CERT" \
--key="$ETCDCTL_KEY"


sleep 3

cd $BACKUP_DIR/;ls -lt |awk '{if(NR>10){print "rm -rf "$9}}'|sh

配置执行计划

0 3 * * * /bin/sh /data/backup/etcd_backup.sh

恢复Etcd数据

创建测试pod

$ cat nginx_pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nginx1
spec:
  containers:
  - name: nginx1
    image: swr.cn-north-4.myhuaweicloud.com/qiqios/nginx:1.25.2-alpine
    ports:
    - containerPort: 80
$ kubectl apply -f nginx_pod.yaml
$ kubectl get pod  -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP              NODE        NOMINATED NODE   READINESS GATES
nginx1   1/1     Running   0          14s   172.20.92.193   worker-61   <none>           <none>
$ curl -I 172.20.92.193
HTTP/1.1 200 OK
Server: nginx/1.25.2
Date: Wed, 06 Mar 2024 03:35:21 GMT
Content-Type: text/html
Content-Length: 615
Last-Modified: Tue, 15 Aug 2023 19:24:07 GMT
Connection: keep-alive
ETag: "64dbd0d7-267"
Accept-Ranges: bytes

手动执行备份Etcd数据

$ /data/backup/etcd_backup.sh
{"level":"info","ts":"2024-03-06T11:50:12.538329+0800","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/data/backup/etcd//snapshot20240306_115012.db.part"}
{"level":"info","ts":"2024-03-06T11:50:12.54825+0800","logger":"client","caller":"v3@v3.5.9/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-03-06T11:50:12.548298+0800","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://192.168.1.60:2379"}
{"level":"info","ts":"2024-03-06T11:50:12.610273+0800","logger":"client","caller":"v3@v3.5.9/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-03-06T11:50:12.733733+0800","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://192.168.1.60:2379","size":"3.4 MB","took":"now"}
{"level":"info","ts":"2024-03-06T11:50:12.73383+0800","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/data/backup/etcd//snapshot20240306_115012.db"}
Snapshot saved at /data/backup/etcd//snapshot20240306_115012.db

删除nginx pod

$ kubectl delete pod nginx1

停止kube-apiserver和 etcd服务

停止kube-apiserver 服务,确保apiserver 服务已经停止运行

$ systemctl stop kube-apiserver
$ systemctl stop etcd
# 注意先后顺序不能乱,停止kube-apiserver后,保证集群没有新数据写入etcd数据库

移除Etcd数据

$  mv /var/lib/etcd/ data.bak
$  mv /data/etcd_wal/ wal.bak

使用对应快照备份恢复Etcd数据

# 可以查看 etcd.service 相关启动参数
$ systemctl cat etcd.service
# /etc/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
Documentation=https://github.com/coreos

[Service]
Type=notify
WorkingDirectory=/var/lib/etcd
ExecStart=/opt/kube/bin/etcd \
  --name=etcd-192.168.1.60 \
  --cert-file=/etc/kubernetes/ssl/etcd.pem \
  --key-file=/etc/kubernetes/ssl/etcd-key.pem \
  --peer-cert-file=/etc/kubernetes/ssl/etcd.pem \
  --peer-key-file=/etc/kubernetes/ssl/etcd-key.pem \
  --trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
  --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
  --initial-advertise-peer-urls=https://192.168.1.60:2380 \
  --listen-peer-urls=https://192.168.1.60:2380 \
  --listen-client-urls=https://192.168.1.60:2379,http://127.0.0.1:2379 \
  --advertise-client-urls=https://192.168.1.60:2379 \
  --initial-cluster-token=etcd-cluster-0 \
  --initial-cluster=etcd-192.168.1.60=https://192.168.1.60:2380 \
  --initial-cluster-state=new \
  --data-dir=/var/lib/etcd \
  --wal-dir=/data/etcd_wal \
  --snapshot-count=50000 \
  --auto-compaction-retention=1 \
  --auto-compaction-mode=periodic \
  --max-request-bytes=10485760 \
  --quota-backend-bytes=8589934592
Restart=always
RestartSec=15
LimitNOFILE=65536
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

# 恢复命令
$ ETCDCTL_API=3 etcdctl  snapshot restore /data/backup/etcd//snapshot20240306_115012.db \
> --name etcd-192.168.1.60 \
> --initial-advertise-peer-urls=https://192.168.1.60:2380 \
> --initial-cluster-token=etcd-cluster-0 \
> --initial-cluster=etcd-192.168.1.60=https://192.168.1.60:2380 \
> --data-dir=/var/lib/etcd \
> --wal-dir=/data/etcd_wal
Deprecated: Use `etcdutl snapshot restore` instead.

2024-03-06T12:04:48+08:00       info    snapshot/v3_snapshot.go:248     restoring snapshot      {"path": "/data/backup/etcd//snapshot20240306_115012.db", "wal-dir": "/data/etcd_wal", "data-dir": "/var/lib/etcd", "snap-dir": "/var/lib/etcd/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\tgo.etcd.io/etcd/etcdutl/v3@v3.5.9/snapshot/v3_snapshot.go:254\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\tgo.etcd.io/etcd/etcdutl/v3@v3.5.9/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.snapshotRestoreCommandFunc\n\tgo.etcd.io/etcd/etcdctl/v3/ctlv3/command/snapshot_command.go:129\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.1.3/command.go:897\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.Start\n\tgo.etcd.io/etcd/etcdctl/v3/ctlv3/ctl.go:107\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.MustStart\n\tgo.etcd.io/etcd/etcdctl/v3/ctlv3/ctl.go:111\nmain.main\n\tgo.etcd.io/etcd/etcdctl/v3/main.go:59\nruntime.main\n\truntime/proc.go:250"}
2024-03-06T12:04:48+08:00       info    membership/store.go:141 Trimming membership information from the backend...
2024-03-06T12:04:48+08:00       info    membership/cluster.go:421       added member    {"cluster-id": "7d4c6b3b6b302fc3", "local-member-id": "0", "added-peer-id": "2175f1dc23237781", "added-peer-peer-urls": ["https://192.168.1.60:2380"]}
2024-03-06T12:04:48+08:00       info    snapshot/v3_snapshot.go:269     restored snapshot       {"path": "/data/backup/etcd//snapshot20240306_115012.db", "wal-dir": "/data/etcd_wal", "data-dir": "/var/lib/etcd", "snap-dir": "/var/lib/etcd/member/snap"}

启动Etcd服务和kube-apiserver服务

$ systemctl start etcd
$ systemctl status etcd
● etcd.service - Etcd Server
     Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2024-03-06 12:05:45 CST; 8s ago
       Docs: https://github.com/coreos
   Main PID: 1393603 (etcd)
      Tasks: 9 (limit: 4594)
     Memory: 10.5M
     CGroup: /system.slice/etcd.service
             └─1393603 /opt/kube/bin/etcd --name=etcd-192.168.1.60 --cert-file=/etc/kubernetes/ssl/etcd.pem --key-file=/etc/kubernetes/ssl/et>

Mar 06 12:05:45 master-60 etcd[1393603]: {"level":"info","ts":"2024-03-06T12:05:45.261796+0800","caller":"embed/serve.go:103","msg":"ready to>
Mar 06 12:05:45 master-60 etcd[1393603]: {"level":"info","ts":"2024-03-06T12:05:45.262102+0800","caller":"embed/serve.go:103","msg":"ready to>
Mar 06 12:05:45 master-60 etcd[1393603]: {"level":"info","ts":"2024-03-06T12:05:45.262297+0800","caller":"membership/cluster.go:584","msg":"s>
Mar 06 12:05:45 master-60 etcd[1393603]: {"level":"info","ts":"2024-03-06T12:05:45.26249+0800","caller":"api/capability.go:75","msg":"enabled>
Mar 06 12:05:45 master-60 etcd[1393603]: {"level":"info","ts":"2024-03-06T12:05:45.263593+0800","caller":"etcdserver/server.go:2595","msg":"c>
Mar 06 12:05:45 master-60 etcd[1393603]: {"level":"info","ts":"2024-03-06T12:05:45.263096+0800","caller":"embed/serve.go:187","msg":"serving >
Mar 06 12:05:45 master-60 etcd[1393603]: {"level":"info","ts":"2024-03-06T12:05:45.262388+0800","caller":"etcdmain/main.go:44","msg":"notifyi>
Mar 06 12:05:45 master-60 etcd[1393603]: {"level":"info","ts":"2024-03-06T12:05:45.264109+0800","caller":"etcdmain/main.go:50","msg":"success>
Mar 06 12:05:45 master-60 systemd[1]: Started Etcd Server.

# etcd集群健康检查
$ ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints=https://192.168.1.60:2379 endpoint health
https://192.168.1.60:2379 is healthy: successfully committed proposal: took = 12.293125ms

$ systemctl start kube-apiserver

检查集群,核对nginx服务是否已经恢复

$ kubectl get node
NAME        STATUS                     ROLES       AGE   VERSION
master-60   Ready,SchedulingDisabled   master      39h   v1.28.1
worker-61   Ready                      node        39h   v1.28.1
worker-62   Ready                      node        39h   v1.28.1
worker-63   Ready                      edge,node   39h   v1.28.1

$ kubectl get pod -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP              NODE        NOMINATED NODE   READINESS GATES
nginx1   1/1     Running   0          6s    172.20.92.194   worker-61   <none>           <none>

$ curl -I 172.20.92.194
HTTP/1.1 200 OK
Server: nginx/1.25.2
Date: Wed, 06 Mar 2024 04:13:35 GMT
Content-Type: text/html
Content-Length: 615
Last-Modified: Tue, 15 Aug 2023 19:24:07 GMT
Connection: keep-alive
ETag: "64dbd0d7-267"
Accept-Ranges: bytes

Etcd数据恢复总结

Kubernets 集群备份主要是备份Etcd 集群。而恢复时,一定要考虑恢复整个顺序。

停止Kube-apiserver –> 停止Etcd –> 恢复数据 –> 启动Etcd –> 启动 Kube-apiserver


K8S集群Etcd 备份与恢复
http://www.qiqios.cn/2024/03/06/K8S集群Etcd备份与恢复/
作者
一亩三分地
发布于
2024年3月6日
许可协议