Nginx Ingress 在 Kubernetes 的高可用配置
在 K8s 中落地 Ingress-Nginx 时,如何配置高可用、弹性与灰度?本文给出实操 YAML、金丝雀流量与压测/演练手册。
1. 基础部署(示例)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web
annotations:
kubernetes.io/ingress.class: nginx
spec:
rules:
- host: demo.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-svc
port:
number: 80
2. 金丝雀灰度
metadata:
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "20" # 20%
3. HA 形态
- DaemonSet + hostNetwork + externalTrafficPolicy=Local,保持源地址;
- 或 Service L4 LB + 多副本 Ingress Controller。
4. 压测与演练
fortio/vegeta压测 10-30 分钟,观察 2xx/4xx/5xx 与 P95;- 演练:杀死节点/Pod、模拟 LB 抖动,验证会话粘性与重试策略。
5. 拓扑与调度(反亲和/跨区扩散/PDB/容忍)
controller:
replicaCount: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/component: controller
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/component: controller
tolerations:
- key: "node-role.kubernetes.io/ingress"
operator: "Exists"
effect: "NoSchedule"
PodDisruptionBudget(避免滚动/维护时一次性驱逐):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: ingress-nginx-pdb, namespace: ingress-nginx }
spec:
minAvailable: 1
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/component: controller
6. 升级与优雅收敛
controller:
updateStrategy: { type: RollingUpdate }
minReadySeconds: 10
terminationGracePeriodSeconds: 60
config:
worker-shutdown-timeout: "30s"
proxy-next-upstream: "error timeout http_502 http_503 http_504"
proxy-next-upstream-tries: "2"
proxy-read-timeout: "30s"
proxy-send-timeout: "30s"
要点:
minReadySeconds确保就绪后才接流量;worker-shutdown-timeout提供连接迁移时间。- 与上游的重试与超时上限要保守,防止风暴。
7. 容量与弹性(HPA)
controller:
metrics:
enabled: true
serviceMonitor:
enabled: true
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 60
targetMemoryUtilizationPercentage: 70
建议:基于 CPU/内存或自定义 QPS/连接数指标(需自定义 Metrics Adapter)弹性扩缩容。
8. 上游容错与连接复用
controller:
config:
keep-alive-requests: "1000" # client <-> ingress 长连接复用上限
upstream-keepalive-connections: "512" # ingress <-> upstream 空闲长连接上限
max-worker-connections: "65536"
retries: "1"
retry-non-idempotent: "false"
注意:若上游有会话亲和(如登录态),需与 session-cookie/一致性哈希配合,避免跨请求状态混淆。
9. 可观测与演练清单(扩展)
- 指标:活动连接、
$upstream_response_time分位数、5xx 率、队列与 fd 用量。 - 日志:统一 JSON 格式,保留版本/路由/上游信息,便于问题回溯。
- 演练:
- 杀 Pod/节点;
- 人为提升上游错误,验证
proxy-next-upstream; - LB 抖动/跨区故障;
- 扩缩容/滚动升级下的会话粘性与连接复用表现。