監控面板紅了四天,壞的從來不是服務

監控面板紅了四天,壞的從來不是服務

一萬一千次失敗的告警

健康探針連續回報失敗,FailingStreak 數千次,監控面板整整四天一片紅。直覺反應是服務掛了,但那個服務從頭到尾都在正常處理請求——cron 跑得動,hook 觸發得了,對外 API 沒有任何異常回應。

技術環境

容器化服務環境,以 Docker HEALTHCHECK 指令配置健康探針,定期以 curl -sf 打特定管理端點;exit code 0 為健康、非零為不健康。底層為 Node.js 應用,管理端點的存取規則由框架版本控制,服務本身不感知探針的存在,兩者的關聯僅存在於監控系統的解讀層。問題模式與框架無關——任何「探針目標端點」與「框架版本存取規則」不同步更新的部署流程,都會複現相同行為。

這就像 Gogoro app 上的橘色警示連亮四天,但每次換電池、上路都跑得順。最後發現是韌體升級後感測器回傳格式跑掉了,電池本身沒事。巷口的車行師傅接過手機看一眼,抬頭說「沒問題啊,好好的」——問題出在讀取的那一層,不在被讀取的東西。

框架升級後的靜默改動

根因不難找,但要找到需要把直覺壓下去。框架從 4.x 升到 5.x 之後,某個管理端點悄悄加上了認證要求——沒有帶 token,一律回 401。健康探針用的是早在 4.x 時期寫死的 curl 指令,指向那個現在需要認證的路徑。curl 遇到 4xx 回傳非零 exit code,監控系統把這個訊號解讀成「服務不健康」。

Health.Log 每筆都是 ExitCode: 1,Output 是空的。空的 Output 是關鍵:curl -sf 對 4xx 不輸出任何內容,只靜默地退出。監控系統看到的是「失敗了」,卻沒有任何可讀的錯誤訊息告訴你失敗在哪一層。

錯誤傳染鏈(時序)

監控系統              curl 探針              應用服務
    |                      |                    |
    |── 觸發健康探測 ──────>|                    |
    |                      |── GET /mgmt/health─>|
    |                      |                    |── auth check
    |                      |                    |   no token ← 框架 5.x 新增規則
    |                      |<── 401 Unauthorized─|
    |                      | exit code: 1        |
    |<── ExitCode: 1 ───────|                    |
    |    [UNHEALTHY] ✗      |                    |

實際服務狀態:cron 跑動 ✓ / hook 觸發 ✓ / API 正常回應 ✓
探針回報狀態:UNHEALTHY ✗(四天持續)

問題不在服務本身,而在探針目標端點與框架新版存取規則之間的靜默不同步。

最容易誤判的那個時刻

容易誤診的原因只有一個:探針和服務在監控儀表板上是同一個東西。狀態欄顯示 unhealthy,視線自然移向服務本身——有沒有 crash、有沒有資源耗盡、有沒有依賴項出問題。沿著這條路查下去,什麼都找不到,因為服務確實沒事。

分界點就在那次框架升級。升級前,curl 打那個端點拿到 200,探針正常。升級後,同一條指令拿到 401,探針失敗。服務本身的行為沒有改變,改變的是那個端點的存取規則。healthcheck 的目標端點沒有跟著升級一起更新,這個不同步在第一時間沒有任何顯式警告。

確認方式直接:手動跑那條 curl 指令,看 exit code 和 HTTP 狀態碼,再跑一次不需要認證的根路徑,對比兩個結果。如果服務真的壞了,兩條路徑都會失敗;如果只有原本的探針路徑失敗,問題在探針的目標,不在服務。

探針測到的,不一定是服務的狀態

修法很簡單,把探針改指到不需要認證的根路徑,重啟容器,FailingStreak 歸零,面板恢復綠色。那個服務早就一直健在。

這件事留下一個值得記著的節點:監控報告的是「探針測量到的狀態」,不是「服務本身的狀態」。這兩件事看起來是同一件,但在框架升級、端點行為靜默改變的那一刻,它們會悄悄分岔。下次框架有 breaking change,值得主動檢查 healthcheck 的目標端點在新版本裡的存取規則是否還一致——不等告警燈亮,主動去敲那扇門。

Code 對照:修法前後

修法前(探針目標為需要認證的管理端點)

# Dockerfile HEALTHCHECK(4.x 時期寫死)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -sf http://localhost:8080/management/health || exit 1
  # ↑ framework 升至 5.x 後,/management/health 需要 Bearer token
  #   curl 收到 401,-sf 靜默退出,exit code: 1 → 監控判定 UNHEALTHY

修法後(改指不需要認證的根路徑)

# Dockerfile HEALTHCHECK(修正版)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -sf http://localhost:8080/health || exit 1
  # ↑ /health 不要求認證,服務存活時回 200,exit code: 0 → HEALTHY
  # 驗證指令:curl -sv http://localhost:8080/health 2>&1 | grep "HTTP/"

該被隔離的側效應類型

  • 探針目標端點版本審查:升版時明確檢查探針指向路徑是否仍符合新版存取規則,不能依賴升級說明文件的完整性。
  • 全域認證中介層範疇:新增全域 auth middleware 或調整認證規則時,應排查是否覆蓋到 healthcheck / readiness / liveness probe 路徑。
  • 監控告警閾值觸發邏輯:FailingStreak 超過閾值應同步觸發「服務直連驗證」,不能只依賴探針狀態,避免誤報驅動不必要的應急回應。
  • 容器重建後探針驗證:容器重建後應有自動探針路徑可達性驗證步驟,確認新版本的存取規則與 HEALTHCHECK 指令一致。
  • 靜默失敗的日誌缺失curl -sf 對 4xx/5xx 不輸出任何內容;關鍵路徑探針應改用 --write-out "%{http_code}",讓失敗有可讀的診斷訊號。
  • 多環境配置漂移:dev/staging/prod HEALTHCHECK 設定若不一致,問題只在生產環境的特定版本才觸發,增加排查成本。
  • CI/CD 部署煙霧測試:部署流程的 smoke test 應包含對 healthcheck 路徑的實際 HTTP 驗證,而非僅確認容器啟動狀態。
  • 依賴服務的健康傳遞:若服務本身的 /health 端點依賴下游服務(DB、cache),應區分「基礎存活探針」與「依賴健康探針」,避免下游抖動誤殺上游的健康判定。

判斷標準:如果這段設定在某個版本邊界靜默改變後,會讓探針狀態與服務狀態出現落差,它就需要顯式的版本升級審查清單。

Code Diff: Before and After

Before (probe targeting an authenticated management endpoint)

# Dockerfile HEALTHCHECK (written during 4.x era)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -sf http://localhost:8080/management/health || exit 1
  # ↑ After framework 5.x, /management/health requires Bearer token
  #   curl receives 401, -sf exits silently, exit code: 1 → UNHEALTHY

After (redirected to unauthenticated root path)

# Dockerfile HEALTHCHECK (fixed)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -sf http://localhost:8080/health || exit 1
  # ↑ /health requires no authentication; 200 on alive → exit code: 0 → HEALTHY
  # Verify: curl -sv http://localhost:8080/health 2>&1 | grep "HTTP/"

Side Effects That Should Be Isolated

  • Probe target endpoint versioning: Explicitly audit whether the probe’s target path still satisfies the new version’s access rules on every major/minor upgrade — don’t rely on release notes being exhaustive.
  • Global auth middleware scope: When adding a global auth middleware or tightening access rules, audit whether the change reaches healthcheck, readiness, and liveness probe paths.
  • Alert threshold trigger logic: When FailingStreak exceeds a threshold, trigger a direct service verification step alongside the alert — not just a probe status read — to catch probe-vs-service divergence before escalating.
  • Post-rebuild probe verification: After a container rebuild, run an automatic probe path reachability check to confirm the new version’s access rules align with the HEALTHCHECK instruction.
  • Silent failures without log output: curl -sf produces no output on 4xx/5xx. Add --write-out "%{http_code}" to critical probe paths to generate a readable diagnostic signal on failure.
  • Multi-environment config drift: Inconsistent HEALTHCHECK configurations across dev/staging/prod mean issues only surface in production at a specific version boundary, compounding debugging cost.
  • CI/CD deployment smoke tests: Post-deploy smoke tests should include a live HTTP check of the healthcheck endpoint path — not just container start state — to validate probe reachability after every deploy.
  • Downstream health propagation: If the /health endpoint depends on downstream services (DB, cache), separate “liveness probes” from “dependency health probes” to prevent downstream jitter from misclassifying upstream health.

Rule of thumb: if a configuration can silently diverge across a version boundary such that probe state and service state no longer agree, it belongs on an explicit version-upgrade audit checklist.

— 邱柏宇

延伸閱讀


Four Days of Red Alerts, Zero Service Failures

Eleven Thousand Failures, One Functioning Service

The health probe kept firing failures. FailingStreak climbed into the thousands. The monitoring dashboard stayed red for four straight days. The obvious read: something is broken. The actual situation: the service was handling every request normally — cron jobs ran, hooks fired, external API responses were clean throughout.

Think of a Gogoro app flashing an orange battery warning for four days while the bike runs perfectly every ride. A firmware update had scrambled the sensor’s return format; the battery itself was fine. The mechanic at the corner shop glances at your phone and says, without looking up: “Nothing wrong with it.” The fault is in the reading layer, not what’s being read.

Technical Environment

Containerized service deployed via Docker, with a HEALTHCHECK instruction running curl -sf against a specific management endpoint on a fixed schedule; exit code 0 signals healthy, any non-zero exit signals unhealthy. The underlying application is Node.js; endpoint access rules are controlled by the framework version. The service has no awareness of the probe — their connection exists only in the monitoring system’s interpretation layer. The failure pattern is framework-agnostic: any deployment workflow where the probe’s target endpoint and the framework’s access rules drift out of sync will reproduce the same behavior.

The Silent Change Inside a Version Bump

The root cause wasn’t complicated, but finding it required suppressing the wrong instinct first. When the framework moved from 4.x to 5.x, a management endpoint quietly gained an authentication requirement — no token, always a 401. The healthcheck command had been written back in the 4.x era, hardcoded to hit that exact path. curl encountering a 4xx exits with a non-zero code. The monitoring system reads that signal as “service unhealthy.”

Every entry in Health.Log showed ExitCode: 1, Output: empty. That empty output is the tell. curl -sf produces no output on 4xx — it just exits silently. The monitoring system registered “failed” with nothing to indicate which layer had failed.

Error Propagation Sequence

Monitoring System      curl Probe          Application Server
      |                     |                      |
      |── trigger probe ───>|                      |
      |                     |── GET /mgmt/health ──>|
      |                     |                      |── auth check
      |                     |                      |   no token ← added in 5.x
      |                     |<── 401 Unauthorized ─|
      |                     | exit code: 1          |
      |<── ExitCode: 1 ──────|                      |
      |    [UNHEALTHY] ✗     |                      |

Actual service state:  cron running ✓ / hooks firing ✓ / API responding normally ✓
Probe-reported state:  UNHEALTHY ✗  (four straight days)

The fault sits between the probe’s hardcoded target and the access rule the new framework version silently applied to that path.

Why the Misdiagnosis Sticks

The probe and the service share the same status indicator on the dashboard. Seeing “unhealthy” pulls attention toward the service: memory exhaustion, crash, dependency failure. That path leads nowhere, because the service is fine. The divergence happened at the version upgrade. Before: curl hits the endpoint, gets 200, probe passes. After: same command, gets 401, probe fails. The service behavior didn’t change — the access rule on that endpoint did. The healthcheck target wasn’t updated alongside the upgrade, and nothing announced that mismatch explicitly.

The verification step is direct: run the curl command manually, check the exit code and HTTP status, then run the same against an unauthenticated root path. If the service is actually broken, both paths fail. If only the original probe target fails, the problem is the probe’s destination, not the service behind it.

What the Probe Measures Isn’t Always the Service

The fix was a single-line change — redirect the probe to an unauthenticated root path, restart the container, FailingStreak resets to zero, dashboard goes green. The service had been healthy the entire time.

The thing worth keeping from this: a monitoring system reports the state the probe measured, not the state of the service itself. Most of the time those two things are identical. At the moment a framework upgrade silently changes endpoint access rules, they quietly diverge. Next time a dependency ships a breaking change, check whether the healthcheck target’s access rules still match — before the alert fires, not after.

— 邱柏宇

Related Posts