加了一個保險,服務開始每天死一次

加了一個保險,服務開始每天死一次

就像同時裝了兩個門鈴,一個正常、一個壞掉會自己亂響。家裡的規則是任何一個響就去開門——壞的那個每天晚上自己響,你就每天白跑。這個比喻不是比喻,是實況。

技術環境

Shell 維運腳本,運行於 Linux 容器化環境(Docker container runtime)。腳本透過 curl 呼叫服務的 health endpoint(HTTP 狀態碼判斷),並以 pgrep 掃描進程名稱,兩者取 OR 邏輯決定是否執行 systemctl restart。服務本體為長駐守護進程(Node.js 或類似 runtime),由容器啟動命令決定進程命名格式。觸發點為 cron job 定時執行此腳本,每次執行均完整跑兩道確認邏輯。問題模式與具體語言無關——任何在 OR 邏輯中接入未驗證感測器的設計,都能複現相同行為。

現象

維運腳本的邏輯很直覺:兩道確認,取 OR。向 health endpoint 打一次請求,再用 pgrep 掃一次進程名稱。任一個回報「沒了」,就重啟服務。

服務開始每天被重啟一次。沒有警報,沒有錯誤日誌,沒有流量異常。Health endpoint 全程回 200,服務從頭到尾活著。但重啟還是發生了。

錯誤傳染鏈(時序)

Cron Job          Ops Script              Container         Service
    |                  |                      |                 |
    |── 觸發腳本 ──────>|                      |                 |
    |                  |── curl /health ──────>|                 |
    |                  |                      |── HTTP 200 ────>|  ← 服務存活 ✓
    |                  |<── 200 OK ───────────|                 |
    |                  |                      |                 |
    |                  |── pgrep 進程名 ───────>|                 |
    |                  |                      |   (空結果)       |  ← 命名格式不符 ✗
    |                  |<── exit 1(空) ──────|                 |
    |                  |                      |                 |
    |                  |  [OR: 任一失敗 → 重啟]                  |
    |                  |── systemctl restart ─────────────────>|  ← 不必要重啟 ✗
    |                  |                      |                 |
實際狀態:服務全程健康 ✓  /  腳本判定:需要重啟 ✗

pgrep 的空結果從第一天就固定回傳;OR 邏輯確保這個靜默的假陽性每次都足以觸發重啟動作。

分界點

加入 pgrep 那一天,是問題的起點。

容器化環境裡,進程的啟動方式和命名格式跟預期的 pattern 不一致。pgrep 每次掃都回空結果,空結果在這段邏輯裡等於「不存在」,等於觸發重啟條件。OR 邏輯確保只要其中一條腿斷了,動作就執行。

那條腿從第一天就是斷的。

容易誤判的原因

第一時間,重啟本身不會引起懷疑。服務重啟是正常維運操作,短時間就恢復,影響不明顯。Health endpoint 一直健康,更容易讓人往「環境不穩」或「記憶體問題」的方向猜。

真正難發現的是:診斷工具本身出了問題,而工具的失效恰好觸發了行動。這和一般的「監控沒偵測到問題」完全反向——監控偵測到了,只是偵測到的是假陽性。系統在回應一個不存在的問題。

OR 邏輯在這裡是放大器。如果是 AND——兩道確認都失敗才重啟——pgrep 的空結果就不會單獨觸發任何事。但 OR 把一個失效的感測器直接升格成了決策者。

確認方式

pgrep 那段單獨跑一次,在容器內手動執行,看回傳值。空結果,結案。

不需要複雜的 debug 流程。問題在腳本裡,確認在腳本裡,修法也在腳本裡:把 pgrep 那條腿拿掉,只信 health endpoint。

移除一個偵測機制,系統反而穩了。

Code 對照:修法前後

修法前(OR 邏輯,pgrep 未在目標環境驗證)

#!/bin/bash
# ops-health-check.sh — 問題版本

HEALTH=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/health)
PROCESS=$(pgrep -x "node server.js")  # ← 問題在這裡:容器內命名格式不符,永遠回空

if [[ "$HEALTH" != "200" ]] || [[ -z "$PROCESS" ]]; then
    # OR 邏輯:任一條件成立就重啟
    # pgrep 永遠空 → 條件永遠成立 → 每天重啟
    echo "Service check failed, restarting..."
    systemctl restart myservice
fi

修法後(移除未驗證的感測器,只信 health endpoint)

#!/bin/bash
# ops-health-check.sh — 修法版本

HEALTH=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/health)

if [[ "$HEALTH" != "200" ]]; then
    echo "Health endpoint returned $HEALTH, restarting..."
    systemctl restart myservice
fi

# pgrep 已移除
# 原因:容器內進程命名格式由啟動命令決定,未驗證前無法保證 pattern 匹配
# 原則:接入 OR 邏輯的每一條腿,必須先在目標環境獨立驗證通過

該被隔離的側效應類型

  • OR 邏輯中的未驗證感測器:任何加入 OR 監控的偵測腳本,在接入前必須先在目標環境獨立跑通,確認回傳值符合預期;未驗證的感測器在 OR 邏輯裡等同於「永遠失敗」。
  • 容器內進程命名假設pgreppidofps aux | grep 等工具的匹配結果取決於容器啟動命令和 entrypoint 格式;不可將裸機或 VM 環境的命名 pattern 直接沿用。
  • 靜默的假陽性重啟:服務重啟若沒有配套的 reason log(記錄是哪一個條件觸發),後續 debug 極難還原現場;重啟動作應附加觸發原因欄位。
  • 健康狀態的多源衝突:當多道確認機制回傳不一致的結果,系統需要明確的裁決優先級;不加思索地 OR 合併可能讓最不可靠的來源成為實際的決策者。
  • cron job 的累積靜默失效:周期性腳本的錯誤若不產生任何 alerting output,可能靜默運行數天甚至數週;維運腳本必須有獨立的執行結果日誌與異常通知路徑。
  • AND vs OR 邏輯選擇的副作用:OR 邏輯對假陽性極度敏感(任一失效就觸發),AND 邏輯對假陰性極度敏感(任一成功就忽略);選擇前應明確定義「寧可誤殺還是寧可漏殺」的策略。
  • 診斷工具與被診斷系統的耦合:診斷工具和服務本體應能獨立測試;若診斷工具只能在完整系統啟動後才能驗證,則它對 CI/部署前驗證流程沒有貢獻,且部署後才會暴露問題。

如果這段邏輯失敗不應讓服務被重啟,它就需要先在隔離環境驗證,再接進任何自動化決策鏈。

留給未來的話

OR 邏輯的監控設計,每一條腿都必須先獨立驗證過才能接進去。一個從未在目標環境裡跑通過的檢查,加進去之前等於零,加進去之後等於噪音。

容器化環境裡,pgrep 能不能找到目標進程,取決於容器內的啟動方式和進程命名。這不是 pgrep 的問題,是假設沒驗過的問題。下次在容器環境裡用任何進程名稱工具之前,先確認那個名稱在容器內真的存在。

— 邱柏宇

延伸閱讀


The Safety Check That Killed the Service Daily

Two doorbells installed at the same time: one working, one broken and ringing on its own. The rule is simple — if either rings, open the door. The broken one rings every night. You walk to the door every night for nothing. That’s not a metaphor. That’s the incident.

Technical Environment

A shell-based ops script running in a Linux containerized environment (Docker container runtime). The script checks service health via curl to an HTTP health endpoint and uses pgrep to scan for the process name — both wired together with OR logic to decide whether to invoke systemctl restart. The service itself is a long-running daemon (Node.js or similar runtime) whose internal process naming is determined by the container’s launch command. The script is executed on a fixed cron schedule; every run evaluates both checks in full. The failure pattern is language-agnostic — any OR-gated monitor wired to an unvalidated sensor reproduces this behavior.

What Was Observed

The ops script ran two checks in parallel using OR logic: one HTTP request to the health endpoint, one pgrep scan for the process name. If either returned “not found,” the service got restarted.

The service started restarting once a day. No alerts, no error logs, no traffic anomaly. The health endpoint returned 200 the entire time. The service never actually went down. The restarts happened anyway.

Error Propagation Sequence

Cron Job          Ops Script              Container         Service
    |                  |                      |                 |
    |── trigger ───────>|                      |                 |
    |                  |── curl /health ──────>|                 |
    |                  |                      |── HTTP 200 ────>|  ← service alive ✓
    |                  |<── 200 OK ───────────|                 |
    |                  |                      |                 |
    |                  |── pgrep process_name ─>|                |
    |                  |                      |   (empty result) |  ← naming mismatch ✗
    |                  |<── exit 1 (empty) ───|                 |
    |                  |                      |                 |
    |                  |  [OR: either fails → restart]          |
    |                  |── systemctl restart ─────────────────>|  ← unnecessary restart ✗
    |                  |                      |                 |
Actual state: service healthy throughout ✓  /  Script verdict: needs restart ✗

The pgrep empty result was consistent from day one; OR logic guaranteed this silent false positive was sufficient to fire the restart action on every single run.

Where It Broke

The day pgrep was added is when everything changed.

Inside a containerized environment, the process launch method and naming format didn’t match the expected pattern. Every pgrep scan returned empty. Empty result meant “not running.” “Not running” in OR logic meant restart. The OR gate guaranteed that one broken leg was enough to fire the action.

That leg was broken from day one.

Why It Wasn’t Caught Immediately

A restart on its own doesn’t raise flags. Services restart for legitimate reasons. Recovery is fast. And with the health endpoint always returning 200, the easier guesses were environmental instability or memory pressure — not “the monitoring tool is lying.”

The harder-to-see failure mode is when the diagnostic tool itself is broken, and its failure is what triggers action. This is the inverse of normal monitoring blindness. The monitor fired. It just fired on a false positive. The system was responding to a problem that didn’t exist.

OR logic amplified this. AND logic would have required both checks to fail simultaneously — the empty pgrep result alone would have been ignored. OR promoted a malfunctioning sensor into the role of sole decision-maker.

How It Was Confirmed

Run pgrep manually inside the container. Watch it return nothing. That’s the confirmation. No elaborate debug sequence needed.

The fix was equally direct: remove the pgrep leg, trust only the health endpoint. One fewer check, and the system became more stable.

Code Diff: Before and After

Before (OR logic, pgrep never validated in target environment)

#!/bin/bash
# ops-health-check.sh — broken version

HEALTH=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/health)
PROCESS=$(pgrep -x "node server.js")  # ← problem here: naming mismatch inside container, always empty

if [[ "$HEALTH" != "200" ]] || [[ -z "$PROCESS" ]]; then
    # OR logic: either condition fires restart
    # pgrep always empty → condition always true → restart every day
    echo "Service check failed, restarting..."
    systemctl restart myservice
fi

After (unvalidated sensor removed, health endpoint is the sole source of truth)

#!/bin/bash
# ops-health-check.sh — fixed version

HEALTH=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/health)

if [[ "$HEALTH" != "200" ]]; then
    echo "Health endpoint returned $HEALTH, restarting..."
    systemctl restart myservice
fi

# pgrep removed
# Reason: process naming inside a container depends on launch command; pattern cannot be
#         assumed without explicit validation in the target environment
# Rule: every leg wired into OR logic must be independently validated in its target env first

Side Effects That Should Be Isolated

  • Unvalidated sensors in OR logic: Any detection script wired into an OR-logic monitor must be independently validated in the target environment before connection — producing the expected return value under normal conditions. An unvalidated sensor in OR logic is functionally equivalent to “always failing.”
  • Container process naming assumptions: Tools like pgrep, pidof, and ps aux | grep match against names determined by the container’s launch command and entrypoint format. Naming patterns from bare-metal or VM environments cannot be assumed to transfer directly.
  • Silent false-positive restarts: Service restarts without a reason log (recording which condition triggered them) make incident reconstruction extremely difficult; restart actions should log the triggering condition explicitly.
  • Multi-source health status conflicts: When multiple health checks return conflicting results, the system needs an explicit resolution priority. Naïvely OR-merging all sources risks promoting the least reliable source into the actual decision-maker.
  • Accumulating silent cron job failures: Periodic scripts that produce no alerting output on error can fail silently for days or weeks. Ops scripts require independent execution result logging and anomaly notification paths.
  • AND vs OR logic side effects: OR logic is extremely sensitive to false positives (any single failure triggers action); AND logic is extremely sensitive to false negatives (any single success suppresses action). The choice should be driven by an explicit “prefer false alarm vs prefer miss” policy.
  • Diagnostic tool coupling to the diagnosed system: Diagnostic tooling should be independently testable. If a diagnostic tool can only be validated after the full system is running, it contributes nothing to CI or pre-deployment verification — and will only surface problems post-deployment.

If a failure in this logic should not cause the service to restart, it needs to be validated in isolation before being wired into any automated decision chain.

Worth Noting Next Time

Any check wired into an OR-logic monitor needs to be validated independently in the target environment before it’s connected. A check that has never successfully matched anything in its environment is noise from the moment it’s added — and in OR logic, noise fires actions.

In containerized environments specifically, whether pgrep can find a target process depends entirely on how that process is launched and named inside the container. That’s not a pgrep limitation — it’s an unverified assumption. Before using any process-name-based tool inside a container, confirm the name actually exists in that context first.

— 邱柏宇

Related Posts