跑了好幾個月的服務,從來沒有真正活著

跑了好幾個月的服務,從來沒有真正活著

台灣超商的自動感應門從不需要員工手動觸發——它的電路直接接進建築主電路,停了電再來就繼續開。如果你的反向代理是手動啟動的,那它就像插在延長線上的風扇:電沒了,沒有人幫它按重開。

curl 回傳 000

HTTPS dashboard 連不上,curl 回傳 000,TLS 握手沒有發生。機器本身是活的,gateway 也在跑,Caddy 程序在進程列表裡看得到。第一反應是懷疑網路層,查了一圈沒問題,再往應用層看也沒問題。

然後才想到:這台機器剛剛停電、重開過。

反向代理服務幾週前手動啟動,之後沒人動過。跑得好好的,沒有任何告警,儀表板一直是綠的。問題是,它從來沒有被加進開機流程。機器過去每次「正常」,都只是因為沒有人踢掉電源。這次停電,電源回來,沒有任何東西叫它起來。

問題沒有藏在深處

TLS 憑證的部分另有一段故事。架構是 Caddy 反向代理加 Tailscale 憑證代理,TLS 用 tailscale cert 簽的憑證檔,效期約 90 天,沒有任何 auto-renewal 機制。服務死掉之後,Caddy 仍用舊憑證嘗試握手,握手失敗,curl 就收到 000。

兩件事同時發生,但根因只有一個分界點:重開機。服務沒有 systemd 服務定義,沒有 WantedBy=multi-user.target,沒有 Restart=always。停電只是把這個事實暴露出來。

容易誤判的地方在於,curl 000 的第一直覺是網路層——防火牆、DNS、port 被封,或是 Tailscale 節點掉線。把這些排掉要花時間,而真正的問題根本不在那一層。如果先問「這台機器最近有沒有重開過」,診斷路徑會短很多。

確認方式很直接

檢查服務是否加入開機流程,一個指令就夠:

systemctl is-enabled <service-name>

回傳 disabledstatic,就是這個問題。不需要看 log,不需要抓封包,不需要懷疑憑證鏈。確認之後,補一份有 WantedByRestart 的 service 定義,systemctl enable,重簽憑證,重啟 Caddy。一切恢復正常。

留給下次的一件事

類似的問題在別的地方也出現過——一台伺服器的 MySQL 安裝後從未設定 systemctl enable,重啟後所有後端容器進入崩潰循環,重啟了二十幾次才被發現。症狀不同,但分界點一樣:重開機是壓力測試,手動啟動的服務全部落地。

下次碰到「服務跑了很久從來沒出問題」這個情境,值得先確認:它有沒有在開機流程裡。跑了多久不是穩定性的證明,只是說明這段時間剛好沒有人踢掉電源。

— 邱柏宇

延伸閱讀


It Ran for Months But Was Never Really Alive

Taiwan’s convenience store automatic doors never need a staff member to trigger them manually — the circuit runs directly into the building’s main power. The moment electricity returns, the door slides open again. A reverse proxy started by hand is more like a fan plugged into an extension cord: no one presses the button when power comes back.

curl returns 000

The HTTPS dashboard went unreachable. curl returned 000 — the TLS handshake never happened. The machine itself was alive, the gateway was running, the Caddy process appeared in the process list. The first instinct was to suspect the network layer. After checking that thoroughly, nothing was wrong there either.

Then came the obvious question: the machine had just lost power and rebooted.

The reverse proxy had been started manually several weeks earlier and left alone. It ran cleanly, no alerts, dashboard green the whole time. The problem was that it had never been added to the boot sequence. Every previous “normal” state relied entirely on no one pulling the plug. This time, the plug got pulled. When power returned, nothing told the service to start.

The root cause wasn’t buried

There was a secondary issue with TLS. The architecture was Caddy with a Tailscale certificate proxy — TLS signed via tailscale cert, valid for around 90 days, with no auto-renewal mechanism. After the service died, Caddy kept trying to handshake with its existing certificate. The handshake failed. curl got 000.

Two failures, one inflection point: reboot. No systemd unit file. No WantedBy=multi-user.target. No Restart=always. The power outage didn’t cause the vulnerability — it just made it visible for the first time.

The easy misdiagnosis is network layer. Firewall rules, DNS resolution, blocked ports, a Tailscale node that dropped. Eliminating all of that takes time. The actual problem wasn’t there. The shorter diagnostic path starts with one question: did this machine recently reboot?

Verification is a single command

To confirm whether a service is wired into the boot sequence:

systemctl is-enabled <service-name>

If the response is disabled or static, that’s the answer. No log trawling, no packet capture, no certificate chain investigation needed. After confirming: write a proper service unit with WantedBy and Restart, run systemctl enable, re-sign the certificate, restart Caddy. Everything recovered.

One thing worth keeping in mind

The same pattern appeared elsewhere — a MySQL installation that had never been enabled via systemctl enable. After a server restart, backend containers crashed and restarted dozens of times before anyone noticed. Different symptoms, same inflection point: reboots are load tests that hand-started services fail silently.

Whenever a service has “been running for a long time without issues,” that’s worth a quick check: is it in the boot sequence? A long uptime doesn’t prove stability. It only proves no one has cut the power yet.

— 邱柏宇

Related Posts