自動化帳單系統在某個週二早上集體失聯。所有工作流都顯示綠燈,執行成功,但後端根本沒收到任何請求。日誌乾淨得像新裝的系統,連一條錯誤訊息都沒有。
這就像便利商店的補貨員發現貨架一直空著,但配送單上寫著「已送達」——不是貨沒出,是送到隔壁街的舊店址去了。
綠燈與空氣
最詭異的部分是監控系統一片祥和。HTTP 請求發出去了,狀態碼 200,沒有 timeout,沒有 connection refused。但接收端的 access log 是空白的。請求像是被送進了平行宇宙。
排查從應用層開始。檢查 API endpoint,權限沒問題。檢查 request payload,格式正確。檢查 middleware,邏輯沒變。一個早上過去,所有模組都證明自己沒壞。
直到有人提了一句:「容器昨天晚上重啟過。」
地址會過期
容器被重新命名後,IP 位址跟著變了。但工作流裡硬編碼的目標 IP 還是舊的那組數字。請求確實送出去了,只是送到一個已經不存在的地址。沒有報錯,因為網路層不知道你原本想找誰。
問題的根源在於對容器網路的誤判。容器 IP 不是固定資產,它更像是臨時門牌號碼——每次重建、重啟、重新部署,都可能換一個。如果你的系統假設地址不會變,那你其實是在用管理實體機的方式管理容器。
修復方式違反直覺:不要信任容器 IP,改綁宿主機的固定 IP,或者用 service name 讓 DNS 自動解析。容器編排工具會處理底層的路由變動,但前提是你不要繞過它直接指定 IP。
假設與現實
這類問題的共通點是假設的保固期過了。假設「容器 IP 不會變」在開發環境可能成立,因為你手動啟動一次就用到關機。但到了正式環境,自動擴展、健康檢查、排程重啟都會觸發 IP 變動。
更麻煩的是,這種問題不會立刻爆炸。系統可以正常運行數週,直到某次例行維護碰巧觸發了容器重建。那時候你已經忘記當初為什麼要硬編碼那個 IP,甚至不記得那段 code 的存在。
修完之後回頭看監控面板,那些綠燈依然亮著。它們沒有說謊,只是它們回答的問題跟你真正關心的問題不一樣。請求有送出,但沒送到對的地方。成功與有效之間,有時候差了一條街的距離。
— 邱柏宇
延伸閱讀
When Your Automation Still Knocks on the Old Door
The automated billing system went silent on a Tuesday morning. Every workflow showed green. Execution successful. But the backend received nothing. Logs were spotless, not a single error message.
It’s like a delivery driver marking orders as “delivered” while the restaurant moved to another street last week. The food left the kitchen. It just never arrived.
Green Lights and Empty Queues
The strangest part was how calm the monitoring dashboard looked. HTTP requests went out. Status code 200. No timeouts, no connection refused. But the receiving end’s access log was blank. Requests vanished into a parallel dimension.
Investigation started at the application layer. API endpoints checked out. Permissions unchanged. Request payloads valid. Middleware logic intact. An entire morning passed proving that nothing was broken.
Then someone mentioned: “The containers restarted last night.”
Addresses Expire
When the container got renamed, its IP address changed. But the hardcoded target IP in the workflow configuration still pointed to the old numbers. Requests were sent successfully—just to an address that no longer existed. No errors, because the network layer didn’t know who you were trying to reach.
The root issue was a misunderstanding of container networking. Container IPs aren’t fixed assets. They’re temporary door numbers. Every rebuild, restart, or redeployment can assign a new one. If your system assumes addresses stay constant, you’re managing containers like physical servers.
The fix was counterintuitive: stop trusting container IPs. Bind to the host machine’s fixed IP instead, or use service names and let DNS handle resolution. Container orchestration tools manage routing changes automatically—but only if you don’t bypass them by hardcoding IPs.
When Assumptions Expire
This class of problem shares a common trait: expired assumptions. “Container IPs won’t change” might hold in development, where you manually start things once and leave them running. But in production, autoscaling, health checks, and scheduled restarts all trigger IP changes.
Worse, these issues don’t explode immediately. The system can run fine for weeks until routine maintenance happens to trigger a container rebuild. By then, you’ve forgotten why that IP was hardcoded—or that the code even exists.
After fixing it, I looked back at the monitoring dashboard. Those green lights were still glowing. They hadn’t lied. They just answered a different question than the one that mattered. Requests were sent. They just didn’t reach the right place. Sometimes success and effectiveness are separated by a city block.
— 邱柏宇