環境變數設了,插件還是載了

環境變數設了,插件還是載了

現象

gateway 的 PID 活著,port 在 listen,健康檢查回 200。外面看什麼都正常。但所有實際請求都回 HTTP 000,hooks 完全沒有動靜。翻 log 才發現:框架每 30 秒重啟一次,已經累積超過 1000 次「starting…」。

根源是 bonjour mDNS 插件。LAN 上出現名稱衝突,插件在啟動時丟 CIAO PROBING CANCELLED,造成 unhandled promise rejection,然後整個 gateway 倒掉,launchd 的 KeepAlive 把它拉起來,再倒,再起,循環不止。

容易誤判的原因

這個 boot loop 很隱蔽。gateway 每次都能在大約 9.4 秒內成功 ready,看 log 像正常啟動。實際上 28 秒後才崩,崩在插件初始化,不是在 request handler,所以 listening port 一直在,健康檢查一直過。只看進程存活和 port 狀態,判斷不出任何問題。

第一個直覺反應是設環境變數把那個功能停掉。這個想法合理——文件上確實有這樣的旗標,邏輯也說得通。重啟之後,崩潰照舊。

分界點

這就像台灣超商的自動訂貨系統:零點系統已根據庫存自動觸發出車,早上九點跟店長說「這批貨不要了」,指令本身沒錯,只是比觸發點晚了九個小時。

讀源碼才搞清楚機制。那個環境變數控制的是 runtime feature flag,在 app code 層被讀取。但 bonjour 插件的 module 初始化發生在更早,module loader 在 app logic 還沒有機會讀到任何 flag 之前,就已經完成了 probe 並崩潰。環境變數設得再快,永遠晚一步。

真正的 fix 是 config-level disable:從框架設定檔把插件整個移除,讓 module loader 根本不去載入那段程式碼。停用之後,6 個剩餘插件全部穩定。

確認方式

區分「runtime feature flag」和「module loading」的一個快速 check:把環境變數改掉之後,在插件初始化的最早一行加一個 log 或 throw,如果那行在 flag 被讀取之前就執行,代表 runtime flag 根本管不到這裡。不用猜執行順序,讓程式告訴你。

留給未來的話

下次碰到「設了 flag 卻沒有效果」,先問的不是 flag 有沒有傳進去,而是:這個 flag 被讀取的時間點,在問題觸發點的前面還是後面?如果後面,flag 永遠到不了。

— 邱柏宇

延伸閱讀


The Flag Arrived After the Crash

The Symptom

PID alive. Port listening. Health check returning 200. Everything looked fine from the outside. Every actual request came back HTTP 000. Hooks silent. Digging into the logs revealed the real situation: the gateway had been restarting every 30 seconds — over 1,000 restarts logged as “starting…”.

The cause was a bonjour mDNS plugin. A name collision had appeared on the LAN, and the plugin threw a CIAO PROBING CANCELLED unhandled promise rejection on startup, taking down the entire gateway. launchd’s KeepAlive pulled it back up. It crashed again. Loop.

Why It Was Easy to Miss

The boot loop was well-disguised. The gateway consistently reached a ready state in roughly 9.4 seconds each cycle — the log looked like a normal startup. The actual crash happened 28 seconds in, during plugin initialization, not during request handling. So the listening port stayed alive, and the health check kept passing. Process alive plus port open told you nothing useful.

The first instinct was to set an environment variable to disable the feature. The flag existed in the documentation. The reasoning was sound. After restart, the crash continued.

Where the Timing Breaks

There’s a convenience-store analogy that fits: Taiwan’s automated inventory systems trigger restocking orders at midnight based on shelf data. Telling the manager at 9 a.m. that you don’t want the delivery isn’t wrong — it’s just nine hours too late. The trucks already left.

Reading the source code clarified the mechanism. The environment variable controlled a runtime feature flag, read at the application logic layer. But the bonjour plugin’s module initialization happened earlier — before app code had any opportunity to read the flag, the plugin had already completed its probe and crashed. The flag arrived after the damage.

The actual fix was a config-level disable: removing the plugin entirely from the framework’s configuration file, so the module loader never loaded that code at all. With the plugin removed, the remaining six plugins ran stable.

One Check Worth Keeping

To distinguish “runtime feature flag” from “module loading” quickly: after changing the flag, add a log line or throw at the very first line of the plugin’s initialization. If that line executes before the flag is ever read, the runtime flag has no reach here. No need to reason about execution order from memory — the program will show you directly.

Next Time

When a flag has no effect, the first question isn’t whether the flag was passed in correctly. It’s whether the moment the flag gets read is before or after the moment the problem fires. If it’s after, the flag never arrives in time.

— 邱柏宇

Related Posts