agent 連續 timeout,錯誤訊息直接噴到 chat 介面:「The model did not produce a response before the model idle timeout」。config 明明寫了 fallbacks,兩個備援 provider 名稱都在,清單很完整,看起來很安心。
就像台灣超商 ibon 機台,首頁列了十幾種服務,按下去才出現「此服務暫停中」。選項在,介面在,後面什麼都沒接。
問題在哪一層
fallbacks 欄位接受任何字串,不會去驗證那個 provider 是否真的存在於 models.providers 裡。結果是:fallbacks 填了兩個 provider 名稱,但 providers 區塊從頭到尾只定義了主 model 的那一個。primary 一旦 timeout,runtime 照順序嘗試第一個備援——找不到 API key,失敗。嘗試第二個——同樣失敗。全部失敗,直接 throw 給使用者。
整條 fallback chain 是空殼,運作了好幾個月,沒有任何異常信號。
容易誤判的地方
config 長得很正常。providers 區塊有定義、有 API key、有 models 陣列——只是只有一個。fallbacks 那兩行字串也存在,格式正確,lint 不會報錯,部署不會失敗,啟動日誌不會警告。整個設定檔靜靜地躺在那裡,看起來完全沒問題。
第一時間懷疑的方向通常是 timeout 參數太短、provider 那邊服務不穩、網路問題——而不是「備援清單上的名字背後根本沒有任何定義」。因為那個層次的錯誤,視覺上完全沒有線索。
確認方式
把 config 讀進來,把 providers 的 key 集合取出,逐一對照 primary 和每個 fallback 的 provider 前綴。哪個名字不在集合裡,哪個就是空殼。幾行 Python 就能做完,結果一目了然:🚨 PROVIDER NOT DEFINED。
修法直接:在 providers 區塊補上完整定義——baseUrl、apiKey、models 陣列。部分 provider 還需要額外的 auth-profile token,補進去之後設定檔會 hot-reload,gateway log 裡會出現 reload 確認。同樣的任務,原本觸發 timeout,修完之後 12 秒穩定完成。
留給下次的一件事
fallback chain 的有效性,不能只靠看 config 判斷。設定檔讀起來完整不代表跑起來有效。唯一的確認方式是親自觸發一次:讓主 model 失敗,觀察備援是否真的接手。這個驗證動作,在第一次部署時沒做,後來每次部署也沒做,所以空殼狀態維持了幾個月都沒被發現。下次加任何新的 fallback,先手動斷掉主 model,看看備援到底接不接得住。
— 邱柏宇
延伸閱讀
Three Names on the Backup List, No Phone Numbers
The agent kept timing out. The error message landed directly in the chat interface: “The model did not produce a response before the model idle timeout.” The config clearly listed fallbacks — two backup provider names, right there in the file, looking solid.
It was like pressing a service option on a convenience store kiosk and getting “This service is currently unavailable.” The menu entry exists. The interface looks complete. Nothing is actually connected behind it.
Where the failure lives
The fallbacks field accepts any string. It does not validate whether the named provider actually exists inside the models.providers block. The config had two provider names listed as fallbacks, but the providers block only defined the primary model’s entry — nothing else. When the primary timed out, the runtime tried the first fallback: no API key found, failure. Tried the second: same result. The entire chain failed and threw directly to the user.
The fallback chain was a shell. It had been running that way for months, with no warning signal of any kind.
Why it’s easy to misread
The config looked normal. The providers block had a definition, an API key, a models array — just one entry. The fallback strings were syntactically correct. No lint error, no deployment failure, no startup warning. The file sat there quietly, giving no indication anything was wrong.
The natural instinct is to suspect timeout thresholds, provider instability, or network issues — not that the names on the backup list had no corresponding definitions anywhere in the file. That class of error leaves no visual trace.
How to verify
Read the config, extract the set of defined provider keys, then check each entry in the primary and fallback chain against that set. Any provider name not in the set is a shell. A few lines of Python surfaces it immediately: 🚨 PROVIDER NOT DEFINED.
The fix is straightforward: add complete provider definitions — baseUrl, apiKey, models array — for every name referenced in the fallback chain. Some providers also require an auth-profile token. After the additions, the config hot-reloads. The same task that was triggering timeout completed stably in 12 seconds.
One thing worth remembering
A fallback chain cannot be validated by reading the config. A file that looks complete is not the same as a chain that works. The only real check is to trigger it: force the primary to fail and watch whether the backup actually takes over. That test wasn’t done at first deployment, and wasn’t done on any subsequent deployment either — which is why the shell condition persisted undetected for months. Next time a fallback entry gets added, break the primary on purpose and see what happens.
— 邱柏宇
Related Posts
https://justfly.idv.tw/s/qiRxBo2