系統一直都沒壞,但沒人去問原始來源

系統一直都沒壞,但沒人去問原始來源

快取裡躺著一份兩天前的錯誤

一個廣告管理子系統某天遭遇 API 驗證錯誤,把「認證失敗」的狀態寫進快取檔。接著就沒有再更新。那份狀態靜靜躺在那裡,像超商裡標示機卡住、持續顯示昨天舊讀數的螢幕——新來的工讀生看了說「沒問題」,但那個螢幕的資料早就不對了。

後來,負責分析優化的子系統定期來讀這份快取。看到「認證失敗」,就以這個前提開始工作:寫出診斷報告,提出改善方向,邏輯清晰,條理分明,完整地建立在一個已經不存在的問題上。整個過程沒有任何 HTTP 錯誤,日誌看起來完全正常。兩天後有人直接呼叫平台 API 做驗證,回傳:認證有效、永不過期。廣告管理子系統的認證從來沒有壞過。

分界點不在錯誤本身,在更新沒有發生

這個案例的分界點不是「API 驗證失敗了」。那次失敗可能是網路抖動、逾時、任何瞬間的干擾。真正的問題發生在下一步:子系統把當時的錯誤狀態寫進快取,然後沒有機制去更新它。

快取的設計本意是加速、減少重複呼叫。但加速的前提是假設快取裡的內容和現實是一致的。一旦這個假設無法被驗證,快取就從加速工具變成固定謊言的容器。這兩件事之間沒有任何系統級的報警,因為沒有任何系統知道自己拿到的是舊的。

為什麼第一時間沒看出來

日誌乾淨。HTTP 回應正常。分析子系統盡職地讀資料、寫報告,每一個動作本身都沒有錯。這是這類問題最難被察覺的地方:問題不在任何單一元件的行為,而在元件之間的資料信任關係。

每個子系統都在做自己被設計來做的事。沒有人拋出例外,沒有人觸發警報。只是沒有人回頭問一句:這份資料是不是還是真的?

多代理人架構正在被快速引入生產環境。有案例顯示,一個企業廣告活動的製作成本已被壓縮到原來的千分之一,40 小時完成。在這樣的架構規模下,代理人之間的 stale state 信任傳播問題只會被放大,而不是自動消失。

確認方式只有一個動作

直接呼叫原始來源。不透過快取,不透過任何中間層,就直接打 API,看回傳值。

這個動作花不了幾秒,但它是整個診斷鏈裡唯一能打破「資料自我引用」迴圈的方法。快取讀快取、分析讀快取、報告寫快取,整個系統可以在完全錯誤的前提下自洽地運行很長時間。只有繞過所有中間層、直接問原始來源,才能確認那份資料是不是真的。

下次遇到類似情境值得注意的一件事:當診斷報告的結論聽起來格外合理,值得問一下——這份診斷的資料來自哪裡,上次驗證是什麼時候。合理的分析不保證前提是真的。

— 邱柏宇

延伸閱讀


The Cache Said Broken. The API Said Fine.

A Status That Never Got Updated

An ad management subsystem hit an API authentication error one day. It wrote “auth failed” into a cache file. Then nothing updated it. That status just sat there.

A separate analysis subsystem came along later, read the cache on schedule, saw “auth failed,” and got to work — producing diagnostics, improvement recommendations, all logically sound, all built entirely on a problem that no longer existed. No HTTP errors anywhere. Logs looked clean. Two days later, someone called the platform API directly: authentication valid, never expired. The ad subsystem’s credentials had never been broken.

The Break Point Wasn’t the Error

The API validation failure itself was probably a blip — a timeout, a network hiccup. The actual break point came immediately after: the subsystem wrote that error state into cache, and no mechanism existed to refresh it.

Cache is designed to accelerate by avoiding redundant calls. That only works if the cache stays consistent with reality. Once that assumption can’t be verified, cache stops being a shortcut and becomes a fixed container for stale facts. No system-level alert fires, because no system knows it’s reading old data.

Why It Wasn’t Caught Earlier

Every component behaved exactly as designed. The analysis subsystem read data and wrote reports — both correct behaviors. The problem wasn’t in any individual component. It was in the trust relationship between components: one read what another had written, assumed it was current, and never checked.

Multi-agent architectures are moving fast into production. There are documented cases of enterprise ad campaigns produced at one-thousandth of the original cost in 40 hours. At that scale, stale state propagation between agents doesn’t shrink — it compounds.

One Verification Step

Call the original source directly. Skip the cache, skip the intermediary layers, hit the API and read the response.

That’s the only action that breaks the self-referential loop — where cache reads cache, analysis reads cache, and the whole system runs coherently on a false premise indefinitely.

One thing worth remembering for next time: when a diagnostic report sounds especially well-reasoned, ask where its input data came from and when it was last verified against the source. A coherent argument doesn’t guarantee a true premise.

— 邱柏宇

Related Posts