系統重啟問題解析 | 部署後錯誤持續原因 | 台日文化對話JustFLY~JustBlog~

餐廳更新了菜單,但外送平台的 App 還在顯示舊版。不是你改錯了,是系統壓根沒重新整理過。

字幕模組輸出的檔案裡,角色說對白前面多了名字前綴:「賈丁丁: xxx」。這種事情該是 parser 的事。我調了解析邏輯,改了函式,加上 prefix-stripping helper,測了三次。失敗三次。

第五次 patch 我乾脆把整個 Code node 重寫了。deploy。還是失敗。

問題不在程式碼,在執行期

直到把 workflow deactivate 再 activate,問題才消失。不是程式碼錯了。是系統根本沒在跑新程式碼。

n8n 的 task-runner 有自己的執行期 context。當你 PATCH 一個 workflow,系統只更新了 storage 裡的 workflow 定義。但 runner 還在用記憶體裡的舊版 Code node。你以為你在改程式,其實你在改一份沒人讀的文件。

這不是 n8n 獨有的問題。任何有 long-running process 的系統都可能遇到:Lambda 的 warm container、Docker compose 裡沒重啟的 service、Kubernetes 裡沒 rollout 的 deployment。更新定義跟更新執行期是兩件事。前者是寫字,後者是換人。

便利商店的機台與執行期記憶體

全台灣有超過 4,000 家便利商店。深夜你站在全家門口,機台面板閃著錯誤的品項代號。你同事說已經更新好了,但機台還在出錯的飲料。因為更新的是後台資料庫,不是機台本身的快取。

執行期 context 就是那個快取。它不會主動去檢查有沒有新版本。它只負責跑手上的版本,直到你明確告訴它:停下來,重讀。

deactivate 再 activate 做的事情就是這個。不是修復,是換人。舊的 runner instance 被終止,新的起來時才會讀到更新後的 Code node。這也是為什麼很多系統的 deployment script 會強制加上 restart 或 reload 步驟。不是多此一舉,是唯一保證。

五次 patch 教我的事

我改了五次程式碼,每次都以為這次對了。但每次測試結果都一樣,因為測試的根本不是我剛寫的版本。

這種 bug 最難抓的地方在於:你的除錯方向完全正確,程式邏輯確實有問題,但修正無效。所以你會懷疑自己的判斷,懷疑自己的實作,甚至懷疑語言本身的行為。直到你意識到,問題不在內容,在傳遞。

現在我的 checklist 多了一條:deploy 之後,確認執行期真的重開了。不是看 API response,是看 process uptime、container restart timestamp、或 workflow activation log。數字會說話。timestamp 不會騙人。

— 邱柏宇

Your Code Deployed But Never Ran

The restaurant updated its menu, but the delivery app still shows the old version. You didn’t make a mistake. The system never refreshed.

The subtitle module was outputting dialogue with character name prefixes: “賈丁丁: xxx”. Should be a parser issue. I adjusted the parsing logic, rewrote the function, added a prefix-stripping helper. Tested three times. Failed three times.

On the fifth patch I rewrote the entire Code node. Deployed. Still failed.

The Problem Wasn’t the Code

Only when I deactivated and reactivated the workflow did the problem disappear. The code wasn’t wrong. The system wasn’t running the new code.

n8n’s task-runner maintains its own execution context. When you PATCH a workflow, the system only updates the workflow definition in storage. The runner keeps using the old Code node from memory. You think you’re fixing code. You’re actually editing a document no one’s reading.

This isn’t unique to n8n. Any system with long-running processes can hit this: Lambda’s warm containers, Docker compose services that never restarted, Kubernetes deployments without rollout. Updating the definition and updating the runtime are two different things. The former is writing. The latter is replacement.

The Vending Machine and Runtime Memory

Taiwan has over 4,000 convenience stores. Late at night you stand outside a FamilyMart, the vending machine flashing wrong item codes. Your colleague says it’s been updated, but the machine keeps dispensing the wrong drink. Because they updated the backend database, not the machine’s cache.

Execution context is that cache. It doesn’t check for new versions. It only runs what it has, until you explicitly tell it: stop, reload.

Deactivate then activate does exactly that. Not a fix, a replacement. The old runner instance terminates, the new one reads the updated Code node on startup. This is why many deployment scripts force a restart or reload step. Not redundant. The only guarantee.

What Five Patches Taught Me

I modified the code five times, convinced each time I’d got it right. But every test result was identical, because the tests weren’t running my latest version.

The hardest part about this kind of bug: your debugging direction is completely correct, the program logic genuinely had issues, but the fixes don’t work. So you doubt your judgment, your implementation, even the language’s behavior. Until you realize the problem isn’t the content. It’s the delivery.

My checklist now has one more item: after deploy, confirm the runtime actually restarted. Not by checking API responses. By checking process uptime, container restart timestamps, workflow activation logs. Numbers speak. Timestamps don’t lie.

— 邱柏宇

JustFLY~JustBlog~

Just Do It!!

deploy 完問題還在,因為系統根本沒重開機