補跑機制把系統踢進第二次崩潰

補跑機制把系統踢進第二次崩潰

颱風警報解除,積壓了一天的通勤人潮同時湧進月台——第一班車根本塞不下,人卡在車門口,列車動彈不得,站務員廣播清空月台重新排隊。這個畫面很準確地描述了一種排程系統的死法。

事情是怎麼開始的

機器停電十幾小時後重啟。排程系統掃了一眼積欠的任務清單,發現一批 cron 沒有在預定時間執行,於是啟動補跑邏輯,把所有積欠的任務幾乎同時送進 runtime。

問題在「幾乎同時」這四個字。這批任務裡有幾個特別重:browser 自動化類型的任務,單次執行會把過去累積的執行紀錄全部拉進來當 context,一跑就是 200k+ token。多個這種任務同時湧入同一個 embedded runtime,開始搶 compaction、搶排隊位置。整個 runtime 在幾分鐘內被撐破上限,session 全部卡在 running 狀態,零進度。系統 load 飆到正常值的六倍以上,就這樣維持了超過十五分鐘。

從外部看,行為很像 OOM——資源耗盡、沒有響應。但根因不是單一任務太重,是八個任務同時啟動、沒有任何錯開(stagger),加上 context 沒有設上限,把剩餘的餘裕空間一口氣消光。

為什麼第一時間沒看出來

補跑機制本來就是系統設計的一部分,它的存在感是「靜默的好事」。正常情況下,停機後的補跑任務不多、彼此之間有時間差,整體幾乎無感。這次是停電時間長、積欠任務多、偏偏又有幾個 context 消耗特別高的任務混在裡面,三個條件同時成立,才撞進了這個死角。

session 卡住的早期症狀看起來像「任務還在跑、只是比較慢」,不像明顯的錯誤訊息。要等到 load 飆升、所有 session 超過十五分鐘零進度,才確認是卡死而不是慢跑。這個辨認視窗拖長了反應時間。

怎麼確認是這個問題

重啟 gateway、清掉所有卡死的 session 之後,查 task_runs 表確認無新的 catch-up entry,再看 cron status 的 nextWakeAtMs 是否回到正常單一排程節奏,load 同步回落。這三個指標一起確認,才算驗證完整。

放棄這一批補跑,讓任務在各自的下一個自然排程時間用乾淨的 context 重新執行。這個選擇違反直覺——「積欠的工作不跑完,怎麼算恢復正常?」但強行補跑的代價是第二次崩潰,代價比延遲一個排程週期高很多。

留給下次的一件事

補跑風暴有個預防時間窗口:開機後的前十分鐘。如果在這段時間內查 task_runs,看到一批 catch-up entry 同時出現,就有機會在 runtime 被撐爆之前手動介入。等到 load 飆升再反應,已經在追火車。

browser-heavy 的任務設短 timeout、限制單次執行的動作數,可以壓低單任務的 context 上限;catch-up stagger 和 max-concurrent 的設定如果存在,應該預設啟用,而不是等到踩過才回頭開。容錯機制沒有考慮任務之間的資源競爭,這不是設計錯誤,是設計的邊界條件——只是這個邊界條件在停電後最容易被觸發。

— 邱柏宇

延伸閱讀


The Catch-Up Storm That Crashed the Recovery

Typhoon warning lifted. Hundreds of commuters flood the platform at once. The first train can’t absorb them — doors jam, the train stalls, the station attendant broadcasts “please clear the platform.” That’s a precise description of how a scheduler can kill itself during recovery.

What happened

The machine came back online after a power outage lasting over ten hours. The scheduler scanned its backlog, found a batch of cron jobs that had missed their windows, and fired them all — nearly simultaneously — as catch-up runs.

The problem lives in “nearly simultaneously.” Several of those tasks were resource-heavy: browser automation jobs that pull their full execution history into context on every run, inflating to 200k+ tokens per session. Multiple sessions like that hitting the same embedded runtime at once started competing for compaction slots and queue positions. The runtime hit its context ceiling within minutes. Every session froze at running status, zero progress, for over fifteen minutes. System load climbed to six times normal and stayed there.

From the outside, it looked like OOM. The real cause wasn’t any single task being too heavy — it was eight tasks launching with no stagger, no concurrency cap, and no per-task context limit. The system had survived the outage, then immediately tried to do five times the normal workload and hit the wall a second time.

Why it wasn’t obvious at first

Catch-up logic is the quiet good part of a scheduler. Under normal conditions, a short outage means a handful of missed jobs, slight time offsets between them, no visible impact. This time: a long outage, a large backlog, and a few context-heavy jobs mixed in. Three conditions converging in the same window.

Early symptoms looked like “tasks running slow,” not “tasks are dead.” No clear error surface. It took fifteen-plus minutes of zero progress before the diagnosis flipped from “slow” to “hung.” That recognition lag cost real time.

How to confirm it

Restart the gateway, clear all stuck sessions, then check three things: no new catch-up entries in task_runs, nextWakeAtMs in cron status showing a single natural next schedule, and load returning to baseline. All three together confirm the storm is over and won’t re-trigger.

The abandoned catch-up batch gets picked up at each job’s next natural schedule with a clean context. The counterintuitive part: letting the backlog go feels like incomplete recovery. But the cost of forcing a re-run into an already-stressed runtime is another crash — more expensive than waiting one schedule cycle.

One thing worth noting next time

There’s a ten-minute window after reboot where intervention is still cheap. If task_runs shows a cluster of catch-up entries appearing at the same timestamp during that window, there’s time to stagger or cancel before the runtime gets overwhelmed. By the time load spikes, the train has already left and the doors are jammed.

Shorter timeouts and per-run action limits on browser-heavy crons keep individual context footprints manageable. Catch-up stagger and max-concurrent settings, if available, should be on by default — not treated as optional tuning after the first incident. The catch-up mechanism wasn’t wrong; it just had no model for resource contention across concurrent jobs. That’s a boundary condition that gets hit most reliably right after a power failure.

— 邱柏宇

Related Posts