排程問題深度解析 | 晚上六點任務延遲原因

夜市鍋貼攤老闆調了兩週火候,試遍每個溫度設定,最後發現是瓦斯桶輸出閥沒全開。鍋子沒問題,爐子沒問題,問題在更上游供氣那端。

我這次遇到的狀況也是這樣。自動發布流程排定 18:00 觸發,每次卻到 23:34 才實際跑完。我查了十四天的工作流引擎,仔細研究排程設定、時區配置、queue 狀態。越查越深,越深越迷。

症狀出現的地方

工作流引擎的 log 顯示任務確實在 23:34 執行,執行時間正常,沒有 timeout,沒有 retry。我改過 cron 表達式三次,檢查過系統時區是 UTC+8,確認過 daylight saving 設定（雖然台灣根本不用這個）。我甚至懷疑過是不是 queue worker 的 polling interval 太長,去翻了整份 config 檔案。

所有設定都指向 18:00。但它就是不在 18:00 跑。

第七天我開始懷疑是 database lock。第十天我檢查了 container 的 resource limit。第十三天我在想會不會是 network latency 導致某個 API call 被 queue 住了。每個方向都查過,每個假設都被推翻。

問題根因的地方

後來才搞清楚一件根本的事:那個工作流引擎從來就不管排程。它只是 webhook 接收器,有請求進來就跑,完全不決定什麼時候跑。決定觸發時間點的是更上游的本機排程服務,而那台主機有時候在休眠,有時候任務 queue 裡還排著其他工作。

本機排程服務用的是 systemd timer,設定檔裡寫的確實是 18:00。但那台機器跑著另外十幾個定時任務,有些會吃掉 CPU,有些會佔住 disk I/O。真正送出 webhook request 的時間,取決於那台機器當下的負載狀況。而我從頭到尾都在看下游的工作流引擎,以為問題出在接收端。

問題從第一天就不在工作流引擎裡。我只是把「症狀出現的地方」錯認成了「問題根因的地方」,然後在錯誤的房間裡找了兩週的鑰匙。

診斷位置的偏移量

這種偏移很難察覺,因為症狀確實在那裡。Log 確實顯示 23:34 執行,工作流引擎確實是最後一個你能觀察到的環節。但觀察到的地方不等於發生的地方。

我後來去看了那台主機的 systemd journal,發現 timer 在 18:00 觸發後,request 進入 queue,等了五個多小時才輪到執行。中間塞著一堆 backup script、log rotation、還有幾個沒人在用的 monitoring agent。全部擠在同一台老舊的 VM 上,誰也沒想過要拆開。

修法很簡單:把排程服務搬到另一台機器,或是直接用 cloud scheduler 取代本機 timer。但找到問題花了十四天。

下次如果看到「設定值」和「實際行為」不一致,我會先往上游查,而不是往下游深挖。火候不對,先看瓦斯。

— 邱柏宇

Scheduled for 6pm, Runs at 11:34pm Every Time

A night market stall owner spent two weeks adjusting the flame, trying every temperature setting, only to discover the gas tank’s output valve wasn’t fully open. The pan was fine, the stove was fine. The problem was upstream, at the supply end.

I ran into the same thing. An automated publishing workflow scheduled for 18:00, but it always finished around 23:34. I spent fourteen days examining the workflow engine—schedule config, timezone settings, queue status. The deeper I dug, the more lost I got.

Where the Symptom Appeared

The workflow engine logs showed the task executed at 23:34. Execution time was normal. No timeout, no retry. I modified the cron expression three times, verified the system timezone was UTC+8, checked daylight saving settings (though Taiwan doesn’t even use that). I even suspected the queue worker’s polling interval was too long and combed through the entire config file.

Every setting pointed to 18:00. But it never ran at 18:00.

Day seven, I suspected database locks. Day ten, I checked container resource limits. Day thirteen, I wondered if network latency was queueing some API call. Every direction explored, every hypothesis rejected.

Where the Problem Actually Was

Then I figured out something fundamental: that workflow engine never handled scheduling. It was just a webhook receiver—requests come in, it runs them. It didn’t decide when to run anything. The trigger timing was determined by an upstream local scheduling service, and that host machine sometimes went to sleep, sometimes had other tasks queued up.

The local scheduler used systemd timer, config file clearly said 18:00. But that machine ran a dozen other scheduled tasks—some hogged CPU, some locked disk I/O. When the webhook request actually got sent depended on that machine’s current load. I’d been staring at the downstream workflow engine the whole time, assuming the problem was on the receiving end.

The problem was never in the workflow engine from day one. I just mistook “where the symptom appeared” for “where the root cause was,” then spent two weeks looking for keys in the wrong room.

The Diagnostic Offset

This kind of offset is hard to catch because the symptom is genuinely there. Logs genuinely show 23:34 execution, the workflow engine is genuinely the last link you can observe. But where you observe isn’t where it happens.

I later checked that host’s systemd journal. The timer triggered at 18:00, request entered the queue, waited over five hours before execution. Sandwiched between backup scripts, log rotation, and several unused monitoring agents. All crammed onto the same aging VM, nobody thought to separate them.

The fix was simple: move the scheduler to another machine, or replace the local timer with a cloud scheduler. But finding the problem took fourteen days.

Next time I see a mismatch between “configured value” and “actual behavior,” I’ll look upstream first, not dig deeper downstream. Wrong flame? Check the gas first.

— 邱柏宇

JustFLY~JustBlog~

Just Do It!!

排程設在晚上六點,每次跑完都已經深夜十一點半