觸發記錄全綠,但任務從來沒有真正開始

觸發記錄全綠,但任務從來沒有真正開始

有點像在台灣用便利商店店到店寄件:追蹤頁面一直顯示「已出庫、運送中」,但包裹在第一個中繼站就靜默卡住,從沒真正派出去。你每天查、查了四天,系統從沒報過一個錯。

技術環境

Node.js + OpenClaw gateway,排程由 cron plugin 管理,任務以 agentTurn payload 觸發,sessionTarget 設為 "isolated" 模式。Isolated runner 在每次 job 觸發時初始化一個獨立沙箱——加載 context engine、tools config 與 agent persona——整個 boot 流程必須在 60 秒內完成,超時則靜默退出。問題發生在 boot 階段,早於任何 prompt 執行邏輯,與 task 內容本身無關。任何在低規格或長時間未重啟的機器上、使用 isolated session target 的排程任務,都可能在升級後遇到相同的靜默超時。

這次碰到的狀況結構幾乎一樣。某個自動化排程任務,升級之後連續四天沒有輸出。觸發記錄完整,時間準確,日誌欄位沒有空缺,一切看起來都在運作。唯一的異狀是:預期的結果從沒出現。

錯誤傳染鏈(時序)

Cron Plugin          Gateway Scheduler      Isolated Runner
    |                       |                      |
    |── trigger agentTurn ─>|                      |
    |<── 202 Accepted ──────|                      |
    |                       |── spawn runner ──────>|
    |                       |                      |── boot context engine...
    |                       |                      |── load tools config...
    |                       |                      |── load agent persona...
    |                       |                      |── [T+60s timeout] ←── 超時點 ✗
    |                       |                      |<── silent exit (no output)
    |                       |   (no failure event propagated back)
    |                       |                      |
    trigger log: ✓ success                     task: never started ✗

排程層在收到 202 後就認為觸發完成;失敗發生在更底層的 runner boot 階段,那個層級沒有義務向排程層回報——這就是全程零告警的根源。

分界點在升級那一刻

升級前,任務跑得好好的。升級後,靜默失敗開始。這個分界點本身就是第一條線索——但它容易被忽略,因為「升級」通常被預設為無害的背景事件,不是排查的起點。

深挖之後確認的根因是:那次升級改變了沙箱執行環境(isolated runner)的初始化流程。執行器在完成啟動準備之前,就因為超時而直接退出。任務根本沒有開始執行任何邏輯,連第一行程式碼都沒有碰到。

根據踩坑紀錄,超時的觸發點發生在 runner 尚未 boot up 的階段,計時器在 60 秒後直接放棄。這不是 prompt 執行到一半超時,是 runner 連進場都沒進場就結束了。

為什麼沒有人發現

這個問題的核心難點不在技術,而在通知架構。排程系統負責「觸發」這一層,它做到了——它確實觸發了,也確實記錄了成功。但執行環境的超時發生在另一個更底層的層級,那個層級沒有義務向排程系統回報失敗,兩者之間沒有橋接。

結果就是全程零告警。排程系統認為自己的工作完成了,執行環境在角落裡靜靜超時,沒有人叫。這種「跨層靜默」比單一層的錯誤更難察覺,因為每一層各自都沒有問題。

同一台機器上,使用 sessionTarget: "main" 的其他排程任務一直正常運作。這個對比是最關鍵的確認——同樣的機器、同樣的時間、相同的排程基礎設施,差別只在 session 模式。

修法比預想的簡單

把有問題的任務從 sessionTarget: "isolated" 改成 "main",繞過那個啟動超時的 isolated runner,直接走 main session。修改寫入設定檔之後,gateway 自動重讀,不需要重啟任何服務。

這個修法不是解決根本問題,而是繞過它。Isolated runner 的初始化為什麼在升級後在特定機器上會超過 60 秒——這個問題還開著。已知的可能條件包括低規格環境、長期未重啟、context engine 卡住等,但沒有一個被完整驗證為唯一原因。

繞過之後,cron 恢復正常。代價是 isolated 沙箱提供的 context 隔離不再有效,main session 是共用的。這個取捨在當下是合理的,但值得記住它不是零成本。

Code 對照:修法前後

修法前(isolated agentTurn,boot 超時靜默失敗)

{
  "name": "daily-content-task",
  "sessionTarget": "isolated",   // ← 問題在這裡:isolated runner 60s boot timeout
  "schedule": { "kind": "cron", "expr": "0 13 * * *" },
  "payload": {
    "kind": "agentTurn",         // ← agentTurn 需要 isolated/current/session
    "message": "執行今日內容排程任務..."
  }
}

修法後(main systemEvent,繞過 isolated runner boot)

{
  "name": "daily-content-task",
  "sessionTarget": "main",       // ← 改用 main session,直接注入,不需 spawn runner
  "schedule": { "kind": "cron", "expr": "0 13 * * *" },
  "payload": {
    "kind": "systemEvent",       // ← main 須用 systemEvent,agentTurn 不合法
    "text": "執行今日內容排程任務..."  // ← 欄位名從 message 改為 text
  }
}

該被隔離的側效應類型

  • 排程觸發記錄(event log):觸發事件寫入日誌的動作必須與執行層解耦——觸發成功不應被視為執行成功的代理指標。
  • Runner 狀態回報(runner lifecycle hook):Boot 成功、超時退出、執行完成這三個事件,應分別向上層回報,不應讓排程層只能看見「觸發 202」。
  • 跨層失敗通知(cross-layer failure propagation):當 isolated runner boot 超時,應有一條明確的失敗事件路徑送回 cron plugin,而非靜默丟棄。
  • 可觀測性輸出(observability sink):Runner 生命週期各階段的時間戳、退出原因、退出碼,應寫入可查詢的 log store,而非僅存在 runner 本地記憶體。
  • 健康探針(health probe):對 isolated runner 的 boot 狀態定期探測,超過閾值應主動 alert,不應等待任務結果缺失才被發現。
  • 重試隊列(retry queue):Boot 超時屬於暫態失敗,應有獨立 retry 機制,而非讓該次任務執行窗口直接作廢。
  • 機器狀態警告(host resource alert):低規格機器或長時間未重啟的環境,應在 boot 耗時接近閾值時提前發出警告,而非等到超時才反應。
  • 版本升級回歸測試(upgrade regression gate):涉及 runner 初始化流程的升級,應在 staging 環境對各 session target 模式跑完整的 boot time 測試,不應升級後才在生產環境發現靜默失敗。

判斷標準:如果這段邏輯的失敗不應讓排程系統誤認為任務已成功執行,它就需要獨立的狀態回報路徑。

留給下次的一件事

觸發記錄正常,不等於任務開始執行。這兩件事在大部分情況下是同步的,但在跨層架構裡,它們是可以分離的——排程層只管觸發,執行層有自己的生命週期,而兩者之間的通道可以在無聲中斷掉。

下次排程任務靜默失敗、觸發記錄卻一片綠燈,第一個問題應該是:執行環境本身有沒有成功啟動?不是任務的邏輯跑到哪裡,而是它有沒有進門。

— 邱柏宇

延伸閱讀


Every Trigger Logged Success. Nothing Ever Ran.

It resembles the convenience store parcel-forwarding service common across Taiwan: the tracking page keeps showing “dispatched, in transit,” but the package silently timed out at the first relay point and never actually moved. You check daily for four days. The system never throws an error.

Technical Environment

Node.js + OpenClaw gateway with a cron plugin managing scheduled jobs. Tasks are dispatched as agentTurn payloads targeting sessionTarget: "isolated" mode. An isolated runner initializes a sandboxed environment on each job trigger — loading the context engine, tools config, and agent persona — and must complete the full boot sequence within 60 seconds or exit silently. The failure occurs at the boot stage, before any prompt logic executes, and is independent of task content. Any scheduled task using isolated session target on a low-spec or long-uptime machine can reproduce this behavior after a platform upgrade.

The structure here was nearly identical. After a minor platform upgrade, a scheduled automation task produced no output for four consecutive days. Trigger logs were intact, timestamps accurate, log fields fully populated. Everything looked operational. The only anomaly was the absence of any result.

Error Propagation Sequence

Cron Plugin          Gateway Scheduler      Isolated Runner
    |                       |                      |
    |── trigger agentTurn ─>|                      |
    |<── 202 Accepted ──────|                      |
    |                       |── spawn runner ──────>|
    |                       |                      |── boot context engine...
    |                       |                      |── load tools config...
    |                       |                      |── load agent persona...
    |                       |                      |── [T+60s timeout] ←── failure point ✗
    |                       |                      |<── silent exit (no output)
    |                       |   (no failure event propagated back)
    |                       |                      |
    trigger log: ✓ success                     task: never started ✗

The scheduling layer considers its work complete on receiving 202; the failure happens at the runner boot layer, which has no obligation to report upward — that gap is why no alert ever fired.

The Boundary Was the Upgrade

Before the upgrade, the task ran without issue. After it, silent failure began. That boundary is the first thread worth pulling — but it gets overlooked because upgrades tend to be treated as neutral background events rather than suspects.

The confirmed root cause: the upgrade changed the initialization sequence of the isolated agent runner. The runner timed out before completing its startup sequence. The task never executed a single line of logic. According to the incident record, the timeout counter hit 60 seconds and the system simply abandoned the attempt — not a mid-execution timeout, but a failure to enter the room at all.

Why No One Was Notified

The scheduling layer did its job. It triggered the task and logged a success. The execution environment’s timeout happened at a different, lower layer — one that had no obligation to report failure back up to the scheduler. No bridge between them. No alert fired.

Every layer, viewed in isolation, had nothing to report. The scheduler saw a clean trigger. The runner silently expired. This cross-layer silence is harder to catch than a single-layer error precisely because neither layer is technically wrong about its own scope.

The clearest confirmation came from comparing session modes on the same machine. Tasks using sessionTarget: "main" continued running normally throughout those four days — same host, same schedule, same infrastructure. The only difference was the session target.

The Fix Was Simpler Than Expected

Changing the failing tasks from sessionTarget: "isolated" to "main" bypassed the broken startup layer entirely. The gateway re-reads the config file automatically; no service restart required.

This is a workaround, not a root fix. Why the isolated runner’s initialization exceeds 60 seconds on certain machines after this particular upgrade remains an open question. Known contributing conditions include low-spec hardware, long uptime without restart, and a stalled context engine — but none have been confirmed as the sole cause.

The tradeoff is real: main session is shared, so the context isolation that isolated mode provides is gone. In this case that tradeoff was acceptable. It’s worth knowing it exists.

Code Diff: Before and After

Before (isolated agentTurn — boot timeout, silent failure)

{
  "name": "daily-content-task",
  "sessionTarget": "isolated",   // ← root cause: isolated runner 60s boot timeout
  "schedule": { "kind": "cron", "expr": "0 13 * * *" },
  "payload": {
    "kind": "agentTurn",         // ← agentTurn requires isolated/current/session target
    "message": "Run today's scheduled content task..."
  }
}

After (main systemEvent — bypasses isolated runner boot entirely)

{
  "name": "daily-content-task",
  "sessionTarget": "main",       // ← main session: no runner spawn, inject directly
  "schedule": { "kind": "cron", "expr": "0 13 * * *" },
  "payload": {
    "kind": "systemEvent",       // ← main requires systemEvent; agentTurn is invalid here
    "text": "Run today's scheduled content task..."  // ← field name: message → text
  }
}

Side Effects That Should Be Isolated

  • Trigger event logging: Writing a trigger record must be decoupled from execution — a successful trigger log must never be treated as a proxy for successful task execution.
  • Runner lifecycle hooks: Boot success, timeout exit, and execution completion should each emit separate upstream events. The scheduler should not be limited to seeing only a “202 trigger accepted.”
  • Cross-layer failure propagation: When an isolated runner boot times out, a failure event must travel back to the cron plugin via a defined path — not be silently discarded.
  • Observability sink: Runner lifecycle timestamps, exit reasons, and exit codes should be written to a queryable log store, not held only in the runner’s in-process memory.
  • Health probes: Boot duration for isolated runners should be monitored continuously; approaching the threshold should trigger an alert before the timeout fires.
  • Retry queue: A boot timeout is a transient failure and should enter a retry queue automatically — not silently consume the execution window.
  • Host resource alerting: Low-spec hardware or machines with long uptime should surface a warning when boot latency approaches the threshold, not after it exceeds it.
  • Upgrade regression gate: Any upgrade touching the runner initialization path should include a boot-time benchmark across all session target modes in a staging environment before reaching production.

The test: if this layer’s failure should not cause the scheduler to record a false success, it needs an independent status reporting path.

One Thing for Next Time

A clean trigger log does not mean the task started running. In most setups these are the same event. In a layered architecture they can be separate — the scheduler manages triggers, the execution environment has its own lifecycle, and the channel between them can go dark without a sound.

Next time a scheduled task silently produces nothing while the trigger log stays green: the first question is whether the execution environment itself booted successfully. Not where in the logic the task stalled — whether it ever got through the door.

— 邱柏宇

Related Posts