就像冷氣設了省電模式,但房間角落有台電暖器一直插著——電費帳單來了,才發現暖氣燒掉九成。
帳單數字出來之前,這件事完全沉默。月帳約七十美元,遠超預期。跑了一支用量稽核腳本,把各機器的 session 記錄按模型估價加總,近十四天的費用有九成集中在單一機器。再往下追,元兇是一個早已退役的 agent 的心跳機制——每十五分鐘一次的健康檢查,靜靜地跑最重的模型,兩週燒掉三十三美元。
分界點
心跳機制有兩層設定:機器層級的全域預設,以及各 agent 自帶的模型配置。先前「心跳輕量化」的修復,只套用到一台機器的全域設定。其它機器上的 agent,仍然各自讀自己的 models.json,用自己指定的模型跑心跳。退役 agent 停了對外服務、加了勿派工標記,但心跳的 cron job 還在背景默默執行。沒人注意到。
問題的核心不是設定複雜,而是優先順序鏈沒有被文件化。直覺上會以為「改了機器預設就全域生效」,實際上 agent 層級的設定優先度更高。全域有輕量模型的 override,但各 agent 又覆寫回重型模型——這個覆蓋關係,在任何地方都沒有明確寫出來。
容易誤判的原因
退役流程走完了:停 gateway、加 banner、標記勿派工。這些步驟都做了,看起來 agent 已經安全下線。但「停止接受任務」和「停止所有背景行為」是兩件事。心跳不屬於任務派發系統,它走獨立的 cron,不受 gateway 狀態影響。
更難察覺的是:這類靜默背景任務本來就沒有可觀測性。沒有 dashboard 顯示各 agent 心跳正在用哪個模型、近期燒了多少錢。帳單是月結,問題累積一整個計費週期才浮出水面。
確認方式
稽核腳本的做法直接:拉各機器 session 的用量記錄,按模型單價估價加總,歸因到具體 agent。單一 agent 佔掉九成費用之後,進該機器翻設定檔,才看到全域有輕量模型的 override,但該 agent 的 models.json 把心跳覆寫回重型模型。數字和現象對齊,根因確認。
根本解法分兩層:機器層級把 agents.defaults.heartbeat 設為 null,全面關閉心跳(健康監控有替代機制);agent 層級,退役流程明確加入「停 heartbeat cron」這一步,避免殭屍空轉。修復後重啟 gateway 套用,月帳從七十美元降到約三十五美元。
留給未來
分層配置系統的優先順序鏈,值得在文件裡明確列出:agent 層級 > 機器層級 > 全域預設。這不是直覺可以推導的,尤其當多人維護、多台機器分散部署時,誤判的機率只會更高。
成本稽核腳本應定期跑,例如月初,而不是等帳單超標才回頭查。任何靜默的背景任務——心跳、cron、定時健康檢查——都需要可觀測性:哪個 agent 在跑、用什麼模型、近期花了多少。看不見的任務,就是下一張沒預期到的帳單。
— 邱柏宇
延伸閱讀
The Heartbeat Nobody Watched Was Running Opus All Along
The air conditioner was set to eco mode. But a space heater in the corner of the room had been plugged in the whole time. The electricity bill arrived, and the heater had consumed ninety percent of it.
Before the bill landed, everything was quiet. Monthly Claude API spend came in around $70 — well past what was budgeted. A usage audit script pulled session records from each machine, estimated cost by model pricing, and summed the totals. Almost ninety percent of the past fourteen days’ charges traced back to a single machine. Drilling further: a retired agent’s heartbeat mechanism — a health check firing every fifteen minutes — had been running on the most expensive model the whole time. Thirty-three dollars in two weeks, from a process nobody was watching.
The Dividing Line
Heartbeat configuration exists at two levels: a machine-wide default, and each agent’s own model config. A prior “heartbeat lightweighting” fix only applied to one machine’s global setting. Other machines’ agents kept reading their own models.json files, running heartbeats on whatever model they specified. The retired agent had its gateway stopped and a do-not-dispatch banner added — but the heartbeat cron kept running in the background. Nobody caught it.
The core issue wasn’t configuration complexity. It was that the priority chain was never documented. The reasonable assumption — “changing the machine default covers everything” — is wrong. Agent-level config takes precedence. The global override set a lightweight model, but individual agents overrode it back to heavier ones. That override relationship existed nowhere in writing.
Why It Wasn’t Caught Earlier
The retirement checklist was followed: stop the gateway, add the banner, mark as inactive. All done, agent visibly offline. But “no longer accepting tasks” and “no longer running background processes” are different things. The heartbeat cron runs independently of the gateway. Its behavior isn’t controlled by dispatch status.
The harder part: silent background tasks have no natural observability. No dashboard showed which model each agent’s heartbeat was using, or how much it had spent recently. Billing is monthly. Problems accumulate for an entire billing cycle before surfacing.
How It Was Confirmed
The audit script’s approach was direct: pull session usage records per machine, estimate cost using per-model pricing, attribute to specific agents. Once a single agent accounted for ninety percent of charges, the next step was opening the machine’s config file — where the global setting showed a lightweight model override, but that agent’s own models.json had re-overridden it back to a heavy model. Numbers matched the pattern. Root cause confirmed.
The fix ran at two levels. At the machine level: set agents.defaults.heartbeat to null in openclaw.json, disabling heartbeats entirely — health monitoring had substitute mechanisms. At the agent level: retirement procedures now explicitly include stopping the heartbeat cron, preventing zombie idle-running. After restarting the gateway to apply changes, monthly spend dropped from roughly $70 to roughly $35.
One Thing Worth Noting Next Time
A layered configuration system needs its priority chain written down somewhere: agent level overrides machine level overrides global default. That ordering isn’t intuitive. With multiple maintainers and distributed machines, the chance of a wrong assumption compounds quickly.
Cost audit scripts should run on a schedule — beginning of the month, not after the bill spikes. And any silent background process — heartbeat, cron, health check — deserves visibility: which agent is running, which model, how much it’s spent recently. The task nobody can see is the next bill nobody expected.
— 邱柏宇
Related Posts
https://justfly.idv.tw/s/KUUzFUY