那通超時的請求,對方早就接到了

那通超時的請求,對方早就接到了

在超商繳費機按完確認,畫面轉圈三十秒沒反應——直覺是沒成功,重按一次。帳單來了兩筆。機器早就把資料傳出去了,只是確認畫面跑得比耐心慢。這個場景和這次踩到的坑,結構一模一樣。

現象

一個定期觸發的自動化服務,每次執行時用 HTTP 請求把任務送給下游系統。某次跑完,caller 五秒後拿到非零的 exit code,判定失敗,立刻補發了一次。下游收到兩份一樣的任務,差點執行了兩次。

追查執行紀錄才發現:第一次請求早在 timeout 發生之前就已送達,下游也開始處理了。回應是在 caller 斷線之後才發出的——caller 永遠等不到那個回應,但任務確實已在跑。

分界點

問題出在對 timeout 語意的誤解。--max-time 控制的是 caller 願意等多久,不是 request 有沒有被對方接收。這兩件事在網路正常時幾乎同步,所以平常不會有感覺;一旦下游回應變慢,這兩件事就分岔了。

分界點就在這裡:連線是否建立成功、請求是否送出,和 caller 有沒有拿到回應,是兩個獨立的事件。超時 exit code 代表的是「我不等了」,不是「對方沒收到」。

容易誤判的原因

非零 exit code 在大多數情境下確實等於失敗——連線被拒、DNS 解不開、網路斷了,這些情況下重試是正確的。問題在於,timeout 產生的非零 exit code 和那些錯誤長得一樣,行為上卻完全不同。

fire-and-forget 的情境裡,超時意味著「已送出、結果未知」,而不是「送出失敗、請重試」。這個語意差異藏在行為一致的外表之下,不追到執行紀錄很難察覺。

確認方式

確認的方法很直接:去下游的接收紀錄比對時間戳。如果第一次請求的到達時間早於 caller 記錄的 timeout 時間,那任務就是已送達、只是回應沒回來。這一個 check 就夠了,不需要更複雜的追蹤。

留給未來的話

修正後的邏輯把重試條件拆開:真正的連線錯誤才觸發重試,超時一律視為「已送出、結果未知」,不自動補發。如果下游沒有冪等設計,這個分岔點就是重複執行的來源。

下次看到非零 exit code 要觸發重試之前,值得先問一個問題:這個錯誤是「沒送到」,還是「送到了但沒回應」?兩種狀態的處理方式不一樣,但 exit code 長得一模一樣。

— 邱柏宇

延伸閱讀


The Request Timed Out — But It Already Arrived

At a convenience store payment kiosk, you press confirm and the screen spins for thirty seconds without a response. You assume it failed and start over. The bill comes twice. The machine had already sent the data — the confirmation screen was just slower than your patience. The engineering incident that surfaced this week has exactly the same shape.

What Happened

A scheduled automation service sends an HTTP request to a downstream system on each run. One time, the caller received a non-zero exit code after five seconds, decided the request had failed, and immediately sent a duplicate. The downstream system received two identical tasks and nearly executed both.

Digging through execution logs revealed the truth: the first request had already arrived and started processing before the timeout fired. The downstream response was sent after the caller disconnected — the caller never received it, but the task was already running.

Where the Logic Splits

The root cause is a misread of what timeout actually means. --max-time controls how long the caller is willing to wait — not whether the request was received. When network latency is low, those two things happen almost simultaneously, so the distinction stays invisible. When the downstream response is slow, they diverge.

Whether the connection was established and the request was transmitted, versus whether the caller received a response — these are two separate events. A timeout exit code means “I stopped waiting.” It does not mean “the other side didn’t receive it.”

Why It’s Easy to Get Wrong

In most scenarios, a non-zero exit code genuinely equals failure: refused connection, DNS resolution failure, dropped network. Retry logic is correct in all of those cases. The problem is that a fire-and-forget timeout produces a non-zero exit code that looks identical to those errors but means something entirely different.

Timeout in this context means “sent, outcome unknown” — not “failed to send, please retry.” That semantic gap hides behind a uniform-looking exit code. Without cross-referencing downstream logs, it won’t surface on its own.

The Check

The verification is straightforward: compare timestamps in the downstream receiving logs. If the first request’s arrival time is earlier than the caller’s recorded timeout moment, the task was delivered — the response just never came back. One check, no elaborate tracing needed.

The Thing Worth Watching Next Time

The corrected retry logic splits the condition: genuine connection errors still trigger a retry; timeouts are always treated as “sent, outcome unknown” and do not auto-resend. If the downstream system has no idempotency design, this is the exact branch where duplicate execution originates.

Before wiring any retry to a non-zero exit code, one question is worth asking first: does this error mean “never delivered,” or does it mean “delivered but no response came back”? The handling is different. The exit code looks the same.

— 邱柏宇

Related Posts