資料寫進去了,但 500 自己爬上來了

資料寫進去了,但 500 自己爬上來了

去超商機台列印文件,紙從出紙口滑出來了,文件在手上,結果螢幕彈出「列印失敗,請重試」。台灣的超商機台幾乎承擔一切服務——繳費、取票、列印——這種場景不是假設,是很多人真的愣在那裡不知道要不要相信那個紅色畫面的瞬間。後端的 500 就是這件事,只是發生在 API 層。

現象

前端送出一個編輯操作,後端回了 500,前端顯示失敗。但打開資料庫——紀錄確實更新了,時間戳正確,欄位值也對。主操作完全正常,500 是從別處傳染過來的。

追進去發現,寫入完成之後,系統會自動觸發一段事件通知邏輯。那段 INSERT 用了一個新的事件類型值:group_expense_updated。問題在於這個值不在資料表的 ENUM 列表裡。一丟進去就炸,而且整段通知邏輯沒有被 try/catch 包住,例外一路往上傳,把整個請求的 HTTP 回應拖成了 500。

容易誤判的原因

第一時間看起來是寫入失敗。500 就是伺服器錯誤,前端收到 500 就判定操作沒完成——這是合理的推論,幾乎任何框架的預設行為都是這樣。沒有人會先去資料庫確認主操作是否成功,再回來看是哪個側效應炸了。

更深的問題是:這個事件通知是個「沒人在意的側效應」。它不是主流程,不回傳任何對前端有意義的資料,理論上失敗了也不應該影響主操作的結果。但它沒有被隔離。側效應的例外直接接管了主請求的回應碼,喧賓奪主。

分界點

問題出現的精確位置:新增了一個事件類型,但沒有同步更新 schema 的 ENUM 列表。ENUM 是資料庫層的強型別約束,插入不在列表裡的值,直接報錯,不商量。

這個設計選擇在系統早期是合理的——ENUM 確保類型值乾淨可控。但系統在成長,事件類型會增加,每次新增都需要一次 schema migration,而 migration 很容易在部署流程中被遺漏。這次就是被遺漏了。

修法兩步,都很直接:把 ENUM 欄位改成 VARCHAR,以後新增事件類型不再需要 schema migration;把所有事件通知邏輯包進獨立的 try/catch,讓側效應的失敗不再決定主操作的回應碼。前者是 migration 010,後者是錯誤邊界的補課。

留給下次的一件事

任何在主操作完成後才執行的邏輯——通知、稽核日誌、事件廣播——都應該預設被隔離。它們失敗,主操作不應該跟著死。這不是防禦性編程的風格問題,是邊界設計。側效應拿到了整個請求的控制權,這件事從一開始就不該發生。

下次看到「操作失敗」的時候,值得多問一句:資料庫裡長什麼樣子?

— 邱柏宇

延伸閱讀


The Write Succeeded. The 500 Didn’t Care.

You print a document at a convenience store kiosk. The paper slides out. You hold it in your hand. The screen flashes red: “Print failed. Please try again.” The write succeeded. The error came from somewhere else entirely.

That’s exactly what happened here — just at the API layer.

What Was Observed

A client submitted an edit operation. The server returned a 500. The frontend reported failure. But opening the database showed the record had been correctly updated — right timestamp, right values. The primary write completed without issue. The 500 came from a side effect.

Digging in: after the write, the system automatically triggered an event notification. That notification attempted an INSERT with a new event type value, group_expense_updated, which wasn’t in the table’s ENUM list. The insert blew up. The notification logic had no try/catch wrapping it. The exception propagated all the way up and hijacked the HTTP response code for the entire request.

Why It Was Misread

A 500 means server error. The frontend received a 500 and concluded the operation had failed — that’s the default assumption for any sane client. Nobody’s first instinct is to check the database directly, then trace back to figure out which side effect detonated.

The deeper issue: this event notification was an afterthought — no meaningful return value, no bearing on the primary result. It should have been irrelevant to the outcome. But it wasn’t isolated. An unhandled exception in a side effect claimed ownership of the entire request’s response code.

The Exact Boundary

The fault point: a new event type was added without a corresponding update to the schema’s ENUM list. ENUM is a hard constraint at the database level — insert a value outside the list and it errors immediately, unconditionally.

ENUM was a reasonable choice early on, when the set of event types was small and stable. But event types grow. Every addition requires a schema migration, and migrations are easy to miss in deployment. This one was missed.

The fix was two steps. Replace the ENUM column with VARCHAR — new event types no longer require a migration. Wrap all event notification logic in its own try/catch — side-effect failures no longer decide the primary operation’s response code.

One Thing Worth Remembering

Any logic that runs after the primary operation completes — notifications, audit logs, event broadcasts — should be isolated by default. If it fails, the primary operation shouldn’t go down with it. This isn’t a style preference. It’s a boundary decision. A side effect having full control over the request’s response code was always the wrong design.

Next time “operation failed” shows up: it’s worth checking what the database actually says first.

— 邱柏宇

Related Posts