
去超商機台列印文件,紙從出紙口滑出來了,文件在手上,結果螢幕彈出「列印失敗,請重試」。台灣的超商機台幾乎承擔一切服務——繳費、取票、列印——這種場景不是假設,是很多人真的愣在那裡不知道要不要相信那個紅色畫面的瞬間。後端的 500 就是這件事,只是發生在 API 層。
技術環境
Node.js + Express 後端,MySQL 資料庫。event_logs 資料表用 ENUM 欄位限制事件類型;事件通知為 await 同步呼叫,與主操作在同一個 request handler 內執行。問題模式與框架無關——任何在主操作後同步執行側效應、且側效應例外未被攔截的設計,都會複現相同行為。
現象
前端送出一個編輯操作,後端回了 500,前端顯示失敗。但打開資料庫——紀錄確實更新了,時間戳正確,欄位值也對。主操作完全正常,500 是從別處傳染過來的。
追進去發現,寫入完成之後,系統會自動觸發一段事件通知邏輯。那段 INSERT 用了一個新的事件類型值:group_expense_updated。問題在於這個值不在資料表的 ENUM 列表裡。一丟進去就炸,而且整段通知邏輯沒有被 try/catch 包住,例外一路往上傳,把整個請求的 HTTP 回應拖成了 500。
錯誤傳染鏈(時序)
Client API Server MySQL | | | |── PUT /expense ──────>| | | |── UPDATE expenses ───>| | |<── OK ────────────────| ← 資料已寫入 ✓ | | | | |── INSERT event_log ──>| ← 'group_expense_updated' | | | 不在 ENUM 列表 → SQL Error | |<── Error ─────────────| | | (uncaught exception) | |<── 500 ──────────────| | | DB state: row updated ✓ Frontend: operation failed ✗
主操作在第三步就已完成。500 來自第六步:一個不在 ENUM 列表裡的值觸發 SQL 錯誤,例外沒有被攔截,一路往上傳,接管了整個 HTTP 回應。
容易誤判的原因
第一時間看起來是寫入失敗。500 就是伺服器錯誤,前端收到 500 就判定操作沒完成——這是合理的推論,幾乎任何框架的預設行為都是這樣。沒有人會先去資料庫確認主操作是否成功,再回來看是哪個側效應炸了。
更深的問題是:這個事件通知是個「沒人在意的側效應」。它不是主流程,不回傳任何對前端有意義的資料,理論上失敗了也不應該影響主操作的結果。但它沒有被隔離。側效應的例外直接接管了主請求的回應碼,喧賓奪主。
分界點
問題出現的精確位置:新增了一個事件類型,但沒有同步更新 schema 的 ENUM 列表。ENUM 是資料庫層的強型別約束,插入不在列表裡的值,直接報錯,不商量。
這個設計選擇在系統早期是合理的——ENUM 確保類型值乾淨可控。但系統在成長,事件類型會增加,每次新增都需要一次 schema migration,而 migration 很容易在部署流程中被遺漏。這次就是被遺漏了。
修法兩步,都很直接:把 ENUM 欄位改成 VARCHAR,以後新增事件類型不再需要 schema migration;把所有事件通知邏輯包進獨立的 try/catch,讓側效應的失敗不再決定主操作的回應碼。前者是 migration 010,後者是錯誤邊界的補課。
Code 對照:修法前後
修法前:ENUM 約束 + 無 try/catch
-- event_logs schema(修法前)
ALTER TABLE event_logs
ADD COLUMN event_type ENUM(
'user_login',
'expense_added',
'group_created'
-- ← 'group_expense_updated' 不在這裡
) NOT NULL;
// Request handler(修法前)
app.put('/groups/:id/expenses/:expId', async (req, res) => {
await db.query('UPDATE expenses SET amount = ? WHERE id = ?', [...]);
await notifyExpenseUpdated(req.params.id, req.user.id);
// ↑ ENUM 裡沒有 'group_expense_updated',INSERT 炸掉
// ↑ 例外未攔截,500 直接回給 client
res.json({ success: true });
});
修法後:VARCHAR + 獨立 try/catch
-- Migration 010:ENUM 改 VARCHAR
ALTER TABLE event_logs
MODIFY COLUMN event_type VARCHAR(100) NOT NULL;
-- 之後新增事件類型不再需要 schema migration
// 側效應函式(修法後)
async function notifyExpenseUpdated(groupId, userId) {
try {
await db.query(
'INSERT INTO event_logs (event_type, group_id, user_id) VALUES (?, ?, ?)',
['group_expense_updated', groupId, userId]
);
} catch (err) {
logger.error('Event notification failed', { groupId, userId, err: err.message });
// 記 log,不往上丟
}
}
// Request handler(修法後)
app.put('/groups/:id/expenses/:expId', async (req, res) => {
await db.query('UPDATE expenses SET amount = ? WHERE id = ?', [...]);
notifyExpenseUpdated(req.params.id, req.user.id); // fire-and-forget
res.json({ success: true }); // 主操作結果不被側效應劫持
});
該被隔離的側效應類型
任何在主操作完成後才執行的邏輯,失敗不應影響主操作的回應碼。常見需要隔離的模式:
- Event log / 稽核日誌:寫入 event_logs、audit_logs 等資料表
- 推播通知:發 push notification、Email、SMS
- Webhook 觸發:呼叫外部系統的 outbound webhook
- 快取失效:清除 Redis / Memcached 快取項目
- 搜尋索引更新:同步 Elasticsearch / Algolia 索引
- Analytics 事件:傳送 Mixpanel、GA 等追蹤事件
- 非同步任務派發:丟 job 到 queue(BullMQ、SQS 等)
判斷標準:如果這段邏輯失敗不應讓使用者看到「操作失敗」,它就需要邊界隔離。
留給下次的一件事
任何在主操作完成後才執行的邏輯——通知、稽核日誌、事件廣播——都應該預設被隔離。它們失敗,主操作不應該跟著死。這不是防禦性編程的風格問題,是邊界設計。側效應拿到了整個請求的控制權,這件事從一開始就不該發生。
下次看到「操作失敗」的時候,值得多問一句:資料庫裡長什麼樣子?
— 邱柏宇
延伸閱讀
The Write Succeeded. The 500 Didn’t Care.
You print a document at a convenience store kiosk. The paper slides out. You hold it in your hand. The screen flashes red: “Print failed. Please try again.” The write succeeded. The error came from somewhere else entirely.
That’s exactly what happened here — just at the API layer.
Technical Environment
Node.js + Express backend, MySQL database. The event_logs table used an ENUM column to constrain event type values. The event notification ran as an await-ed synchronous call inside the same request handler as the primary write. The pattern isn't framework-specific — any design that runs side effects synchronously after a primary operation, without catching their exceptions, will reproduce the same failure.
What Was Observed
A client submitted an edit operation. The server returned a 500. The frontend reported failure. But opening the database showed the record had been correctly updated — right timestamp, right values. The primary write completed without issue. The 500 came from a side effect.
Digging in: after the write, the system automatically triggered an event notification. That notification attempted an INSERT with a new event type value, group_expense_updated, which wasn’t in the table’s ENUM list. The insert blew up. The notification logic had no try/catch wrapping it. The exception propagated all the way up and hijacked the HTTP response code for the entire request.
Error Propagation Sequence
Client API Server MySQL | | | |── PUT /expense ──────>| | | |── UPDATE expenses ───>| | |<── OK ────────────────| ← row written ✓ | | | | |── INSERT event_log ──>| ← 'group_expense_updated' | | | not in ENUM → SQL Error | |<── Error ─────────────| | | (uncaught exception) | |<── 500 ──────────────| | | DB state: row updated ✓ Frontend: operation failed ✗
The primary operation completed at step three. The 500 came from step six: a value outside the ENUM triggered a SQL error, the exception went uncaught, propagated upward, and claimed the HTTP response for the entire request.
Why It Was Misread
A 500 means server error. The frontend received a 500 and concluded the operation had failed — that’s the default assumption for any sane client. Nobody’s first instinct is to check the database directly, then trace back to figure out which side effect detonated.
The deeper issue: this event notification was an afterthought — no meaningful return value, no bearing on the primary result. It should have been irrelevant to the outcome. But it wasn’t isolated. An unhandled exception in a side effect claimed ownership of the entire request’s response code.
The Exact Boundary
The fault point: a new event type was added without a corresponding update to the schema’s ENUM list. ENUM is a hard constraint at the database level — insert a value outside the list and it errors immediately, unconditionally.
ENUM was a reasonable choice early on, when the set of event types was small and stable. But event types grow. Every addition requires a schema migration, and migrations are easy to miss in deployment. This one was missed.
The fix was two steps. Replace the ENUM column with VARCHAR — new event types no longer require a migration. Wrap all event notification logic in its own try/catch — side-effect failures no longer decide the primary operation’s response code.
Code Diff: Before and After
Before: ENUM constraint + no try/catch
-- event_logs schema (before)
ALTER TABLE event_logs
ADD COLUMN event_type ENUM(
'user_login',
'expense_added',
'group_created'
-- ← 'group_expense_updated' missing
) NOT NULL;
// Request handler (before)
app.put('/groups/:id/expenses/:expId', async (req, res) => {
await db.query('UPDATE expenses SET amount = ? WHERE id = ?', [...]);
await notifyExpenseUpdated(req.params.id, req.user.id);
// ↑ 'group_expense_updated' not in ENUM — INSERT throws
// ↑ Exception uncaught — 500 returned to client
res.json({ success: true });
});
After: VARCHAR + isolated try/catch
-- Migration 010: ENUM → VARCHAR
ALTER TABLE event_logs
MODIFY COLUMN event_type VARCHAR(100) NOT NULL;
-- New event types no longer require a schema migration
// Side-effect function (after)
async function notifyExpenseUpdated(groupId, userId) {
try {
await db.query(
'INSERT INTO event_logs (event_type, group_id, user_id) VALUES (?, ?, ?)',
['group_expense_updated', groupId, userId]
);
} catch (err) {
logger.error('Event notification failed', { groupId, userId, err: err.message });
// Log it. Don't rethrow.
}
}
// Request handler (after)
app.put('/groups/:id/expenses/:expId', async (req, res) => {
await db.query('UPDATE expenses SET amount = ? WHERE id = ?', [...]);
notifyExpenseUpdated(req.params.id, req.user.id); // fire-and-forget
res.json({ success: true }); // primary result no longer hijacked
});
Side Effects That Should Be Isolated
Any logic that runs after the primary operation completes should be treated as a side effect. If its failure shouldn't surface as a user-facing error, it needs a boundary. Common patterns that need isolation:
- Event logs / audit trails: writes to event_logs, audit_logs tables
- Push notifications: push, email, SMS delivery
- Outbound webhooks: calls to external systems
- Cache invalidation: Redis / Memcached evictions
- Search index updates: Elasticsearch / Algolia sync
- Analytics events: Mixpanel, GA tracking calls
- Async job dispatch: enqueuing work into BullMQ, SQS, etc.
The test: if this logic failing shouldn't cause the user to see "operation failed," it needs an isolation boundary.
One Thing Worth Remembering
Any logic that runs after the primary operation completes — notifications, audit logs, event broadcasts — should be isolated by default. If it fails, the primary operation shouldn’t go down with it. This isn’t a style preference. It’s a boundary decision. A side effect having full control over the request’s response code was always the wrong design.
Next time “operation failed” shows up: it’s worth checking what the database actually says first.
— 邱柏宇
Related Posts
https://justfly.idv.tw/s/BkeDjYT