模擬交易系統跑了兩個多月,沒有噴錯誤訊息,沒有服務中斷,甚至連 log 看起來都很正常。直到有人想導出報表對帳,才發現裡面有 9 筆重複的交易記錄,還有 30 個操作是針對帳上根本不存在的部位——系統賣出了你從來沒有持有過的東西。
這就像夜市攤位的手寫帳本。同一行被修改過三次,劃掉又補上,再加個括號備注,你已經不知道最後那個數字是餘額還是所有錯誤的總和。台灣有超過 23 萬個夜市攤位,許多傳統攤販至今仍用手寫帳本記帳,這種場景每天都在發生。差別在於,攤販老闆知道自己改過帳,系統不知道。
問題不是某一次寫壞,而是沒有人攔
重複記錄的來源很明確:同一筆交易在不同時間點被寫入兩次,可能是 retry 邏輯沒做好,也可能是前端重複送了請求。至於那 30 個幽靈部位,則是因為賣出邏輯沒有事前驗證——它相信你說你有,就直接扣。
這兩種錯誤分散在不同時間點。有些發生在第一週,有些在第五週。等到發現時,你面對的不是一個 bug,而是一段已經失真的歷史。回溯成本比重建成本高,不是因為問題本身複雜,而是因為你不知道哪些記錄可以信任。
修補的前提是你相信剩下的部分
理論上可以寫腳本把重複的刪掉,把幽靈部位標記為無效。但問題是:你怎麼確定這 9 筆和這 30 筆就是全部?你怎麼知道其他看起來正常的記錄,不是另一種你還沒發現的錯誤?
當資料完整性出現裂縫,信任就會連帶崩塌。最後的選擇是重設。清空資料庫,重新匯入乾淨的初始狀態,然後在入口加上三層防線。
防線不是補丁,是設計
第一層:買入前驗證持倉。不是查快取,是查資料庫。確定這個部位存在且數量足夠,才允許賣出。第二層:每次寫入時做去重比對。用交易 ID 和時間戳組成唯一鍵,資料庫層級擋掉重複寫入。第三層:每天自動對帳一次。跑一個排程,比對帳上餘額和交易明細,有不一致就發 alert。
這些防線的成本不高。驗證持倉多一次 query,去重比對是 unique constraint,自動對帳是 cron job 加一段 SQL。真正的成本是承認:系統不會自己保持正確,你得主動設計讓它無法出錯的結構。
重設之後系統重新上線。這次沒有靜默漂移,因為每一層都有人在看。錯誤還是會發生,但不會累積到你無法信任整本帳。
— 邱柏宇
延伸閱讀
When the Ledger Lies: 9 Duplicates, 30 Ghost Positions, One Reset
The simulated trading system ran for over two months without throwing errors, without downtime, without anything alarming in the logs. Then someone tried to export a reconciliation report. Inside: 9 duplicate transaction records and 30 operations targeting positions that never existed—the system had sold what it never owned.
Think of a handwritten ledger at a night market stall. The same line modified three times, crossed out, rewritten, annotated in parentheses. You no longer know if the final number is the balance or the sum of all mistakes. Taiwan has over 230,000 night market stalls, many still keeping handwritten ledgers. This scene plays out daily. The difference: the stall owner knows they edited the ledger. The system doesn’t.
The problem wasn’t one bad write—it was no one stopping it
The duplicate records had clear origins: the same transaction written twice at different times, likely from poor retry logic or duplicate frontend requests. The 30 ghost positions came from sell logic with no upfront validation—it trusted your claim of ownership and deducted directly.
These errors scattered across time. Some in week one, others in week five. By discovery, you faced not a bug but a distorted history. Backtracking cost more than rebuilding, not because the problem was complex, but because you couldn’t trust which records were valid.
Patching assumes you trust what remains
Theoretically you could script away duplicates and flag ghost positions as invalid. But how do you confirm these 9 and these 30 are everything? How do you know other seemingly normal records aren’t another error type you haven’t discovered yet?
When data integrity cracks, trust collapses with it. The final choice was reset. Clear the database, reload clean initial state, then install three defensive layers at the gate.
Defenses aren’t patches—they’re design
First layer: validate holdings before sell. Not cache lookup, database query. Confirm the position exists with sufficient quantity before allowing the sale. Second layer: deduplication on every write. Use transaction ID plus timestamp as unique key, block duplicate writes at the database level. Third layer: daily automated reconciliation. Run a scheduled job comparing balance totals against transaction details, alert on any mismatch.
These defenses cost little. Validation adds one query, deduplication is a unique constraint, reconciliation is a cron job plus SQL. The real cost is admitting: systems don’t stay correct on their own. You must actively design structures that make failure impossible.
After reset the system went live again. No silent drift this time, because every layer has eyes on it. Errors still happen, but they don’t accumulate to the point where you can’t trust the entire ledger.
— 邱柏宇