
用保鮮膜包一整隻烤乳豬,膜在中途裂開——裂的位置是腿還是頭,跟問題本身無關。換成烘焙紙,分段包覆,才有辦法收尾。這次備份事故的結構,幾乎是同一個形狀。
技術環境
CMS 備份套件(PHP-based plugin)呼叫 ZipArchive 壓縮整站 wp-content/uploads/,備份目標 9 到 16 GB。雲端儲存為 S3-compatible 服務,單檔上傳上限 5 GB,超過需走 multipart upload,套件未實作。執行環境的記憶體上限受 PHP memory_limit 與 container runtime 雙重限制,實際可用量無法容納 ZIP 隨機存取索引結構的建立。問題模式與 CMS 平台無關——任何在記憶體受限環境中對 GB 級資料使用 ZIP 壓縮的流程,都會複現相同行為。
現象
某 CMS 平台的每月完整備份,連續數月在同一個節點失敗。Log 混雜兩種錯誤:「temp 檔 rename 失敗,No such file」,以及「9.6 GB 超過雲端儲存單檔 5 GB 上限」。初步懷疑磁碟滿、路徑權限錯誤、網路逾時——這三個方向都是工程師看到這類訊息的直覺反應,也都是錯的。
錯誤傳染鏈(時序)
備份工具 ZIP 壓縮層 本地檔案系統 雲端儲存 | | | | |── compress(9-16GB) ─>| | | | |── 建立記憶體索引... | | | | [記憶體牆,靜默卡死 ~27h] | | | temp.zip 從未寫入 | | |<── OOM / freeze ──────| | | | | | |── rename(temp.zip → backup.zip) ───────────>| | |<── No such file or directory ───────────────| | ← 錯誤① | | | |── upload(backup.zip, 9.6 GB) ────────────────────────────────> | |<── 超過單檔 5 GB 上限 ─────────────────────────────────────────| ← 錯誤② | [兩個錯誤訊息都是真的;都是症狀。根因在壓縮層:記憶體撐不住 ZIP 索引]
壓縮層靜默卡死後,後續所有操作都在對空氣動作——temp 檔不存在,rename 找不到檔,upload 沒有檔可送。兩條錯誤訊息全指向下游,都沒說出記憶體牆。
分界點
真正的轉折藏在備份 log 的時間戳記裡:壓縮階段卡住整整 27 小時,觸發「5 分鐘無動作自動重啟」機制,反覆重試,每次都在同一個位置停住。磁碟使用率 62%,空間充足。問題不在磁碟,不在網路,不在權限。
根因是 ZIP 格式本身的記憶體模型。備份套件用 ZIP 壓縮整站,含 9 到 16 GB 的媒體檔。ZIP 需要隨機存取索引,無法串流,必須一次性載入記憶體。在該 runtime 環境,記憶體承受不住這個量級,壓縮程式靜悄悄地卡死。temp 檔從未產生,後續的 rename 與上傳自然全部炸掉——錯誤訊息指向的「找不到檔案」和「超過 5 GB 上限」,都是下游症狀,不是根因。
容易誤判的原因
「No such file」這個訊息太具體,讓手忍不住往路徑和權限查。「超過 5 GB 上限」又像是個可以操作的問題——切分檔案,或申請更大的上傳額度,方向看起來清晰。這兩條線索都是真實的,只是描述的是結果,不是原因。壓縮程式在記憶體牆前面卡死,不會主動說「記憶體不夠」,只會讓後面所有流程在空氣裡炸開。
27 小時這個數字是關鍵。備份流程在「幾分鐘內」失敗,問題通常在設定層;卡在壓縮階段長達數十小時,記憶體耗盡才是合理的懷疑對象。時間長度本身就是線索,只是很少在第一時間被看見。
確認方式
改用 TAR 格式——循序寫入、可串流、不吃記憶體——搭配雲端儲存的 multipart upload 支援超過 5 GB 的單檔,實測同樣的備份目標從卡死 27 小時降至 9 到 13 分鐘完成。對照測試確認 ZIP 在該環境確實有記憶體牆。兩個變數,一次切換,結果差了兩個數量級。
Code 對照:修法前後
修法前(ZIP,需隨機存取索引,記憶體無法承受)
#!/bin/bash
BACKUP="/tmp/backup_$(date +%Y%m%d).zip"
# ZipArchive 將整個目錄索引一次性載入記憶體 ← 記憶體牆在這裡
zip -r "$BACKUP" /var/www/html/wp-content/uploads/
# 9-16 GB → 靜默卡死 ~27h,temp 從未寫入
mv "$BACKUP" /backups/backup_final.zip # No such file ← 錯誤①
aws s3 cp /backups/backup_final.zip s3://bucket/ # 9.6GB > 5GB ← 錯誤②
修法後(TAR + gzip 串流,multipart upload)
#!/bin/bash
BACKUP="/tmp/backup_$(date +%Y%m%d).tar.gz"
# TAR 循序寫入、可串流,記憶體用量固定,不隨檔案大小增長 ✓
tar -czf "$BACKUP" /var/www/html/wp-content/uploads/
# 同樣目標,9 到 13 分鐘完成 ✓
# S3 multipart upload 支援超過 5 GB 單檔
aws s3 cp "$BACKUP" s3://bucket/ --expected-size "$(stat -f%z "$BACKUP")" # 正常上傳 ✓
該被隔離的側效應類型
- 壓縮操作:記憶體密集,應在獨立 process 或 worker 執行,failure 不能靜默傳染整條鏈。
- 臨時檔案清理:壓縮失敗殘留的 temp 需獨立清理機制,不能依賴主流程。
- 雲端上傳:傳輸應與壓縮解耦,失敗可單獨重試,不需重跑壓縮。
- 舊備份清除政策:清除過期備份只能在主備份成功確認後觸發,避免提前刪除。
- 完成通知(email / webhook):通知需在雲端上傳驗證完成後觸發,不在壓縮完成時送出。
- 備份完整性驗證:checksum / hash 對比是獨立側效應,失敗需獨立記錄,不阻斷通知。
- 監控與 log 更新:壓縮耗時、檔案大小、上傳速度寫入 monitoring;哪一段失敗就記哪一段,而非整筆 null。
判斷標準:如果這個操作失敗不應讓整條備份鏈停擺,它就需要邊界隔離與獨立重試。壓縮失敗 ≠ 上傳失敗;上傳失敗 ≠ 備份完全無效。
留給未來的話
壓縮格式不只是副檔名的差異。ZIP 索引要隨機存取,TAR 循序串流——這個機制差異在小檔案時完全看不出來,一旦備份目標跨過 GB 級且環境記憶體受限,就會變成一堵隱形的牆。
排查大檔案處理失敗時,優先懷疑壓縮與序列化層的記憶體模型,而不是檔案系統或網路。錯誤訊息幾乎永遠指向下游——找不到檔案、逾時、超限——因為上游卡死時,下游只能看到空氣。
— 邱柏宇
延伸閱讀
The Error Lied. ZIP Was the Culprit.
A ZIP file builds a central directory index that requires random access — which means the entire structure must be held in memory before finalization. That detail is harmless at small scale. At 9 to 16 GB of media files, it becomes a wall.
Technical Environment
A PHP-based CMS backup plugin using ZipArchive to compress the entire wp-content/uploads/ directory — 9 to 16 GB of media files. Cloud storage: an S3-compatible service with a 5 GB single-file upload limit; anything over 5 GB requires multipart upload, which the plugin had not implemented. The runtime environment imposed a hard memory ceiling through a combination of PHP's memory_limit and the container's own constraint. The actual available memory was insufficient to build ZIP's random-access central directory index. The failure pattern is platform-agnostic — any process compressing GB-scale data with ZIP in a memory-constrained runtime will reproduce the same behaviour.
What Was Happening
A CMS platform’s monthly full-site backup had been failing for months at the same point. The logs showed two errors: a temp file rename failure with “No such file”, and an upload rejection because 9.6 GB exceeded a 5 GB single-file limit. The natural suspects were disk space, path permissions, and network timeouts. Disk usage was at 62%. None of those three checked out.
Where the Real Break Was
The timestamp in the backup log told a different story: the compression stage had been stuck for 27 hours each time, repeatedly triggering a “5-minute inactivity restart” loop. The backup tool was using ZIP to compress the entire site. The ZIP library in that runtime environment ran out of memory on multi-GB files and silently stalled. The temp file never got written. Everything downstream — the rename, the upload — exploded against a file that simply didn’t exist yet.
The error messages weren’t lying, exactly. Just describing consequences, not causes.
Error Propagation Sequence
Backup Tool ZIP Library Local Filesystem Cloud Storage
| | | |
|── compress(9-16GB) >| | |
| |── building index... | |
| | [memory wall: silent stall ~27h] |
| | temp.zip never written |
|<── OOM / freeze ────| | |
| | |
|── rename(temp.zip → backup.zip) ──────────>| |
|<── No such file or directory ──────────────| | ← Error ①
| | |
|── upload(backup.zip, 9.6 GB) ──────────────────────────────> |
|<── Exceeds 5 GB single-file limit ─────────────────────────── | ← Error ②
|
[Both errors are real. Both are downstream. Neither mentions the memory wall.]
Once the compression layer froze silently, every subsequent operation was acting against a file that didn't exist. The rename failed because there was nothing to rename. The upload failed because there was nothing to upload. Both error messages were accurate descriptions of what happened — just not of why.
Why It Was Easy to Misread
“No such file” points directly at paths and permissions — concrete, actionable, wrong. “Exceeds 5 GB limit” looks like a solvable configuration problem: split the file, or request a higher quota. Both lines of investigation are real responses to real error messages. Neither gets anywhere near the actual failure point.
The 27-hour stall is the tell. A failure within minutes usually lives in the config layer. A process that runs silently for tens of hours before dying is almost always hitting a resource ceiling — memory, file descriptors, something the process is quietly consuming until continuation stops. The duration is diagnostic information. Rarely gets read that way.
How It Was Confirmed
Switching to TAR format — sequential writes, streamable, no in-memory index requirement — combined with multipart upload to handle files over 5 GB, brought the same backup from a 27-hour stall down to 9 to 13 minutes. The contrast test confirmed ZIP had a hard memory ceiling in that environment. Two variables changed at once, and the result was two orders of magnitude faster.
Code Diff: Before and After
Before (ZIP — random-access index, full memory load)
#!/bin/bash
BACKUP="/tmp/backup_$(date +%Y%m%d).zip"
# ZipArchive loads the entire directory index into memory at once ← memory wall here
zip -r "$BACKUP" /var/www/html/wp-content/uploads/
# 9–16 GB → silent stall ~27h, temp.zip never written
mv "$BACKUP" /backups/backup_final.zip # No such file ← Error ①
aws s3 cp /backups/backup_final.zip s3://bucket/ # 9.6 GB > 5 GB ← Error ②
After (TAR + gzip stream, multipart upload)
#!/bin/bash
BACKUP="/tmp/backup_$(date +%Y%m%d).tar.gz"
# TAR writes sequentially and streams — memory usage stays flat regardless of file size ✓
tar -czf "$BACKUP" /var/www/html/wp-content/uploads/
# Same target: completes in 9 to 13 minutes ✓
# S3 multipart upload handles files over 5 GB
aws s3 cp "$BACKUP" s3://bucket/ --expected-size "$(stat -f%z "$BACKUP")" # Uploads cleanly ✓
Side Effects That Should Be Isolated
- Compression: Memory-intensive; should run in an isolated process or worker. Failure must not silently propagate down the chain.
- Temp file cleanup: Stale temp files from a failed compression run need their own cleanup path, not one that depends on the main flow succeeding.
- Cloud upload: Decoupled from compression. Upload failures should be independently retryable without re-running compression.
- Old backup rotation: Expired backup deletion should only trigger after the new backup is confirmed uploaded — not before.
- Completion notification (email / webhook): Notifications should fire after upload verification, not after compression completes.
- Integrity verification: Checksum or hash comparison is an independent side effect. Its failure should be logged separately without blocking the notification step.
- Monitoring and log writes: Compression time, file size, upload speed — written per stage. When something fails, record which stage failed, not a null entry for the whole run.
The test: if this operation failing shouldn't halt the entire backup chain, it needs boundary isolation and independent retry. Compression failure ≠ upload failure. Upload failure ≠ backup completely lost.
The Note Worth Keeping
The format choice isn’t cosmetic. ZIP needs random-access indexing; TAR writes sequentially and streams. At small file sizes, the difference is invisible. Once the backup target crosses into GB territory in a memory-constrained runtime, ZIP can quietly freeze the entire pipeline — and the errors left behind will always point downstream. “File not found”, “timeout”, “upload limit exceeded”. The layer that actually failed won’t say “ran out of memory.” Just stops.
When large-file processing fails in ways that don’t immediately make sense, the compression and serialization layer’s memory model is worth suspecting before the filesystem or the network.
— 邱柏宇
Related Posts
https://justfly.idv.tw/s/qS6vDMr