音訊處理問題解析 | generate_audio=false 的真相

台南廟口辦活動，音響組把 mixer 接好、線拉好、輸出端全部插上,但其中一台設備根本沒接麥克風輸入。Mixer 不會報錯,等主持人上台開口才發現那條頻道完全無聲。問題不是音量太小,是線的另一頭根本什麼都沒插。

generate_audio=false 不是你以為的那樣

影片生成 API 有個 generate_audio 參數。大部分人看到 generate_audio=false 會以為得到一段靜音影片——有影像、有音軌、但音量是零。不是。真正發生的是影片輸出根本不包含 audio stream,連音軌的容器都沒有。

這個差異在大多數播放器下是隱形的。你打開影片、拖動進度條、看起來一切正常,就往下走了。問題出在下游的處理流程。

ffmpeg 的 -af apad 跟 -ac 2 指令都需要 audio input 才能運作。遇到一個完全沒有 audio stream 的 clip 就直接 throw。更糟的是中間有個 normalize 步驟加了 fallback,失敗時直接吃原始 clip 繼續走,沒有任何警告。

問題被靜靜吞掉

問題被靜靜吞掉,直到最後的 BGM 混音步驟嘗試用 amix filter 引用一個不存在的 audio stream,整個流程才在最後一步崩潰——距離輸出完成只差幾秒。

這就是 fallback 的代價。Normalize 失敗時,系統選擇了「算了繼續走」,而不是「停下來檢查」。錯誤訊息被包起來、被吃掉、被當成可接受的異常。Pipeline 一路往下跑,直到最後一個環節需要那條不存在的音軌,才發現整個假設從一開始就是錯的。

修法分兩部分

Normalize 步驟改成用 ffprobe 先偵測 audio stream。沒有的話,注入一條靜音軌道。這樣後續所有需要 audio input 的指令都能正常運作。

然後把 fallback 整個移掉。Normalize 失敗就直接 hard throw,讓問題在源頭爆、不要等到最後才爆。

Fail-fast 不是為了讓系統更脆弱,是為了讓錯誤在還能理解的地方發生。Normalize 步驟失敗你還知道是哪個 clip 出了問題,等到 BGM 混音才爆你根本不知道要回去查哪一段。

基礎假設要主動驗證

「影片一定有音軌」這個假設在 99% 的情況下都成立,但 API 參數可以打破它。問題是整個 pipeline 從來沒有驗證過這個假設,直到最後一刻需要用到音軌才發現它不存在。

Fallback 機制讓這類問題更難追。它把錯誤包起來、讓流程繼續走,但錯誤本身沒有被解決,只是被推遲了。最後你得到的是一個看起來跑完但其實中間斷掉的流程,而你不知道斷在哪裡。

generate_audio=false 代表的是 absence,不是 silence。前者是結構性的缺失,後者只是數值上的零。處理流程需要的是後者,但 API 給的是前者。

— 邱柏宇

generate_audio=false Returns Nothing, Not Silence

A temple square in Tainan hosts an event. The audio crew sets up the mixer, runs all the cables, plugs in all the outputs — but one device never had a microphone input connected. The mixer doesn’t complain. Only when the host steps up and speaks does anyone notice that channel is completely silent. The problem isn’t low volume. It’s that nothing was ever plugged into the other end of the cable.

What generate_audio=false Actually Means

Video generation APIs have a generate_audio parameter. Most people assume generate_audio=false produces a silent video — image present, audio track present, but volume at zero. Wrong. What actually happens is the video output contains no audio stream at all. Not even the container for one.

This difference is invisible in most players. You open the file, scrub through, everything looks fine. The problem shows up downstream.

ffmpeg commands like -af apad and -ac 2 require audio input to operate. Feed them a clip with no audio stream and they throw immediately. Worse, the normalize step has a fallback that swallows the original clip when it fails, without any warning.

Errors Swallowed Quietly

The error gets swallowed. The pipeline keeps running. Only at the final BGM mixing step, when amix filter tries to reference a nonexistent audio stream, does the whole process crash — seconds before completion.

This is the cost of fallback logic. When normalization fails, the system chooses “keep going” instead of “stop and check.” The error is caught, consumed, treated as acceptable. The pipeline runs all the way to the last step before discovering the entire assumption was wrong from the start.

Two-Part Fix

The normalize step now uses ffprobe to detect audio streams first. If none exists, inject a silent track. This ensures all downstream commands expecting audio input can operate normally.

Then remove the fallback entirely. Normalize failures now hard throw. Let problems explode at the source, not at the end.

Fail-fast doesn’t make systems more fragile. It makes errors occur where you can still understand them. A normalization failure tells you which clip is problematic. A BGM mixing crash tells you nothing about where to look.

Verify Core Assumptions

“Videos always have audio tracks” holds true 99% of the time, but API parameters can break it. The problem: the pipeline never verified this assumption until the final moment when it needed the track.

Fallback mechanisms make these issues harder to trace. They wrap errors, let processes continue, but the underlying problem remains unsolved — just postponed. You end up with a pipeline that appears to complete but silently broke somewhere in the middle, with no indication where.

generate_audio=false signals absence, not silence. The former is structural omission. The latter is merely numerical zero. Processing pipelines need the latter, but the API delivers the former.