spec 寫對了,女角開口卻是男聲

spec 寫對了,女角開口卻是男聲

影片跑完,女角開口說話,聲音是男的。

不是 AI 出問題,不是語音模型壞掉,也不是 spec 填錯。spec 欄位寫的是 female,資料完整,上游沒有任何一筆損壞的記錄。但聲音就是男的。

像在台灣診所用英文掛號單填了 “female”,但診間系統只認中文「女」,空白就補預設值,護士叫號叫你「先生」。這個情境在台灣不難想像——同一件事有兩種完全正確的說法,共存於同一個系統的不同層次,彼此不認識。

比對邏輯只活在一個語言裡

問題的核心不是資料錯,是判斷性別那段 code 的比對邏輯只覆蓋了一種語言。

系統上游用英文字串 “female” 描述角色性別,下游的語音選擇邏輯拿這個字串去比對中文字「女」——兩者各自正確,但對不上。對不上就落入預設值,預設是男聲。沒有例外拋出,沒有 warning,整個流程安靜地走完,輸出了一段聽起來「有問題但資料沒問題」的結果。

這是多語言字串比對的隱藏假設:寫這段邏輯的人預設輸入一定是中文,或者預設輸入一定是英文,兩件事有一件是錯的。

為什麼沒立刻看出來

這個 bug 藏得夠深,原因是它在每一層都顯得「正常」。

spec 沒寫錯。上游資料沒壞。AI 照規矩辦事。語音模型本身收到指令也是正確執行。每個節點單獨看都過關,問題只出現在兩個節點的交接處——一個輸出英文,一個只讀中文,中間沒有翻譯層,也沒有人覺得需要。

Seedance 的語音本來就不可預測,語言、位置、性別都會飄移,需要多重 prompt 防護。這讓問題更難定位:聲音出錯,第一直覺是語音模型又漂移了,不會先去查字串比對邏輯。誤診的成本是整整一次排查循環。

確認方式

把語音選擇那段邏輯拉出來單獨測:輸入 “female”,看它走到哪一條分支。如果它直接掉進預設值而不是走女聲分支,問題就在這裡,不在語音模型。

反向驗證:把 spec 欄位改成中文「女」,重跑一次。聲音對了,就確認是比對層的問題,不是別的。

修法不複雜,但需要主動去想到

讓性別偵測同時接受中英兩種寫法,“female” 和「女」都能識別,任何一種命中就走對應分支。兩行條件,不需要重構。

麻煩的不是修,是要先想到「這個欄位可能跨越兩個語言生態」。當一個欄位的資料來源是多語言混合的,任何只比對其中一邊的邏輯都是定時炸彈,安靜地等待某個特定輸入組合觸發。

下次碰到「資料看起來對但輸出不對」這種情況,值得先問:比對邏輯和資料的語言假設是否一致?不用先翻模型,先看字串。

— 邱柏宇

延伸閱讀


The Spec Said Female. The Voice Was Male.

The video finished rendering. The female character opened her mouth. The voice was male.

Not a broken AI. Not a corrupted dataset. The spec field said female — correctly filled, clean data all the way up the chain. But the voice was male.

Think of a clinic in Taiwan where a patient fills out an English-language registration form with “female”, but the hospital system only recognizes the Chinese character 「女」. The field comes back blank, the system fills in the default, and the nurse calls you “先生.” Same information. Two correct representations. Neither one recognizes the other.

The Matching Logic Only Spoke One Language

The upstream system described character gender as the English string “female”. The downstream voice-selection logic compared that string against the Chinese character 「女」. Both values were correct. They never matched. No match means default. Default means male voice.

No exception thrown. No warning. The pipeline completed quietly and produced audio that sounded wrong but carried no error state anywhere in the system.

The hidden assumption: whoever wrote the comparison logic assumed the input would always be Chinese, or always be English. One of those is wrong.

Why It Wasn’t Caught Immediately

Every layer looked normal in isolation. The spec was clean. The upstream data was intact. The AI followed its instructions. The voice model executed correctly on what it received. The failure lived entirely in the handoff between two nodes — one outputting English, one reading only Chinese, with no translation layer between them and no one who thought one was needed.

Voice behavior in generative video pipelines can already be unpredictable — language, position, and gender can all drift, and multiple layers of prompt constraints are sometimes needed just to keep things stable. That instability made the misdiagnosis easy: when a voice sounds wrong, the first instinct is to check the voice model, not a string comparison buried in gender-detection logic. An entire debugging cycle spent in the wrong place.

How to Confirm It

Pull the voice-selection logic out and run it in isolation. Feed it “female” and trace which branch it takes. If it falls through to the default rather than routing to the female voice branch, the problem is here — not in the model.

Reverse-verify: change the spec field to the Chinese character 「女」 and rerun. If the voice corrects itself, the matching layer is the cause, confirmed.

The Fix Is Two Lines. The Insight Takes Longer.

Make the gender-detection logic accept both forms — “female” and 「女」, “male” and 「男」. Either one hits the correct branch. No refactor needed.

The hard part isn’t writing the fix. It’s remembering to ask: does this field’s data ever cross a language boundary? When a single field can be populated in two different languages depending on which part of the system writes to it, any logic that only checks one side is waiting for the right input combination to fail silently.

Next time output is wrong but data looks clean — before touching the model, check whether the matching logic and the data share the same language assumption.

— 邱柏宇

Related Posts