六年回測跑贏,十七年翻車——你的樣本窗口就是你的偏見

六年回測跑贏,十七年翻車——你的樣本窗口就是你的偏見

七月初去玉井買芒果,愛文攤排最長隊,每斤破百塊。你會以為芒果永遠是水果界冠軍。但果農會告訴你,荔枝在六月初已經過氣,中秋前後白文旦才是主場,草莓要等到冬天。你的七月樣本只是一個切片,不是全年規律。

回測策略時,我們犯的錯誤跟這個一模一樣。

六年樣本的陷阱

2020 到 2025 這六年,有疫情崩盤、V 型反彈、升息週期,看起來樣本夠豐富了。我們拿這段資料驗證一個總體指標驅動的倉位策略,結論很清楚:這個策略在每一段都跑輸直接持有指數,差距超過 80 個百分點。整套框架差點被砍掉。

後來有人說等一下。六年裡每次下跌都是 V 型反彈。系統從來沒遇過真正需要它發揮作用的熊市——那種跌幅超過 50%、復甦期長達兩年的週期。這個樣本有系統偏差。

把樣本拉到 17 年,含 2008 金融海嘯。結論完全翻轉。同一個策略變成第一名,最大回撤砍了一半。

二十五年又換一個答案

再拉到 25 年,含 2000 科技泡沫。冠軍再度換人。純均線策略變成跨樣本最穩的選擇,因為指標投票系統在漫長的反彈期間一直誤報,反而拖累績效。

三個樣本窗口,三組完全不同的冠軍。會讓你做出截然不同的架構決策。

我們沒有犯策略錯誤。犯的是採樣錯誤。

你需要跑過完整週期

短樣本的問題不是資料不夠多,是週期不完整。2020 到 2025 這六年,市場沒有經歷過一次真正的熊市——跌幅 50% 以上、復甦期兩年以上的那種。所有下跌都在幾個月內收復,防禦型策略根本沒機會證明自己的價值。

至少需要跑過一個完整週期——多頭、熊市、復甦都要有——再加上兩個獨立週期,結論才有統計意義。這通常意味著 15 年起跳,最好 20 年以上。

2008 金融海嘯從高點到谷底跌了 57%,花了四年才收復。2000 科技泡沫從高點到谷底跌了 49%,完整復甦花了七年。這兩段週期跟 2020 年後的 V 型反彈完全不同。你的策略如果只在 V 型反彈裡測試過,你根本不知道它在漫長熊市裡會怎麼表現。

窗口就是立場

回測樣本窗口不是技術細節,是你的立場。選六年,你假設未來市場會像過去六年一樣快速反彈。選二十五年,你假設未來市場會經歷跟過去類似的完整週期。

這不是哪個窗口「比較正確」的問題。是你要知道自己選了什麼、排除了什麼,以及這個選擇會如何影響結論。

我們現在的規則:任何策略驗證,樣本至少要含兩次完整熊市週期,否則不做架構決策。六年資料可以拿來做初步篩選,但不能拿來定生死。

— 邱柏宇

延伸閱讀


Your Backtest Window Is Your Bias

You go to Yujing in early July to buy mangoes. The Irwin mango stand has the longest queue. Over NT$100 per catty. You conclude mangoes are the eternal champion. But the farmer will tell you lychees peaked in early June, pomelos dominate mid-autumn, and strawberries wait for winter. Your July sample is a slice, not the annual pattern.

We make the exact same mistake when backtesting strategies.

The Six-Year Trap

From 2020 to 2025, we had a pandemic crash, V-shaped recovery, and rate hikes. The sample looked rich enough. We tested a position-sizing strategy driven by macro indicators. The conclusion was clear: this strategy underperformed buy-and-hold in every segment, by over 80 percentage points. The entire framework nearly got scrapped.

Then someone said wait. Every decline in these six years was a V-shaped recovery. The system never encountered a real bear market — the kind with 50%+ drawdowns and two-year recovery periods. This sample has systematic bias.

We extended the sample to 17 years, including the 2008 financial crisis. The conclusion flipped completely. The same strategy became first place. Maximum drawdown cut in half.

Twenty-Five Years, Different Winner

We extended to 25 years, including the 2000 tech bubble. The champion changed again. A pure moving average strategy became the most stable across samples, because the indicator voting system kept giving false signals during prolonged rebounds, dragging down performance.

Three sample windows. Three completely different champions. Leading to completely different architectural decisions.

We didn’t make a strategy error. We made a sampling error.

You Need Complete Cycles

The problem with short samples isn’t insufficient data. It’s incomplete cycles. From 2020 to 2025, the market never experienced a real bear market — 50%+ decline with two-year recovery. All declines recovered within months. Defensive strategies never got a chance to prove their value.

You need at least one complete cycle — bull, bear, recovery — plus two independent cycles. This typically means 15 years minimum, preferably 20+.

The 2008 financial crisis dropped 57% peak-to-trough, took four years to recover. The 2000 tech bubble dropped 49%, took seven years for full recovery. These cycles are nothing like post-2020 V-shaped rebounds. If your strategy only tested in V-shaped rebounds, you have no idea how it performs in prolonged bear markets.

Your Window Is Your Position

The backtest sample window isn’t a technical detail. It’s your position. Choose six years, you assume the future will rebound like the past six years. Choose twenty-five years, you assume the future will experience complete cycles like the past.

This isn’t about which window is “more correct.” It’s about knowing what you chose, what you excluded, and how that choice shapes your conclusion.

Our rule now: any strategy validation must include at least two complete bear market cycles, or we don’t make architectural decisions. Six-year data works for preliminary screening. Not for life-or-death calls.

— 邱柏宇

Related Posts