Financial Backtesting: A Cautionary Tale

Consider the following market timing strategy, which we’ll call “daily momentum”:

(1) If the market’s total return for the day, measured from yesterday’s close to today’s close, is positive, then buy the market at today’s close and hold for one day.

(2) If the market’s total return for the day, measured from yesterday’s close to today’s close, is negative, then sell the market at today’s close and hold the proceeds in a short-term interest bearing deposit account for one day.

The two charts below show the hypothetical performance of the strategy in the aggregate, capitalization-weighted U.S. equity market from February 1st of 1928 to December 31st of 1999 (1st chart right y-axis: linear; 2nd chart right y-axis: log; data source: CRSP):

The blue line is the total return of daily momentum, the timing strategy being tested. The black line is the total return of a buy and hold strategy. The yellow line is the cash total return. The gray columns are U.S. recession dates.

The red line is the total return of the X/Y portfolio. The X/Y portfolio is a mixed portfolio with an allocation to equity and cash that matches the timing strategy’s cumulative ex-post exposure to each of those assets. The timing strategy spends 55% of its time in equities, and 45% of its time in cash. The corresponding X/Y portfolio is then a 55/45 equity/cash portfolio, a portolio that is continually rebalanced to hold 55% of its assets in equities and 45% of its assets in cash at all times.

I introduce the concept of an X/Y portfolio to serve as a benchmark or control sample. I need that benchmark or control sample to be able to conduct appropriate statistical analysis on the timing strategy’s performance. If “timing” itself were of no value, and all that mattered to returns were asset exposures, then the return of any timing strategy would be expected to match the return of its corresponding X/Y portfolio. The returns would be expected to match because the cumulative asset exposures would be exactly the same–the only difference would be in the specific timing of the exposures. If the timing strategy outperforms the X/Y portfolio in a statistically significant fashion, then we know that it’s adding value through its timing. It’s taking the same cumulative asset exposures, and turning them into “something more.”

The green line is the most important line in the chart. It shows the timing strategy’s cumulative outperformance over the market, defined as the ratio of the trailing total return of the timing strategy to the trailing total return of a buy and hold strategy. It takes its measurement off of the right y-axis, shown in linear scale in the first chart, and logarithmic scale in the second.

As you can see in the the chart, the timing strategy performs unbelievably well. From the beginning of 1928 to the end of 1999, it produces a total return more than 5,000 times larger than the market’s total return, with less volatility and a lower maximum drawdown. It earns 25.1% per year, 1400 bps more than the market. The idea that a timing strategy would be able to beat the market by 14% per year, not only over the short or medium term, but over a period of seven decades, is almost inconceivable.

Now, imagine that it’s late December of 1999, and I’m trying to sell this strategy to you. What would be the best way for me to sell it? If you’re familiar with the current intellectual vogue in finance, then you know the answer. The best way for me to sell it would be to package it as a strategy that’s “data-driven.” Other investors are employing investment strategies that are grounded in sloppy, unreliable guesses and hunches. I, however, am employing an investment strategy whose efficacy is demonstrated in “the data.” All of the success that you see in the well-established fields of science–physics, chemistry, biology, engineering, medicine–you can expect to see from my strategy, because my strategy originated in the same empirical, evidence-based approach.

On the totality of what I’ve seen, active investors that center their investment processes on “the data” do not perform any better in real-world investing environments than active investors that invest based on their own market analysis, or investors that simply index. Granted, some investors have done extremely well using data-driven approaches, but others have done poorly–some, spectacularly poorly, to a degree that was not expected beforehand. The failure to see consistent outperformance from the group as a whole has made me increasingly skeptical of investment approaches that claim to be data-driven. In my view, such approaches receive too much trust and respect, and not enough scrutiny. They hold a reputation for scientific credibility that is not deserved.

In this piece, I’m going to use the timing strategy presented above to distinguish between valid and invalid uses of data in an investment process. In the conventional practice, we take a claim or a strategy and we “backtest” it–i.e., we test it in historical data. We then draw probabilistic conclusions about the future from the results, conclusions that become the foundations for investment decisions. To use the timing strategy as an example, we take the strategy and test it back to 1928. We observe very strong performance. From that performance, we conclude that the strategy will “probably” perform well into the future. But is this conclusion valid? If it is valid, what makes it valid? What is its basis? Those are the kinds of questions that I’m going to pursue in the piece.

Now, if we want to use the results of a backtest to make statements about the returns that investors are likely to receive if they put the strategy to use in the real-world, the first thing we need to do is properly account for the real-world frictions associated with the strategy’s transactions. The daily momentum strategy transacts extremely frequently, trading on 44% of all trading days and amassing a total of 8,338 trades across the tested period. In addition to brokerage fees, these trades entail the cost of buying at the ask and selling at the bid, a cost equal to the spread between the two, incurred on each round-trip (buy-sell pair).

In 1999, the bid-ask spread for the market’s most liquid ETF–the SPDR S&P 500 ETF $SPY–was less than 10 cents, which equated to around 0.08% of market value. The lowest available transaction fee from an online broker was around $10, which, if we assume a trade size of $50,000, amounted to around 0.02% of assets. Summing these together, we arrive at 0.10% as a conservative friction, or “slippage” cost, to apply to each trade. Of course, the actual average slippage cost in the 1928 to 1999 period was much higher than 0.10%. But an investor who employs the strategy from 1999 onward, as we are about to do, is not going to see that higher cost; she is going to see the 0.10% cost, which is the cost we want to build into the test.

The following charts show the strategy’s performance on an assumed round-trip slip of 0.10%:

As you can see, with slippage costs appropriately factored in, the annual return falls from 25.1% to 18.0%–a sizeable drop. But the strategy still strongly outperforms, beating the market by more than 700 bps per year. We therefore conclude that an investor who employs the strategy from 1999 onward is likely to enjoy strong returns–maybe not returns that equal or exceed 18%, but returns that will more than likely beat the market.

The chief threat to our conclusion is the possibility that randomness is driving the performance seen in the backtest. Of course, we need to clarify what exactly the term “random” would mean in this context. Consider an example. In May of 2015, the New York Rangers played the Tampa Bay Lightning in Game 7 of the Stanley Cup semi-finals. The game was a home game for the Rangers, held at Madison Square Garden (MSG). The following table shows the Rangers’ performance in game sevens at MSG up to that point:

As you can see, the Rangers were a perfect 7 for 7 in at-home game sevens. Given this past performance, would it have been valid to conclude that the Rangers would “probably” win the game seven that they were about to play? Intuitively, we recognize the answer to be no. The statistic “7 for 7 in at-home game sevens” is a purely random, coincidental occurrence that has little if any bearing on the team’s true probability of victory in any game (Note: the Rangers went on to lose the game).

But this intuition is hard to square with “the data.” Suppose, hypothetically, that in every at-home game seven that the Rangers play, the true probability of victory is at most 50%–a coin flip. For a given sampling of at-home game sevens, what is the probability that the team would win all of them? The answer: 0.5^7 = 0.8%, an extremely low probability. The Rangers successfully defied that extremely low probability and won all seven at-home game sevens that they played over the period. The implication, then, is that their true probability of winning at-home game sevens must have been higher than 50%–that the coin being flipped cannot have been a fair coin, but must have been a coin biased towards victory.

Consider the two possibilities:

(1) The Rangers’ probability of victory in any at-home game seven is less than or equal to 50%. Seven at-home game sevens are played over the period, and the Rangers win all seven–an outcome with a probability of less than 1%.

(2) The Rangers’ probability of victory in any at-home game seven is greater than 50%.

Since, from a statistical perspective, (1) is exceedingly unlikely to occur, we feel forced to accept the alternative, (2). The problem, of course, is that our delineation of those seven games as a “sampling” of the Ranger’s likelihood of winning the upcoming game is entirely invalid. The sample is biased by the fact that we intentionally picked it non-randomly, out of a large group of possible samples, precisely because it carried the unique pro-Rangers results that we were looking for.

If the probability of victory in each competition in the NHL is exactly 50%, what is the likelihood that over a fifty year period, a few teams out of the group of 30 will secure an unusually long string of victories? Extremely high. If we search the data, we will surely be able to find one of those teams. Nothing stops us from then picking out unique facts about the victories–for example, that they involved the Rangers, that the Rangers were playing game sevens, that the game sevens occurred at MSG–and arguing that those facts are somehow causally relevant, that they affected the likelihood that the team would win. What we should not do, however, is try to claim that “the data” supports this conclusion. The data simply produced an expected anomaly that we picked out from the bunch and identified as special, after the fact.

When we sample a system to test claims about the likelihood that it will produce certain outcomes, the sample needs to be random, blind. We cannot choose our sample, and present it as a valid test, when we already know that the results confirm the hypothesis. And so if we believe that there is something special about the New York Rangers, Game Sevens, and MSG as a venue–if we believe that the presence of those variables in a game changes the probability of victory–the appropriate way to test that belief is not to cite, as evidence, the seven at-home game sevens that we know the Rangers did win, the very games that led us to associate those variables with increased victory odds in the first place. Rather, the appropriate way to test the belief is to identify a different set of Rangers games with those properties, a set that we haven’t yet seen and haven’t yet extracted a hypothesis from, and look to see whether that sample yields an outsized number of Rangers victories. If it does, then can we legitimately claim that we’ve tested our belief in “the data.”

A Rangers fan who desperately wants people to believe that the Rangers will beat the Lightning will scour the universe to find random patterns that support her desired view. If she manages to find a random pattern, the finding, in itself, will not tell us anything about the team’s true probability of victory. The fan is not showing us the number of potential patterns that she had to sift through and discard in order to find a pattern that actually did what she wanted it to do, and therefore she is hiding the significant possibility that the pattern that she found is a generic anomaly that would be expected to randomly occur in any large population.

In the context of the timing strategy, how many other strategies did I have to sift through and discard–explicitly, or implicitly–in order to find the strategy that I showed to you, the strategy that I am now trying to make a big deal out of? You don’t know, because I haven’t told you. You therefore can’t put accurate odds on the possibility that the strategy’s impressive results were just a random anomaly that would be expected to be found in any sufficiently large population of potential strategies, when searched for.

In practice, the best way to rule out the possibility that we may have preferentially identified random success is to first define the strategy, and then, after we’ve defined it and committed ourselves to it, test it live, in real-world data, data that we have not yet seen and could not possibly have molded our strategy to fit with. If I suspect that there is something special about the Rangers, game sevens, and MSG, then the solution is to pull those variables together in live experiments, and see whether or not the victories keep happening. We set up, say, 50 game sevens in MSG for the Rangers to play, and 50 normal games in other stadiums for them to play as a control, and if they end up winning many more of the MSG game sevens than the control games, then we can correctly conclude that the identified game 7 MSG success was not an expected random anomaly, but a reflection of true causal impact in the relevant variables.

Unfortunately, in economic and financial contexts, such tests are not feasible, because they would take too long to play out. Our only option is to test our strategies in historical data. Even though historical data are inferior for that purpose, they can still be useful. The key is to conduct the tests out-of-sample, in historical data that we haven’t yet seen or worked with. Out-of-sample tests prevent us from picking out expected random anomalies in a large population, and making special claims about them, when there is nothing that is actually special about them.

In a financial context, the optimal way to conduct an out-of-sample test on an investment strategy is to use data from foreign markets–ideally, foreign markets whose price movements are unrelated to the price movements in our own market. Unfortunately, in this case, the daily data necessary for such a test are difficult to find, particularly if we want go back to the 1920s, as we did with the current test.

For an out-of-sample test, the best I can offer in the current case is a test in the 10 different sectors of the market, data that is available from CRSP. The tests will not be fully independent of the previous test on the total market index, because the stocks in the individual sectors overlap with the stocks in the larger index. But the sectors still carry a uniqueness and differentiation from the index that will challenge the strategy in a way that might shed light on the possibility that its success was, in fact, random.

The following table shows the performance of the strategy in ten separate market sectors back to 1928:

As the table reveals, the only sector in which the strategy fails to strongly outperform is the Telecom sector. We can therefore conclude that the strategy’s success is unlikely to be random.

Now, the Rangers game 7 MSG streak was only a seven game streak. On the assumption that the Rangers had an even chance of winning or losing each game, the probability of such a streak occuring in a seven game trial would have been 0.8%–a low number, but not a number so low as to preclude the generic occurrence of the streak somewhere in a large population. If 100 or 200 or 300 different teams with the same probability of victory played seven games, it’s reasonable to expect that a few would win seven straight.

The same point cannot be made, however, about the timing strategy–at least not as easily. The timing strategy contained 19,135 days, which amounts to 19,135 independent tests of the strategy’s prowess. In those tests, the strategy’s average daily excess return over the risk-free rate was 0.058%. The average daily excess return of the control, the X/Y portfolio, was 0.0154%, with a standard deviation of 0.718%. Statistically, we can ask the following question. If my timing strategy did not add any value through its timing–that is, if it carried the same expected return as the X/Y portfolio, a portfolio with the same cumulative asset exposures, which is expected to produce an excess return of 0.0154% per day, with a standard deviation of 0.718%–what is the probability that we would conduct 19,135 independent tests of the timing strategy, and get an average return of 0.058% or better? If we assume that daily returns follow a normal distribution, we can give a precise statistical answer. The answer: as close to zero as you can possibly imagine. So close that not even separate trials on an inordinately large number of different strategies, repeated over and over and over, would be able to produce the outcome randomly.

For us to find one strategy out of a thousand that adds no value over the X/Y portfolio, but that still manages to randomly produce such extreme outperformance over such a large sample size, would be like our finding one “average” team out of 100 in a sports league that manages to win 1,000 straight games, entirely by luck. One “average” team out of 100 “average” teams will likely get lucky and win seven straight games in a seven game trial. But one “average” team out of 100 average teams is not going to get lucky and win 1,000 straight games in a 1,000 game trial. If there is an “average” team–out of 100, or 1,000, or even 10,000–that manages to win that many games in a row, then we were simply wrong to call the team “average.” The team’s probability of victory cannot realistically have been a mere 50%–if it had been, we would be forced to believe that “that which simply does not take place” actually took place.

I am willing to accept the same claim about the daily momentum strategy. Indeed, I am forced to accept it. The probability that daily momentum–or any timing strategy that we might conjure up–would be no better than its corresponding X/Y portfolio, and yet go on to produce such extreme outperformance over such a large number of independent trials, is effectively zero. It follows that the strategy must have been capturing a non-random pattern in the data–some causally relevant fact that made prices more likely to go up tomorrow if they went up today, and down tomorrow if they went down today.

I should clarify, at this point, that what makes this conclusion forceful is the large number of independent trials contained in the backtest. Crucially, a large number of independent trials is not the same as a large number of years tested. If the time horizon of each trial in a test is long, the test can span a large number of years and yet still only contain a small number of independent trials.

To illustrate, suppose that I’ve formulated a technique that purports to predict returns on a 10 year time horizon. If I backtest that technique on a monthly basis over a 50 year period, the number of independent trials in my test will not be 12 months per year * 50 years = 600. Rather, it will be 50 years / 10 years = 5, a much smaller number. The reason is that the 10 year periods inside the 50 year period are not independent of each other–they overlap. The 10 year period that ranges from February 1965 to February 1975, for example, overlaps with the 10 year period that ranges from March 1965 to March 1975, in every month except one. Given the overlap, if the technique works well in predicting the 10 year return from February 1965 onward, it’s almost certainly going to work well in predicting the 10 year return from March onward–and likewise for April, May, June, and so on. The independence will increase until we get to February 1975, at which point full independence from the February 1965 trial will exist.

To summarize where we are at this point, we accept that the source of the daily momentum strategy’s success is a real pattern in the data–a pattern that cannot have reasonably occurred by chance, and that must have some causal explanation underneath it, even though we don’t know what that explanation is. The next question we need to address is the following. What is our basis for concluding that because the system produced that pattern in the past, that it will continue to produce the pattern in the future?

The answer, of course, is that we assume that the causal forces in the system that produced the pattern will remain in the system to keep producing it. When we make appeals to the results of historical backtests, as I did in this case, that is the assumption that we are making. Unfortunately, we frequently fail to appreciate how tenuous and unreliable that assumption can be, particularly in the context of a dynamic financial market influenced by an exceedingly large number of complicated forces.

To claim that all of the relevant causal forces in a system–all of the conditions that played a role in producing a particular pattern–remain in the system, and that the conditions will reliably produce the pattern again, we need to know, at a minimum, what those causal forces are. And to know what those causal forces are, we need an accurate theoretical model of how the system works, how it produces the outcomes that we observe. With respect to the timing strategy, what is the accurate theoretical model that explains how the market produces the daily momentum pattern that we’ve observed? I’ve given you no model. All I’ve to give you is “the data.” Should you trust me?

Suppose that after deep investigation, we come to find out that the driver of the timing strategy’s success in the 1928 to 1999 period was a peculiarity associated with the way in which large market participants initiated (or terminated) their positions. When they initiated (or terminated) their positions, they did so by sending out buy (or sell) orders to a broker many miles away. Those orders were then fractionally executed over a period of many days, without communication from the sender, and without the possibility of being pulled. The result might conceivably be a short-term momentum pattern in the price, a pattern that the daily momentum strategy could then exploit. This phenomenon, if it were the true driver for the strategy’s success–and I’m not saying that it is, I made it up from nothing, simply to illustrate a point–would be an example of a driver that would be extremely fragile and ephemeral. Any number of changes in market structure could cause it to disappear as a phenomenon. The strategy’s outperformance would then evaporate.

Until we work out an accurate account of what is going on with this peculiar result–and my guess is as good as yours, feel free to float your own theories–we won’t be able to rule out the possibility that the result is due to something fragile and ephemeral, such as a quirk in how people traded historically. We won’t even be able to put a probability on that possibility. We are flying blind.

Investors tend to be skeptical of theoretical explanations. The reason they tend to be skeptical is that it is easy to conjure up flaky stories to explain observed data after the fact. In the case of the daily momentum results, you saw how easy it was for me to make up exactly that type of story. But the fact that flaky stories are easy to conjure up doesn’t mean that sound theoretical explanations aren’t important. They’re extremely important–arguably just as important as “the data.” Without an accurate understanding of how and why a system produces the patterns that we see, there’s no way for us to know whether or for how long the system will continue to produce those patterns. And, if the system that we’re talking about is a financial market, it’s hardly a given that the system will continue to produce them.

Now, to be fair, if a system has not been perturbed, it’s reasonable to expect that the system will continue to produce the types of outcomes that it’s been producing up to now. But if we choose to use that expectation as a justification for extrapolating past performance into the future, we need to favor recent data, recent observations of the system’s functioning, in the extrapolation. Successful performance in recent data is more likely to be a consequence of conditions that remain in the system to produce the successful performance again. In contrast, successful performance that is found only in the distant past, and not in recent data, is likely to have resulted from conditions that are no longer present in the system.

Some investors like to poo-poo this emphasis on recency. They interpret it to be a kind of arrogant and dismissive trashing of the sacred market wisdoms that our investor ancestors carved out for us, through their experiences. But, hyperbole aside, there’s a sound basis for emphasizing recent performance over antiquated performance in the evaluation of data. Recent performance is more likely to be an accurate guide to future performance, because it is more likely to have arisen out of causal conditions that are still there in the system, as opposed to conditions that have since fallen away.

This fact should give us pause in our evaluation of the strategy. Speaking from the perspective of 1999, over the last two decades–the 1980s and 1990s–the strategy has failed to reliably outperform the market. Why? What happened to the pattern that it was supposedly exploiting? Why are we no longer seeing that pattern in the data? Given that we never had an understanding of the factors that brought about the pattern in the first place, we can’t even begin to offer up an answer. We have to simply take it on faith that there is some latent structural property of the market system that causes it to produce the pattern that our strategy exploits, and that even though we haven’t seen the system produce that pattern in over 20 years, we’re eventually going to see the pattern come up again. Good luck with that.

If you’ve made it this far, congratulations. We’re now in a position to open up the curtain and see how the strategy would have performed from 1999 onward. The following chart shows the performance:

As you can see, the strategy would have performed atrociously. It would have inflicted a cumulative total return loss of 71%. That loss would have been spread out over a multi-decade period in which almost all asset classes outside of the technology sector saw substantial price appreciation, and in which general prices in the economy increased by more than a third.

So much for basing an investment process on “the data.” The pattern that the strategy had been exploiting was significantly more fragile than anticipated. Something changed somewhere in time, and caused it to disappear. We tend to assume that this kind of thing can’t happen, that a market system is like a physical system whose governing “laws” never change. That assumption would be true, of course, if we were modeling a market system physically, at the level of the neurons in each participant’s brain, the ultimate source of everything that subsequently happens in the system. But the assumption is not true if we’re modeling the system at a macroscopic level. It’s entirely possible for the “macro” rules that describe outcomes in a market system to change in relevant ways over time. As quantitative investors, we should worry deeply about that possibility.

With respect to the strategy’s dismal performance, the writing was on the wall. The strategy itself–buying after daily gains and selling after daily losses–was weird and counter-intuitive. We had no understanding whatsoever of the causal forces that were driving its success. We therefore had no reliable way to assess the robustness or ephemerality of those forces, no way to estimate the likelihood that they would remain in the system to keep the success going. Granted, if we know, for a fact, that relevant conditions in the system have not been perturbed, we can reasonably extrapolate past performance into the future, without necessarily understanding its basis. But in this case, the strategy’s recent historical performance–the performance that conveys the most information about the strategy’s likely future performance–had not been good. If we had appropriately given that performance a greater weight in the assessment, we would have rightly set the strategy aside.

We are left with two important investing takeaways:

(1) From an investment perspective, a theoretical understanding of how the market produces a given outcome is important–arguably just as important as “the data” showing that it does produce that outcome. We need such an understanding in order to be able to evaluate the robustness of the outcome, the likelihood that the outcome will continue to be seen in the future. Those that have spent time testing out quantitative approaches in the real world can attest to the fact that the risk that a well-backtested strategy will not work in the future is significant.

(2) When we extrapolate future performance from past performance–a move that can be justified, if conditions in the system have remained the same–we need to favor recent data over data from the distant past. Recent data is more likely to share common causal factors with the data of the future–the data that matter.

Now, a critic could argue that my construction here is arbitrary, that I went out and intentionally found a previously well-working strategy that subsequently blew up, specifically so that I could make all of these points. But actually, I stumbled onto the result while playing around with a different test in the 1928 to 1999 period: a test of the famous Fama-French-Asness momentum factor, which sorts stocks in the market on the basis of prior one year total returns. That factor also contains an apparent deviation in its performance that starts around the year 2000.

The following chart and table show the performance of the market’s 10 momentum deciles from 1928 up to the year 2000:

As the chart and table confirm, the returns sort perfectly on the momentum factor. Higher momenta correspond to higher returns, and lower momenta correspond to lower returns.

But now consider the results for the period from the year 2000 to today:

The results are off from what we would have expected. The top performer on total return ends up being the 4/10 decile, with the 5/10 decile, the Sharpe winner, a close second. The highest moment decile–10/10–ends up in 6th place, with the 9/10 decile in 5th place.

To be fair, it may be possible to explain the unexpected shuffling of the performance rankings as a random statistical deviation. But the shuffling represents a reason for caution, especially given that the post-2000 period is a recent period like our own, a period in which momentum was already a known factor to market participants. For all we know, momentum could be a highly fragile market phenomenon that could be perturbed out of existence if only a few smart investors with large footprints were to try to implement it. Or it could disappear for entirely unrelated reasons–a butterfly could flap its wings somewhere else in the market, and mess things up. Or it could be robust, and stick like glue in the system no matter how the financial world changes. Without an understanding of the causal drivers of its historical outperformance, it’s difficult to confidently assess the likelihood of any of these possibilities.

The daily momentum strategy’s outperformance was so weird that I asked a quant friend, @econompic, to do his own testing on the strategy, to see if he could reproduce the results. It turns out that he had already reproduced them. In a short blog post from June, he tested the strategy, which is originally attributable to John Orford, in the daily S&P 500 price index (dividends excluded). Lo and behold, he observed similarly extreme outperformance, with a massive unexplained break in the year 2000. This result allayed my chief concern, which was that the outperformance was being driven by some unique quirk in the way that the CRSP indexes were being put together, a quirk that then abruptly changed in the year 2000. But the same result is found in an index produced by a completely separate entity: S&P, i.e., the S&P 500.

In his testing, @econompic also found that the inverse of the daily momentum strategy–daily mean reversion–which had worked horribly up to the year 2000, has since outperformed, at least before frictions. The chart below reproduces his result on the total market, with dividends included:

What should we conclude from all of this? We don’t have to conclude anything–all I’ve offered is an example. I’m sticking with what I had already concluded–that the currently fashionable project of using “the data” to build superior investment strategies, or to make claims about the future, deserves significant scrutiny. It’s not useless, it’s not without a place, but it’s worthy of caution and skepticism–more than it typically receives. Markets are too complex, too dynamic, too adaptive, for it to be able to succeed on its own.

As an investor evaluating a potential strategy, what I want to see is not just an impressive backtest, but a compelling, accurate, reductionistic explanation of what is actually happening in the strategy–who in the market is doing what, where, when and why, and how the agglomeration is producing the result, the pattern that the strategy is successfully exploiting. I want an explanation that I know to be accurate, an explanation that will allow me to reliably gauge the likelihood that the pattern and the associated outperformance will persist into the future–which is the only thing I care about.

If I’m going to run a systematic strategy, I want the strategy to work now, when I run it, as I run it. I don’t want to have to put faith in an eventual reversion to a past period of glory. That’s too risky–the exploited pattern could have been ephemeral, relevant conditions could have changed. If a strategy can’t deliver success on a near-term basis, in the out-of-sample test that reality is putting it through, then I’d rather just abandon the systematic approach altogether and invest on my own concrete analysis of the situation, my own gut feel for where things are likely headed, given the present facts. If I don’t have confidence in my own analysis, if I can’t trust my gut, then I shouldn’t be actively investing. I should save the time and effort and just toss the money in an index.

To make the point with an analogy from the game of chess, suppose that there’s a certain position on the board. You’ve done a large statistical analysis of historical games with similar positions, and you’ve observed that a certain move showed a high frequency of subsequent victories. If you’re going to tell me that I should make that move, I want you to tell me why it’s a good move. Explain it to me, in terms of the logic of the position itself. If you can’t do that, I’m going to be skeptical. And if I make that move, and things turn south for me in the game, well, I’m going to go back to working off of my own analysis of the position. I’m not going to blindly trust in the recommendations that your statistical analysis is pumping out–nor should I. Finally, if you’re going to tell me that my attempt to find the right move through a direct analysis of the position isn’t going to work–that my impulses and irrationalities will inevitably lead me astray–what you’re telling me is that I shouldn’t be playing chess. And maybe you’re right.

Of course, the situation isn’t entirely the same in markets, but there’s a loose analogy. The actual truth of what the right move is in a chess game is contained in the position that is there on the board, not in positions seen in prior games. Likewise, the actual truth of what the right move is in a market is contained in the conditions that presently define the market, not in conditions observed in the past–and especially not in conditions observed in the distant past. The goal is to get to that truth. Looking to the past isn’t necessarily the best way to get to it.

Financial Backtesting: A Cautionary Tale

Recent Posts

Subscribe via Email

Archives

Financial Backtesting: A Cautionary Tale

Share this:

Recent Posts

Subscribe via Email

Archives