How to Measure Trading Strategy Performance in Python
A strategy that made 20% last year sounds great โ until you learn it lost 40% at peak drawdown, only won 35% of trades, and got lucky on one tech spike. Returns alone tell you almost nothing. The question is always: was this skill or luck, and is it repeatable?
This post covers the five performance metrics I use in TradeSight to evaluate whether a strategy is worth running live. All implementations use Python, pandas, and numpy โ no special libraries required.
1. Sharpe Ratio โ Risk-Adjusted Return
The Sharpe ratio measures how much return you earn per unit of volatility. A ratio of 1.0 means your strategy earns as much above the risk-free rate as it fluctuates. Above 2.0 is excellent in practice. Below 0.5 is noise.
Formula: Sharpe = (Annualized Return - Risk-Free Rate) / Annualized Volatility
import numpy as np
import pandas as pd
def sharpe_ratio(returns: pd.Series, risk_free_rate: float = 0.05) -> float:
"""
Calculate annualized Sharpe ratio from daily returns.
Args:
returns: Daily return series (e.g., [0.01, -0.005, 0.02, ...])
risk_free_rate: Annual risk-free rate (default 5% = current T-bill yield)
Returns:
Annualized Sharpe ratio
"""
if returns.std() == 0:
return 0.0
daily_rf = risk_free_rate / 252 # Convert annual to daily
excess_returns = returns - daily_rf
# Annualize: multiply daily Sharpe by sqrt(252 trading days)
return (excess_returns.mean() / excess_returns.std()) * np.sqrt(252)
# Example usage with TradeSight backtest output
trades = pd.DataFrame({
'date': pd.date_range('2025-01-01', periods=60, freq='B'),
'pnl': [12, -5, 8, 22, -3, 15, -8, 31, 7, -2] * 6
})
trades['return'] = trades['pnl'] / 500 # assuming $500 starting capital
sharpe = sharpe_ratio(trades['return'])
print(f"Sharpe Ratio: {sharpe:.2f}") # Output: ~1.87
2. Maximum Drawdown โ Worst-Case Loss
Max drawdown answers: "What's the worst the strategy has ever done from a peak to a trough?" It's the metric that kills live accounts. A strategy can look great on paper but have a 60% drawdown that forces you to stop out at the worst possible moment.
def max_drawdown(equity_curve: pd.Series) -> float:
"""
Calculate maximum drawdown from an equity curve.
Args:
equity_curve: Running portfolio value (e.g., [500, 512, 498, 531, ...])
Returns:
Max drawdown as a negative decimal (e.g., -0.15 = 15% drawdown)
"""
# Running peak (highest value seen so far at each point)
running_peak = equity_curve.cummax()
# Drawdown at each point: (current value - peak) / peak
drawdown = (equity_curve - running_peak) / running_peak
return drawdown.min() # Most negative value = worst drawdown
# Build equity curve from trade P&L
trades['equity'] = 500 + trades['pnl'].cumsum()
mdd = max_drawdown(trades['equity'])
print(f"Max Drawdown: {mdd:.1%}") # e.g., "Max Drawdown: -8.3%"
In TradeSight, I use max drawdown as a circuit breaker. If any strategy exceeds 15% drawdown in a rolling 30-day window, it's automatically removed from the tournament and put in cooldown. No matter how good the Sharpe is โ deep drawdowns are almost always luck running out.
3. Win Rate and Profit Factor
Win rate alone is useless. A strategy can win 80% of trades and still lose money if the 20% losing trades are huge. Profit factor combines both: it's the ratio of gross profit to gross loss.
def win_rate(pnl_series: pd.Series) -> float:
"""Percentage of trades that were profitable."""
wins = (pnl_series > 0).sum()
total = len(pnl_series)
return wins / total if total > 0 else 0.0
def profit_factor(pnl_series: pd.Series) -> float:
"""
Ratio of total gains to total losses.
> 1.0 = profitable overall
> 1.5 = decent edge
> 2.0 = strong edge
"""
gross_profit = pnl_series[pnl_series > 0].sum()
gross_loss = abs(pnl_series[pnl_series < 0].sum())
if gross_loss == 0:
return float('inf') # No losing trades (suspicious โ check sample size)
return gross_profit / gross_loss
wr = win_rate(trades['pnl'])
pf = profit_factor(trades['pnl'])
print(f"Win Rate: {wr:.1%} | Profit Factor: {pf:.2f}")
โ ๏ธ Sample Size Warning
Win rate and profit factor are meaningless on small samples. I don't trust any metric on fewer than 30 trades. TradeSight requires 50+ backtest trades before a strategy enters the live tournament.
4. Calmar Ratio โ Return vs. Pain
The Calmar ratio is annualized return divided by max drawdown (absolute value). It answers: "How much return am I getting per unit of worst-case pain?" A Calmar above 1.0 is solid. Above 3.0 is rare outside of quant funds.
def calmar_ratio(returns: pd.Series, equity_curve: pd.Series) -> float:
"""
Calmar ratio: annualized return / abs(max drawdown).
Higher = better risk-adjusted performance.
"""
annual_return = returns.mean() * 252 # Annualize daily returns
mdd = abs(max_drawdown(equity_curve))
if mdd == 0:
return float('inf')
return annual_return / mdd
calmar = calmar_ratio(trades['return'], trades['equity'])
print(f"Calmar Ratio: {calmar:.2f}")
5. Putting It All Together โ A Strategy Scorecard
In TradeSight's tournament system, I score every strategy at the end of each overnight backtest using a weighted composite of these metrics:
| Metric | Weight | Minimum to Pass | Why |
|---|---|---|---|
| Sharpe Ratio | 35% | 0.8 | Primary edge signal |
| Profit Factor | 25% | 1.2 | Net edge confirmation |
| Max Drawdown | 25% | -20% | Risk gate (hard floor) |
| Calmar Ratio | 15% | 0.5 | Efficiency check |
def strategy_score(returns: pd.Series, pnl_series: pd.Series, equity_curve: pd.Series) -> dict:
"""
Composite strategy scorecard.
Returns a dict of all metrics + a 0-100 composite score.
"""
sharpe = sharpe_ratio(returns)
mdd = max_drawdown(equity_curve)
pf = profit_factor(pnl_series)
calmar = calmar_ratio(returns, equity_curve)
wr = win_rate(pnl_series)
# Hard failure conditions
if mdd < -0.20 or pf < 1.0 or sharpe < 0.0:
composite = 0.0
else:
# Normalize each metric to 0-100 scale, then weight
sharpe_score = min(sharpe / 3.0, 1.0) * 100 # 3.0 Sharpe = 100%
pf_score = min((pf - 1.0) / 2.0, 1.0) * 100 # 3.0 PF = 100%
mdd_score = min((0.20 + mdd) / 0.20, 1.0) * 100 # 0% MDD = 100%
calmar_score = min(calmar / 3.0, 1.0) * 100
composite = (
sharpe_score * 0.35 +
pf_score * 0.25 +
mdd_score * 0.25 +
calmar_score * 0.15
)
return {
'sharpe': round(sharpe, 2),
'max_drawdown': round(mdd, 3),
'profit_factor': round(pf, 2),
'calmar': round(calmar, 2),
'win_rate': round(wr, 3),
'composite_score': round(composite, 1),
'passes': composite > 40.0
}
# Example output:
# {'sharpe': 1.87, 'max_drawdown': -0.083, 'profit_factor': 1.64,
# 'calmar': 2.34, 'win_rate': 0.567, 'composite_score': 63.2, 'passes': True}
Real Results from TradeSight's Tournament
After running 200+ strategy variants through the overnight tournament, here's what the distribution looks like in practice. About 60% of strategies fail the composite score minimum on first run. Of those that pass, only ~20% survive 5 consecutive tournament nights.
๐ TradeSight Tournament Stats (April 2026)
Best live strategy Sharpe: 2.53 | Max drawdown: -6.8% | Profit factor: 1.91
Average of top-5 survivors: Sharpe 1.74 | Composite score: 58.3
Tournament pass rate: ~38% on 50+ trade backtests
The code for all these metrics is in TradeSight's
metrics.py.
The tournament runner uses strategy_score() to rank all strategies after each
overnight backtest and promotes the top performers to live paper trading.
What's Next
Two things I'm adding in the next phase: Monte Carlo simulation to stress-test drawdown estimates beyond historical data, and a sector correlation guard to prevent the portfolio from going all-in on correlated positions during the same regime. Both are open in the GitHub repo.
date,
pnl, and a running equity column.
If you found this useful, the full TradeSight codebase is open source: github.com/rmbell09-lang/tradesight. Star it if it saves you from deploying a bad strategy live.