Linear/logistic regression, SVM, decision trees, and the bias-variance tradeoff
In supervised learning, we observe a dataset where are features and is the target. The goal is to learn a function that maps new inputs to accurate predictions. When is continuous, this is regression; when is categorical, this is classification.
Every quant interview involving ML starts here. The interviewer wants to know whether you understand the assumptions behind each model and when those assumptions break.
The ordinary least squares estimator minimizes squared residuals:
Key properties to know cold:
In a quant context, OLS is the backbone of factor models. When you regress returns on Fama-French factors, you are running OLS.
When is large relative to (many features, limited history), OLS overfits. Regularization adds a penalty:
Ridge regression (L2):
Ridge shrinks coefficients toward zero but never sets them exactly to zero. The closed-form solution is . Notice that adding makes the matrix invertible even when is singular — this is why Ridge works when OLS fails due to multicollinearity.
Lasso regression (L1):
Lasso performs feature selection by driving some coefficients exactly to zero. This is invaluable in quant finance where you might have hundreds of candidate alpha signals but only a handful are truly predictive.
Elastic Net combines both: . Use it when features are correlated (Lasso tends to pick one from a group of correlated features; Elastic Net keeps the group).
For binary classification (), logistic regression models the probability:
where is the sigmoid function. The decision boundary is at , i.e., .
Parameters are estimated by maximum likelihood (no closed form — use gradient descent or Newton's method). The loss function is binary cross-entropy:
Quant application: Logistic regression is widely used for trade classification (buy vs. sell), default prediction, and regime detection (risk-on vs. risk-off).
SVM finds the maximum-margin hyperplane that separates two classes. Given a linear decision boundary , the margin is . The SVM optimization problem is:
Support vectors are the data points that lie on the margin boundaries (). Only these points determine the decision boundary — the rest could be removed without changing the classifier.
When data is not linearly separable, map it to a higher-dimensional space via . The kernel trick computes without ever computing explicitly.
Common kernels:
A decision tree recursively partitions the feature space by choosing splits that maximize purity of the resulting subsets.
Gini impurity for a node with class proportions :
Information gain uses entropy . The split that maximizes is chosen.
Decision trees are interpretable and handle non-linear relationships naturally, but they overfit aggressively. This motivates ensembles (next lesson).
The expected prediction error decomposes as:
| Model | Bias | Variance | |---|---|---| | OLS (few features) | Low | Low | | OLS (many features) | Low | High | | Ridge/Lasso | Slightly higher | Much lower | | Decision tree (deep) | Low | Very high | | SVM (RBF kernel) | Low | Can be high |
Understanding this tradeoff is the single most tested ML concept in quant interviews.
| Scenario | Best choice | Why | |---|---|---| | Few features, linear relationship | OLS or Ridge | Simple, interpretable, fast | | Many features, need selection | Lasso or Elastic Net | Automatic sparsity | | Binary classification, need probabilities | Logistic regression | Calibrated probabilities | | Complex boundary, small-to-medium data | SVM with RBF kernel | Effective in high dimensions | | Need interpretability + non-linear | Decision tree (shallow) | Easy to explain to PMs |
Interview tip: When asked "which model would you use?", never jump to a complex model. Start with logistic regression or Ridge, explain the baseline, then discuss when to graduate to more complex methods. Interviewers reward structured thinking over name-dropping XGBoost.