Supervised Learning

The Supervised Learning Framework

In supervised learning, we observe a dataset $\{(x_i, y_i)\}_{i=1}^{N}$ where $x_i \in \mathbb{R}^d$ are features and $y_i$ is the target. The goal is to learn a function $\hat{f}$ that maps new inputs to accurate predictions. When $y_i$ is continuous, this is regression; when $y_i$ is categorical, this is classification.

Every quant interview involving ML starts here. The interviewer wants to know whether you understand the assumptions behind each model and when those assumptions break.

Linear Regression (OLS Review)

The ordinary least squares estimator minimizes squared residuals:

$\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{N} (y_i - x_i^\top \beta)^2 = (X^\top X)^{-1} X^\top y$

Key properties to know cold:

Unbiased: $E[\hat{\beta}] = \beta$ when the model is correctly specified.
Gauss-Markov: Among all linear unbiased estimators, OLS has minimum variance (BLUE).
Assumes: Linearity, independence, homoscedasticity, no multicollinearity.

In a quant context, OLS is the backbone of factor models. When you regress returns on Fama-French factors, you are running OLS.

The Regularization Bridge: Ridge and Lasso

When $d$ is large relative to $N$ (many features, limited history), OLS overfits. Regularization adds a penalty:

Ridge regression (L2):

$\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \sum_{i=1}^{N} (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_2^2$

Ridge shrinks coefficients toward zero but never sets them exactly to zero. The closed-form solution is $\hat{\beta}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$ . Notice that adding $\lambda I$ makes the matrix invertible even when $X^\top X$ is singular — this is why Ridge works when OLS fails due to multicollinearity.

Lasso regression (L1):

$\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \sum_{i=1}^{N} (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_1$

Lasso performs feature selection by driving some coefficients exactly to zero. This is invaluable in quant finance where you might have hundreds of candidate alpha signals but only a handful are truly predictive.

Elastic Net combines both: $\lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2$ . Use it when features are correlated (Lasso tends to pick one from a group of correlated features; Elastic Net keeps the group).

Logistic Regression

For binary classification ( $y_i \in \{0, 1\}$ ), logistic regression models the probability:

$P(y = 1 \mid x) = \sigma(w^\top x + w_0) = \frac{1}{1 + e^{-(w_0 + w^\top x)}}$

where $\sigma(z) = 1/(1 + e^{-z})$ is the sigmoid function. The decision boundary is at $\sigma(z) = 0.5$ , i.e., $w^\top x + w_0 = 0$ .

Parameters are estimated by maximum likelihood (no closed form — use gradient descent or Newton's method). The loss function is binary cross-entropy:

$\mathcal{L} = -\sum_{i=1}^{N} \bigl[y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)\bigr]$

Quant application: Logistic regression is widely used for trade classification (buy vs. sell), default prediction, and regime detection (risk-on vs. risk-off).

Support Vector Machines (SVM)

SVM finds the maximum-margin hyperplane that separates two classes. Given a linear decision boundary $w^\top x + b = 0$ , the margin is $2 / \|w\|$ . The SVM optimization problem is:

$\min_{w, b} \frac{1}{2} \|w\|^2 \quad \text{s.t.} \quad y_i(w^\top x_i + b) \ge 1 \; \forall i$

Support vectors are the data points that lie on the margin boundaries ( $y_i(w^\top x_i + b) = 1$ ). Only these points determine the decision boundary — the rest could be removed without changing the classifier.

The Kernel Trick

When data is not linearly separable, map it to a higher-dimensional space via $\phi(x)$ . The kernel trick computes $K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)$ without ever computing $\phi$ explicitly.

Common kernels:

Linear: $K(x, x') = x^\top x'$
RBF (Gaussian): $K(x, x') = \exp(-\gamma \|x - x'\|^2)$
Polynomial: $K(x, x') = (x^\top x' + c)^d$

Decision Trees

A decision tree recursively partitions the feature space by choosing splits that maximize purity of the resulting subsets.

Gini impurity for a node with class proportions $p_1, p_2, \ldots, p_K$ :

$G = 1 - \sum_{k=1}^{K} p_k^2$

Information gain uses entropy $H = -\sum_{k} p_k \log_2 p_k$ . The split that maximizes $H(\text{parent}) - \sum \frac{n_j}{n} H(\text{child}_j)$ is chosen.

Decision trees are interpretable and handle non-linear relationships naturally, but they overfit aggressively. This motivates ensembles (next lesson).

The Bias-Variance Tradeoff

The expected prediction error decomposes as:

$E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$

| Model | Bias | Variance | |---|---|---| | OLS (few features) | Low | Low | | OLS (many features) | Low | High | | Ridge/Lasso | Slightly higher | Much lower | | Decision tree (deep) | Low | Very high | | SVM (RBF kernel) | Low | Can be high |

Understanding this tradeoff is the single most tested ML concept in quant interviews.

When to Use What

| Scenario | Best choice | Why | |---|---|---| | Few features, linear relationship | OLS or Ridge | Simple, interpretable, fast | | Many features, need selection | Lasso or Elastic Net | Automatic sparsity | | Binary classification, need probabilities | Logistic regression | Calibrated probabilities | | Complex boundary, small-to-medium data | SVM with RBF kernel | Effective in high dimensions | | Need interpretability + non-linear | Decision tree (shallow) | Easy to explain to PMs |

Interview tip: When asked "which model would you use?", never jump to a complex model. Start with logistic regression or Ridge, explain the baseline, then discuss when to graduate to more complex methods. Interviewers reward structured thinking over name-dropping XGBoost.

The Supervised Learning Framework

Every quant interview involving ML starts here. The interviewer wants to know whether you understand the assumptions behind each model and when those assumptions break.

Linear Regression (OLS Review)

The ordinary least squares estimator minimizes squared residuals:

$\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{N} (y_i - x_i^\top \beta)^2 = (X^\top X)^{-1} X^\top y$

Key properties to know cold:

Unbiased: $E[\hat{\beta}] = \beta$ when the model is correctly specified.
Gauss-Markov: Among all linear unbiased estimators, OLS has minimum variance (BLUE).
Assumes: Linearity, independence, homoscedasticity, no multicollinearity.

In a quant context, OLS is the backbone of factor models. When you regress returns on Fama-French factors, you are running OLS.

The Regularization Bridge: Ridge and Lasso

When $d$ is large relative to $N$ (many features, limited history), OLS overfits. Regularization adds a penalty:

Ridge regression (L2):

$\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \sum_{i=1}^{N} (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_2^2$

Lasso regression (L1):

$\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \sum_{i=1}^{N} (y_i - x_i^\top \beta)^2 + \lambda \|\beta\|_1$

Logistic Regression

For binary classification ( $y_i \in \{0, 1\}$ ), logistic regression models the probability:

$P(y = 1 \mid x) = \sigma(w^\top x + w_0) = \frac{1}{1 + e^{-(w_0 + w^\top x)}}$

where $\sigma(z) = 1/(1 + e^{-z})$ is the sigmoid function. The decision boundary is at $\sigma(z) = 0.5$ , i.e., $w^\top x + w_0 = 0$ .

Parameters are estimated by maximum likelihood (no closed form — use gradient descent or Newton's method). The loss function is binary cross-entropy:

$\mathcal{L} = -\sum_{i=1}^{N} \bigl[y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)\bigr]$

Quant application: Logistic regression is widely used for trade classification (buy vs. sell), default prediction, and regime detection (risk-on vs. risk-off).

Support Vector Machines (SVM)

SVM finds the maximum-margin hyperplane that separates two classes. Given a linear decision boundary $w^\top x + b = 0$ , the margin is $2 / \|w\|$ . The SVM optimization problem is:

$\min_{w, b} \frac{1}{2} \|w\|^2 \quad \text{s.t.} \quad y_i(w^\top x_i + b) \ge 1 \; \forall i$

The Kernel Trick

When data is not linearly separable, map it to a higher-dimensional space via $\phi(x)$ . The kernel trick computes $K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)$ without ever computing $\phi$ explicitly.

Common kernels:

Linear: $K(x, x') = x^\top x'$
RBF (Gaussian): $K(x, x') = \exp(-\gamma \|x - x'\|^2)$
Polynomial: $K(x, x') = (x^\top x' + c)^d$

Decision Trees

A decision tree recursively partitions the feature space by choosing splits that maximize purity of the resulting subsets.

Gini impurity for a node with class proportions $p_1, p_2, \ldots, p_K$ :

$G = 1 - \sum_{k=1}^{K} p_k^2$

Information gain uses entropy $H = -\sum_{k} p_k \log_2 p_k$ . The split that maximizes $H(\text{parent}) - \sum \frac{n_j}{n} H(\text{child}_j)$ is chosen.

Decision trees are interpretable and handle non-linear relationships naturally, but they overfit aggressively. This motivates ensembles (next lesson).

The Bias-Variance Tradeoff

The expected prediction error decomposes as:

$E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$

Understanding this tradeoff is the single most tested ML concept in quant interviews.

When to Use What

Interview tip: When asked "which model would you use?", never jump to a complex model. Start with logistic regression or Ridge, explain the baseline, then discuss when to graduate to more complex methods. Interviewers reward structured thinking over name-dropping XGBoost.

The Supervised Learning Framework

Linear Regression (OLS Review)

The Regularization Bridge: Ridge and Lasso

Logistic Regression

Support Vector Machines (SVM)

The Kernel Trick

Decision Trees

The Bias-Variance Tradeoff

When to Use What

Practice Problems

Supervised Learning

The Supervised Learning Framework

Linear Regression (OLS Review)

The Regularization Bridge: Ridge and Lasso

Logistic Regression

Support Vector Machines (SVM)

The Kernel Trick

Decision Trees

The Bias-Variance Tradeoff

When to Use What

Practice Problems