Machine Learning Cheat Sheet

Preparing for ML Breadth and Depth for Interviews
ML
Author

Zeel B Patel

Published

April 24, 2025

Probability

Expectation

\[\begin{align*} \mathbb{E}[X] &= \sum_{x} x P(X=x) \\ \mathbb{E}[X] &= \int_{-\infty}^{\infty} x p(x) dx \\ \bar{x} &= \frac{1}{n} \sum_{i=1}^{n} x_i \\ \mathbb{E}[aX + b] &= a\mathbb{E}[X] + b \end{align*}\]

Variance

\[\begin{align*} \text{Var}(X) &= \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \\ \text{Var}(X) &= \sum_{x} (x - \mathbb{E}[X])^2 P(X=x) \\ \text{Var}(X) &= \int_{-\infty}^{\infty} (x - \mathbb{E}[X])^2 p(x) dx\\ \text{Var}(X) &= \frac{1}{n} \sum_{i=1}^{n} (x_i - \mathbb{E}[X])^2 \\ \text{Var}(aX + b) = a^2 \text{Var}(X) \end{align*}\]

Standard Deviation

\[\begin{align*} \text{SD}(X) &= \sqrt{\text{Var}(X)} \end{align*}\]

Covariance

\[\begin{align*} \text{Cov}(X, Y) &= \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] \\ \text{Cov}(X, Y) &= \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] \\ \text{Cov}(X, Y) &= \sum_{x,y} (x - \mathbb{E}[X])(y - \mathbb{E}[Y]) P(X=x, Y=y) \\ \text{Cov}(X, Y) &= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} (x - \mathbb{E}[X])(y - \mathbb{E}[Y]) p(x,y) dx dy \\ \text{Cov}(X, Y) &= \frac{1}{n} \sum_{i=1}^{n} (x_i - \mathbb{E}[X])(y_i - \mathbb{E}[Y]) \\ \end{align*}\]

Correlation

\[\begin{align*} \text{Corr}(X, Y) &= \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} \\ &= \frac{\text{Cov}(X, Y)}{\text{SD}(X) \text{SD}(Y)} \\ \end{align*}\]

Linearity of Expectation

\[\begin{align*} \mathbb{E}[X + Y] &= \mathbb{E}[X] + \mathbb{E}[Y] \\ \mathbb{E}\left[\sum_{i=1}^{n} X_i\right] &= \sum_{i=1}^{n} \mathbb{E}[X_i] \end{align*}\]

Classification

Logistic Regression

Entity Math Shape Type
Features \(X\) (n, d) Continuous
Weights \(W\) (d, 1) Continuous
Bias \(b\) (1, 1) Continuous
Output \(\boldsymbol{y}\) (n, 1) Binary

\[\begin{align*} \hat{\boldsymbol{y}} &= \sigma(XW + b) \\ &= \sigma(\underbrace{X}_{n \times d} \underbrace{W}_{d \times 1} + \underbrace{b}_{1}) \\ &= \sigma(\underbrace{\boldsymbol{z}}_{n \times 1}) \\ \text{where } \sigma(a) &= \frac{1}{1 + e^{-a}} \end{align*}\]

\[\begin{align*} \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \begin{cases} \log(\hat{\boldsymbol{y}}_i) & \text{if } \boldsymbol{y}_i = 1 \\ \log(1 - \hat{\boldsymbol{y}}_i) & \text{if } \boldsymbol{y}_i = 0 \end{cases} \end{align*}\]

Another way to write the loss function is:

\[\begin{align*} \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left( \boldsymbol{y}_i \log(\hat{\boldsymbol{y}}_i) + (1 - \boldsymbol{y}_i) \log(1 - \hat{\boldsymbol{y}}_i)\right) \end{align*}\]

Multi-Class Logistic Regression

Entity Math Shape Type
Features \(X\) (n, d) Continuous
Weights \(W\) (d, k) Continuous
Bias \(\boldsymbol{b}\) (1, k) Continuous
Output \(Y\) (n, k) One-Hot

\[\begin{align*} \hat{Y} &= \text{softmax}(XW + \boldsymbol{b}) \\ &= \text{softmax}(\underbrace{X}_{n \times d} \underbrace{W}_{d \times k} + \underbrace{\boldsymbol{b}}_{1 \times k}) \\ &= \text{softmax}(\underbrace{Z}_{n \times k}) \\ \text{where } \hat{Y}_{ij} = \text{softmax}(Z_{ij}) &= \frac{e^{Z_{ij}}}{\sum_{j=1}^{k} e^{Z_{ij}}} \end{align*}\]

\[\begin{align*} \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{k} y_{ij} \log(\hat{y}_{ij}) \end{align*}\]

Metrics

Accuracy

\[\begin{align*} \text{Accuracy} &= \frac{\text{Correct Predictions}}{\text{Total Predictions}} \\ &= \frac{TP + TN}{TP + TN + FP + FN} \end{align*}\]

True Positive Rate (TPR)

\[\begin{align*} \text{True Positive Rate} &= \frac{\text{True Positives}}{\text{Actual Positives}} \\ &= \frac{TP}{TP + FN} \end{align*}\]

False Positive Rate (FPR)

\[\begin{align*} \text{False Positive Rate} &= \frac{\text{False Positives}}{\text{Actual Negatives}} \\ &= \frac{FP}{FP + TN} \end{align*}\]

ROC Curve (Receiver Operating Characteristic Curve)

  • X-axis: FPR, Y-axis: TPR

Area Under the ROC Curve (AUC)

  • AUC = 1: Perfect model
  • AUC = 0.5: Random model
  • AUC < 0.5: Model is worse than random

Precision

\[\begin{align*} \text{Precision} &= \frac{\text{True Positives}}{\text{Predicted Positives}} \\ &= \frac{TP}{TP + FP} \end{align*}\]

Recall

\[\begin{align*} \text{Recall} &= \frac{\text{True Positives}}{\text{Actual Positives}} \\ &= \frac{TP}{TP + FN} \end{align*}\]

F1 Score

Harmonic mean of precision and recall.

\[\begin{align*} \text{F1 Score} &= 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \\ &= \frac{2TP}{2TP + FP + FN} \end{align*}\]

Matthews Correlation Coefficient

Similar to Pearson correlation coefficient, but for binary classification.

The Matthews correlation coefficient (MCC) is pearson correlation coefficient between the observed and predicted binary classifications.

Derivation: (Homework) Range: -1 to 1

  • 1: Total disagreement
  • 0: Random agreement
  • 1: Perfect agreement

Formula:

\[\begin{align*} \text{MCC} &= \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \end{align*}\]

Cohen’s Kappa

Definition: Cohen’s Kappa is a statistic that measures agreement between two rates while correcting for chance. Range: -1 to 1

  • 1: Total disagreement
  • 0: Random agreement
  • 1: Perfect agreement

Formula:

\[\begin{align*} \text{Cohen's Kappa} &= \frac{P_o - P_e}{1 - P_e} \\ P_o &= \text{Accuracy} \\ P_e &= \sum_{i=1}^{k} \frac{\text{predicted}_i}{\text{total samples}} \cdot \frac{\text{actual}_i}{\text{total samples}} \end{align*}\]

Regression

Linear Regression

Entity Math Shape Type
Features \(X\) (n, d) Continuous
Weights \(W\) (d, 1) Continuous
Bias \(b\) (1, 1) Continuous
Output \(\boldsymbol{y}\) (n, 1) Continuous

\[\begin{align*} \hat{\boldsymbol{y}} &= XW + b \\ &= \underbrace{X}_{n \times d} \underbrace{W}_{d \times 1} + \underbrace{b}_{1} \\ &= \underbrace{\boldsymbol{z}}_{n \times 1} \end{align*}\]

\[\begin{align*} \text{Mean Squared Loss} &= \frac{1}{n} \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2 \end{align*}\]

Normal Equation:

\[\begin{align*} \tilde{X} &= \begin{bmatrix} 1 & X \end{bmatrix} \\ \tilde{W} &= \begin{bmatrix} b \\ W \end{bmatrix} \\ \hat{\boldsymbol{y}} &= \tilde{X} \tilde{W} \\ \tilde{W} &= (\tilde{X}^T \tilde{X})^{-1} \tilde{X}^T \boldsymbol{y} \end{align*}\]

Metrics

Mean Absolute Error (MAE)

\[\begin{align*} \text{MAE} &= \frac{1}{n} \sum_{i=1}^{n} |\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i| \end{align*}\]

Mean Squared Error (MSE)

\[\begin{align*} \text{MSE} &= \frac{1}{n} \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2 \end{align*}\]

Root Mean Squared Error (RMSE)

\[\begin{align*} \text{RMSE} &= \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2} \end{align*}\]

R-squared (Coefficient of Determination)

\[\begin{align*} \text{R-squared} &= 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} \\ \text{where } \text{SS}_{\text{res}} &= \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2 \\ \text{and } \text{SS}_{\text{tot}} &= \sum_{i=1}^{n} (\boldsymbol{y}_i - \bar{\boldsymbol{y}})^2 \end{align*}\]

Mean Absolute Percentage Error (MAPE)

\[\begin{align*} \text{MAPE} &= \frac{1}{n} \sum_{i=1}^{n} \left| \frac{\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i}{\boldsymbol{y}_i} \right| \times 100 \end{align*}\]

Clustering

K-Means Clustering

  • Randomly initialize \(k\) centroids

\[\begin{align*} \mu_j &= \text{randomly initialize } j \text{th centroid} \end{align*}\]

  • Assign each data point to the nearest centroid

\[\begin{align*} \hat{y}_i &= \arg\min_{j} ||x_i - \mu_j||^2 \end{align*}\]

  • Update the centroids

\[\begin{align*} \mu_j &= \frac{1}{n_j} \sum_{i: \hat{y}_i = j} x_i \end{align*}\]

  • K-Means is guaranteed to converge, but not necessarily to the global optimum.
  • K-Means is sensitive to the initial placement of centroids.
  • K-Means is sensitive to outliers.

Bias-Variance Tradeoff

High Bias High Variance
Simple model Complex model
Underfitting Overfitting
High training error Low training error
High test error High test error
Not sensitive to noise Sensitive to noise

Decompose the error into three components:

  • This decomposition is applied to a single point \((x, y)\).
  • Train K models on different training datasets of the same distribution.
  • All expectations are taken over predictions \(\hat{y}_i\) of the K models on the same point \((x, y)\) where \(i = 1, \ldots, K\).

\[\begin{align*} \text{Error} &= \text{Variance} + \text{Bias}^2 + \sigma^2 \\ E[(\hat{y} - y)^2] &= E[(\hat{y} - \mathbb{E}[\hat{y}])^2] + (\mathbb{E}[\hat{y}] - y)^2 + \sigma^2 \end{align*}\]

Optimization

Gradient Descent

\[\begin{align*} \theta_{t+1} &= \theta_t - \alpha \nabla J(\theta_t) \\ \text{where } \alpha &= \text{learning rate} \\ \text{and } J(\theta) &= \text{cost function} \end{align*}\]

Stochastic Gradient Descent (SGD)

\[\begin{align*} \theta_{t+1} &= \theta_t - \alpha \nabla J(\theta_t, x_i, y_i) \\ \text{where } (x_i, y_i) &= \text{randomly selected training example} \end{align*}\]

Mini-Batch Gradient Descent

\[\begin{align*} \theta_{t+1} &= \theta_t - \alpha \nabla J(\theta_t, B) \\ \text{where } B &= \text{mini-batch of training examples} \end{align*}\]

SGD with Momentum

\[\begin{align*} v_t &= \beta v_{t-1} + (1 - \beta) \nabla J(\theta_t) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{align*}\] where \(\beta\) is the momentum term (usually set to 0.9).

RMSProp

\[\begin{align*} s_t &= \beta s_{t-1} + (1 - \beta) (\nabla J(\theta_t))^2 \\ \theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{s_t} + \epsilon} \nabla J(\theta_t) \end{align*}\] where \(\epsilon\) is a small constant to prevent division by zero.

Adam

\[\begin{align*} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) \nabla J(\theta_t) \\ s_t &= \beta_2 s_{t-1} + (1 - \beta_2) (\nabla J(\theta_t))^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{s}_t &= \frac{s_t}{1 - \beta_2^t} \\ \theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{\hat{s}_t} + \epsilon} \hat{m}_t \end{align*}\] where \(\beta_1\) and \(\beta_2\) are the exponential decay rates for the first and second moment estimates, respectively (usually set to 0.9 and 0.999).