Machine Learning Cheat Sheet

Probability

Expectation

\[\begin{align*} \mathbb{E}[X] &= \sum_{x} x P(X=x) \\ \mathbb{E}[X] &= \int_{-\infty}^{\infty} x p(x) dx \\ \bar{x} &= \frac{1}{n} \sum_{i=1}^{n} x_i \\ \mathbb{E}[aX + b] &= a\mathbb{E}[X] + b \end{align*}\]

Variance

\[\begin{align*} \text{Var}(X) &= \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \\ \text{Var}(X) &= \sum_{x} (x - \mathbb{E}[X])^2 P(X=x) \\ \text{Var}(X) &= \int_{-\infty}^{\infty} (x - \mathbb{E}[X])^2 p(x) dx\\ \text{Var}(X) &= \frac{1}{n} \sum_{i=1}^{n} (x_i - \mathbb{E}[X])^2 \\ \text{Var}(aX + b) = a^2 \text{Var}(X) \end{align*}\]

Standard Deviation

\[\begin{align*} \text{SD}(X) &= \sqrt{\text{Var}(X)} \end{align*}\]

Covariance

\[\begin{align*} \text{Cov}(X, Y) &= \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] \\ \text{Cov}(X, Y) &= \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] \\ \text{Cov}(X, Y) &= \sum_{x,y} (x - \mathbb{E}[X])(y - \mathbb{E}[Y]) P(X=x, Y=y) \\ \text{Cov}(X, Y) &= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} (x - \mathbb{E}[X])(y - \mathbb{E}[Y]) p(x,y) dx dy \\ \text{Cov}(X, Y) &= \frac{1}{n} \sum_{i=1}^{n} (x_i - \mathbb{E}[X])(y_i - \mathbb{E}[Y]) \\ \end{align*}\]

Correlation

\[\begin{align*} \text{Corr}(X, Y) &= \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} \\ &= \frac{\text{Cov}(X, Y)}{\text{SD}(X) \text{SD}(Y)} \\ \end{align*}\]

Linearity of Expectation

\[\begin{align*} \mathbb{E}[X + Y] &= \mathbb{E}[X] + \mathbb{E}[Y] \\ \mathbb{E}\left[\sum_{i=1}^{n} X_i\right] &= \sum_{i=1}^{n} \mathbb{E}[X_i] \end{align*}\]

Classification

Logistic Regression

Entity	Math	Shape	Type
Features	\(X\)	(n, d)	Continuous
Weights	\(W\)	(d, 1)	Continuous
Bias	\(b\)	(1, 1)	Continuous
Output	\(\boldsymbol{y}\)	(n, 1)	Binary

\[\begin{align*} \hat{\boldsymbol{y}} &= \sigma(XW + b) \\ &= \sigma(\underbrace{X}_{n \times d} \underbrace{W}_{d \times 1} + \underbrace{b}_{1}) \\ &= \sigma(\underbrace{\boldsymbol{z}}_{n \times 1}) \\ \text{where } \sigma(a) &= \frac{1}{1 + e^{-a}} \end{align*}\]

\[\begin{align*} \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \begin{cases} \log(\hat{\boldsymbol{y}}_i) & \text{if } \boldsymbol{y}_i = 1 \\ \log(1 - \hat{\boldsymbol{y}}_i) & \text{if } \boldsymbol{y}_i = 0 \end{cases} \end{align*}\]

Another way to write the loss function is:

\[\begin{align*} \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left( \boldsymbol{y}_i \log(\hat{\boldsymbol{y}}_i) + (1 - \boldsymbol{y}_i) \log(1 - \hat{\boldsymbol{y}}_i)\right) \end{align*}\]

Multi-Class Logistic Regression

Entity	Math	Shape	Type
Features	\(X\)	(n, d)	Continuous
Weights	\(W\)	(d, k)	Continuous
Bias	\(\boldsymbol{b}\)	(1, k)	Continuous
Output	\(Y\)	(n, k)	One-Hot

\[\begin{align*} \hat{Y} &= \text{softmax}(XW + \boldsymbol{b}) \\ &= \text{softmax}(\underbrace{X}_{n \times d} \underbrace{W}_{d \times k} + \underbrace{\boldsymbol{b}}_{1 \times k}) \\ &= \text{softmax}(\underbrace{Z}_{n \times k}) \\ \text{where } \hat{Y}_{ij} = \text{softmax}(Z_{ij}) &= \frac{e^{Z_{ij}}}{\sum_{j=1}^{k} e^{Z_{ij}}} \end{align*}\]

\[\begin{align*} \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{k} y_{ij} \log(\hat{y}_{ij}) \end{align*}\]

Metrics

Accuracy

\[\begin{align*} \text{Accuracy} &= \frac{\text{Correct Predictions}}{\text{Total Predictions}} \\ &= \frac{TP + TN}{TP + TN + FP + FN} \end{align*}\]

True Positive Rate (TPR)

\[\begin{align*} \text{True Positive Rate} &= \frac{\text{True Positives}}{\text{Actual Positives}} \\ &= \frac{TP}{TP + FN} \end{align*}\]

False Positive Rate (FPR)

\[\begin{align*} \text{False Positive Rate} &= \frac{\text{False Positives}}{\text{Actual Negatives}} \\ &= \frac{FP}{FP + TN} \end{align*}\]

ROC Curve (Receiver Operating Characteristic Curve)

X-axis: FPR, Y-axis: TPR

Area Under the ROC Curve (AUC)

AUC = 1: Perfect model
AUC = 0.5: Random model
AUC < 0.5: Model is worse than random

Precision

\[\begin{align*} \text{Precision} &= \frac{\text{True Positives}}{\text{Predicted Positives}} \\ &= \frac{TP}{TP + FP} \end{align*}\]

Recall

\[\begin{align*} \text{Recall} &= \frac{\text{True Positives}}{\text{Actual Positives}} \\ &= \frac{TP}{TP + FN} \end{align*}\]

F1 Score

Harmonic mean of precision and recall.

\[\begin{align*} \text{F1 Score} &= 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \\ &= \frac{2TP}{2TP + FP + FN} \end{align*}\]

Matthews Correlation Coefficient

Similar to Pearson correlation coefficient, but for binary classification.

The Matthews correlation coefficient (MCC) is pearson correlation coefficient between the observed and predicted binary classifications.

Derivation: (Homework) Range: -1 to 1

1: Total disagreement
0: Random agreement
1: Perfect agreement

Formula:

\[\begin{align*} \text{MCC} &= \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \end{align*}\]

Cohen’s Kappa

Definition: Cohen’s Kappa is a statistic that measures agreement between two rates while correcting for chance. Range: -1 to 1

1: Total disagreement
0: Random agreement
1: Perfect agreement

Formula:

\[\begin{align*} \text{Cohen's Kappa} &= \frac{P_o - P_e}{1 - P_e} \\ P_o &= \text{Accuracy} \\ P_e &= \sum_{i=1}^{k} \frac{\text{predicted}_i}{\text{total samples}} \cdot \frac{\text{actual}_i}{\text{total samples}} \end{align*}\]

Regression

Linear Regression

Entity	Math	Shape	Type
Features	\(X\)	(n, d)	Continuous
Weights	\(W\)	(d, 1)	Continuous
Bias	\(b\)	(1, 1)	Continuous
Output	\(\boldsymbol{y}\)	(n, 1)	Continuous

\[\begin{align*} \hat{\boldsymbol{y}} &= XW + b \\ &= \underbrace{X}_{n \times d} \underbrace{W}_{d \times 1} + \underbrace{b}_{1} \\ &= \underbrace{\boldsymbol{z}}_{n \times 1} \end{align*}\]

\[\begin{align*} \text{Mean Squared Loss} &= \frac{1}{n} \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2 \end{align*}\]

Normal Equation:

\[\begin{align*} \tilde{X} &= \begin{bmatrix} 1 & X \end{bmatrix} \\ \tilde{W} &= \begin{bmatrix} b \\ W \end{bmatrix} \\ \hat{\boldsymbol{y}} &= \tilde{X} \tilde{W} \\ \tilde{W} &= (\tilde{X}^T \tilde{X})^{-1} \tilde{X}^T \boldsymbol{y} \end{align*}\]

Metrics

Mean Absolute Error (MAE)

\[\begin{align*} \text{MAE} &= \frac{1}{n} \sum_{i=1}^{n} |\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i| \end{align*}\]

Mean Squared Error (MSE)

\[\begin{align*} \text{MSE} &= \frac{1}{n} \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2 \end{align*}\]

Root Mean Squared Error (RMSE)

\[\begin{align*} \text{RMSE} &= \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2} \end{align*}\]

R-squared (Coefficient of Determination)

\[\begin{align*} \text{R-squared} &= 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} \\ \text{where } \text{SS}_{\text{res}} &= \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2 \\ \text{and } \text{SS}_{\text{tot}} &= \sum_{i=1}^{n} (\boldsymbol{y}_i - \bar{\boldsymbol{y}})^2 \end{align*}\]

Mean Absolute Percentage Error (MAPE)

\[\begin{align*} \text{MAPE} &= \frac{1}{n} \sum_{i=1}^{n} \left| \frac{\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i}{\boldsymbol{y}_i} \right| \times 100 \end{align*}\]

Clustering

K-Means Clustering

Randomly initialize \(k\) centroids

\[\begin{align*} \mu_j &= \text{randomly initialize } j \text{th centroid} \end{align*}\]

Assign each data point to the nearest centroid

\[\begin{align*} \hat{y}_i &= \arg\min_{j} ||x_i - \mu_j||^2 \end{align*}\]

Update the centroids

\[\begin{align*} \mu_j &= \frac{1}{n_j} \sum_{i: \hat{y}_i = j} x_i \end{align*}\]

K-Means is guaranteed to converge, but not necessarily to the global optimum.
K-Means is sensitive to the initial placement of centroids.
K-Means is sensitive to outliers.

Bias-Variance Tradeoff

High Bias	High Variance
Simple model	Complex model
Underfitting	Overfitting
High training error	Low training error
High test error	High test error
Not sensitive to noise	Sensitive to noise

Decompose the error into three components:

This decomposition is applied to a single point \((x, y)\).
Train K models on different training datasets of the same distribution.
All expectations are taken over predictions \(\hat{y}_i\) of the K models on the same point \((x, y)\) where \(i = 1, \ldots, K\).

\[\begin{align*} \text{Error} &= \text{Variance} + \text{Bias}^2 + \sigma^2 \\ E[(\hat{y} - y)^2] &= E[(\hat{y} - \mathbb{E}[\hat{y}])^2] + (\mathbb{E}[\hat{y}] - y)^2 + \sigma^2 \end{align*}\]

Optimization

Gradient Descent

\[\begin{align*} \theta_{t+1} &= \theta_t - \alpha \nabla J(\theta_t) \\ \text{where } \alpha &= \text{learning rate} \\ \text{and } J(\theta) &= \text{cost function} \end{align*}\]

Stochastic Gradient Descent (SGD)

\[\begin{align*} \theta_{t+1} &= \theta_t - \alpha \nabla J(\theta_t, x_i, y_i) \\ \text{where } (x_i, y_i) &= \text{randomly selected training example} \end{align*}\]

Mini-Batch Gradient Descent

\[\begin{align*} \theta_{t+1} &= \theta_t - \alpha \nabla J(\theta_t, B) \\ \text{where } B &= \text{mini-batch of training examples} \end{align*}\]

SGD with Momentum

\[\begin{align*} v_t &= \beta v_{t-1} + (1 - \beta) \nabla J(\theta_t) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{align*}\] where \(\beta\) is the momentum term (usually set to 0.9).

RMSProp

\[\begin{align*} s_t &= \beta s_{t-1} + (1 - \beta) (\nabla J(\theta_t))^2 \\ \theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{s_t} + \epsilon} \nabla J(\theta_t) \end{align*}\] where \(\epsilon\) is a small constant to prevent division by zero.

Adam

\[\begin{align*} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) \nabla J(\theta_t) \\ s_t &= \beta_2 s_{t-1} + (1 - \beta_2) (\nabla J(\theta_t))^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{s}_t &= \frac{s_t}{1 - \beta_2^t} \\ \theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{\hat{s}_t} + \epsilon} \hat{m}_t \end{align*}\] where \(\beta_1\) and \(\beta_2\) are the exponential decay rates for the first and second moment estimates, respectively (usually set to 0.9 and 0.999).