Probability
Expectation
\[\begin{align*} \mathbb{E}[X] &= \sum_{x} x P(X=x) \\ \mathbb{E}[X] &= \int_{-\infty}^{\infty} x p(x) dx \\ \bar{x} &= \frac{1}{n} \sum_{i=1}^{n} x_i \\ \mathbb{E}[aX + b] &= a\mathbb{E}[X] + b \end{align*}\]
Variance
\[\begin{align*} \text{Var}(X) &= \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \\ \text{Var}(X) &= \sum_{x} (x - \mathbb{E}[X])^2 P(X=x) \\ \text{Var}(X) &= \int_{-\infty}^{\infty} (x - \mathbb{E}[X])^2 p(x) dx\\ \text{Var}(X) &= \frac{1}{n} \sum_{i=1}^{n} (x_i - \mathbb{E}[X])^2 \\ \text{Var}(aX + b) = a^2 \text{Var}(X) \end{align*}\]
Standard Deviation
\[\begin{align*} \text{SD}(X) &= \sqrt{\text{Var}(X)} \end{align*}\]
Covariance
\[\begin{align*} \text{Cov}(X, Y) &= \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] \\ \text{Cov}(X, Y) &= \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] \\ \text{Cov}(X, Y) &= \sum_{x,y} (x - \mathbb{E}[X])(y - \mathbb{E}[Y]) P(X=x, Y=y) \\ \text{Cov}(X, Y) &= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} (x - \mathbb{E}[X])(y - \mathbb{E}[Y]) p(x,y) dx dy \\ \text{Cov}(X, Y) &= \frac{1}{n} \sum_{i=1}^{n} (x_i - \mathbb{E}[X])(y_i - \mathbb{E}[Y]) \\ \end{align*}\]
Correlation
\[\begin{align*} \text{Corr}(X, Y) &= \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} \\ &= \frac{\text{Cov}(X, Y)}{\text{SD}(X) \text{SD}(Y)} \\ \end{align*}\]
Linearity of Expectation
\[\begin{align*} \mathbb{E}[X + Y] &= \mathbb{E}[X] + \mathbb{E}[Y] \\ \mathbb{E}\left[\sum_{i=1}^{n} X_i\right] &= \sum_{i=1}^{n} \mathbb{E}[X_i] \end{align*}\]
Classification
Logistic Regression
Entity | Math | Shape | Type |
---|---|---|---|
Features | \(X\) | (n, d) | Continuous |
Weights | \(W\) | (d, 1) | Continuous |
Bias | \(b\) | (1, 1) | Continuous |
Output | \(\boldsymbol{y}\) | (n, 1) | Binary |
\[\begin{align*} \hat{\boldsymbol{y}} &= \sigma(XW + b) \\ &= \sigma(\underbrace{X}_{n \times d} \underbrace{W}_{d \times 1} + \underbrace{b}_{1}) \\ &= \sigma(\underbrace{\boldsymbol{z}}_{n \times 1}) \\ \text{where } \sigma(a) &= \frac{1}{1 + e^{-a}} \end{align*}\]
\[\begin{align*} \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \begin{cases} \log(\hat{\boldsymbol{y}}_i) & \text{if } \boldsymbol{y}_i = 1 \\ \log(1 - \hat{\boldsymbol{y}}_i) & \text{if } \boldsymbol{y}_i = 0 \end{cases} \end{align*}\]
Another way to write the loss function is:
\[\begin{align*} \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left( \boldsymbol{y}_i \log(\hat{\boldsymbol{y}}_i) + (1 - \boldsymbol{y}_i) \log(1 - \hat{\boldsymbol{y}}_i)\right) \end{align*}\]
Multi-Class Logistic Regression
Entity | Math | Shape | Type |
---|---|---|---|
Features | \(X\) | (n, d) | Continuous |
Weights | \(W\) | (d, k) | Continuous |
Bias | \(\boldsymbol{b}\) | (1, k) | Continuous |
Output | \(Y\) | (n, k) | One-Hot |
\[\begin{align*} \hat{Y} &= \text{softmax}(XW + \boldsymbol{b}) \\ &= \text{softmax}(\underbrace{X}_{n \times d} \underbrace{W}_{d \times k} + \underbrace{\boldsymbol{b}}_{1 \times k}) \\ &= \text{softmax}(\underbrace{Z}_{n \times k}) \\ \text{where } \hat{Y}_{ij} = \text{softmax}(Z_{ij}) &= \frac{e^{Z_{ij}}}{\sum_{j=1}^{k} e^{Z_{ij}}} \end{align*}\]
\[\begin{align*} \text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{k} y_{ij} \log(\hat{y}_{ij}) \end{align*}\]
Metrics
Accuracy
\[\begin{align*} \text{Accuracy} &= \frac{\text{Correct Predictions}}{\text{Total Predictions}} \\ &= \frac{TP + TN}{TP + TN + FP + FN} \end{align*}\]
True Positive Rate (TPR)
\[\begin{align*} \text{True Positive Rate} &= \frac{\text{True Positives}}{\text{Actual Positives}} \\ &= \frac{TP}{TP + FN} \end{align*}\]
False Positive Rate (FPR)
\[\begin{align*} \text{False Positive Rate} &= \frac{\text{False Positives}}{\text{Actual Negatives}} \\ &= \frac{FP}{FP + TN} \end{align*}\]
ROC Curve (Receiver Operating Characteristic Curve)
- X-axis: FPR, Y-axis: TPR
Area Under the ROC Curve (AUC)
- AUC = 1: Perfect model
- AUC = 0.5: Random model
- AUC < 0.5: Model is worse than random
Precision
\[\begin{align*} \text{Precision} &= \frac{\text{True Positives}}{\text{Predicted Positives}} \\ &= \frac{TP}{TP + FP} \end{align*}\]
Recall
\[\begin{align*} \text{Recall} &= \frac{\text{True Positives}}{\text{Actual Positives}} \\ &= \frac{TP}{TP + FN} \end{align*}\]
F1 Score
Harmonic mean of precision and recall.
\[\begin{align*} \text{F1 Score} &= 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \\ &= \frac{2TP}{2TP + FP + FN} \end{align*}\]
Matthews Correlation Coefficient
Similar to Pearson correlation coefficient, but for binary classification.
The Matthews correlation coefficient (MCC) is pearson correlation coefficient between the observed and predicted binary classifications.
Derivation: (Homework) Range: -1 to 1
- 1: Total disagreement
- 0: Random agreement
- 1: Perfect agreement
Formula:
\[\begin{align*} \text{MCC} &= \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \end{align*}\]
Cohen’s Kappa
Definition: Cohen’s Kappa is a statistic that measures agreement between two rates while correcting for chance. Range: -1 to 1
- 1: Total disagreement
- 0: Random agreement
- 1: Perfect agreement
Formula:
\[\begin{align*} \text{Cohen's Kappa} &= \frac{P_o - P_e}{1 - P_e} \\ P_o &= \text{Accuracy} \\ P_e &= \sum_{i=1}^{k} \frac{\text{predicted}_i}{\text{total samples}} \cdot \frac{\text{actual}_i}{\text{total samples}} \end{align*}\]
Regression
Linear Regression
Entity | Math | Shape | Type |
---|---|---|---|
Features | \(X\) | (n, d) | Continuous |
Weights | \(W\) | (d, 1) | Continuous |
Bias | \(b\) | (1, 1) | Continuous |
Output | \(\boldsymbol{y}\) | (n, 1) | Continuous |
\[\begin{align*} \hat{\boldsymbol{y}} &= XW + b \\ &= \underbrace{X}_{n \times d} \underbrace{W}_{d \times 1} + \underbrace{b}_{1} \\ &= \underbrace{\boldsymbol{z}}_{n \times 1} \end{align*}\]
\[\begin{align*} \text{Mean Squared Loss} &= \frac{1}{n} \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2 \end{align*}\]
Normal Equation:
\[\begin{align*} \tilde{X} &= \begin{bmatrix} 1 & X \end{bmatrix} \\ \tilde{W} &= \begin{bmatrix} b \\ W \end{bmatrix} \\ \hat{\boldsymbol{y}} &= \tilde{X} \tilde{W} \\ \tilde{W} &= (\tilde{X}^T \tilde{X})^{-1} \tilde{X}^T \boldsymbol{y} \end{align*}\]
Metrics
Mean Absolute Error (MAE)
\[\begin{align*} \text{MAE} &= \frac{1}{n} \sum_{i=1}^{n} |\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i| \end{align*}\]
Mean Squared Error (MSE)
\[\begin{align*} \text{MSE} &= \frac{1}{n} \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2 \end{align*}\]
Root Mean Squared Error (RMSE)
\[\begin{align*} \text{RMSE} &= \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2} \end{align*}\]
R-squared (Coefficient of Determination)
\[\begin{align*} \text{R-squared} &= 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} \\ \text{where } \text{SS}_{\text{res}} &= \sum_{i=1}^{n} (\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i)^2 \\ \text{and } \text{SS}_{\text{tot}} &= \sum_{i=1}^{n} (\boldsymbol{y}_i - \bar{\boldsymbol{y}})^2 \end{align*}\]
Mean Absolute Percentage Error (MAPE)
\[\begin{align*} \text{MAPE} &= \frac{1}{n} \sum_{i=1}^{n} \left| \frac{\boldsymbol{y}_i - \hat{\boldsymbol{y}}_i}{\boldsymbol{y}_i} \right| \times 100 \end{align*}\]
Clustering
K-Means Clustering
- Randomly initialize \(k\) centroids
\[\begin{align*} \mu_j &= \text{randomly initialize } j \text{th centroid} \end{align*}\]
- Assign each data point to the nearest centroid
\[\begin{align*} \hat{y}_i &= \arg\min_{j} ||x_i - \mu_j||^2 \end{align*}\]
- Update the centroids
\[\begin{align*} \mu_j &= \frac{1}{n_j} \sum_{i: \hat{y}_i = j} x_i \end{align*}\]
- K-Means is guaranteed to converge, but not necessarily to the global optimum.
- K-Means is sensitive to the initial placement of centroids.
- K-Means is sensitive to outliers.
Bias-Variance Tradeoff
High Bias | High Variance |
---|---|
Simple model | Complex model |
Underfitting | Overfitting |
High training error | Low training error |
High test error | High test error |
Not sensitive to noise | Sensitive to noise |
Decompose the error into three components:
- This decomposition is applied to a single point \((x, y)\).
- Train K models on different training datasets of the same distribution.
- All expectations are taken over predictions \(\hat{y}_i\) of the K models on the same point \((x, y)\) where \(i = 1, \ldots, K\).
\[\begin{align*} \text{Error} &= \text{Variance} + \text{Bias}^2 + \sigma^2 \\ E[(\hat{y} - y)^2] &= E[(\hat{y} - \mathbb{E}[\hat{y}])^2] + (\mathbb{E}[\hat{y}] - y)^2 + \sigma^2 \end{align*}\]
Optimization
Gradient Descent
\[\begin{align*} \theta_{t+1} &= \theta_t - \alpha \nabla J(\theta_t) \\ \text{where } \alpha &= \text{learning rate} \\ \text{and } J(\theta) &= \text{cost function} \end{align*}\]
Stochastic Gradient Descent (SGD)
\[\begin{align*} \theta_{t+1} &= \theta_t - \alpha \nabla J(\theta_t, x_i, y_i) \\ \text{where } (x_i, y_i) &= \text{randomly selected training example} \end{align*}\]
Mini-Batch Gradient Descent
\[\begin{align*} \theta_{t+1} &= \theta_t - \alpha \nabla J(\theta_t, B) \\ \text{where } B &= \text{mini-batch of training examples} \end{align*}\]
SGD with Momentum
\[\begin{align*} v_t &= \beta v_{t-1} + (1 - \beta) \nabla J(\theta_t) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{align*}\] where \(\beta\) is the momentum term (usually set to 0.9).
RMSProp
\[\begin{align*} s_t &= \beta s_{t-1} + (1 - \beta) (\nabla J(\theta_t))^2 \\ \theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{s_t} + \epsilon} \nabla J(\theta_t) \end{align*}\] where \(\epsilon\) is a small constant to prevent division by zero.
Adam
\[\begin{align*} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) \nabla J(\theta_t) \\ s_t &= \beta_2 s_{t-1} + (1 - \beta_2) (\nabla J(\theta_t))^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{s}_t &= \frac{s_t}{1 - \beta_2^t} \\ \theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{\hat{s}_t} + \epsilon} \hat{m}_t \end{align*}\] where \(\beta_1\) and \(\beta_2\) are the exponential decay rates for the first and second moment estimates, respectively (usually set to 0.9 and 0.999).