- Supervised learning uses labeled outcomes (e.g., regression, classification).
- Unsupervised learning has no labels and discovers structure in the data (e.g., clustering, PCA).
Machine Learning Fundamentals
17 questions. Use Show Answer, then slide right (or use Next) to continue.
It is the tradeoff between model simplicity and flexibility:
- High bias → underfitting
- High variance → overfitting
A common decomposition (squared error) is:
\[\mathbb{E}[(\hat f(x)-y)^2] = \text{Bias}(\hat f(x))^2 + \text{Var}(\hat f(x)) + \sigma^2\]
- Overfitting: model captures noise and performs poorly on new data.
- Underfitting: model is too simple to capture underlying patterns.
- Train (≈ 60–80%): fit model parameters
- Validation (≈ 10–20%): tune hyperparameters and select models
- Test (≈ 10–20%): estimate final generalization performance
Notes:
- Small datasets often use cross-validation.
- Time series requires ordered splits (no shuffling).
Training performance is optimistically biased and can hide overfitting; you need out-of-sample evaluation.
A resampling method that trains and evaluates on multiple splits to estimate out-of-sample performance (e.g., k-fold CV).
- Data leakage
- Time-series data split randomly (violates time order)
- Non-independent observations (groups/duplicates)
Using information during training that would not be available at prediction time, inflating performance estimates.
- Scaling/normalizing the full dataset before splitting
- Using future information (look-ahead)
- Features derived from the target (post-outcome features)
Tree-based models (decision trees, random forests, gradient boosting) are generally insensitive to scaling.
Adding a penalty to reduce model complexity and overfitting (trade a little bias for lower variance).
- L1 (Lasso): encourages sparsity (feature selection).
- L2 (Ridge): shrinks coefficients smoothly toward zero.
- Parameters: learned from data (e.g., weights).
- Hyperparameters: set before training (e.g., learning rate, depth, C).
Estimating final generalization performance using a truly unseen test set (or unbiased evaluation protocol).
Reusing the same data for both selection and evaluation biases results; keep the test set "locked" until the end.
The assumptions a model/algorithm uses to generalize beyond the training data (e.g., linearity, smoothness, locality).
They typically have lower variance and are less sensitive to noise, improving out-of-sample performance.