Machine Learning
From data fundamentals to model evaluation
Definition
Collection
Preparation
Analysis
Building
& Tuning
Monitor
AI is the broad field of making machines simulate human intelligence. ML is a subset where machines learn from data. DL uses deep neural networks for complex patterns.
Understanding the hierarchy helps you choose the right approach โ rule-based AI, statistical ML, or deep neural nets โ based on your data size, interpretability needs, and compute budget.
Broad field of making machines simulate human intelligence โ reasoning, problem-solving, language.
Subset of AI where machines learn patterns from data without explicit programming rules.
Subset of ML using multi-layered neural networks for images, audio, text.
| Scenario | Use | Why |
|---|---|---|
| Rule-based chatbot, fixed logic | AI | No learning needed |
| Predict prices, detect fraud | ML | Structured data, interpretable |
| Image/speech recognition | DL | Complex unstructured patterns |
| Small dataset (<1000 rows) | ML | DL needs large data |
| Need explainability (medical) | ML | DL is a black box |
ML is categorized by the type of feedback a model receives during learning โ labeled data (supervised), no labels (unsupervised), or rewards from actions (reinforcement).
Choosing the wrong type means your model can't learn at all. A supervised model needs labels โ if you don't have them, you need a different approach entirely.
Model learns from labeled data โ input + correct output pairs.
Classification
Output = category
- Logistic Regression
- Decision Tree
- Random Forest
- SVM, KNN
- XGBoost
Regression
Output = number
- Linear Regression
- Ridge (L2)
- Lasso (L1)
- ElasticNet
- XGBoost
Finds hidden patterns with no labels.
Clustering
- K-Means
- DBSCAN
- Hierarchical
Dimensionality Reduction
- PCA, t-SNE, UMAP
Association
- Apriori, FP-Growth
Agent learns by interacting with environment, receiving rewards or penalties.
Data can be categorical (labels/groups) or numerical (measured quantities). Each subtype โ nominal, ordinal, discrete, continuous โ determines how you can process and model it.
Data type determines which encoding to apply, which algorithms work, and which statistical tests are valid. Treating ordinal as nominal (or vice versa) leads to wrong models.
Nominal (No Order)
Ordinal (Has Order)
Discrete (Countable)
Continuous (Measurable)
Pandas is Python's most powerful data manipulation library. It provides DataFrame and Series structures for loading, cleaning, transforming, and exploring tabular data.
80% of ML work is data preparation. Pandas is how you inspect nulls, filter rows, encode features, merge datasets, and create new columns โ all before training begins.
Data preprocessing step where missing values (nulls/NaN) and duplicate rows are detected and handled before any analysis or modeling. Dirty data = wrong models. Clean data = reliable results.
Most ML algorithms cannot handle NaN values โ they either crash or produce silently wrong results. Duplicate rows bias your model by over-representing certain patterns and inflating evaluation metrics.
Missing values appear as NaN, None, NULL, empty strings, or placeholder values like -999, "unknown", "N/A". First step is always detection.
Detection Code
Types of Missingness
| Type | Meaning | Example |
|---|---|---|
| MCAR | Missing Completely At Random โ no pattern | Random data entry errors |
| MAR | Missing At Random โ depends on other columns | Income missing more for younger people |
| MNAR | Missing Not At Random โ depends on the missing value itself | High earners skip salary questions |
Strategy 1: Drop Rows/Columns
Strategy 2: Fill with Statistics
Strategy 3: Forward/Backward Fill (Time Series)
Strategy 4: Advanced Imputation
Which Strategy to Use?
| Situation | Recommended Strategy | Why |
|---|---|---|
| Numerical, few nulls (<5%), symmetric distribution | Fill with Mean | Mean preserves overall average |
| Numerical, skewed distribution or outliers | Fill with Median | Median is robust to outliers |
| Categorical column | Fill with Mode | Most frequent value is safest assumption |
| Time series data | Forward/Backward Fill | Preserves temporal continuity |
| Many nulls, complex relationships | KNN or Iterative Imputer | Uses relationships between columns |
| Column has >70% nulls | Drop the Column | Too little data to be reliable |
| MNAR โ missingness has meaning | Create "was_null" flag column | Preserves information in the missingness |
Rows that have identical values across all (or selected) columns. They arise from data merges, scraping, re-submissions, or system errors.
Duplicates inflate dataset size, bias model training (the model sees certain patterns more often than they actually occur), and artificially inflate evaluation metrics.
Detection
Treatment
EDA is the process of systematically investigating your dataset before building any model. You examine shape, distributions, relationships, missing values, and anomalies using statistics and visualizations.
Without EDA, you're modeling blindly. EDA prevents you from feeding garbage to your model โ you catch outliers, skewed distributions, data leakage, encoding issues, and multicollinearity early.
1 variable at a time
2 variables together
3+ variables
Analyzing one variable at a time. You look at how values of a single column are distributed โ their shape, center, spread, and extreme values.
Before comparing features, you need to understand each individually. A bimodal distribution, heavy skew, or extreme outlier in a single column can break models if not addressed.
- Count each category โ frequency table
- Find the most common value โ Mode
- Check if any category dominates (>90%)
Visualizations
- Mean โ average (affected by outliers)
- Median โ middle value (robust)
- Mode โ most frequent value
- Variance & SD โ spread
- IQR โ middle 50% range
- Skewness โ distribution symmetry
Visualizations
Measures the spread of the middle 50% of data. Robust to extreme outliers.
Variance measures how far each value is spread from the mean. Standard Deviation (SD) is simply the square root of variance โ bringing it back to the original unit so it's easier to interpret. A high SD means values are widely scattered; a low SD means they're tightly clustered around the mean.
Mean alone doesn't tell the full story. Two datasets can have the same mean but completely different spreads. SD tells you how reliable and consistent the data is โ essential for detecting outliers, comparing features, and understanding model uncertainty.
Identifying data points that deviate significantly from the rest. Methods include IQR fences (box plot whiskers) and Z-score thresholds based on standard deviations from the mean.
Outliers distort model coefficients, inflate RMSE, and skew correlations. A single extreme value can move a regression line significantly. Must detect and decide: fix or remove.
Right Skew (Positive)
Examples: Income data, house prices
Left Skew (Negative)
Examples: Test scores (few very low)
A Z-Score tells you how many standard deviations a value is away from the mean. A Z of 0 means the value is exactly at the mean. A Z of +2 means it's 2 standard deviations above. It standardizes any numerical column to a common scale so values from different features can be compared fairly.
Raw values alone don't tell you if something is unusual. A salary of โน5,00,000 could be normal or extreme depending on the dataset. Z-Score gives context โ |Z| > 3 flags a value as a statistical outlier, sitting beyond 99.7% of the distribution. It's also the foundation of standardization (StandardScaler) used before many ML models.
Empirical Rule (68-95-99.7)
Analyzing the relationship between two variables simultaneously โ how they move together (numerical vs numerical), differ across groups (categorical vs numerical), or associate (categorical vs categorical).
Feature selection, correlation detection, and hypothesis testing all start here. You find which features actually relate to the target โ and which are noise.
Covariance measures whether two variables move in the same direction or opposite directions. It shows the direction of the relationship, but NOT how strong it is because it depends on the units of measurement.
When X increases, Y also increases. They deviate from their means in the same direction.
When X increases, Y decreases. They deviate in opposite directions.
| Property | Covariance | Correlation |
|---|---|---|
| What it tells | Direction only | Direction + Strength |
| Range | โโ to +โ | โ1 to +1 |
| Unit dependent? | Yes | No (unitless) |
| r Value | Meaning |
|---|---|
| +0.8 to +1.0 | Strong positive |
| +0.4 to +0.7 | Moderate positive |
| โ 0 | No relationship |
| โ0.7 to โ0.4 | Moderate negative |
| โ1.0 to โ0.8 | Strong negative |
Visualizations: Scatter Plot Heatmap Pair Plot
Testing whether a numerical variable (e.g. salary, score) significantly differs across categories (e.g. department, gender). We set up a null hypothesis Hโ and use a test statistic to decide whether to reject it.
Tells you whether a categorical feature has real predictive power for a numerical target. If salary doesn't differ by department, that feature is noise.
| Hypothesis | Meaning |
|---|---|
| Hโ (Null) | No difference exists โ any observed gap is just random chance |
| Hโ (Alternative) | A real difference exists โ it's not just random noise |
The test statistic (t or F) is placed on its probability distribution. The P-value = area in the tail beyond that value โ the probability of seeing this extreme a result by pure chance alone, assuming Hโ is true.
Use when categorical variable has exactly 2 groups.
| Result | What it means | Action |
|---|---|---|
| p โค 0.05 | Groups are significantly different | Keep feature |
| p > 0.05 | No significant difference | Feature may be useless |
Use when categorical variable has 3 or more groups.
| Result | What it means | Next Step |
|---|---|---|
| p โค 0.05 | At least one group differs significantly | Run Post-Hoc (Tukey) |
| p > 0.05 | No significant difference | Feature may not add value |
Testing whether two categorical variables are associated or independent. We compare observed data to what we would expect if the variables had absolutely no relationship.
Tells you if a categorical feature is related to a categorical target. If Gender and Movie Preference are independent, Gender won't help predict preferences.
The Chi-Square test measures how much the observed counts in a cross-table differ from what we'd expect if the two variables had absolutely no relationship. The larger the ฯยฒ value, the bigger the gap between observed and expected โ meaning a stronger association exists between the two categories.
When both your feature and target are categorical, you can't use correlation or t-tests. Chi-Square is the go-to test to decide whether a categorical feature is worth keeping โ if Gender and Loan Default are independent, Gender adds zero predictive value to your model.
| P-value | Decision | Feature usefulness |
|---|---|---|
| p โค 0.05 | Reject Hโ โ | Variables associated โ keep feature |
| p > 0.05 | Fail to reject Hโ โ | Variables independent โ may be useless |
Cramรฉr's V is a normalized version of the Chi-Square statistic that measures the strength of association between two categorical variables. It always ranges from 0 to 1 โ where 0 means no association at all and 1 means a perfect relationship โ regardless of table size or sample size.
Chi-Square only tells you whether an association exists, not how strong it is. A huge dataset can produce a significant ฯยฒ even for a trivially weak relationship. Cramรฉr's V fixes this โ it gives you a comparable, scaled strength score so you can rank which categorical features are most useful for your model.
| Cramรฉr's V | Strength | What it means |
|---|---|---|
| 0.00โ0.10 | Negligible | Barely any association โ ignore |
| 0.10โ0.30 | WeakโModerate | Some association โ use with caution |
| 0.30โ0.60 | ModerateโStrong | Meaningful association โ likely useful |
| 0.60โ1.00 | Very Strong | Strong relationship โ important feature |
Examining 3 or more variables simultaneously. The key tool is the correlation heatmap โ a matrix where each cell shows Pearson's r between two features, color-coded by strength and direction.
Reveals multicollinearity (features too similar to each other), which bloats variance and hurts models. Also shows which features correlate with the target.
VIF (Variance Inflation Factor) is a measure that quantifies how much the variance of a regression coefficient is inflated due to multicollinearity โ that is, how much one feature can be linearly predicted from the other features.
When features are highly correlated with each other, the model gets confused about which feature deserves credit. Coefficients become unstable, unreliable, and have exploding standard errors, making the model meaningless.
How VIF is Calculated โ Step by Step
For each feature X in your dataset, VIF asks: "How well can I predict X using all the OTHER features?"
Interpreting VIF Values
| VIF Value | Meaning | Action |
|---|---|---|
| VIF = 1 | No multicollinearity at all โ feature is completely independent | Keep โ |
| 1 < VIF โค 5 | Low to moderate โ some correlation but acceptable | Keep โ |
| 5 < VIF โค 10 | Moderate to high โ investigate carefully | Consider dropping |
| VIF > 10 | Severe multicollinearity โ this feature is nearly redundant | Drop or combine โ |
Intuitive Example
Suppose your dataset has both Height_cm and Height_inches:
VIF vs Correlation Matrix โ Key Difference
| Property | Correlation Matrix | VIF |
|---|---|---|
| What it checks | Pairwise (Xโ vs Xโ) relationship only | X vs ALL other features combined |
| Detects 3-way collinearity? | โ No โ misses it | โ Yes โ catches it |
| Output | โ1 to +1 per pair | 1 to โ per feature |
| Example blind spot | Xโ = Xโ + Xโ might show low pairwise r | VIF for Xโ will be very high |
How to Fix High VIF
If two features are highly correlated, drop the one with less predictive value for the target.
Create a single feature from the correlated ones (e.g. ratio, sum, average).
Transform correlated features into uncorrelated principal components โ VIF = 1 for all components.
VIF Calculation Code
Mathematical transformations applied to a column's values to reduce skewness. They compress large values, making the distribution closer to normal so models work better.
Many ML algorithms assume normally distributed features. Highly skewed data gives disproportionate weight to extreme values. Log and Box-Cox transformations fix this without removing any rows.
Log transformation applies the logarithm function to every value in a column โ compressing large values and spreading small ones. It turns a right-skewed distribution into something closer to a normal bell curve.
Many ML models assume features are roughly normally distributed. Income, house prices, and population counts are heavily right-skewed โ a few huge values dominate. Log squashes those extremes so the model treats all values fairly.
| Situation | Use? |
|---|---|
| Right-skewed (skewness > +1) | โ Yes |
| Data has zeros | โ ๏ธ log(x+1) |
| Negative values | โ No |
Box-Cox is a family of power transformations that automatically finds the best exponent ฮป (lambda) to make your data as normal as possible. Instead of you guessing whether to use log, square root, or reciprocal โ Box-Cox tests them all and picks the optimal one.
Log only fixes right skew. Box-Cox handles both left and right skew by tuning ฮป. When skewness is above +2 or the distribution is complex, Box-Cox finds a better transformation than log alone โ but requires all values to be strictly positive.
Automatically finds the best ฮป (lambda) to normalize data.
| ฮป | Effect | Skew Type |
|---|---|---|
| 2 | yยฒ (Square) | Left-skewed |
| 0.5 | โy | Moderate right |
| 0 | ln(y) | Strong right |
| -1 | 1/y | Severe right |
| Skewness | Action |
|---|---|
| โ0.5 to +0.5 | No transform needed |
| +0.5 to +1.0 | Consider Log or โ |
| > +1.0 | Apply Log |
| > +2.0 | Apply Box-Cox |
The go-to replacement for Box-Cox when negatives are present. Works on any value โ positive, zero, or negative. Automatically finds the best ฮป just like Box-Cox.
ฮปโ 2,x<0: โ((โx+1)^(2โฮป)โ1)/(2โฮป) | ฮป=2,x<0: โln(โx+1)
| Property | Detail |
|---|---|
| Handles negatives? | โ Yes |
| Handles zeros? | โ Yes |
| Auto-finds ฮป? | โ Yes |
| Best for | Any skewed data, mixed signs |
Simple manual method. The cube root of a negative number is still negative โ so it naturally handles all signs. Good for moderate skew with negatives.
| Property | Detail |
|---|---|
| Handles negatives? | โ Yes |
| Handles zeros? | โ Yes |
| Auto-finds ฮป? | โ Fixed power (1/3) |
| Best for | Moderate skew, simple fix |
| Your Data | Recommended Transform | Why |
|---|---|---|
| All positive, right-skewed | Log / Box-Cox | Classic choice, well-understood |
| Has zeros, right-skewed | log(x+1) / Yeo-Johnson | Avoids log(0) error |
| Has negatives, any skew | Yeo-Johnson | Works on all signs, auto-finds ฮป |
| Has negatives, moderate skew | Cube Root | Simple, no fitting needed |
Principal Component Analysis transforms many correlated features into fewer uncorrelated "principal components" that capture the maximum variance in the data.
Too many correlated features (multicollinearity) cause unstable model weights and slow training. PCA compresses information into independent components โ directly eliminating multicollinearity.
| Variance Retained | Quality |
|---|---|
| <60% | Too low โ |
| 75โ90% | Good โ |
| 90โ95% | Excellent โ |
Synthetic Minority Over-sampling Technique creates new artificial minority samples by interpolating between existing real minority samples โ making class sizes more equal.
When 99% of samples are class 0 and 1% are class 1, a model can achieve 99% accuracy by just predicting 0 always โ useless! SMOTE forces the model to actually learn the minority class.
| Variant | When to Use |
|---|---|
| SMOTE (default) | Standard imbalance, continuous features |
| Borderline-SMOTE | Focus on hard boundary cases |
| ADASYN | Adaptive โ more samples where density is low |
| SMOTE-Tomek | SMOTE + clean overlapping points |
| SMOTENC | Mixed data (numerical + categorical) |
A structured overview of the major ML model families โ what problem each solves, what data it expects, and which real-world use cases it's best suited for.
Choosing the wrong model type is the single most common ML mistake. The model is a tool โ match the tool to the job.
| Model | Output | Best Use Case | Notes |
|---|---|---|---|
| Linear Regression | Continuous number | House price, salary prediction | Assumes linear relationship; sensitive to outliers |
| Ridge Regression | Continuous number | Same as linear but with correlated features | L2 regularization built-in |
| Lasso Regression | Continuous number | High-dimensional data, feature selection needed | L1 regularization โ zeros out features |
| Logistic Regression | Probability (0โ1) | Spam detection, default prediction, disease yes/no | Despite the name, it's a classifier |
| Model | Output | Best Use Case | Notes |
|---|---|---|---|
| Decision Tree | Class or number | Simple classification, interpretable rules | Easy to explain; prone to overfitting |
| Random Forest | Class or number | General-purpose: fraud, churn, credit scoring | Ensemble of trees; robust, handles missing values |
| XGBoost / LightGBM | Class or number | Tabular data competitions, ranking, clickthrough rate | Gradient boosting; state-of-the-art on structured data |
| Model | Output | Best Use Case | Notes |
|---|---|---|---|
| SVM | Class | Text classification, small datasets | Maximum-margin hyperplane; effective in high dimensions |
| K-Nearest Neighbors | Class or number | Recommendation systems, anomaly detection | No training phase; slow on large data |
| Naive Bayes | Class probability | Email spam, sentiment analysis | Fast, simple, works well with text/NLP |
| Model | Output | Best Use Case | Notes |
|---|---|---|---|
| ANN (MLP) | Any | Complex tabular patterns | Fully connected layers; general purpose |
| CNN | Class/label | Image classification, object detection | Convolutional layers capture spatial patterns |
| RNN / LSTM | Sequence | Time series, language modeling | Remembers long-term dependencies |
| Transformer | Any | NLP (ChatGPT, BERT), translation | Attention mechanism; state-of-the-art in NLP |
Use when your target variable has a time dimension: tomorrow's sales, next month's temperature, stock prices. Regular ML ignores temporal order โ forecasting preserves it.
Time Series Components
| Component | What It Is | Example |
|---|---|---|
| Trend | Long-term upward/downward direction | Rising sales year over year |
| Seasonality | Repeating pattern at fixed intervals | High sales every December |
| Cyclic | Longer irregular waves | Economic recession cycles |
| Noise | Random unexplainable variation | One-off spike due to event |
Classical Forecasting Models
| Model | Formula / Key Idea | Best Use Case |
|---|---|---|
| Moving Average | avg of last N observations | Stable, no trend/seasonality โ quick baseline |
| Exponential Smoothing (ETS) | F(t+1) = ฮฑยทY(t) + (1โฮฑ)ยทF(t) | Short-term forecast, sales, inventory |
| ARIMA(p,d,q) | AR + Differencing + MA | Stationary or easily differenced series |
| SARIMA | ARIMA + seasonal(P,D,Q,m) | Clear seasonality (retail, weather, energy) |
Modern / ML-Based Forecasting Models
| Model | Key Idea | Best Use Case |
|---|---|---|
| Prophet (Meta) | Trend + seasonality + holidays decomposition | Business forecasting, holiday effects |
| XGBoost + Lag Features | ML with t-1, t-7, t-30 as input features | Complex relationships, external regressors |
| LSTM | Recurrent NN with long-term memory | Long sequences, complex temporal patterns |
| Temporal Fusion Transformer | Attention + multi-horizon forecasting | Large-scale production forecasting |
Which Forecasting Model?
| Situation | Recommended Model |
|---|---|
| Quick baseline, stable series | Moving Average / ETS |
| No seasonality, stationary | ARIMA |
| Clear seasonality (monthly/weekly) | SARIMA / Holt-Winters |
| Business data with holidays | Prophet |
| Complex, many features | XGBoost + lag features |
| Long sequence, big data | LSTM / Transformer |
| Your Problem | Recommended Model | Why |
|---|---|---|
| Predict a number (house price, salary) | Linear / Ridge Regression | Interpretable, fast, good baseline |
| Binary yes/no (spam, fraud, churn) | Logistic Regression / XGBoost | Logistic for interpretability, XGBoost for performance |
| Multi-class (A/B/C/D categories) | Random Forest / XGBoost | Handles multi-class natively; robust |
| Customer churn prediction | XGBoost + SMOTE | Handles imbalance; captures complex patterns |
| Text classification / sentiment | Naive Bayes / Transformer | Naive Bayes for speed; Transformer for accuracy |
| Image recognition | CNN | Designed for spatial/visual patterns |
| Time series / forecasting | ARIMA / Prophet / LSTM | Depends on seasonality and data volume |
| Anomaly / fraud detection | Isolation Forest / XGBoost | Handles extreme class imbalance |
| Small dataset (<1000 rows) | SVM / Logistic Regression | Work well in low-data regimes |
| Very large structured data | XGBoost / LightGBM | Scales well; battle-tested on tabular data |
Dividing your dataset into separate subsets so the model is trained on one portion and evaluated on data it has never seen before. This simulates real-world performance.
If you evaluate on training data, your model looks perfect โ it just memorized the answers. A separate test set reveals if it actually learned the pattern or just memorized noise.
The model learns from this โ adjusts weights, fits patterns. Never used for final evaluation.
Used to tune hyperparameters (learning rate, depth, regularization) during development.
Final unseen evaluation โ touched only ONCE at the very end. Reports real-world performance.
When data is limited, K-Fold CV rotates which portion is used as validation โ every sample is tested exactly once.
The two fundamental problems in model training. Underfitting = model is too simple to learn the pattern. Overfitting = model memorizes training noise instead of learning the real pattern. Regularization is the fix for overfitting.
A model that performs perfectly on training data but fails on new data is useless. The goal is generalization โ learning the real underlying pattern. This is the core challenge of ML.
| Set | Score |
|---|---|
| Train | Low โ |
| Test | Low โ |
| Set | Score |
|---|---|
| Train | High โ |
| Test | High โ |
| Set | Score |
|---|---|
| Train | Very High โ |
| Test | Low โ |
Regularization adds a penalty term to the loss function that punishes large model weights. This forces the model to stay simple โ it can't just assign huge weights to every feature to perfectly fit training noise.
ฮป (lambda) is the regularization strength. Higher ฮป = more penalty = simpler model. Too high โ underfitting. Tune with cross-validation.
Penalizes the absolute value of weights. Pushes small/irrelevant feature weights all the way to exactly zero โ automatically removing them.
| Property | Detail |
|---|---|
| Effect on weights | Shrinks weak ones to exactly 0 |
| Feature selection | โ Automatic |
| Correlated features | Randomly drops one of them |
| Best for | High-dim data, sparse features |
Penalizes the squared value of weights. Shrinks all weights toward zero but never reaches exactly zero โ keeps all features, just with smaller influence.
| Property | Detail |
|---|---|
| Effect on weights | Shrinks all, never to exactly 0 |
| Feature selection | โ Keeps all features |
| Correlated features | Distributes weight across them |
| Best for | Correlated features, regression |
| Method | Penalty | Zeros out features? | Best Use Case |
|---|---|---|---|
| L1 Lasso | ฮปฮฃ|w| | โ Yes | Feature selection, sparse data |
| L2 Ridge | ฮปฮฃwยฒ | โ No | Correlated features, regression |
| ElasticNet | L1 + L2 | โ Sometimes | Both benefits, general purpose |
Quantitative measures that tell you how well your model performs. Different metrics are appropriate for regression vs classification vs forecasting models.
Accuracy alone is misleading (class imbalance). RMSE penalizes large errors more than MAE. Rยฒ tells you proportion of variance explained. Choosing the right metric defines what "good" means for your problem.
| Rยฒ Value | Interpretation |
|---|---|
| 0.90โ1.00 | Excellent |
| 0.70โ0.89 | Good |
| 0.50โ0.69 | Moderate |
| < 0.50 | Weak |
| F1 Value | Quality |
|---|---|
| 0.85+ | Excellent |
| 0.70โ0.84 | Good |
| 0.50โ0.69 | Fair |
| <0.50 | Poor |
| Problem Type | Primary Metric | Secondary Metric |
|---|---|---|
| Regression (general) | RMSE | Rยฒ |
| Regression (outliers present) | MAE | RMSE |
| Regression (business %) | MAPE | MAE |
| Binary Classification (balanced) | Accuracy | AUC-ROC |
| Binary Classification (imbalanced) | F1 Score | Precision/Recall |
| Fraud Detection (miss = bad) | Recall | F1 |
| Spam Filter (FP = bad) | Precision | F1 |
| Multi-class Classification | Macro F1 | Accuracy |
| Forecasting (time series) | MAPE | MAE |
A consolidated checklist of the most important habits, rules, and decisions that separate a model that actually works in production from one that only looks good on paper.
Knowing individual techniques is not enough. How you combine them, in what order, and what mistakes to avoid determines whether your model generalizes or fails silently on real data.
- Always clean before you model โ handle nulls, duplicates, and wrong data types before any EDA or training.
- Check for data leakage โ never let future information or target-derived features sneak into your training set.
- Understand your data source โ know how it was collected, what each column means, and what could go wrong.
- Validate data types โ a column stored as string that should be numeric will silently break everything.
- Split train/test first โ before scaling, encoding, imputing, or any transformation.
- Fit transformers on train only โ then apply (transform) to both train and test. Never fit on the full dataset.
- Never peek at the test set โ it must stay completely unseen until final evaluation. Touching it earlier inflates your metrics.
- Use stratified split for classification โ ensures class proportions are preserved in both sets.
- Build the simplest model first โ Logistic Regression for classification, Linear Regression for regression.
- A baseline tells you the minimum bar โ if your complex model barely beats it, the complexity isn't worth it.
- Compare all models against the baseline โ not against each other in isolation.
- A dummy classifier (always predict majority class) is your floor โ your model must beat it or it's useless.
- Better features beat better models โ a simple model with great features outperforms a complex model with raw features.
- Encode categoricals correctly โ ordinal data needs label encoding, nominal needs one-hot or target encoding.
- Scale features for distance-based models โ KNN, SVM, and neural networks are sensitive to feature scale. Tree models are not.
- Remove highly correlated features โ check VIF and correlation heatmap before training linear models.
- Create domain-driven features โ house age, price per sqft, interaction terms often matter more than raw columns.
| Mistake | Why Itโs Dangerous |
|---|---|
| Fitting scaler on full data | Test data leaks into preprocessing โ inflated metrics |
| Using accuracy on imbalanced data | 99% accuracy by predicting majority class โ useless model |
| Applying SMOTE before splitting | Synthetic samples in test set โ fake performance |
| Dropping nulls without checking % | Losing 30% of data silently biases the model |
| Label encoding nominal features | Implies false order โ model learns wrong relationships |
| Mistake | Why Itโs Dangerous |
|---|---|
| Shuffling time series data | Future data trains the model โ data leakage |
| Tuning on test set | Test set becomes part of training โ overly optimistic results |
| Ignoring class imbalance | Model never learns minority class โ fails in production |
| Skipping EDA | Outliers, skew, and wrong types break models silently |
| Using Rยฒ alone for regression | High Rยฒ can still mean terrible predictions on new data |
- Use K-Fold cross-validation when data is limited โ gives a more reliable estimate than a single train/test split.
- Use stratified K-Fold for classification โ preserves class balance in every fold.
- Report mean ยฑ std of CV scores โ a model with high variance across folds is unstable.
- Pick the right metric for your problem โ F1 for imbalanced, RMSE for regression, AUC for ranking.
- Never report only training accuracy โ always report validation or test performance.