Machine Learning

From data fundamentals to model evaluation

DATA df.shape(5000,12) EDA Explore Input Hidden Hidden Output METRICS Accuracy 97% F1 Score 0.95 AUC-ROC 0.98 RMSE 0.03 MODEL Deploy
22+
Topics
70+
Concepts
25+
Code Blocks
โˆž
Reference
๐Ÿ”„ ML Lifecycle
The end-to-end cycle every ML project follows
๐ŸŽฏ
Problem
Definition
๐Ÿ“ฆ
Data
Collection
๐Ÿงน
Data
Preparation
๐Ÿ”ฌ
EDA &
Analysis
๐Ÿงฎ
Model
Building
๐Ÿ“
Evaluation
& Tuning
๐Ÿš€
Deploy &
Monitor
โ™พ๏ธ
Iterative
๐Ÿ“Œ ML Learning Roadmap
Click any node to jump to that topic
๐Ÿค– AI / ML / Deep Learning ๐ŸŽ“ Types of ML ๐Ÿ“‹ Data Types ๐Ÿผ Pandas ๐Ÿงน Null & Duplicate Treatment ๐Ÿ”ฌ EDA โ€” Exploratory Data Analysis ๐Ÿ“Š Univariate Analysis ๐Ÿ”— Bivariate ๐ŸŒ Multivariate Analysis โš™๏ธ Feature Engineeringclick to open full guide โ†— ๐Ÿ“ˆ Log / Box-Cox โš–๏ธ SMOTE ๐Ÿ”ป PCA ๐Ÿงฎ ML Model Types โœ‚๏ธ Train / Test / Validation ๐Ÿ“‰ Overfitting / L1 / L2 ๐Ÿ“ Evaluation Metrics
AI, ML & Deep Learning
What is it?

AI is the broad field of making machines simulate human intelligence. ML is a subset where machines learn from data. DL uses deep neural networks for complex patterns.

Why does it matter?

Understanding the hierarchy helps you choose the right approach โ€” rule-based AI, statistical ML, or deep neural nets โ€” based on your data size, interpretability needs, and compute budget.

๐Ÿค– Artificial Intelligence

Broad field of making machines simulate human intelligence โ€” reasoning, problem-solving, language.

AI is the Field of Study. Everything lives inside it.
๐Ÿ“Š Machine Learning

Subset of AI where machines learn patterns from data without explicit programming rules.

Learns from examples instead of hard-coded rules.
๐Ÿง  Deep Learning

Subset of ML using multi-layered neural networks for images, audio, text.

DL = ML with many layers of feature extraction.
When to Use What?
ScenarioUseWhy
Rule-based chatbot, fixed logicAINo learning needed
Predict prices, detect fraudMLStructured data, interpretable
Image/speech recognitionDLComplex unstructured patterns
Small dataset (<1000 rows)MLDL needs large data
Need explainability (medical)MLDL is a black box
Types of Machine Learning
What is it?

ML is categorized by the type of feedback a model receives during learning โ€” labeled data (supervised), no labels (unsupervised), or rewards from actions (reinforcement).

Why does it matter?

Choosing the wrong type means your model can't learn at all. A supervised model needs labels โ€” if you don't have them, you need a different approach entirely.

๐ŸŽ“ Supervised Learning

Model learns from labeled data โ€” input + correct output pairs.

Classification

Output = category

  • Logistic Regression
  • Decision Tree
  • Random Forest
  • SVM, KNN
  • XGBoost
Spam Detection
Regression

Output = number

  • Linear Regression
  • Ridge (L2)
  • Lasso (L1)
  • ElasticNet
  • XGBoost
House Price
๐Ÿ” Unsupervised Learning

Finds hidden patterns with no labels.

Clustering
  • K-Means
  • DBSCAN
  • Hierarchical
Customer Segmentation
Dimensionality Reduction
  • PCA, t-SNE, UMAP
Association
  • Apriori, FP-Growth
Market Basket
๐ŸŽฎ Reinforcement Learning

Agent learns by interacting with environment, receiving rewards or penalties.

Agent โ†’ Action โ†’ Environment โ†’ Reward/Penalty โ†’ Agent learns
Types of Data
What is it?

Data can be categorical (labels/groups) or numerical (measured quantities). Each subtype โ€” nominal, ordinal, discrete, continuous โ€” determines how you can process and model it.

Why does it matter?

Data type determines which encoding to apply, which algorithms work, and which statistical tests are valid. Treating ordinal as nominal (or vice versa) leads to wrong models.

๐Ÿ“‹ Categorical Data
Nominal (No Order)
Colors: Red, Blue, Green Gender: Male, Female, Other
Red is NOT greater than Blue โ€” no order exists.
Ordinal (Has Order)
Education: School < UG < PG < PhD Rating: 1โ˜… < 2โ˜… < 3โ˜… < 4โ˜… < 5โ˜…
๐Ÿ”ข Numerical Data
Discrete (Countable)
Children: 0, 1, 2, 3... Cars sold: 0, 1, 2...
You can't have 2.5 children!
Continuous (Measurable)
Height: 172.5 cm, 168.2 cm Salary: โ‚น45,230.50
Pandas โ€” Data Manipulation
What is it?

Pandas is Python's most powerful data manipulation library. It provides DataFrame and Series structures for loading, cleaning, transforming, and exploring tabular data.

Why does it matter?

80% of ML work is data preparation. Pandas is how you inspect nulls, filter rows, encode features, merge datasets, and create new columns โ€” all before training begins.

Series โ€” 1D labeled array
import pandas as pd s = pd.Series([85,92,78], index=['Alice','Bob','Carol'])
DataFrame โ€” 2D table
df = pd.DataFrame({ 'Name': ['Alice','Bob'], 'Age': [25, 30], 'Score': [85, 92] })
Essential Pandas Operations
df.head() # First 5 rows df.info() # Types, null counts df.describe() # Statistical summary df.isnull().sum() # Missing values per column df.dropna() # Remove rows with nulls df[df['Age'] > 25] # Filter rows df[['Name','Score']] # Select columns
Null & Duplicate Value Treatment
What is it?

Data preprocessing step where missing values (nulls/NaN) and duplicate rows are detected and handled before any analysis or modeling. Dirty data = wrong models. Clean data = reliable results.

Why does it matter?

Most ML algorithms cannot handle NaN values โ€” they either crash or produce silently wrong results. Duplicate rows bias your model by over-representing certain patterns and inflating evaluation metrics.

๐Ÿ•ณ๏ธ Null / Missing Values โ€” Detection

Missing values appear as NaN, None, NULL, empty strings, or placeholder values like -999, "unknown", "N/A". First step is always detection.

Detection Code
# Check nulls per column df.isnull().sum() # Percentage of nulls df.isnull().mean() * 100 # Heatmap of nulls import missingno as msno msno.matrix(df) # Which rows have nulls df[df.isnull().any(axis=1)]
Types of Missingness
TypeMeaningExample
MCARMissing Completely At Random โ€” no patternRandom data entry errors
MARMissing At Random โ€” depends on other columnsIncome missing more for younger people
MNARMissing Not At Random โ€” depends on the missing value itselfHigh earners skip salary questions
MNAR is the hardest to handle โ€” the missingness itself carries information.
๐Ÿ”ง Null Treatment Strategies โ€” How to Handle
Strategy 1: Drop Rows/Columns
# Drop rows with any null df.dropna(inplace=True) # Drop columns with >50% null thresh = len(df) * 0.5 df.dropna(axis=1, thresh=thresh)
โš ๏ธ Only drop rows if <5% are missing. Never drop a column without analysis โ€” it may contain information.
Strategy 2: Fill with Statistics
# Numerical: fill with mean or median df['Age'].fillna(df['Age'].mean()) df['Salary'].fillna(df['Salary'].median()) # Categorical: fill with mode df['City'].fillna(df['City'].mode()[0])
Strategy 3: Forward/Backward Fill (Time Series)
# Use previous value (time series) df.fillna(method='ffill') # Use next value df.fillna(method='bfill')
Strategy 4: Advanced Imputation
from sklearn.impute import KNNImputer, SimpleImputer # KNN Imputation (uses similar rows) imputer = KNNImputer(n_neighbors=5) df_imputed = imputer.fit_transform(df) # Iterative Imputer (model-based) from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer
Which Strategy to Use?
SituationRecommended StrategyWhy
Numerical, few nulls (<5%), symmetric distributionFill with MeanMean preserves overall average
Numerical, skewed distribution or outliersFill with MedianMedian is robust to outliers
Categorical columnFill with ModeMost frequent value is safest assumption
Time series dataForward/Backward FillPreserves temporal continuity
Many nulls, complex relationshipsKNN or Iterative ImputerUses relationships between columns
Column has >70% nullsDrop the ColumnToo little data to be reliable
MNAR โ€” missingness has meaningCreate "was_null" flag columnPreserves information in the missingness
๐Ÿ‘ฅ Duplicate Values โ€” Detection & Treatment
What are duplicates?

Rows that have identical values across all (or selected) columns. They arise from data merges, scraping, re-submissions, or system errors.

Why remove them?

Duplicates inflate dataset size, bias model training (the model sees certain patterns more often than they actually occur), and artificially inflate evaluation metrics.

Detection
# Count total duplicates df.duplicated().sum() # View duplicate rows df[df.duplicated(keep=False)] # Duplicates on specific columns df.duplicated(subset=['Name','Email']).sum()
Treatment
# Remove all duplicates (keep first) df.drop_duplicates(inplace=True) # Keep last occurrence df.drop_duplicates(keep='last') # Based on key columns only df.drop_duplicates( subset=['CustomerID', 'OrderDate'], keep='first' )
๐Ÿ”‘ keep='first' keeps the earliest entry (usually more reliable). keep='last' keeps the most recent (useful if latest data is more accurate). keep=False marks ALL duplicates โ€” useful for investigation.
๐Ÿ”„ Complete Data Cleaning Pipeline
import pandas as pd import numpy as np # 1. Load data df = pd.read_csv('data.csv') print(f"Shape: {df.shape}") # 2. Check nulls null_pct = df.isnull().mean() * 100 print(null_pct[null_pct > 0]) # 3. Drop high-null columns (>70%) df.drop(columns=null_pct[null_pct > 70].index, inplace=True) # 4. Fill numerical with median num_cols = df.select_dtypes(include='number').columns df[num_cols] = df[num_cols].fillna(df[num_cols].median()) # 5. Fill categorical with mode cat_cols = df.select_dtypes(include='object').columns for col in cat_cols: df[col].fillna(df[col].mode()[0], inplace=True) # 6. Remove duplicates df.drop_duplicates(inplace=True) print(f"Clean Shape: {df.shape}") df.isnull().sum() # Should be all zeros
โœ… Always run this pipeline BEFORE EDA. You want to analyze clean data, not discover during modeling that you have 15% nulls.
EDA โ€” Exploratory Data Analysis
What is EDA?

EDA is the process of systematically investigating your dataset before building any model. You examine shape, distributions, relationships, missing values, and anomalies using statistics and visualizations.

Why do we need EDA?

Without EDA, you're modeling blindly. EDA prevents you from feeding garbage to your model โ€” you catch outliers, skewed distributions, data leakage, encoding issues, and multicollinearity early.

EDA Process Overview
๐Ÿ“ฅ Raw Dataset (after null/duplicate cleaning)
โ†“
๐Ÿ“Š Univariate
1 variable at a time
๐Ÿ”— Bivariate
2 variables together
๐ŸŒ Multivariate
3+ variables
โ†“
๐Ÿงน Handle Outliers, Skew, Imbalance
โ†“
โœ… Data Ready for Feature Engineering & Modeling
EDA does not change your data โ€” it only helps you understand it. Transformations come later in Feature Engineering.
Univariate Analysis
What is it?

Analyzing one variable at a time. You look at how values of a single column are distributed โ€” their shape, center, spread, and extreme values.

Why do we need it?

Before comparing features, you need to understand each individually. A bimodal distribution, heavy skew, or extreme outlier in a single column can break models if not addressed.

Categorical Variable
  • Count each category โ†’ frequency table
  • Find the most common value โ†’ Mode
  • Check if any category dominates (>90%)
Visualizations
Bar Chart, Pie Chart, Count Plot
Numerical Variable
  • Mean โ€” average (affected by outliers)
  • Median โ€” middle value (robust)
  • Mode โ€” most frequent value
  • Variance & SD โ€” spread
  • IQR โ€” middle 50% range
  • Skewness โ€” distribution symmetry
Visualizations
Histogram, Box Plot, KDE Plot
IQR, Variance & Standard Deviation
๐Ÿ“ฆ IQR โ€” Interquartile Range

Measures the spread of the middle 50% of data. Robust to extreme outliers.

IQR = Q3 โˆ’ Q1  |  Lower = Q1 โˆ’ 1.5ร—IQR  |  Upper = Q3 + 1.5ร—IQR
Dataset: [4, 7, 10, 13, 16, 20, 25] Q1=7 Median=13 Q3=20 IQR=13 Lower = 7 โˆ’ 19.5 = โˆ’12.5 | Upper = 20 + 19.5 = 39.5 โ†’ No outliers โœ…
Min Q1 Median Q3 Max โ—„โ”€โ”€โ”€ IQR = Q3โˆ’Q1 โ”€โ”€โ”€โ–บ Outlier Outlier
๐Ÿ“ Variance & Standard Deviation
What is it?

Variance measures how far each value is spread from the mean. Standard Deviation (SD) is simply the square root of variance โ€” bringing it back to the original unit so it's easier to interpret. A high SD means values are widely scattered; a low SD means they're tightly clustered around the mean.

Why do we need it?

Mean alone doesn't tell the full story. Two datasets can have the same mean but completely different spreads. SD tells you how reliable and consistent the data is โ€” essential for detecting outliers, comparing features, and understanding model uncertainty.

ฯƒยฒ = ฮฃ(xแตขโˆ’ฮผ)ยฒ/N  |  s = โˆš(ฯƒยฒ)
Calculate Mean (xฬ„)
Subtract: (xแตข โˆ’ xฬ„)
Square: (xแตข โˆ’ xฬ„)ยฒ
Sum and divide by nโˆ’1: sยฒ
โˆšsยฒ = Standard Deviation
Divide by nโˆ’1 for sample (Bessel's correction) โ€” avoids underestimating variance.
Outlier Detection
What is it?

Identifying data points that deviate significantly from the rest. Methods include IQR fences (box plot whiskers) and Z-score thresholds based on standard deviations from the mean.

Why do we need it?

Outliers distort model coefficients, inflate RMSE, and skew correlations. A single extreme value can move a regression line significantly. Must detect and decide: fix or remove.

Skewness & What It Means
Right Skew (Positive)
Mean > Median โ€” outliers pulling mean up (tail RIGHT)

Examples: Income data, house prices

Left Skew (Negative)
Mean < Median โ€” outliers pulling mean down (tail LEFT)

Examples: Test scores (few very low)

Symmetric: Mean โ‰ˆ Median โ‰ˆ Mode
Z-Score Method
What is it?

A Z-Score tells you how many standard deviations a value is away from the mean. A Z of 0 means the value is exactly at the mean. A Z of +2 means it's 2 standard deviations above. It standardizes any numerical column to a common scale so values from different features can be compared fairly.

Why do we need it?

Raw values alone don't tell you if something is unusual. A salary of โ‚น5,00,000 could be normal or extreme depending on the dataset. Z-Score gives context โ€” |Z| > 3 flags a value as a statistical outlier, sitting beyond 99.7% of the distribution. It's also the foundation of standardization (StandardScaler) used before many ML models.

Z = (X โˆ’ ฮผ) / ฯƒ    โ†’    |Z| > 3 = Outlier
Empirical Rule (68-95-99.7)
ยฑ1ฯƒ
68%
ยฑ2ฯƒ
95%
ยฑ3ฯƒ
99.7%
|Z| > 3 = Outlier โ€” beyond 99.7% of the normal distribution.
Bivariate Analysis
What is it?

Analyzing the relationship between two variables simultaneously โ€” how they move together (numerical vs numerical), differ across groups (categorical vs numerical), or associate (categorical vs categorical).

Why do we need it?

Feature selection, correlation detection, and hypothesis testing all start here. You find which features actually relate to the target โ€” and which are noise.

๐Ÿ”ข Num vs Num โ€” Step 1: Covariance (Direction)

Covariance measures whether two variables move in the same direction or opposite directions. It shows the direction of the relationship, but NOT how strong it is because it depends on the units of measurement.

Cov(X,Y) = ฮฃ (xแตข โˆ’ xฬ„)(yแตข โˆ’ ศณ) / (nโˆ’1)
โœ… Positive Covariance

When X increases, Y also increases. They deviate from their means in the same direction.

๐Ÿ“Œ Height & Weight โ€” taller people tend to weigh more โ†’ Cov > 0
๐Ÿ”ป Negative Covariance

When X increases, Y decreases. They deviate in opposite directions.

๐Ÿ“Œ Study hours & Exam errors โ€” more study, fewer mistakes โ†’ Cov < 0
โš ๏ธ Problem: Covariance has no fixed scale. That's why we normalize it โ†’ Correlation.
๐Ÿ“ Num vs Num โ€” Step 2: Pearson Correlation (Strength + Direction)
r = Cov(X,Y) / (ฯƒโ‚“ ร— ฯƒแตง)     Range: โˆ’1 to +1
PropertyCovarianceCorrelation
What it tellsDirection onlyDirection + Strength
Rangeโˆ’โˆž to +โˆžโˆ’1 to +1
Unit dependent?YesNo (unitless)
r ValueMeaning
+0.8 to +1.0Strong positive
+0.4 to +0.7Moderate positive
โ‰ˆ 0No relationship
โˆ’0.7 to โˆ’0.4Moderate negative
โˆ’1.0 to โˆ’0.8Strong negative
โˆ’1.0โˆ’0.7โˆ’0.5โˆ’0.30+0.3+0.5+0.7+1.0
Perfectโˆ’Strongโˆ’Modโˆ’WeakNoneWeakMod+Strong+Perfect+
๐Ÿ”‘ If two features have r > 0.8 โ†’ multicollinearity! One can be dropped safely.

Visualizations: Scatter Plot Heatmap Pair Plot

Categorical vs Numerical
What is it?

Testing whether a numerical variable (e.g. salary, score) significantly differs across categories (e.g. department, gender). We set up a null hypothesis Hโ‚€ and use a test statistic to decide whether to reject it.

Why do we need it?

Tells you whether a categorical feature has real predictive power for a numerical target. If salary doesn't differ by department, that feature is noise.

๐Ÿงช How Hypothesis Testing Works
HypothesisMeaning
Hโ‚€ (Null)No difference exists โ€” any observed gap is just random chance
Hโ‚ (Alternative)A real difference exists โ€” it's not just random noise
Example: Hโ‚€ = "Male and female scores are the same"

The test statistic (t or F) is placed on its probability distribution. The P-value = area in the tail beyond that value โ€” the probability of seeing this extreme a result by pure chance alone, assuming Hโ‚€ is true.

Small P-value โ†’ result is very unlikely by chance โ†’ reject Hโ‚€ โ†’ real difference exists
T-Test โ€” Comparing 2 Groups

Use when categorical variable has exactly 2 groups.

t = (xฬ„โ‚ โˆ’ xฬ„โ‚‚) / โˆš(sโ‚ยฒ/nโ‚ + sโ‚‚ยฒ/nโ‚‚)
ResultWhat it meansAction
p โ‰ค 0.05Groups are significantly differentKeep feature
p > 0.05No significant differenceFeature may be useless
ANOVA โ€” Comparing 3+ Groups

Use when categorical variable has 3 or more groups.

F = Between-Group Variance / Within-Group Variance
ResultWhat it meansNext Step
p โ‰ค 0.05At least one group differs significantlyRun Post-Hoc (Tukey)
p > 0.05No significant differenceFeature may not add value
Categorical vs Categorical
What is it?

Testing whether two categorical variables are associated or independent. We compare observed data to what we would expect if the variables had absolutely no relationship.

Why do we need it?

Tells you if a categorical feature is related to a categorical target. If Gender and Movie Preference are independent, Gender won't help predict preferences.

ฯ‡ยฒ Chi-Square Test
What is it?

The Chi-Square test measures how much the observed counts in a cross-table differ from what we'd expect if the two variables had absolutely no relationship. The larger the ฯ‡ยฒ value, the bigger the gap between observed and expected โ€” meaning a stronger association exists between the two categories.

Why do we need it?

When both your feature and target are categorical, you can't use correlation or t-tests. Chi-Square is the go-to test to decide whether a categorical feature is worth keeping โ€” if Gender and Loan Default are independent, Gender adds zero predictive value to your model.

ฯ‡ยฒ = ฮฃ [(Observed โˆ’ Expected)ยฒ / Expected]
Cross-tab: Gender ร— Movie Genre Action Romance Comedy Total Male: 60 20 20 100 Female: 15 55 30 100 Total: 75 75 50 200 Expected(Male, Action) = (100 ร— 75) / 200 = 37.5 ฯ‡ยฒ contribution = (60โˆ’37.5)ยฒ / 37.5 = 13.5 โ†’ p < 0.001 โ†’ Reject Hโ‚€ โœ…
P-valueDecisionFeature usefulness
p โ‰ค 0.05Reject Hโ‚€ โœ…Variables associated โ€” keep feature
p > 0.05Fail to reject Hโ‚€ โŒVariables independent โ€” may be useless
โš ๏ธ Chi-square only tells you whether an association exists. Use Cramรฉr's V for strength.
๐Ÿ“ Cramรฉr's V โ€” Association Strength
What is it?

Cramรฉr's V is a normalized version of the Chi-Square statistic that measures the strength of association between two categorical variables. It always ranges from 0 to 1 โ€” where 0 means no association at all and 1 means a perfect relationship โ€” regardless of table size or sample size.

Why do we need it?

Chi-Square only tells you whether an association exists, not how strong it is. A huge dataset can produce a significant ฯ‡ยฒ even for a trivially weak relationship. Cramรฉr's V fixes this โ€” it gives you a comparable, scaled strength score so you can rank which categorical features are most useful for your model.

V = โˆš( ฯ‡ยฒ / (n ร— min(rโˆ’1, cโˆ’1)) )   Range: 0 to 1
Cramรฉr's VStrengthWhat it means
0.00โ€“0.10NegligibleBarely any association โ€” ignore
0.10โ€“0.30Weakโ€“ModerateSome association โ€” use with caution
0.30โ€“0.60Moderateโ€“StrongMeaningful association โ€” likely useful
0.60โ€“1.00Very StrongStrong relationship โ€” important feature
Multivariate Analysis
What is it?

Examining 3 or more variables simultaneously. The key tool is the correlation heatmap โ€” a matrix where each cell shows Pearson's r between two features, color-coded by strength and direction.

Why do we need it?

Reveals multicollinearity (features too similar to each other), which bloats variance and hurts models. Also shows which features correlate with the target.

๐ŸŒก๏ธ Correlation Heatmap
Correlation Matrix โ€” Example Dataset
Age Income Score Exp Target
Age 1.00 0.72 โˆ’0.41 0.81 0.53
Income 0.72 1.00 0.28 0.61 0.78
Score โˆ’0.41 0.28 1.00 โˆ’0.15 0.19
Exp 0.81 0.61 โˆ’0.15 1.00 0.67
Target 0.53 0.78 0.19 0.67 1.00
โˆ’1.0
+1.0 Dark Red = High Positive Correlation
Reading the heatmap: Ageโ€“Exp = 0.81 (multicollinearity โš ๏ธ). Scoreโ€“Target = 0.19 (weak predictor). Incomeโ€“Target = 0.78 (strong predictor โœ…).
import seaborn as sns corr = df.corr() sns.heatmap(corr, annot=True, fmt=".2f", cmap="RdBu_r", center=0)
๐Ÿ” VIF โ€” Variance Inflation Factor (Full Explanation)
What is VIF?

VIF (Variance Inflation Factor) is a measure that quantifies how much the variance of a regression coefficient is inflated due to multicollinearity โ€” that is, how much one feature can be linearly predicted from the other features.

Why does it matter?

When features are highly correlated with each other, the model gets confused about which feature deserves credit. Coefficients become unstable, unreliable, and have exploding standard errors, making the model meaningless.

How VIF is Calculated โ€” Step by Step

For each feature X in your dataset, VIF asks: "How well can I predict X using all the OTHER features?"

Pick one feature โ€” say Age โ€” and set it as the target variable
Regress Age on ALL other features (Income, Score, Exp, etc.) using linear regression
Record the Rยฒ of this regression โ€” how well the other features explain Age
Calculate VIF = 1 / (1 โˆ’ Rยฒ). If Rยฒ = 0.9 โ†’ VIF = 1/(1โˆ’0.9) = 10 โ†’ severe!
Repeat for every feature in your dataset
VIF_x = 1 / (1 โˆ’ Rยฒ_x)    where Rยฒ_x = Rยฒ from regressing X on all other features
Interpreting VIF Values
VIF ValueMeaningAction
VIF = 1No multicollinearity at all โ€” feature is completely independentKeep โœ…
1 < VIF โ‰ค 5Low to moderate โ€” some correlation but acceptableKeep โœ…
5 < VIF โ‰ค 10Moderate to high โ€” investigate carefullyConsider dropping
VIF > 10Severe multicollinearity โ€” this feature is nearly redundantDrop or combine โŒ
Intuitive Example

Suppose your dataset has both Height_cm and Height_inches:

Height_cm = 2.54 ร— Height_inches Rยฒ โ‰ˆ 1.0 (perfect prediction) VIF = 1/(1โˆ’1.0) = โˆž โ†’ Perfect multicollinearity! Model cannot decide: "Is the price high because Height_cm is 180, or because Height_inches is 70.9?" โ†’ Coefficients explode, model is unstable
When VIF is very high, regression coefficients have huge standard errors โ€” small changes in data cause massive swings in coefficients.
VIF vs Correlation Matrix โ€” Key Difference
PropertyCorrelation MatrixVIF
What it checksPairwise (Xโ‚ vs Xโ‚‚) relationship onlyX vs ALL other features combined
Detects 3-way collinearity?โŒ No โ€” misses itโœ… Yes โ€” catches it
Outputโˆ’1 to +1 per pair1 to โˆž per feature
Example blind spotXโ‚ƒ = Xโ‚ + Xโ‚‚ might show low pairwise rVIF for Xโ‚ƒ will be very high
๐Ÿ”‘ Always use VIF alongside the correlation heatmap. Correlation shows pairwise problems. VIF catches complex multi-way collinearity that correlation matrices miss.
How to Fix High VIF
Option 1: Drop a Feature

If two features are highly correlated, drop the one with less predictive value for the target.

Option 2: Combine Features

Create a single feature from the correlated ones (e.g. ratio, sum, average).

Option 3: Use PCA

Transform correlated features into uncorrelated principal components โ€” VIF = 1 for all components.

VIF Calculation Code
from statsmodels.stats.outliers_influence import variance_inflation_factor import pandas as pd # X should be your feature matrix (no target column) vif_data = pd.DataFrame() vif_data["Feature"] = X.columns vif_data["VIF"] = [ variance_inflation_factor(X.values, i) for i in range(len(X.columns)) ] print(vif_data.sort_values("VIF", ascending=False)) # Drop features with VIF > 10 high_vif = vif_data[vif_data["VIF"] > 10]["Feature"].tolist() X_clean = X.drop(columns=high_vif)

โš™๏ธ
Feature Engineering
Data Prep โ†’ Transformation โ†’ Ready to Model
๐Ÿ“ Open Full FE Guide โ†’
Quick Access โ€” Topics on this page
๐Ÿ“ The Full Feature Engineering Guide covers: Label Encoding, One-Hot Encoding, Binary Encoding, Target Encoding, Frequency Encoding + Feature Selection, Transformation & Creation.
Outlier Treatment โ€” Log & Box-Cox
What is it?

Mathematical transformations applied to a column's values to reduce skewness. They compress large values, making the distribution closer to normal so models work better.

Why do we need it?

Many ML algorithms assume normally distributed features. Highly skewed data gives disproportionate weight to extreme values. Log and Box-Cox transformations fix this without removing any rows.

๐Ÿ“ˆ Log Transformation
What is it?

Log transformation applies the logarithm function to every value in a column โ€” compressing large values and spreading small ones. It turns a right-skewed distribution into something closer to a normal bell curve.

Why do we need it?

Many ML models assume features are roughly normally distributed. Income, house prices, and population counts are heavily right-skewed โ€” a few huge values dominate. Log squashes those extremes so the model treats all values fairly.

y' = log(x)  โ€” x must be > 0. Use log(x+1) if zeros present.
SituationUse?
Right-skewed (skewness > +1)โœ… Yes
Data has zerosโš ๏ธ log(x+1)
Negative valuesโŒ No
๐Ÿ”ง Box-Cox Transformation
What is it?

Box-Cox is a family of power transformations that automatically finds the best exponent ฮป (lambda) to make your data as normal as possible. Instead of you guessing whether to use log, square root, or reciprocal โ€” Box-Cox tests them all and picks the optimal one.

Why do we need it?

Log only fixes right skew. Box-Cox handles both left and right skew by tuning ฮป. When skewness is above +2 or the distribution is complex, Box-Cox finds a better transformation than log alone โ€” but requires all values to be strictly positive.

Automatically finds the best ฮป (lambda) to normalize data.

y'(ฮป) = (yแต›โˆ’1)/ฮป if ฮปโ‰ 0  |  y'=ln(y) if ฮป=0
ฮปEffectSkew Type
2yยฒ (Square)Left-skewed
0.5โˆšyModerate right
0ln(y)Strong right
-11/ySevere right
SkewnessAction
โˆ’0.5 to +0.5No transform needed
+0.5 to +1.0Consider Log or โˆš
> +1.0Apply Log
> +2.0Apply Box-Cox
โš ๏ธ When Data Has Negative Values โ€” Use These Instead
Log and Box-Cox both require x > 0. If your column has negative values (e.g. temperature, profit/loss, z-scores), use the methods below instead.
๐Ÿ”„ Yeo-Johnson Transformation

The go-to replacement for Box-Cox when negatives are present. Works on any value โ€” positive, zero, or negative. Automatically finds the best ฮป just like Box-Cox.

ฮปโ‰ 0,xโ‰ฅ0: ((x+1)^ฮปโˆ’1)/ฮป  |  ฮป=0,xโ‰ฅ0: ln(x+1)
ฮปโ‰ 2,x<0: โˆ’((โˆ’x+1)^(2โˆ’ฮป)โˆ’1)/(2โˆ’ฮป)  |  ฮป=2,x<0: โˆ’ln(โˆ’x+1)
PropertyDetail
Handles negatives?โœ… Yes
Handles zeros?โœ… Yes
Auto-finds ฮป?โœ… Yes
Best forAny skewed data, mixed signs
from sklearn.preprocessing import PowerTransformer pt = PowerTransformer(method='yeo-johnson') df['col_transformed'] = pt.fit_transform(df[['col']]) # Works with negatives, zeros, positives โœ…
โˆ› Cube Root Transformation

Simple manual method. The cube root of a negative number is still negative โ€” so it naturally handles all signs. Good for moderate skew with negatives.

y' = x^(1/3)  โ€” works for all x including negatives
PropertyDetail
Handles negatives?โœ… Yes
Handles zeros?โœ… Yes
Auto-finds ฮป?โŒ Fixed power (1/3)
Best forModerate skew, simple fix
import numpy as np # np.cbrt handles negatives correctly df['col_cbrt'] = np.cbrt(df['col']) # e.g. cbrt(-27) = -3 โœ… cbrt(0) = 0 โœ…
Your DataRecommended TransformWhy
All positive, right-skewedLog / Box-CoxClassic choice, well-understood
Has zeros, right-skewedlog(x+1) / Yeo-JohnsonAvoids log(0) error
Has negatives, any skewYeo-JohnsonWorks on all signs, auto-finds ฮป
Has negatives, moderate skewCube RootSimple, no fitting needed
PCA โ€” Dimensionality Reduction
What is it?

Principal Component Analysis transforms many correlated features into fewer uncorrelated "principal components" that capture the maximum variance in the data.

Why do we need it?

Too many correlated features (multicollinearity) cause unstable model weights and slow training. PCA compresses information into independent components โ€” directly eliminating multicollinearity.

PCA Steps
Standardize all features (mean=0, SD=1)
Compute Covariance Matrix
Calculate Eigenvalues & Eigenvectors
Sort by eigenvalue โ€” higher = more variance explained
Select top K using Kaiser's Rule (eigenvalue โ‰ฅ 1) or scree plot
Project data onto K components โ†’ new feature matrix
from sklearn.decomposition import PCA pca = PCA(n_components=0.95) # keep 95% variance X_pca = pca.fit_transform(X_scaled)
Variance Retained Guide
Variance RetainedQuality
<60%Too low โŒ
75โ€“90%Good โœ…
90โ€“95%Excellent โœ…
SMOTE โ€” Class Imbalance
What is it?

Synthetic Minority Over-sampling Technique creates new artificial minority samples by interpolating between existing real minority samples โ€” making class sizes more equal.

Why do we need it?

When 99% of samples are class 0 and 1% are class 1, a model can achieve 99% accuracy by just predicting 0 always โ€” useless! SMOTE forces the model to actually learn the minority class.

How SMOTE Works
x_new = x_original + ฮป ร— (x_neighbor โˆ’ x_original)    ฮป โˆˆ [0,1]
Find all minority class samples
For each: find K nearest minority neighbors (default K=5)
Randomly pick one neighbor, generate random ฮป โˆˆ [0,1]
Create synthetic point: x_new = original + ฮป ร— (neighbor โˆ’ original)
Repeat until desired balance is achieved
Critical: Apply SMOTE ONLY on training data โ€” never on test/validation data.
Variants
VariantWhen to Use
SMOTE (default)Standard imbalance, continuous features
Borderline-SMOTEFocus on hard boundary cases
ADASYNAdaptive โ€” more samples where density is low
SMOTE-TomekSMOTE + clean overlapping points
SMOTENCMixed data (numerical + categorical)
ML Model Types โ€” When to Use What
What is it?

A structured overview of the major ML model families โ€” what problem each solves, what data it expects, and which real-world use cases it's best suited for.

Why do we need it?

Choosing the wrong model type is the single most common ML mistake. The model is a tool โ€” match the tool to the job.

๐Ÿ“ Linear Models
ModelOutputBest Use CaseNotes
Linear RegressionContinuous numberHouse price, salary predictionAssumes linear relationship; sensitive to outliers
Ridge RegressionContinuous numberSame as linear but with correlated featuresL2 regularization built-in
Lasso RegressionContinuous numberHigh-dimensional data, feature selection neededL1 regularization โ€” zeros out features
Logistic RegressionProbability (0โ€“1)Spam detection, default prediction, disease yes/noDespite the name, it's a classifier
โœ… Start here first โ€” linear models give you a strong baseline and are fully interpretable.
๐ŸŒฒ Tree-Based & Non-Linear Models
ModelOutputBest Use CaseNotes
Decision TreeClass or numberSimple classification, interpretable rulesEasy to explain; prone to overfitting
Random ForestClass or numberGeneral-purpose: fraud, churn, credit scoringEnsemble of trees; robust, handles missing values
XGBoost / LightGBMClass or numberTabular data competitions, ranking, clickthrough rateGradient boosting; state-of-the-art on structured data
โœ… For most tabular/structured data problems, XGBoost or Random Forest is the go-to choice.
๐ŸŽฏ Classification-Specific Models
ModelOutputBest Use CaseNotes
SVMClassText classification, small datasetsMaximum-margin hyperplane; effective in high dimensions
K-Nearest NeighborsClass or numberRecommendation systems, anomaly detectionNo training phase; slow on large data
Naive BayesClass probabilityEmail spam, sentiment analysisFast, simple, works well with text/NLP
๐Ÿง  Deep Learning Models
ModelOutputBest Use CaseNotes
ANN (MLP)AnyComplex tabular patternsFully connected layers; general purpose
CNNClass/labelImage classification, object detectionConvolutional layers capture spatial patterns
RNN / LSTMSequenceTime series, language modelingRemembers long-term dependencies
TransformerAnyNLP (ChatGPT, BERT), translationAttention mechanism; state-of-the-art in NLP
โš ๏ธ Deep learning needs a lot of data (>10k rows minimum). For small datasets, tree-based models almost always win.
๐Ÿ“ˆ Forecasting Models (Time Series)

Use when your target variable has a time dimension: tomorrow's sales, next month's temperature, stock prices. Regular ML ignores temporal order โ€” forecasting preserves it.

Time Series Components
ComponentWhat It IsExample
TrendLong-term upward/downward directionRising sales year over year
SeasonalityRepeating pattern at fixed intervalsHigh sales every December
CyclicLonger irregular wavesEconomic recession cycles
NoiseRandom unexplainable variationOne-off spike due to event
Classical Forecasting Models
ModelFormula / Key IdeaBest Use Case
Moving Averageavg of last N observationsStable, no trend/seasonality โ€” quick baseline
Exponential Smoothing (ETS)F(t+1) = ฮฑยทY(t) + (1โˆ’ฮฑ)ยทF(t)Short-term forecast, sales, inventory
ARIMA(p,d,q)AR + Differencing + MAStationary or easily differenced series
SARIMAARIMA + seasonal(P,D,Q,m)Clear seasonality (retail, weather, energy)
Modern / ML-Based Forecasting Models
ModelKey IdeaBest Use Case
Prophet (Meta)Trend + seasonality + holidays decompositionBusiness forecasting, holiday effects
XGBoost + Lag FeaturesML with t-1, t-7, t-30 as input featuresComplex relationships, external regressors
LSTMRecurrent NN with long-term memoryLong sequences, complex temporal patterns
Temporal Fusion TransformerAttention + multi-horizon forecastingLarge-scale production forecasting
Which Forecasting Model?
SituationRecommended Model
Quick baseline, stable seriesMoving Average / ETS
No seasonality, stationaryARIMA
Clear seasonality (monthly/weekly)SARIMA / Holt-Winters
Business data with holidaysProphet
Complex, many featuresXGBoost + lag features
Long sequence, big dataLSTM / Transformer
Key rule: Always split time series by time โ€” train on past, test on future. Never shuffle! Shuffling creates data leakage in time series.
๐Ÿ—บ๏ธ Model Selection Guide
Your ProblemRecommended ModelWhy
Predict a number (house price, salary)Linear / Ridge RegressionInterpretable, fast, good baseline
Binary yes/no (spam, fraud, churn)Logistic Regression / XGBoostLogistic for interpretability, XGBoost for performance
Multi-class (A/B/C/D categories)Random Forest / XGBoostHandles multi-class natively; robust
Customer churn predictionXGBoost + SMOTEHandles imbalance; captures complex patterns
Text classification / sentimentNaive Bayes / TransformerNaive Bayes for speed; Transformer for accuracy
Image recognitionCNNDesigned for spatial/visual patterns
Time series / forecastingARIMA / Prophet / LSTMDepends on seasonality and data volume
Anomaly / fraud detectionIsolation Forest / XGBoostHandles extreme class imbalance
Small dataset (<1000 rows)SVM / Logistic RegressionWork well in low-data regimes
Very large structured dataXGBoost / LightGBMScales well; battle-tested on tabular data
๐Ÿ”‘ General rule: Always start with Logistic/Linear Regression as your baseline. If performance is insufficient, move to Random Forest, then XGBoost. Only use Deep Learning if you have large data and unstructured inputs.
Train / Test / Validation Split
What is it?

Dividing your dataset into separate subsets so the model is trained on one portion and evaluated on data it has never seen before. This simulates real-world performance.

Why do we need it?

If you evaluate on training data, your model looks perfect โ€” it just memorized the answers. A separate test set reveals if it actually learned the pattern or just memorized noise.

Standard Split Ratios
70/30 Split (Simple)
Train (70%)
Test (30%)
80/10/10 Split (With Validation)
Train (80%)
Val (10%)
Test (10%)
60/20/20 Split (Common in DL)
Train (60%)
Val (20%)
Test (20%)
๐ŸŸข Train Set

The model learns from this โ€” adjusts weights, fits patterns. Never used for final evaluation.

๐ŸŸ  Validation Set

Used to tune hyperparameters (learning rate, depth, regularization) during development.

๐Ÿ”ด Test Set

Final unseen evaluation โ€” touched only ONCE at the very end. Reports real-world performance.

Cross-Validation (K-Fold)

When data is limited, K-Fold CV rotates which portion is used as validation โ€” every sample is tested exactly once.

from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring='r2') print(scores.mean(), scores.std())
Overfitting, Underfitting & Regularization
What is it?

The two fundamental problems in model training. Underfitting = model is too simple to learn the pattern. Overfitting = model memorizes training noise instead of learning the real pattern. Regularization is the fix for overfitting.

Why does it matter?

A model that performs perfectly on training data but fails on new data is useless. The goal is generalization โ€” learning the real underlying pattern. This is the core challenge of ML.

The Bias-Variance Tradeoff
๐Ÿ“‰
Underfitting
Model is too simple โ€” misses the pattern in both train and test data.
SetScore
TrainLow โŒ
TestLow โŒ
Cause: Too few features, too shallow model
Fix: Add more features, use complex model
โœ…
Good Fit
Model generalizes well โ€” learns the real pattern, not the noise.
SetScore
TrainHigh โœ…
TestHigh โœ…
This is the goal โ€” balanced complexity, good generalization
๐Ÿ“ˆ
Overfitting
Model memorizes training noise โ€” great on train, fails on new data.
SetScore
TrainVery High โœ…
TestLow โŒ
Cause: Too complex model, too many features
Fix: Regularization (L1/L2), more data, pruning
๐Ÿ”’ Regularization โ€” The Fix for Overfitting

Regularization adds a penalty term to the loss function that punishes large model weights. This forces the model to stay simple โ€” it can't just assign huge weights to every feature to perfectly fit training noise.

Without Reg: Minimize only ฮฃ(y โˆ’ ลท)ยฒ L1 (Lasso): Minimize ฮฃ(y โˆ’ ลท)ยฒ + ฮปฮฃ|w| โ† penalizes |weight| L2 (Ridge): Minimize ฮฃ(y โˆ’ ลท)ยฒ + ฮปฮฃwยฒ โ† penalizes weightยฒ ElasticNet: Combine both L1 + L2

ฮป (lambda) is the regularization strength. Higher ฮป = more penalty = simpler model. Too high โ†’ underfitting. Tune with cross-validation.

L1 โ€” Lasso Regularization

Penalizes the absolute value of weights. Pushes small/irrelevant feature weights all the way to exactly zero โ€” automatically removing them.

PropertyDetail
Effect on weightsShrinks weak ones to exactly 0
Feature selectionโœ… Automatic
Correlated featuresRandomly drops one of them
Best forHigh-dim data, sparse features
Use L1 when you suspect many irrelevant features exist and want automatic feature selection.
L2 โ€” Ridge Regularization

Penalizes the squared value of weights. Shrinks all weights toward zero but never reaches exactly zero โ€” keeps all features, just with smaller influence.

PropertyDetail
Effect on weightsShrinks all, never to exactly 0
Feature selectionโŒ Keeps all features
Correlated featuresDistributes weight across them
Best forCorrelated features, regression
Use L2 when all features might be relevant, especially with correlated inputs.
L1 vs L2 vs ElasticNet
MethodPenaltyZeros out features?Best Use Case
L1 Lassoฮปฮฃ|w|โœ… YesFeature selection, sparse data
L2 RidgeฮปฮฃwยฒโŒ NoCorrelated features, regression
ElasticNetL1 + L2โœ… SometimesBoth benefits, general purpose
Evaluation Metrics
What is it?

Quantitative measures that tell you how well your model performs. Different metrics are appropriate for regression vs classification vs forecasting models.

Why does it matter?

Accuracy alone is misleading (class imbalance). RMSE penalizes large errors more than MAE. Rยฒ tells you proportion of variance explained. Choosing the right metric defines what "good" means for your problem.

Quick Cheat Sheet: Regression โ†’ MAE/RMSE/Rยฒ  |  Classification โ†’ Accuracy/F1/AUC  |  Imbalanced โ†’ F1/Precision/Recall
๐Ÿ“‰ Regression Metrics
MAE โ€” Mean Absolute Error
MAE = ฮฃ|yแตขโˆ’ลทแตข| / n
Average absolute difference. Easy to interpret โ€” same units as target. Not sensitive to large outliers. Best for: Robust eval when outliers exist
MSE โ€” Mean Squared Error
MSE = ฮฃ(yแตขโˆ’ลทแตข)ยฒ / n
Squares errors โ€” penalizes large errors heavily. Units are squared. Used as loss function during training. Best for: Penalizing big mistakes
RMSE โ€” Root Mean Squared Error
RMSE = โˆšMSE
Square root of MSE โ€” back to original units. Penalizes outliers more than MAE. Most commonly reported regression metric. Best for: General-purpose regression evaluation
Rยฒ โ€” Coefficient of Determination
Rยฒ = 1 โˆ’ (SS_res / SS_tot)
Proportion of variance in target explained by the model. Rยฒ=1 is perfect, Rยฒ=0 means model is no better than predicting the mean.
Rยฒ ValueInterpretation
0.90โ€“1.00Excellent
0.70โ€“0.89Good
0.50โ€“0.69Moderate
< 0.50Weak
MAPE โ€” Mean Absolute Percentage Error
MAPE = (1/n) ฮฃ |yแตขโˆ’ลทแตข| / |yแตข| ร— 100%
Error as a percentage of actual value. Scale-independent โ€” compare across datasets. Problem: undefined when yแตข=0. Best for: Forecasting, business reporting
โœ… Classification Metrics
Confusion Matrix (Foundation)
Predicted Positive Predicted Negative Actual Positive: TP (True Pos) FN (False Neg) โ† Type II Actual Negative: FP (False Pos) TN (True Neg) โ†‘ Type I Error TP = correctly predicted positive TN = correctly predicted negative FP = predicted positive but was negative (false alarm) FN = predicted negative but was positive (missed)
Accuracy
(TP + TN) / (TP+TN+FP+FN)
Overall correct predictions. Misleading with imbalanced classes. Use only when classes are balanced
Precision
TP / (TP + FP)
Of all predicted positives, how many were actually positive? Use when: False Positives are costly (spam filter, legal)
Recall (Sensitivity)
TP / (TP + FN)
Of all actual positives, how many did we catch? Use when: False Negatives are costly (cancer detection, fraud)
F1 Score
2 ร— (Precision ร— Recall) / (Precision + Recall)
Harmonic mean of Precision and Recall โ€” balanced single metric. Best for imbalanced datasets.
F1 ValueQuality
0.85+Excellent
0.70โ€“0.84Good
0.50โ€“0.69Fair
<0.50Poor
Best for: Imbalanced classification (fraud, disease)
AUC-ROC
Area Under the ROC Curve
ROC plots True Positive Rate vs False Positive Rate at every threshold. AUC=1 perfect, AUC=0.5 random. Threshold-independent.
AUC 0.5
Random
AUC 0.7
Acceptable
AUC 0.9
Excellent
Best for: Binary classification, comparing models
Quick Reference โ€” Which Metric?
Problem TypePrimary MetricSecondary Metric
Regression (general)RMSERยฒ
Regression (outliers present)MAERMSE
Regression (business %)MAPEMAE
Binary Classification (balanced)AccuracyAUC-ROC
Binary Classification (imbalanced)F1 ScorePrecision/Recall
Fraud Detection (miss = bad)RecallF1
Spam Filter (FP = bad)PrecisionF1
Multi-class ClassificationMacro F1Accuracy
Forecasting (time series)MAPEMAE

Best Practices โ€” Train a Better Model
What is it?

A consolidated checklist of the most important habits, rules, and decisions that separate a model that actually works in production from one that only looks good on paper.

Why does it matter?

Knowing individual techniques is not enough. How you combine them, in what order, and what mistakes to avoid determines whether your model generalizes or fails silently on real data.

๐Ÿงน 1. Data Quality First
  • Always clean before you model โ€” handle nulls, duplicates, and wrong data types before any EDA or training.
  • Check for data leakage โ€” never let future information or target-derived features sneak into your training set.
  • Understand your data source โ€” know how it was collected, what each column means, and what could go wrong.
  • Validate data types โ€” a column stored as string that should be numeric will silently break everything.
Garbage in = garbage out. No model can fix fundamentally bad data.
โœ‚๏ธ 2. Always Split Before You Touch the Data
  • Split train/test first โ€” before scaling, encoding, imputing, or any transformation.
  • Fit transformers on train only โ€” then apply (transform) to both train and test. Never fit on the full dataset.
  • Never peek at the test set โ€” it must stay completely unseen until final evaluation. Touching it earlier inflates your metrics.
  • Use stratified split for classification โ€” ensures class proportions are preserved in both sets.
โš ๏ธ Fitting a scaler or encoder on the full dataset before splitting is one of the most common and damaging mistakes in ML.
๐Ÿ“Š 3. Always Start with a Baseline
  • Build the simplest model first โ€” Logistic Regression for classification, Linear Regression for regression.
  • A baseline tells you the minimum bar โ€” if your complex model barely beats it, the complexity isn't worth it.
  • Compare all models against the baseline โ€” not against each other in isolation.
  • A dummy classifier (always predict majority class) is your floor โ€” your model must beat it or it's useless.
โœ… Baseline first โ†’ then iterate. Never jump straight to XGBoost or deep learning.
๐Ÿ”ง 4. Feature Engineering Over Model Complexity
  • Better features beat better models โ€” a simple model with great features outperforms a complex model with raw features.
  • Encode categoricals correctly โ€” ordinal data needs label encoding, nominal needs one-hot or target encoding.
  • Scale features for distance-based models โ€” KNN, SVM, and neural networks are sensitive to feature scale. Tree models are not.
  • Remove highly correlated features โ€” check VIF and correlation heatmap before training linear models.
  • Create domain-driven features โ€” house age, price per sqft, interaction terms often matter more than raw columns.
๐Ÿšซ 5. Avoid These Common Mistakes
MistakeWhy Itโ€™s Dangerous
Fitting scaler on full dataTest data leaks into preprocessing โ€” inflated metrics
Using accuracy on imbalanced data99% accuracy by predicting majority class โ€” useless model
Applying SMOTE before splittingSynthetic samples in test set โ€” fake performance
Dropping nulls without checking %Losing 30% of data silently biases the model
Label encoding nominal featuresImplies false order โ€” model learns wrong relationships
MistakeWhy Itโ€™s Dangerous
Shuffling time series dataFuture data trains the model โ€” data leakage
Tuning on test setTest set becomes part of training โ€” overly optimistic results
Ignoring class imbalanceModel never learns minority class โ€” fails in production
Skipping EDAOutliers, skew, and wrong types break models silently
Using Rยฒ alone for regressionHigh Rยฒ can still mean terrible predictions on new data
๐Ÿ” 6. Validate Properly
  • Use K-Fold cross-validation when data is limited โ€” gives a more reliable estimate than a single train/test split.
  • Use stratified K-Fold for classification โ€” preserves class balance in every fold.
  • Report mean ยฑ std of CV scores โ€” a model with high variance across folds is unstable.
  • Pick the right metric for your problem โ€” F1 for imbalanced, RMSE for regression, AUC for ranking.
  • Never report only training accuracy โ€” always report validation or test performance.
A model with Train=98%, Test=62% is overfit. A model with Train=78%, Test=76% is production-ready.
๐Ÿš€ 7. The Golden Checklist Before Training
Clean data โ€” nulls handled, duplicates removed, types correct
Split train/test โ€” before any transformation or encoding
EDA done โ€” distributions understood, outliers identified, correlations checked
Features engineered โ€” encoded, scaled, skew fixed, irrelevant features dropped
Imbalance handled โ€” SMOTE or class weights applied on training data only
Baseline built โ€” simple model trained and evaluated first
Right metric chosen โ€” matches the business problem, not just accuracy
Cross-validation used โ€” not just a single train/test split
Test set touched only once โ€” at the very end for final reporting
โœ… Follow this order every time and you will avoid 90% of the mistakes that make models fail in production.