MLKit

AI, ML & Deep Learning

What is it?

AI is the broad field of making machines simulate human intelligence. ML is a subset where machines learn from data. DL uses deep neural networks for complex patterns.

Why does it matter?

Understanding the hierarchy helps you choose the right approach — rule-based AI, statistical ML, or deep neural nets — based on your data size, interpretability needs, and compute budget.

🤖 Artificial Intelligence

Broad field of making machines simulate human intelligence — reasoning, problem-solving, language.

AI is the Field of Study. Everything lives inside it.

📊 Machine Learning

Subset of AI where machines learn patterns from data without explicit programming rules.

Learns from examples instead of hard-coded rules.

🧠 Deep Learning

Subset of ML using multi-layered neural networks for images, audio, text.

DL = ML with many layers of feature extraction.

When to Use What?

Scenario	Use	Why
Rule-based chatbot, fixed logic	AI	No learning needed
Predict prices, detect fraud	ML	Structured data, interpretable
Image/speech recognition	DL	Complex unstructured patterns
Small dataset (<1000 rows)	ML	DL needs large data
Need explainability (medical)	ML	DL is a black box

Types of Machine Learning

What is it?

ML is categorized by the type of feedback a model receives during learning — labeled data (supervised), no labels (unsupervised), or rewards from actions (reinforcement).

Why does it matter?

Choosing the wrong type means your model can't learn at all. A supervised model needs labels — if you don't have them, you need a different approach entirely.

🎓 Supervised Learning

Model learns from labeled data — input + correct output pairs.

Classification

Output = category

Logistic Regression
Decision Tree
Random Forest
SVM, KNN
XGBoost

Spam Detection

Regression

Output = number

Linear Regression
Ridge (L2)
Lasso (L1)
ElasticNet
XGBoost

House Price

🔍 Unsupervised Learning

Finds hidden patterns with no labels.

Clustering

K-Means
DBSCAN
Hierarchical

Customer Segmentation

Dimensionality Reduction

PCA, t-SNE, UMAP

Association

Apriori, FP-Growth

Market Basket

🎮 Reinforcement Learning

Agent learns by interacting with environment, receiving rewards or penalties.

Agent → Action → Environment → Reward/Penalty → Agent learns

Types of Data

What is it?

Data can be categorical (labels/groups) or numerical (measured quantities). Each subtype — nominal, ordinal, discrete, continuous — determines how you can process and model it.

Why does it matter?

Data type determines which encoding to apply, which algorithms work, and which statistical tests are valid. Treating ordinal as nominal (or vice versa) leads to wrong models.

📋 Categorical Data

Nominal (No Order)

Colors: Red, Blue, Green Gender: Male, Female, Other

Red is NOT greater than Blue — no order exists.

Ordinal (Has Order)

Education: School < UG < PG < PhD Rating: 1★ < 2★ < 3★ < 4★ < 5★

🔢 Numerical Data

Discrete (Countable)

Children: 0, 1, 2, 3... Cars sold: 0, 1, 2...

You can't have 2.5 children!

Continuous (Measurable)

Height: 172.5 cm, 168.2 cm Salary: ₹45,230.50

Pandas — Data Manipulation

What is it?

Pandas is Python's most powerful data manipulation library. It provides DataFrame and Series structures for loading, cleaning, transforming, and exploring tabular data.

Why does it matter?

80% of ML work is data preparation. Pandas is how you inspect nulls, filter rows, encode features, merge datasets, and create new columns — all before training begins.

Series — 1D labeled array

import pandas as pd s = pd.Series([85,92,78], index=['Alice','Bob','Carol'])

DataFrame — 2D table

df = pd.DataFrame({ 'Name': ['Alice','Bob'], 'Age': [25, 30], 'Score': [85, 92] })

Essential Pandas Operations

df.head() # First 5 rows df.info() # Types, null counts df.describe() # Statistical summary df.isnull().sum() # Missing values per column df.dropna() # Remove rows with nulls df[df['Age'] > 25] # Filter rows df[['Name','Score']] # Select columns

Null & Duplicate Value Treatment

What is it?

Data preprocessing step where missing values (nulls/NaN) and duplicate rows are detected and handled before any analysis or modeling. Dirty data = wrong models. Clean data = reliable results.

Why does it matter?

Most ML algorithms cannot handle NaN values — they either crash or produce silently wrong results. Duplicate rows bias your model by over-representing certain patterns and inflating evaluation metrics.

🕳️ Null / Missing Values — Detection

Missing values appear as NaN, None, NULL, empty strings, or placeholder values like -999, "unknown", "N/A". First step is always detection.

Detection Code

# Check nulls per column df.isnull().sum() # Percentage of nulls df.isnull().mean() * 100 # Heatmap of nulls import missingno as msno msno.matrix(df) # Which rows have nulls df[df.isnull().any(axis=1)]

Types of Missingness

Type	Meaning	Example
MCAR	Missing Completely At Random — no pattern	Random data entry errors
MAR	Missing At Random — depends on other columns	Income missing more for younger people
MNAR	Missing Not At Random — depends on the missing value itself	High earners skip salary questions

MNAR is the hardest to handle — the missingness itself carries information.

🔧 Null Treatment Strategies — How to Handle

Strategy 1: Drop Rows/Columns

# Drop rows with any null df.dropna(inplace=True) # Drop columns with >50% null thresh = len(df) * 0.5 df.dropna(axis=1, thresh=thresh)

⚠️ Only drop rows if <5% are missing. Never drop a column without analysis — it may contain information.

Strategy 2: Fill with Statistics

# Numerical: fill with mean or median df['Age'].fillna(df['Age'].mean()) df['Salary'].fillna(df['Salary'].median()) # Categorical: fill with mode df['City'].fillna(df['City'].mode()[0])

Strategy 3: Forward/Backward Fill (Time Series)

# Use previous value (time series) df.fillna(method='ffill') # Use next value df.fillna(method='bfill')

Strategy 4: Advanced Imputation

from sklearn.impute import KNNImputer, SimpleImputer # KNN Imputation (uses similar rows) imputer = KNNImputer(n_neighbors=5) df_imputed = imputer.fit_transform(df) # Iterative Imputer (model-based) from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer

Which Strategy to Use?

Situation	Recommended Strategy	Why
Numerical, few nulls (<5%), symmetric distribution	Fill with Mean	Mean preserves overall average
Numerical, skewed distribution or outliers	Fill with Median	Median is robust to outliers
Categorical column	Fill with Mode	Most frequent value is safest assumption
Time series data	Forward/Backward Fill	Preserves temporal continuity
Many nulls, complex relationships	KNN or Iterative Imputer	Uses relationships between columns
Column has >70% nulls	Drop the Column	Too little data to be reliable
MNAR — missingness has meaning	Create "was_null" flag column	Preserves information in the missingness

👥 Duplicate Values — Detection & Treatment

What are duplicates?

Rows that have identical values across all (or selected) columns. They arise from data merges, scraping, re-submissions, or system errors.

Why remove them?

Duplicates inflate dataset size, bias model training (the model sees certain patterns more often than they actually occur), and artificially inflate evaluation metrics.

Detection

# Count total duplicates df.duplicated().sum() # View duplicate rows df[df.duplicated(keep=False)] # Duplicates on specific columns df.duplicated(subset=['Name','Email']).sum()

Treatment

# Remove all duplicates (keep first) df.drop_duplicates(inplace=True) # Keep last occurrence df.drop_duplicates(keep='last') # Based on key columns only df.drop_duplicates( subset=['CustomerID', 'OrderDate'], keep='first' )

🔑 keep='first' keeps the earliest entry (usually more reliable). keep='last' keeps the most recent (useful if latest data is more accurate). keep=False marks ALL duplicates — useful for investigation.

🔄 Complete Data Cleaning Pipeline

import pandas as pd import numpy as np # 1. Load data df = pd.read_csv('data.csv') print(f"Shape: {df.shape}") # 2. Check nulls null_pct = df.isnull().mean() * 100 print(null_pct[null_pct > 0]) # 3. Drop high-null columns (>70%) df.drop(columns=null_pct[null_pct > 70].index, inplace=True) # 4. Fill numerical with median num_cols = df.select_dtypes(include='number').columns df[num_cols] = df[num_cols].fillna(df[num_cols].median()) # 5. Fill categorical with mode cat_cols = df.select_dtypes(include='object').columns for col in cat_cols: df[col].fillna(df[col].mode()[0], inplace=True) # 6. Remove duplicates df.drop_duplicates(inplace=True) print(f"Clean Shape: {df.shape}") df.isnull().sum() # Should be all zeros

✅ Always run this pipeline BEFORE EDA. You want to analyze clean data, not discover during modeling that you have 15% nulls.

EDA — Exploratory Data Analysis

What is EDA?

EDA is the process of systematically investigating your dataset before building any model. You examine shape, distributions, relationships, missing values, and anomalies using statistics and visualizations.

Why do we need EDA?

Without EDA, you're modeling blindly. EDA prevents you from feeding garbage to your model — you catch outliers, skewed distributions, data leakage, encoding issues, and multicollinearity early.

EDA Process Overview

📥 Raw Dataset (after null/duplicate cleaning)

↓

📊 Univariate
1 variable at a time

🔗 Bivariate
2 variables together

🌐 Multivariate
3+ variables

↓

🧹 Handle Outliers, Skew, Imbalance

↓

✅ Data Ready for Feature Engineering & Modeling

EDA does not change your data — it only helps you understand it. Transformations come later in Feature Engineering.

Univariate Analysis

What is it?

Analyzing one variable at a time. You look at how values of a single column are distributed — their shape, center, spread, and extreme values.

Why do we need it?

Before comparing features, you need to understand each individually. A bimodal distribution, heavy skew, or extreme outlier in a single column can break models if not addressed.

Categorical Variable

Count each category → frequency table
Find the most common value → Mode
Check if any category dominates (>90%)

Visualizations

Bar Chart, Pie Chart, Count Plot

Numerical Variable

Mean — average (affected by outliers)
Median — middle value (robust)
Mode — most frequent value
Variance & SD — spread
IQR — middle 50% range
Skewness — distribution symmetry

Visualizations

Histogram, Box Plot, KDE Plot

IQR, Variance & Standard Deviation

📦 IQR — Interquartile Range

Measures the spread of the middle 50% of data. Robust to extreme outliers.

IQR = Q3 − Q1 | Lower = Q1 − 1.5×IQR | Upper = Q3 + 1.5×IQR

Dataset: [4, 7, 10, 13, 16, 20, 25] Q1=7 Median=13 Q3=20 IQR=13 Lower = 7 − 19.5 = −12.5 | Upper = 20 + 19.5 = 39.5 → No outliers ✅

📐 Variance & Standard Deviation

What is it?

Variance measures how far each value is spread from the mean. Standard Deviation (SD) is simply the square root of variance — bringing it back to the original unit so it's easier to interpret. A high SD means values are widely scattered; a low SD means they're tightly clustered around the mean.

Why do we need it?

Mean alone doesn't tell the full story. Two datasets can have the same mean but completely different spreads. SD tells you how reliable and consistent the data is — essential for detecting outliers, comparing features, and understanding model uncertainty.

σ² = Σ(xᵢ−μ)²/N | s = √(σ²)

Calculate Mean (x̄)

Subtract: (xᵢ − x̄)

Square: (xᵢ − x̄)²

Sum and divide by n−1: s²

√s² = Standard Deviation

Divide by n−1 for sample (Bessel's correction) — avoids underestimating variance.

Outlier Detection

What is it?

Identifying data points that deviate significantly from the rest. Methods include IQR fences (box plot whiskers) and Z-score thresholds based on standard deviations from the mean.

Why do we need it?

Outliers distort model coefficients, inflate RMSE, and skew correlations. A single extreme value can move a regression line significantly. Must detect and decide: fix or remove.

Skewness & What It Means

Right Skew (Positive)

Mean > Median — outliers pulling mean up (tail RIGHT)

Examples: Income data, house prices

Left Skew (Negative)

Mean < Median — outliers pulling mean down (tail LEFT)

Examples: Test scores (few very low)

Symmetric: Mean ≈ Median ≈ Mode

Z-Score Method

What is it?

A Z-Score tells you how many standard deviations a value is away from the mean. A Z of 0 means the value is exactly at the mean. A Z of +2 means it's 2 standard deviations above. It standardizes any numerical column to a common scale so values from different features can be compared fairly.

Why do we need it?

Raw values alone don't tell you if something is unusual. A salary of ₹5,00,000 could be normal or extreme depending on the dataset. Z-Score gives context — |Z| > 3 flags a value as a statistical outlier, sitting beyond 99.7% of the distribution. It's also the foundation of standardization (StandardScaler) used before many ML models.

Z = (X − μ) / σ → |Z| > 3 = Outlier

Empirical Rule (68-95-99.7)

±1σ

68%

±2σ

95%

±3σ

99.7%

|Z| > 3 = Outlier — beyond 99.7% of the normal distribution.

Bivariate Analysis

What is it?

Analyzing the relationship between two variables simultaneously — how they move together (numerical vs numerical), differ across groups (categorical vs numerical), or associate (categorical vs categorical).

Why do we need it?

Feature selection, correlation detection, and hypothesis testing all start here. You find which features actually relate to the target — and which are noise.

🔢 Num vs Num — Step 1: Covariance (Direction)

Covariance measures whether two variables move in the same direction or opposite directions. It shows the direction of the relationship, but NOT how strong it is because it depends on the units of measurement.

Cov(X,Y) = Σ (xᵢ − x̄)(yᵢ − ȳ) / (n−1)

✅ Positive Covariance

When X increases, Y also increases. They deviate from their means in the same direction.

📌 Height & Weight — taller people tend to weigh more → Cov > 0

🔻 Negative Covariance

When X increases, Y decreases. They deviate in opposite directions.

📌 Study hours & Exam errors — more study, fewer mistakes → Cov < 0

⚠️ Problem: Covariance has no fixed scale. That's why we normalize it → Correlation.

📏 Num vs Num — Step 2: Pearson Correlation (Strength + Direction)

r = Cov(X,Y) / (σₓ × σᵧ) Range: −1 to +1

Property	Covariance	Correlation
What it tells	Direction only	Direction + Strength
Range	−∞ to +∞	−1 to +1
Unit dependent?	Yes	No (unitless)

r Value	Meaning
+0.8 to +1.0	Strong positive
+0.4 to +0.7	Moderate positive
≈ 0	No relationship
−0.7 to −0.4	Moderate negative
−1.0 to −0.8	Strong negative

−1.0−0.7−0.5−0.30+0.3+0.5+0.7+1.0

Perfect−Strong−Mod−WeakNoneWeakMod+Strong+Perfect+

🔑 If two features have r > 0.8 → multicollinearity! One can be dropped safely.

Visualizations: Scatter Plot Heatmap Pair Plot

Categorical vs Numerical

What is it?

Testing whether a numerical variable (e.g. salary, score) significantly differs across categories (e.g. department, gender). We set up a null hypothesis H₀ and use a test statistic to decide whether to reject it.

Why do we need it?

Tells you whether a categorical feature has real predictive power for a numerical target. If salary doesn't differ by department, that feature is noise.

🧪 How Hypothesis Testing Works

Hypothesis	Meaning
H₀ (Null)	No difference exists — any observed gap is just random chance
H₁ (Alternative)	A real difference exists — it's not just random noise

Example: H₀ = "Male and female scores are the same"

The test statistic (t or F) is placed on its probability distribution. The P-value = area in the tail beyond that value — the probability of seeing this extreme a result by pure chance alone, assuming H₀ is true.

Small P-value → result is very unlikely by chance → reject H₀ → real difference exists

T-Test — Comparing 2 Groups

Use when categorical variable has exactly 2 groups.

t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)

Result	What it means	Action
p ≤ 0.05	Groups are significantly different	Keep feature
p > 0.05	No significant difference	Feature may be useless

ANOVA — Comparing 3+ Groups

Use when categorical variable has 3 or more groups.

F = Between-Group Variance / Within-Group Variance

Result	What it means	Next Step
p ≤ 0.05	At least one group differs significantly	Run Post-Hoc (Tukey)
p > 0.05	No significant difference	Feature may not add value

Categorical vs Categorical

What is it?

Testing whether two categorical variables are associated or independent. We compare observed data to what we would expect if the variables had absolutely no relationship.

Why do we need it?

Tells you if a categorical feature is related to a categorical target. If Gender and Movie Preference are independent, Gender won't help predict preferences.

χ² Chi-Square Test

What is it?

The Chi-Square test measures how much the observed counts in a cross-table differ from what we'd expect if the two variables had absolutely no relationship. The larger the χ² value, the bigger the gap between observed and expected — meaning a stronger association exists between the two categories.

Why do we need it?

When both your feature and target are categorical, you can't use correlation or t-tests. Chi-Square is the go-to test to decide whether a categorical feature is worth keeping — if Gender and Loan Default are independent, Gender adds zero predictive value to your model.

χ² = Σ [(Observed − Expected)² / Expected]

Cross-tab: Gender × Movie Genre Action Romance Comedy Total Male: 60 20 20 100 Female: 15 55 30 100 Total: 75 75 50 200 Expected(Male, Action) = (100 × 75) / 200 = 37.5 χ² contribution = (60−37.5)² / 37.5 = 13.5 → p < 0.001 → Reject H₀ ✅

P-value	Decision	Feature usefulness
p ≤ 0.05	Reject H₀ ✅	Variables associated — keep feature
p > 0.05	Fail to reject H₀ ❌	Variables independent — may be useless

⚠️ Chi-square only tells you whether an association exists. Use Cramér's V for strength.

📐 Cramér's V — Association Strength

What is it?

Cramér's V is a normalized version of the Chi-Square statistic that measures the strength of association between two categorical variables. It always ranges from 0 to 1 — where 0 means no association at all and 1 means a perfect relationship — regardless of table size or sample size.

Why do we need it?

Chi-Square only tells you whether an association exists, not how strong it is. A huge dataset can produce a significant χ² even for a trivially weak relationship. Cramér's V fixes this — it gives you a comparable, scaled strength score so you can rank which categorical features are most useful for your model.

V = √( χ² / (n × min(r−1, c−1)) ) Range: 0 to 1

Cramér's V	Strength	What it means
0.00–0.10	Negligible	Barely any association — ignore
0.10–0.30	Weak–Moderate	Some association — use with caution
0.30–0.60	Moderate–Strong	Meaningful association — likely useful
0.60–1.00	Very Strong	Strong relationship — important feature

Multivariate Analysis

What is it?

Examining 3 or more variables simultaneously. The key tool is the correlation heatmap — a matrix where each cell shows Pearson's r between two features, color-coded by strength and direction.

Why do we need it?

Reveals multicollinearity (features too similar to each other), which bloats variance and hurts models. Also shows which features correlate with the target.

🌡️ Correlation Heatmap

Correlation Matrix — Example Dataset

	Age	Income	Score	Exp	Target
Age	1.00	0.72	−0.41	0.81	0.53
Income	0.72	1.00	0.28	0.61	0.78
Score	−0.41	0.28	1.00	−0.15	0.19
Exp	0.81	0.61	−0.15	1.00	0.67
Target	0.53	0.78	0.19	0.67	1.00

−1.0

+1.0 Dark Red = High Positive Correlation

Reading the heatmap: Age–Exp = 0.81 (multicollinearity ⚠️). Score–Target = 0.19 (weak predictor). Income–Target = 0.78 (strong predictor ✅).

import seaborn as sns corr = df.corr() sns.heatmap(corr, annot=True, fmt=".2f", cmap="RdBu_r", center=0)

🔍 VIF — Variance Inflation Factor (Full Explanation)

What is VIF?

VIF (Variance Inflation Factor) is a measure that quantifies how much the variance of a regression coefficient is inflated due to multicollinearity — that is, how much one feature can be linearly predicted from the other features.

Why does it matter?

When features are highly correlated with each other, the model gets confused about which feature deserves credit. Coefficients become unstable, unreliable, and have exploding standard errors, making the model meaningless.

How VIF is Calculated — Step by Step

For each feature X in your dataset, VIF asks: "How well can I predict X using all the OTHER features?"

Pick one feature — say Age — and set it as the target variable

Regress Age on ALL other features (Income, Score, Exp, etc.) using linear regression

Record the R² of this regression — how well the other features explain Age

Calculate VIF = 1 / (1 − R²). If R² = 0.9 → VIF = 1/(1−0.9) = 10 → severe!

Repeat for every feature in your dataset

VIF_x = 1 / (1 − R²_x) where R²_x = R² from regressing X on all other features

Interpreting VIF Values

VIF Value	Meaning	Action
VIF = 1	No multicollinearity at all — feature is completely independent	Keep ✅
1 < VIF ≤ 5	Low to moderate — some correlation but acceptable	Keep ✅
5 < VIF ≤ 10	Moderate to high — investigate carefully	Consider dropping
VIF > 10	Severe multicollinearity — this feature is nearly redundant	Drop or combine ❌

Intuitive Example

Suppose your dataset has both Height_cm and Height_inches:

Height_cm = 2.54 × Height_inches R² ≈ 1.0 (perfect prediction) VIF = 1/(1−1.0) = ∞ → Perfect multicollinearity! Model cannot decide: "Is the price high because Height_cm is 180, or because Height_inches is 70.9?" → Coefficients explode, model is unstable

When VIF is very high, regression coefficients have huge standard errors — small changes in data cause massive swings in coefficients.

VIF vs Correlation Matrix — Key Difference

Property	Correlation Matrix	VIF
What it checks	Pairwise (X₁ vs X₂) relationship only	X vs ALL other features combined
Detects 3-way collinearity?	❌ No — misses it	✅ Yes — catches it
Output	−1 to +1 per pair	1 to ∞ per feature
Example blind spot	X₃ = X₁ + X₂ might show low pairwise r	VIF for X₃ will be very high

🔑 Always use VIF alongside the correlation heatmap. Correlation shows pairwise problems. VIF catches complex multi-way collinearity that correlation matrices miss.

How to Fix High VIF

Option 1: Drop a Feature

If two features are highly correlated, drop the one with less predictive value for the target.

Option 2: Combine Features

Create a single feature from the correlated ones (e.g. ratio, sum, average).

Option 3: Use PCA

Transform correlated features into uncorrelated principal components — VIF = 1 for all components.

VIF Calculation Code

from statsmodels.stats.outliers_influence import variance_inflation_factor import pandas as pd # X should be your feature matrix (no target column) vif_data = pd.DataFrame() vif_data["Feature"] = X.columns vif_data["VIF"] = [ variance_inflation_factor(X.values, i) for i in range(len(X.columns)) ] print(vif_data.sort_values("VIF", ascending=False)) # Drop features with VIF > 10 high_vif = vif_data[vif_data["VIF"] > 10]["Feature"].tolist() X_clean = X.drop(columns=high_vif)

Outlier Treatment — Log & Box-Cox

What is it?

Mathematical transformations applied to a column's values to reduce skewness. They compress large values, making the distribution closer to normal so models work better.

Why do we need it?

Many ML algorithms assume normally distributed features. Highly skewed data gives disproportionate weight to extreme values. Log and Box-Cox transformations fix this without removing any rows.

📈 Log Transformation

What is it?

Log transformation applies the logarithm function to every value in a column — compressing large values and spreading small ones. It turns a right-skewed distribution into something closer to a normal bell curve.

Why do we need it?

Many ML models assume features are roughly normally distributed. Income, house prices, and population counts are heavily right-skewed — a few huge values dominate. Log squashes those extremes so the model treats all values fairly.

y' = log(x) — x must be > 0. Use log(x+1) if zeros present.

Situation	Use?
Right-skewed (skewness > +1)	✅ Yes
Data has zeros	⚠️ log(x+1)
Negative values	❌ No

🔧 Box-Cox Transformation

What is it?

Box-Cox is a family of power transformations that automatically finds the best exponent λ (lambda) to make your data as normal as possible. Instead of you guessing whether to use log, square root, or reciprocal — Box-Cox tests them all and picks the optimal one.

Why do we need it?

Log only fixes right skew. Box-Cox handles both left and right skew by tuning λ. When skewness is above +2 or the distribution is complex, Box-Cox finds a better transformation than log alone — but requires all values to be strictly positive.

Automatically finds the best λ (lambda) to normalize data.

y'(λ) = (yᵛ−1)/λ if λ≠0 | y'=ln(y) if λ=0

λ	Effect	Skew Type
2	y² (Square)	Left-skewed
0.5	√y	Moderate right
0	ln(y)	Strong right
-1	1/y	Severe right

Skewness	Action
−0.5 to +0.5	No transform needed
+0.5 to +1.0	Consider Log or √
> +1.0	Apply Log
> +2.0	Apply Box-Cox

⚠️ When Data Has Negative Values — Use These Instead

Log and Box-Cox both require x > 0. If your column has negative values (e.g. temperature, profit/loss, z-scores), use the methods below instead.

🔄 Yeo-Johnson Transformation

The go-to replacement for Box-Cox when negatives are present. Works on any value — positive, zero, or negative. Automatically finds the best λ just like Box-Cox.

λ≠0,x≥0: ((x+1)^λ−1)/λ | λ=0,x≥0: ln(x+1)
λ≠2,x<0: −((−x+1)^(2−λ)−1)/(2−λ) | λ=2,x<0: −ln(−x+1)

Property	Detail
Handles negatives?	✅ Yes
Handles zeros?	✅ Yes
Auto-finds λ?	✅ Yes
Best for	Any skewed data, mixed signs

from sklearn.preprocessing import PowerTransformer pt = PowerTransformer(method='yeo-johnson') df['col_transformed'] = pt.fit_transform(df[['col']]) # Works with negatives, zeros, positives ✅

∛ Cube Root Transformation

Simple manual method. The cube root of a negative number is still negative — so it naturally handles all signs. Good for moderate skew with negatives.

y' = x^(1/3) — works for all x including negatives

Property	Detail
Handles negatives?	✅ Yes
Handles zeros?	✅ Yes
Auto-finds λ?	❌ Fixed power (1/3)
Best for	Moderate skew, simple fix

import numpy as np # np.cbrt handles negatives correctly df['col_cbrt'] = np.cbrt(df['col']) # e.g. cbrt(-27) = -3 ✅ cbrt(0) = 0 ✅

Your Data	Recommended Transform	Why
All positive, right-skewed	Log / Box-Cox	Classic choice, well-understood
Has zeros, right-skewed	log(x+1) / Yeo-Johnson	Avoids log(0) error
Has negatives, any skew	Yeo-Johnson	Works on all signs, auto-finds λ
Has negatives, moderate skew	Cube Root	Simple, no fitting needed

PCA — Dimensionality Reduction

What is it?

Principal Component Analysis transforms many correlated features into fewer uncorrelated "principal components" that capture the maximum variance in the data.

Why do we need it?

Too many correlated features (multicollinearity) cause unstable model weights and slow training. PCA compresses information into independent components — directly eliminating multicollinearity.

PCA Steps

Standardize all features (mean=0, SD=1)

Compute Covariance Matrix

Calculate Eigenvalues & Eigenvectors

Sort by eigenvalue — higher = more variance explained

Select top K using Kaiser's Rule (eigenvalue ≥ 1) or scree plot

Project data onto K components → new feature matrix

from sklearn.decomposition import PCA pca = PCA(n_components=0.95) # keep 95% variance X_pca = pca.fit_transform(X_scaled)

Variance Retained Guide

Variance Retained	Quality
<60%	Too low ❌
75–90%	Good ✅
90–95%	Excellent ✅

SMOTE — Class Imbalance

What is it?

Synthetic Minority Over-sampling Technique creates new artificial minority samples by interpolating between existing real minority samples — making class sizes more equal.

Why do we need it?

When 99% of samples are class 0 and 1% are class 1, a model can achieve 99% accuracy by just predicting 0 always — useless! SMOTE forces the model to actually learn the minority class.

How SMOTE Works

x_new = x_original + λ × (x_neighbor − x_original) λ ∈ [0,1]

Find all minority class samples

For each: find K nearest minority neighbors (default K=5)

Randomly pick one neighbor, generate random λ ∈ [0,1]

Create synthetic point: x_new = original + λ × (neighbor − original)

Repeat until desired balance is achieved

Critical: Apply SMOTE ONLY on training data — never on test/validation data.

Variants

Variant	When to Use
SMOTE (default)	Standard imbalance, continuous features
Borderline-SMOTE	Focus on hard boundary cases
ADASYN	Adaptive — more samples where density is low
SMOTE-Tomek	SMOTE + clean overlapping points
SMOTENC	Mixed data (numerical + categorical)

ML Model Types — When to Use What

What is it?

A structured overview of the major ML model families — what problem each solves, what data it expects, and which real-world use cases it's best suited for.

Why do we need it?

Choosing the wrong model type is the single most common ML mistake. The model is a tool — match the tool to the job.

📐 Linear Models

Model	Output	Best Use Case	Notes
Linear Regression	Continuous number	House price, salary prediction	Assumes linear relationship; sensitive to outliers
Ridge Regression	Continuous number	Same as linear but with correlated features	L2 regularization built-in
Lasso Regression	Continuous number	High-dimensional data, feature selection needed	L1 regularization — zeros out features
Logistic Regression	Probability (0–1)	Spam detection, default prediction, disease yes/no	Despite the name, it's a classifier

✅ Start here first — linear models give you a strong baseline and are fully interpretable.

🌲 Tree-Based & Non-Linear Models

Model	Output	Best Use Case	Notes
Decision Tree	Class or number	Simple classification, interpretable rules	Easy to explain; prone to overfitting
Random Forest	Class or number	General-purpose: fraud, churn, credit scoring	Ensemble of trees; robust, handles missing values
XGBoost / LightGBM	Class or number	Tabular data competitions, ranking, clickthrough rate	Gradient boosting; state-of-the-art on structured data

✅ For most tabular/structured data problems, XGBoost or Random Forest is the go-to choice.

🎯 Classification-Specific Models

Model	Output	Best Use Case	Notes
SVM	Class	Text classification, small datasets	Maximum-margin hyperplane; effective in high dimensions
K-Nearest Neighbors	Class or number	Recommendation systems, anomaly detection	No training phase; slow on large data
Naive Bayes	Class probability	Email spam, sentiment analysis	Fast, simple, works well with text/NLP

🧠 Deep Learning Models

Model	Output	Best Use Case	Notes
ANN (MLP)	Any	Complex tabular patterns	Fully connected layers; general purpose
CNN	Class/label	Image classification, object detection	Convolutional layers capture spatial patterns
RNN / LSTM	Sequence	Time series, language modeling	Remembers long-term dependencies
Transformer	Any	NLP (ChatGPT, BERT), translation	Attention mechanism; state-of-the-art in NLP

⚠️ Deep learning needs a lot of data (>10k rows minimum). For small datasets, tree-based models almost always win.

📈 Forecasting Models (Time Series)

Use when your target variable has a time dimension: tomorrow's sales, next month's temperature, stock prices. Regular ML ignores temporal order — forecasting preserves it.

Time Series Components

Component	What It Is	Example
Trend	Long-term upward/downward direction	Rising sales year over year
Seasonality	Repeating pattern at fixed intervals	High sales every December
Cyclic	Longer irregular waves	Economic recession cycles
Noise	Random unexplainable variation	One-off spike due to event

Classical Forecasting Models

Model	Formula / Key Idea	Best Use Case
Moving Average	avg of last N observations	Stable, no trend/seasonality — quick baseline
Exponential Smoothing (ETS)	F(t+1) = α·Y(t) + (1−α)·F(t)	Short-term forecast, sales, inventory
ARIMA(p,d,q)	AR + Differencing + MA	Stationary or easily differenced series
SARIMA	ARIMA + seasonal(P,D,Q,m)	Clear seasonality (retail, weather, energy)

Modern / ML-Based Forecasting Models

Model	Key Idea	Best Use Case
Prophet (Meta)	Trend + seasonality + holidays decomposition	Business forecasting, holiday effects
XGBoost + Lag Features	ML with t-1, t-7, t-30 as input features	Complex relationships, external regressors
LSTM	Recurrent NN with long-term memory	Long sequences, complex temporal patterns
Temporal Fusion Transformer	Attention + multi-horizon forecasting	Large-scale production forecasting

Which Forecasting Model?

Situation	Recommended Model
Quick baseline, stable series	Moving Average / ETS
No seasonality, stationary	ARIMA
Clear seasonality (monthly/weekly)	SARIMA / Holt-Winters
Business data with holidays	Prophet
Complex, many features	XGBoost + lag features
Long sequence, big data	LSTM / Transformer

Key rule: Always split time series by time — train on past, test on future. Never shuffle! Shuffling creates data leakage in time series.

🗺️ Model Selection Guide

Your Problem	Recommended Model	Why
Predict a number (house price, salary)	Linear / Ridge Regression	Interpretable, fast, good baseline
Binary yes/no (spam, fraud, churn)	Logistic Regression / XGBoost	Logistic for interpretability, XGBoost for performance
Multi-class (A/B/C/D categories)	Random Forest / XGBoost	Handles multi-class natively; robust
Customer churn prediction	XGBoost + SMOTE	Handles imbalance; captures complex patterns
Text classification / sentiment	Naive Bayes / Transformer	Naive Bayes for speed; Transformer for accuracy
Image recognition	CNN	Designed for spatial/visual patterns
Time series / forecasting	ARIMA / Prophet / LSTM	Depends on seasonality and data volume
Anomaly / fraud detection	Isolation Forest / XGBoost	Handles extreme class imbalance
Small dataset (<1000 rows)	SVM / Logistic Regression	Work well in low-data regimes
Very large structured data	XGBoost / LightGBM	Scales well; battle-tested on tabular data

🔑 General rule: Always start with Logistic/Linear Regression as your baseline. If performance is insufficient, move to Random Forest, then XGBoost. Only use Deep Learning if you have large data and unstructured inputs.

Train / Test / Validation Split

What is it?

Dividing your dataset into separate subsets so the model is trained on one portion and evaluated on data it has never seen before. This simulates real-world performance.

Why do we need it?

If you evaluate on training data, your model looks perfect — it just memorized the answers. A separate test set reveals if it actually learned the pattern or just memorized noise.

Standard Split Ratios

70/30 Split (Simple)

Train (70%)

Test (30%)

80/10/10 Split (With Validation)

Train (80%)

Val (10%)

Test (10%)

60/20/20 Split (Common in DL)

Train (60%)

Val (20%)

Test (20%)

🟢 Train Set

The model learns from this — adjusts weights, fits patterns. Never used for final evaluation.

🟠 Validation Set

Used to tune hyperparameters (learning rate, depth, regularization) during development.

🔴 Test Set

Final unseen evaluation — touched only ONCE at the very end. Reports real-world performance.

Cross-Validation (K-Fold)

When data is limited, K-Fold CV rotates which portion is used as validation — every sample is tested exactly once.

from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring='r2') print(scores.mean(), scores.std())

Overfitting, Underfitting & Regularization

What is it?

The two fundamental problems in model training. Underfitting = model is too simple to learn the pattern. Overfitting = model memorizes training noise instead of learning the real pattern. Regularization is the fix for overfitting.

Why does it matter?

A model that performs perfectly on training data but fails on new data is useless. The goal is generalization — learning the real underlying pattern. This is the core challenge of ML.

The Bias-Variance Tradeoff

📉

Underfitting

Model is too simple — misses the pattern in both train and test data.

Set	Score
Train	Low ❌
Test	Low ❌

Cause: Too few features, too shallow model

Fix: Add more features, use complex model

✅

Good Fit

Model generalizes well — learns the real pattern, not the noise.

Set	Score
Train	High ✅
Test	High ✅

This is the goal — balanced complexity, good generalization

📈

Overfitting

Model memorizes training noise — great on train, fails on new data.

Set	Score
Train	Very High ✅
Test	Low ❌

Cause: Too complex model, too many features

Fix: Regularization (L1/L2), more data, pruning

🔒 Regularization — The Fix for Overfitting

Regularization adds a penalty term to the loss function that punishes large model weights. This forces the model to stay simple — it can't just assign huge weights to every feature to perfectly fit training noise.

Without Reg: Minimize only Σ(y − ŷ)² L1 (Lasso): Minimize Σ(y − ŷ)² + λΣ|w| ← penalizes |weight| L2 (Ridge): Minimize Σ(y − ŷ)² + λΣw² ← penalizes weight² ElasticNet: Combine both L1 + L2

λ (lambda) is the regularization strength. Higher λ = more penalty = simpler model. Too high → underfitting. Tune with cross-validation.

L1 — Lasso Regularization

Penalizes the absolute value of weights. Pushes small/irrelevant feature weights all the way to exactly zero — automatically removing them.

Property	Detail
Effect on weights	Shrinks weak ones to exactly 0
Feature selection	✅ Automatic
Correlated features	Randomly drops one of them
Best for	High-dim data, sparse features

Use L1 when you suspect many irrelevant features exist and want automatic feature selection.

L2 — Ridge Regularization

Penalizes the squared value of weights. Shrinks all weights toward zero but never reaches exactly zero — keeps all features, just with smaller influence.

Property	Detail
Effect on weights	Shrinks all, never to exactly 0
Feature selection	❌ Keeps all features
Correlated features	Distributes weight across them
Best for	Correlated features, regression

Use L2 when all features might be relevant, especially with correlated inputs.

L1 vs L2 vs ElasticNet

Method	Penalty	Zeros out features?	Best Use Case
L1 Lasso	λΣ\|w\|	✅ Yes	Feature selection, sparse data
L2 Ridge	λΣw²	❌ No	Correlated features, regression
ElasticNet	L1 + L2	✅ Sometimes	Both benefits, general purpose

Evaluation Metrics

What is it?

Quantitative measures that tell you how well your model performs. Different metrics are appropriate for regression vs classification vs forecasting models.

Why does it matter?

Accuracy alone is misleading (class imbalance). RMSE penalizes large errors more than MAE. R² tells you proportion of variance explained. Choosing the right metric defines what "good" means for your problem.

Quick Cheat Sheet: Regression → MAE/RMSE/R² | Classification → Accuracy/F1/AUC | Imbalanced → F1/Precision/Recall

📉 Regression Metrics

MAE — Mean Absolute Error

MAE = Σ|yᵢ−ŷᵢ| / n

Average absolute difference. Easy to interpret — same units as target. Not sensitive to large outliers. Best for: Robust eval when outliers exist

MSE — Mean Squared Error

MSE = Σ(yᵢ−ŷᵢ)² / n

Squares errors — penalizes large errors heavily. Units are squared. Used as loss function during training. Best for: Penalizing big mistakes

RMSE — Root Mean Squared Error

RMSE = √MSE

Square root of MSE — back to original units. Penalizes outliers more than MAE. Most commonly reported regression metric. Best for: General-purpose regression evaluation

R² — Coefficient of Determination

R² = 1 − (SS_res / SS_tot)

Proportion of variance in target explained by the model. R²=1 is perfect, R²=0 means model is no better than predicting the mean.

R² Value	Interpretation
0.90–1.00	Excellent
0.70–0.89	Good
0.50–0.69	Moderate
< 0.50	Weak

MAPE — Mean Absolute Percentage Error

MAPE = (1/n) Σ |yᵢ−ŷᵢ| / |yᵢ| × 100%

Error as a percentage of actual value. Scale-independent — compare across datasets. Problem: undefined when yᵢ=0. Best for: Forecasting, business reporting

✅ Classification Metrics

Confusion Matrix (Foundation)

Predicted Positive Predicted Negative Actual Positive: TP (True Pos) FN (False Neg) ← Type II Actual Negative: FP (False Pos) TN (True Neg) ↑ Type I Error TP = correctly predicted positive TN = correctly predicted negative FP = predicted positive but was negative (false alarm) FN = predicted negative but was positive (missed)

Accuracy

(TP + TN) / (TP+TN+FP+FN)

Overall correct predictions. Misleading with imbalanced classes. Use only when classes are balanced

Precision

TP / (TP + FP)

Of all predicted positives, how many were actually positive? Use when: False Positives are costly (spam filter, legal)

Recall (Sensitivity)

TP / (TP + FN)

Of all actual positives, how many did we catch? Use when: False Negatives are costly (cancer detection, fraud)

F1 Score

2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of Precision and Recall — balanced single metric. Best for imbalanced datasets.

F1 Value	Quality
0.85+	Excellent
0.70–0.84	Good
0.50–0.69	Fair
<0.50	Poor

Best for: Imbalanced classification (fraud, disease)

AUC-ROC

Area Under the ROC Curve

ROC plots True Positive Rate vs False Positive Rate at every threshold. AUC=1 perfect, AUC=0.5 random. Threshold-independent.

AUC 0.5

Random

AUC 0.7

Acceptable

AUC 0.9

Excellent

Best for: Binary classification, comparing models

Quick Reference — Which Metric?

Problem Type	Primary Metric	Secondary Metric
Regression (general)	RMSE	R²
Regression (outliers present)	MAE	RMSE
Regression (business %)	MAPE	MAE
Binary Classification (balanced)	Accuracy	AUC-ROC
Binary Classification (imbalanced)	F1 Score	Precision/Recall
Fraud Detection (miss = bad)	Recall	F1
Spam Filter (FP = bad)	Precision	F1
Multi-class Classification	Macro F1	Accuracy
Forecasting (time series)	MAPE	MAE

Best Practices — Train a Better Model

What is it?

A consolidated checklist of the most important habits, rules, and decisions that separate a model that actually works in production from one that only looks good on paper.

Why does it matter?

Knowing individual techniques is not enough. How you combine them, in what order, and what mistakes to avoid determines whether your model generalizes or fails silently on real data.

🧹 1. Data Quality First

Always clean before you model — handle nulls, duplicates, and wrong data types before any EDA or training.
Check for data leakage — never let future information or target-derived features sneak into your training set.
Understand your data source — know how it was collected, what each column means, and what could go wrong.
Validate data types — a column stored as string that should be numeric will silently break everything.

Garbage in = garbage out. No model can fix fundamentally bad data.

✂️ 2. Always Split Before You Touch the Data

Split train/test first — before scaling, encoding, imputing, or any transformation.
Fit transformers on train only — then apply (transform) to both train and test. Never fit on the full dataset.
Never peek at the test set — it must stay completely unseen until final evaluation. Touching it earlier inflates your metrics.
Use stratified split for classification — ensures class proportions are preserved in both sets.

⚠️ Fitting a scaler or encoder on the full dataset before splitting is one of the most common and damaging mistakes in ML.

📊 3. Always Start with a Baseline

Build the simplest model first — Logistic Regression for classification, Linear Regression for regression.
A baseline tells you the minimum bar — if your complex model barely beats it, the complexity isn't worth it.
Compare all models against the baseline — not against each other in isolation.
A dummy classifier (always predict majority class) is your floor — your model must beat it or it's useless.

✅ Baseline first → then iterate. Never jump straight to XGBoost or deep learning.

🔧 4. Feature Engineering Over Model Complexity

Better features beat better models — a simple model with great features outperforms a complex model with raw features.
Encode categoricals correctly — ordinal data needs label encoding, nominal needs one-hot or target encoding.
Scale features for distance-based models — KNN, SVM, and neural networks are sensitive to feature scale. Tree models are not.
Remove highly correlated features — check VIF and correlation heatmap before training linear models.
Create domain-driven features — house age, price per sqft, interaction terms often matter more than raw columns.

🚫 5. Avoid These Common Mistakes

Mistake	Why It’s Dangerous
Fitting scaler on full data	Test data leaks into preprocessing — inflated metrics
Using accuracy on imbalanced data	99% accuracy by predicting majority class — useless model
Applying SMOTE before splitting	Synthetic samples in test set — fake performance
Dropping nulls without checking %	Losing 30% of data silently biases the model
Label encoding nominal features	Implies false order — model learns wrong relationships

Mistake	Why It’s Dangerous
Shuffling time series data	Future data trains the model — data leakage
Tuning on test set	Test set becomes part of training — overly optimistic results
Ignoring class imbalance	Model never learns minority class — fails in production
Skipping EDA	Outliers, skew, and wrong types break models silently
Using R² alone for regression	High R² can still mean terrible predictions on new data

🔁 6. Validate Properly

Use K-Fold cross-validation when data is limited — gives a more reliable estimate than a single train/test split.
Use stratified K-Fold for classification — preserves class balance in every fold.
Report mean ± std of CV scores — a model with high variance across folds is unstable.
Pick the right metric for your problem — F1 for imbalanced, RMSE for regression, AUC for ranking.
Never report only training accuracy — always report validation or test performance.

A model with Train=98%, Test=62% is overfit. A model with Train=78%, Test=76% is production-ready.

🚀 7. The Golden Checklist Before Training

Clean data — nulls handled, duplicates removed, types correct

Split train/test — before any transformation or encoding

EDA done — distributions understood, outliers identified, correlations checked

Features engineered — encoded, scaled, skew fixed, irrelevant features dropped

Imbalance handled — SMOTE or class weights applied on training data only

Baseline built — simple model trained and evaluated first

Right metric chosen — matches the business problem, not just accuracy

Cross-validation used — not just a single train/test split

Test set touched only once — at the very end for final reporting

✅ Follow this order every time and you will avoid 90% of the mistakes that make models fail in production.