Machine Learning Β· Feature Engineering

Feature
Engineering

A complete hands-on guide to encoding, selecting, transforming, and creating features β€” taught through a consistent real-world dataset.

🏠

Our Working Dataset: House Price Prediction

We'll use the same dataset throughout every topic so you can see how each technique transforms the same raw data.

City Neighborhood Quality House Style Year Built Area (sqft) Bedrooms Has Pool Price ($)
Raw Dataset (8 rows)
#CityNeighborhood QualityHouse Style Year BuiltArea (sqft)BedroomsHas PoolPrice ($)
1MumbaiExcellentBungalow201018003Yes82,000
2DelhiGoodApartment20159502No45,000
3BangaloreAverageVilla200525004Yes110,000
4MumbaiPoorApartment20007001No28,000
5ChennaiGoodBungalow201816003No67,000
6DelhiExcellentVilla202030005Yes145,000
7BangaloreAverageApartment201211002No52,000
8ChennaiPoorBungalow199812003No31,000

01
Label Encoding
For Ordinal data β€” where order matters
Label / Ordinal Encoding
βœ“ Use for: Ordinal (ordered) categories Column: Neighborhood Quality

We assign a whole number to each category, preserving their natural order. Poor < Average < Good < Excellent becomes 0 < 1 < 2 < 3. The model can now understand that Excellent is "greater than" Good β€” which is exactly true here.

Before

#Neighborhood Quality
1Excellent
2Good
3Average
4Poor

After Label Encoding

#Neighborhood Quality
13
22
31
40
πŸ“‹ Mapping: Poor=0 Β· Average=1 Β· Good=2 Β· Excellent=3  β€”  Notice the numbers respect the real-world rank. This is what makes it "ordinal."
# Python — Label Encoding for Ordinal Data from sklearn.preprocessing import OrdinalEncoder import pandas as pd order = [['Poor', 'Average', 'Good', 'Excellent']] enc = OrdinalEncoder(categories=order) df['Neighborhood_Encoded'] = enc.fit_transform(df[['Neighborhood Quality']]) # Poor→0, Average→1, Good→2, Excellent→3
⚠️ Never use Label Encoding on nominal data (like City). If you encode Mumbai=0, Delhi=1, Bangalore=2 β€” the model wrongly assumes Bangalore > Delhi > Mumbai. Use One-Hot instead.
02
One-Hot Encoding
For Nominal data β€” where order doesn't matter
One-Hot Encoding
βœ“ Use for: Nominal (unordered) categories Column: City (4 unique values)

Creates one binary column per category. A row gets 1 in its city's column and 0 everywhere else. No false ordering is implied β€” Mumbai is not "greater than" Delhi.

#City (original)City_MumbaiCity_DelhiCity_BangaloreCity_Chennai
1Mumbai1000
2Delhi0100
3Bangalore0010
4Mumbai1000
5Chennai0001
6Delhi0100
7Bangalore0010
8Chennai0001
πŸ’‘ Dummy Variable Trap: Drop one column (e.g., City_Chennai) when using linear models, since if all other city columns are 0, we already know it's Chennai. Use drop_first=True in pandas.
# One-Hot Encoding β€” 2 ways pd.get_dummies(df, columns=['City'], drop_first=True) # OR using sklearn from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder(sparse_output=False, drop='first') enc.fit_transform(df[['City']])
⚠️ Problem with high cardinality: If City had 500 unique values, you'd get 500 new columns. That's the curse of dimensionality. Use Binary or Target Encoding instead for high-cardinality columns.
03
Binary Encoding
For High-Cardinality nominal data β€” saves columns
Binary Encoding
βœ“ Use for: Nominal with many categories (50+) Example: City (imagine 500 cities)

Binary Encoding first converts each category to a number (like Label Encoding), then converts that number into its binary representation, and splits each binary digit into a separate column. 4 categories β†’ 2 bits (columns). 8 categories β†’ 3 bits. 1000 categories β†’ just 10 columns!

Step 1: Assign integer β†’ Mumbai=1, Delhi=2, Bangalore=3, Chennai=4
Step 2: Convert to binary β†’ 1=001, 2=010, 3=011, 4=100
Step 3: Each bit becomes a column β†’ col_1, col_2, col_3
CityIntegerBinaryCity_b1City_b2City_b3
Mumbai1001001
Delhi2010010
Bangalore3011011
Chennai4100100
🎯 Key advantage: One-Hot for 4 cities = 4 columns. Binary Encoding = only 3 columns. For 1000 cities: One-Hot needs 1000 columns, Binary needs only 10. Massive saving!
# pip install category_encoders import category_encoders as ce encoder = ce.BinaryEncoder(cols=['City']) df_encoded = encoder.fit_transform(df)
04
Target Encoding
Replace category with mean of target variable
Target Encoding (Mean Encoding)
βœ“ Use for: High-cardinality nominal, regression tasks Column: City β†’ Target: Price

Replace each category with the mean of the target variable for rows in that category. If Mumbai houses average β‚Ή55,000, every row with City=Mumbai gets replaced with 55,000. This directly encodes the relationship between category and target.

Step 1: Calculate Mean Price per City

CityRowsAvg Price
Mumbai1, 4$55,000
Delhi2, 6$95,000
Bangalore3, 7$81,000
Chennai5, 8$49,000

Step 2: Replace City with Mean

#City (original)City_encoded
1Mumbai55,000
2Delhi95,000
3Bangalore81,000
4Mumbai55,000
5Chennai49,000
6Delhi95,000
⚠️ Target Leakage! If you compute the mean on the same data you train on, the model learns the target directly. Always use cross-fold target encoding (compute mean from other folds) or add smoothing. Use category_encoders.TargetEncoder which handles this.
import category_encoders as ce enc = ce.TargetEncoder(cols=['City'], smoothing=10) # smoothing blends category mean with global mean for rare categories df['City_encoded'] = enc.fit_transform(df['City'], df['Price'])
05
Frequency Encoding
Replace category with how often it appears
Frequency / Count Encoding
βœ“ Use for: Nominal where frequency carries signal Column: City Β· House Style

Replace each category with either its count (how many rows have that value) or its frequency (proportion of total). No target variable needed β€” purely unsupervised. Mumbai appears 2 times β†’ 2 (count) or 0.25 (frequency).

#CityHouse StyleCity CountCity FreqStyle Count
1MumbaiBungalow20.253
2DelhiApartment20.253
3BangaloreVilla20.252
4MumbaiApartment20.253
5ChennaiBungalow20.253
6DelhiVilla20.252
7BangaloreApartment20.253
8ChennaiBungalow20.253
πŸ’‘ When is this useful? When rare categories might indicate anomalies or special cases (e.g., a city that appears only once in loan data might indicate geographic outlier). Also useful when no target exists (unsupervised preprocessing).
# Frequency Encoding β€” simple manual approach freq_map = df['City'].value_counts() / len(df) df['City_Freq'] = df['City'].map(freq_map) count_map = df['City'].value_counts() df['City_Count'] = df['City'].map(count_map)
06
Other Encodings
Hash, Helmert, Leave-One-Out & more
Hashing Encoding

Applies a hash function to categories, maps them to a fixed number of columns. Handles unseen categories. Memory efficient. Used in NLP feature hashing.

n_categories β†’ fixed n_components (e.g., 8)
Leave-One-Out

Like Target Encoding but computes the mean excluding the current row. Significantly reduces target leakage. Best for small datasets.

val = mean(target) where City=X, excluding current row
Helmert Encoding

Compares each level of a category to the mean of all subsequent levels. Useful in ANOVA-style statistical analysis. Rarely used in ML.

Statistical contrasts between ordered groups
Boolean / Binary Flag

For columns like Has Pool (Yes/No), simply map to 0 and 1. This is actually a special case of label encoding for binary nominal data.

Yes β†’ 1, No β†’ 0 (or True/False)

07
Feature Selection
Choose only the columns that actually help predict the target

Feature Selection = picking the RIGHT features. Too many features = overfitting, slow training, noise. Three main approaches:

β‘  Filter Methods β€” Fastest, No Model Needed

Score each feature independently using statistical tests. Select top-k features based on score.

Correlation

For numeric features. Drop if |corr| with target < 0.1, or if two features have |corr| > 0.9 with each other.

Chi-Square (χ²)

For categorical features + categorical target. Tests statistical independence between feature and target.

ANOVA F-test

For numeric features + categorical target. Variance between groups vs within groups.

FeatureCorrelation with PriceDecision
Area (sqft)+0.94βœ“ Keep β€” strong positive correlation
Bedrooms+0.87βœ“ Keep
Year Built+0.71βœ“ Keep
Neighborhood Quality+0.82βœ“ Keep (ordinal encoded)
Has Pool+0.12βœ— Maybe drop β€” weak signal
# Correlation filter corr = df.corr()['Price'].abs().sort_values(ascending=False) keep = corr[corr > 0.2].index.tolist() # SelectKBest with ANOVA from sklearn.feature_selection import SelectKBest, f_regression selector = SelectKBest(f_regression, k=4) X_new = selector.fit_transform(X, y)
β‘‘ Wrapper Methods β€” Model-based, More Accurate

Train a model with different subsets of features, pick the subset that gives best performance. Expensive but powerful.

RFE (Recursive Feature Elimination)

Train model β†’ remove weakest feature β†’ retrain β†’ repeat until k features remain. Like peeling an onion.

Forward / Backward Selection

Start empty, add best feature each step (Forward). OR start full, remove worst feature each step (Backward).

from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression model = LinearRegression() rfe = RFE(model, n_features_to_select=3) rfe.fit(X, y) print(dict(zip(X.columns, rfe.ranking_))) # ranking_ = 1 means SELECTED, >1 means eliminated
β‘’ Embedded Methods β€” Best of Both Worlds

Feature selection happens INSIDE the model training itself. Lasso (L1 regularization) drives useless feature weights to exactly zero. Tree models provide feature importances.

FeatureRandom Forest ImportanceLasso CoefficientDecision
Area (sqft)0.4218.5βœ“ Most important
Neighborhood Quality0.2812.1βœ“ Important
Year Built0.168.3βœ“ Moderate
Bedrooms0.113.2βœ“ Keep
Has Pool0.030.0βœ— Lasso set to 0 β€” drop
from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=100) rf.fit(X, y) importances = pd.Series(rf.feature_importances_, index=X.columns) importances.sort_values().plot(kind='barh') # Visualize # Keep features with importance > 0.05 good_features = importances[importances > 0.05].index

08
Feature Transformation
Change the scale or distribution of existing features

Many ML algorithms assume features are on similar scales or have a Gaussian distribution. Transformation fixes skewed or differently-scaled features.

β‘  Scaling β€” Normalize the Range

Area ranges 700–3000, but Bedrooms ranges 1–5. Without scaling, Area dominates the model unfairly.

Min-Max Scaling (Normalization)

x_new = (x βˆ’ min) / (max βˆ’ min) β†’ range [0, 1]

Use when: Neural networks, KNN, SVM. Sensitive to outliers.

Standard Scaling (Standardization)

x_new = (x βˆ’ mean) / std β†’ mean=0, std=1

Use when: Linear regression, Logistic regression, PCA. Robust to outliers.

#Area (original)Area MinMaxArea StandardScaled
118000.48+0.04
29500.11βˆ’0.77
325000.78+0.71
47000.00βˆ’1.01
516000.39βˆ’0.15
630001.00+1.19
from sklearn.preprocessing import MinMaxScaler, StandardScaler mm = MinMaxScaler() df['Area_minmax'] = mm.fit_transform(df[['Area']]) ss = StandardScaler() df['Area_std'] = ss.fit_transform(df[['Area']])
β‘‘ Log / Power Transform β€” Fix Skewed Distributions

House prices are often right-skewed (a few very expensive homes pull the distribution right). Log transform brings them closer to Gaussian, which many models prefer.

Before (Right-Skewed)

28,000 Β· 31,000 Β· 45,000
52,000 Β· 67,000 Β· 82,000
110,000 Β· 145,000

Skew: +1.2 (right-skewed)

After Log Transform

10.24 Β· 10.34 Β· 10.71
10.86 Β· 11.11 Β· 11.31
11.61 Β· 11.88

Skew: βˆ’0.1 (near-normal βœ“)

import numpy as np df['Log_Price'] = np.log1p(df['Price']) # log(1+x) avoids log(0) # Or Box-Cox (auto-finds best power) from scipy.stats import boxcox df['Price_BC'], _ = boxcox(df['Price']) # Yeo-Johnson (works with negatives too) from sklearn.preprocessing import PowerTransformer pt = PowerTransformer(method='yeo-johnson') df['Price_PT'] = pt.fit_transform(df[['Price']])
β‘’ Binning / Discretization β€” Turn Continuous into Categorical

Group continuous values into bins. Useful when the relationship isn't linear (e.g., "new" vs "old" houses matters more than exact year). Can also help with outliers.

#Year BuiltHouse AgeAge BinAge Label
12010161Recent
22015111Recent
32005212Middle-Aged
42000262Middle-Aged
5201880New
6202060New
72012141Recent
81998283Old
# Equal-width bins df['Age_Bin'] = pd.cut(df['House_Age'], bins=[0,10,20,30,100], labels=['New','Recent','Middle-Aged','Old']) # Equal-frequency bins (quantile-based) df['Age_Qbin'] = pd.qcut(df['House_Age'], q=4, labels=['Q1','Q2','Q3','Q4'])

09
Feature Creation
Engineer entirely new features from existing ones

Often the best features aren't in your raw data β€” you have to build them. This is where domain knowledge meets data science.

Math Combinations β€” Multiply, Divide, Subtract

Create new numeric features by combining existing ones in meaningful ways.

#AreaBedroomsYear BuiltArea/BedroomHouse AgePrice/sqft (target)
11800320106001645.6
2950220154751147.4
32500420056252144.0
4700120007002640.0
5160032018533841.9
df['Area_Per_Bedroom'] = df['Area'] / df['Bedrooms'] df['House_Age'] = 2026 - df['Year_Built'] df['Age_x_Quality'] = df['House_Age'] * df['Quality_Encoded'] df['Log_Area'] = np.log1p(df['Area']) df['Area_Squared'] = df['Area'] ** 2 # polynomial feature
Interaction Features β€” Combine Two Columns

The combined effect of two features can be stronger than either alone. A "Good neighborhood with large area" might deserve its own feature.

#Neighborhood QualityAreaLuxury Score = Quality Γ— AreaIs Luxury?
1Excellent (3)18005,4001
2Good (2)9501,9000
3Average (1)25002,5000
4Poor (0)70000
6Excellent (3)30009,0001
df['Luxury_Score'] = df['Quality_Encoded'] * df['Area'] df['Is_Luxury'] = ((df['Quality_Encoded'] >= 2) & (df['Area'] > 1500)).astype(int) df['New_And_Spacious'] = ((df['House_Age'] < 10) & (df['Area'] > 1500)).astype(int) # Automated polynomial features from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(df[['Area', 'House_Age']])
Aggregation Features β€” Group Statistics

Compute statistics (mean, max, count, std) per group. "What's the average house size in this city?" can be a powerful feature.

# Group-level statistics as new features city_stats = df.groupby('City')['Area'].agg(['mean', 'max', 'std']) city_stats.columns = ['City_AvgArea', 'City_MaxArea', 'City_StdArea'] df = df.merge(city_stats, on='City', how='left') # Deviation from group mean (is this house larger than typical in its city?) df['Area_vs_City_Avg'] = df['Area'] - df['City_AvgArea']
Date/Time Features β€” Extract Hidden Signals

If you have timestamps, decompose them into rich features that models can use.

df['Year_Built'] = pd.to_datetime(df['Year_Built'], format='%Y') df['Decade_Built'] = (df['Year_Built'].dt.year // 10) * 10 df['Is_Post_2010'] = (df['Year_Built'].dt.year >= 2010).astype(int) df['Age_at_Purchase'] = 2026 - df['Year_Built'].dt.year

10
When to Use What
Quick reference cheat sheet
TechniqueData TypeWhen to UseWatch Out For
Label Encoding Ordinal categorical Order matters (Poor < Good < Excellent) Never on nominal data β€” implies false order
One-Hot Encoding Nominal categorical Low cardinality (<20 unique values), tree or linear models Dummy variable trap; explodes with high cardinality
Binary Encoding Nominal categorical High cardinality (50–500+ unique values) Bit patterns may not have intuitive meaning
Target Encoding Nominal, high cardinality Strong cat–target relationship, regression tasks Target leakage! Use cross-fold + smoothing
Frequency Encoding Nominal categorical Frequency carries signal; unsupervised preprocessing Multiple categories with same freq β†’ same encoded value
Feature Selection Any Too many features, overfitting, slow training Don't select on full dataset; use CV to avoid bias
Scaling Numeric Distance-based models (KNN, SVM), Neural networks, PCA Not needed for tree-based models (RF, XGBoost)
Log Transform Numeric, right-skewed Income, prices, counts β€” power-law distributed data Can't take log of 0 or negatives β†’ use log1p or Yeo-Johnson
Binning Continuous numeric Non-linear relationships, outlier-robust models Loses information; choose bin boundaries wisely
Feature Creation Any Domain expertise suggests interaction effects Too many engineered features β†’ overfitting again
πŸ† The Golden Rule of Feature Engineering: Always apply transformations AFTER splitting into train/test sets. Fit scalers, encoders, and imputers on TRAINING data only, then transform both train and test. Fitting on the full dataset causes data leakage β€” your test set secretly influences your preprocessing.