Feature Engineering — Complete Guide

#	City	Neighborhood Quality	House Style	Year Built	Area (sqft)	Bedrooms	Has Pool	Price ($)
1	Mumbai	Excellent	Bungalow	2010	1800	3	Yes	82,000
2	Delhi	Good	Apartment	2015	950	2	No	45,000
3	Bangalore	Average	Villa	2005	2500	4	Yes	110,000
4	Mumbai	Poor	Apartment	2000	700	1	No	28,000
5	Chennai	Good	Bungalow	2018	1600	3	No	67,000
6	Delhi	Excellent	Villa	2020	3000	5	Yes	145,000
7	Bangalore	Average	Apartment	2012	1100	2	No	52,000
8	Chennai	Poor	Bungalow	1998	1200	3	No	31,000

01

Label Encoding

For Ordinal data — where order matters

Label / Ordinal Encoding

✓ Use for: Ordinal (ordered) categories Column: Neighborhood Quality

We assign a whole number to each category, preserving their natural order. Poor < Average < Good < Excellent becomes 0 < 1 < 2 < 3. The model can now understand that Excellent is "greater than" Good — which is exactly true here.

Before

#	Neighborhood Quality
1	Excellent
2	Good
3	Average
4	Poor

After Label Encoding

#	Neighborhood Quality
1	3
2	2
3	1
4	0

📋 Mapping: Poor=0 · Average=1 · Good=2 · Excellent=3 — Notice the numbers respect the real-world rank. This is what makes it "ordinal."

# Python — Label Encoding for Ordinal Data
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

order = [['Poor', 'Average', 'Good', 'Excellent']]
enc = OrdinalEncoder(categories=order)

df['Neighborhood_Encoded'] = enc.fit_transform(df[['Neighborhood Quality']])
# Poor→0, Average→1, Good→2, Excellent→3
      

⚠️ Never use Label Encoding on nominal data (like City). If you encode Mumbai=0, Delhi=1, Bangalore=2 — the model wrongly assumes Bangalore > Delhi > Mumbai. Use One-Hot instead.

02

One-Hot Encoding

For Nominal data — where order doesn't matter

One-Hot Encoding

✓ Use for: Nominal (unordered) categories Column: City (4 unique values)

Creates one binary column per category. A row gets 1 in its city's column and 0 everywhere else. No false ordering is implied — Mumbai is not "greater than" Delhi.

#	City (original)	City_Mumbai	City_Delhi	City_Bangalore	City_Chennai
1	Mumbai	1	0	0	0
2	Delhi	0	1	0	0
3	Bangalore	0	0	1	0
4	Mumbai	1	0	0	0
5	Chennai	0	0	0	1
6	Delhi	0	1	0	0
7	Bangalore	0	0	1	0
8	Chennai	0	0	0	1

💡 Dummy Variable Trap: Drop one column (e.g., City_Chennai) when using linear models, since if all other city columns are 0, we already know it's Chennai. Use drop_first=True in pandas.

# One-Hot Encoding — 2 ways
pd.get_dummies(df, columns=['City'], drop_first=True)

# OR using sklearn
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False, drop='first')
enc.fit_transform(df[['City']])
      

⚠️ Problem with high cardinality: If City had 500 unique values, you'd get 500 new columns. That's the curse of dimensionality. Use Binary or Target Encoding instead for high-cardinality columns.

03

Binary Encoding

For High-Cardinality nominal data — saves columns

Binary Encoding

✓ Use for: Nominal with many categories (50+) Example: City (imagine 500 cities)

Binary Encoding first converts each category to a number (like Label Encoding), then converts that number into its binary representation, and splits each binary digit into a separate column. 4 categories → 2 bits (columns). 8 categories → 3 bits. 1000 categories → just 10 columns!

Step 1: Assign integer → Mumbai=1, Delhi=2, Bangalore=3, Chennai=4
Step 2: Convert to binary → 1=001, 2=010, 3=011, 4=100
Step 3: Each bit becomes a column → col_1, col_2, col_3

City	Integer	Binary	City_b1	City_b2	City_b3
Mumbai	1	001	0	0	1
Delhi	2	010	0	1	0
Bangalore	3	011	0	1	1
Chennai	4	100	1	0	0

🎯 Key advantage: One-Hot for 4 cities = 4 columns. Binary Encoding = only 3 columns. For 1000 cities: One-Hot needs 1000 columns, Binary needs only 10. Massive saving!

# pip install category_encoders
import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['City'])
df_encoded = encoder.fit_transform(df)
      

04

Target Encoding

Replace category with mean of target variable

Target Encoding (Mean Encoding)

✓ Use for: High-cardinality nominal, regression tasks Column: City → Target: Price

Replace each category with the mean of the target variable for rows in that category. If Mumbai houses average ₹55,000, every row with City=Mumbai gets replaced with 55,000. This directly encodes the relationship between category and target.

Step 1: Calculate Mean Price per City

City	Rows	Avg Price
Mumbai	1, 4	$55,000
Delhi	2, 6	$95,000
Bangalore	3, 7	$81,000
Chennai	5, 8	$49,000

Step 2: Replace City with Mean

#	City (original)	City_encoded
1	Mumbai	55,000
2	Delhi	95,000
3	Bangalore	81,000
4	Mumbai	55,000
5	Chennai	49,000
6	Delhi	95,000

⚠️ Target Leakage! If you compute the mean on the same data you train on, the model learns the target directly. Always use cross-fold target encoding (compute mean from other folds) or add smoothing. Use category_encoders.TargetEncoder which handles this.

import category_encoders as ce

enc = ce.TargetEncoder(cols=['City'], smoothing=10)
# smoothing blends category mean with global mean for rare categories
df['City_encoded'] = enc.fit_transform(df['City'], df['Price'])
      

05

Frequency Encoding

Replace category with how often it appears

Frequency / Count Encoding

✓ Use for: Nominal where frequency carries signal Column: City · House Style

Replace each category with either its count (how many rows have that value) or its frequency (proportion of total). No target variable needed — purely unsupervised. Mumbai appears 2 times → 2 (count) or 0.25 (frequency).

#	City	House Style	City Count	City Freq	Style Count
1	Mumbai	Bungalow	2	0.25	3
2	Delhi	Apartment	2	0.25	3
3	Bangalore	Villa	2	0.25	2
4	Mumbai	Apartment	2	0.25	3
5	Chennai	Bungalow	2	0.25	3
6	Delhi	Villa	2	0.25	2
7	Bangalore	Apartment	2	0.25	3
8	Chennai	Bungalow	2	0.25	3

💡 When is this useful? When rare categories might indicate anomalies or special cases (e.g., a city that appears only once in loan data might indicate geographic outlier). Also useful when no target exists (unsupervised preprocessing).

# Frequency Encoding — simple manual approach
freq_map = df['City'].value_counts() / len(df)
df['City_Freq'] = df['City'].map(freq_map)

count_map = df['City'].value_counts()
df['City_Count'] = df['City'].map(count_map)
      

06

Other Encodings

Hash, Helmert, Leave-One-Out & more

Hashing Encoding

Applies a hash function to categories, maps them to a fixed number of columns. Handles unseen categories. Memory efficient. Used in NLP feature hashing.

n_categories → fixed n_components (e.g., 8)

Leave-One-Out

Like Target Encoding but computes the mean excluding the current row. Significantly reduces target leakage. Best for small datasets.

val = mean(target) where City=X, excluding current row

Helmert Encoding

Compares each level of a category to the mean of all subsequent levels. Useful in ANOVA-style statistical analysis. Rarely used in ML.

Statistical contrasts between ordered groups

Boolean / Binary Flag

For columns like Has Pool (Yes/No), simply map to 0 and 1. This is actually a special case of label encoding for binary nominal data.

Yes → 1, No → 0 (or True/False)

07

Feature Selection

Choose only the columns that actually help predict the target

Feature Selection = picking the RIGHT features. Too many features = overfitting, slow training, noise. Three main approaches:

① Filter Methods — Fastest, No Model Needed

Score each feature independently using statistical tests. Select top-k features based on score.

Correlation

For numeric features. Drop if |corr| with target < 0.1, or if two features have |corr| > 0.9 with each other.

Chi-Square (χ²)

For categorical features + categorical target. Tests statistical independence between feature and target.

ANOVA F-test

For numeric features + categorical target. Variance between groups vs within groups.

Feature	Correlation with Price	Decision
Area (sqft)	+0.94	✓ Keep — strong positive correlation
Bedrooms	+0.87	✓ Keep
Year Built	+0.71	✓ Keep
Neighborhood Quality	+0.82	✓ Keep (ordinal encoded)
Has Pool	+0.12	✗ Maybe drop — weak signal

# Correlation filter
corr = df.corr()['Price'].abs().sort_values(ascending=False)
keep = corr[corr > 0.2].index.tolist()

# SelectKBest with ANOVA
from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(f_regression, k=4)
X_new = selector.fit_transform(X, y)
      

② Wrapper Methods — Model-based, More Accurate

Train a model with different subsets of features, pick the subset that gives best performance. Expensive but powerful.

RFE (Recursive Feature Elimination)

Train model → remove weakest feature → retrain → repeat until k features remain. Like peeling an onion.

Forward / Backward Selection

Start empty, add best feature each step (Forward). OR start full, remove worst feature each step (Backward).

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

model = LinearRegression()
rfe = RFE(model, n_features_to_select=3)
rfe.fit(X, y)

print(dict(zip(X.columns, rfe.ranking_)))
# ranking_ = 1 means SELECTED, >1 means eliminated
      

③ Embedded Methods — Best of Both Worlds

Feature selection happens INSIDE the model training itself. Lasso (L1 regularization) drives useless feature weights to exactly zero. Tree models provide feature importances.

Feature	Random Forest Importance	Lasso Coefficient	Decision
Area (sqft)	0.42	18.5	✓ Most important
Neighborhood Quality	0.28	12.1	✓ Important
Year Built	0.16	8.3	✓ Moderate
Bedrooms	0.11	3.2	✓ Keep
Has Pool	0.03	0.0	✗ Lasso set to 0 — drop

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100)
rf.fit(X, y)

importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values().plot(kind='barh')  # Visualize

# Keep features with importance > 0.05
good_features = importances[importances > 0.05].index
      

08

Feature Transformation

Change the scale or distribution of existing features

Many ML algorithms assume features are on similar scales or have a Gaussian distribution. Transformation fixes skewed or differently-scaled features.

① Scaling — Normalize the Range

Area ranges 700–3000, but Bedrooms ranges 1–5. Without scaling, Area dominates the model unfairly.

Min-Max Scaling (Normalization)

x_new = (x − min) / (max − min) → range [0, 1]

Use when: Neural networks, KNN, SVM. Sensitive to outliers.

Standard Scaling (Standardization)

x_new = (x − mean) / std → mean=0, std=1

Use when: Linear regression, Logistic regression, PCA. Robust to outliers.

#	Area (original)	Area MinMax	Area StandardScaled
1	1800	0.48	+0.04
2	950	0.11	−0.77
3	2500	0.78	+0.71
4	700	0.00	−1.01
5	1600	0.39	−0.15
6	3000	1.00	+1.19

from sklearn.preprocessing import MinMaxScaler, StandardScaler

mm = MinMaxScaler()
df['Area_minmax'] = mm.fit_transform(df[['Area']])

ss = StandardScaler()
df['Area_std'] = ss.fit_transform(df[['Area']])
      

② Log / Power Transform — Fix Skewed Distributions

House prices are often right-skewed (a few very expensive homes pull the distribution right). Log transform brings them closer to Gaussian, which many models prefer.

Before (Right-Skewed)

            28,000 ·  31,000 ·  45,000
52,000 ·  67,000 ·  82,000
110,000 · 145,000
          

Skew: +1.2 (right-skewed)

After Log Transform

            10.24 · 10.34 · 10.71
10.86 · 11.11 · 11.31
11.61 · 11.88
          

Skew: −0.1 (near-normal ✓)

import numpy as np

df['Log_Price'] = np.log1p(df['Price'])    # log(1+x) avoids log(0)

# Or Box-Cox (auto-finds best power)
from scipy.stats import boxcox
df['Price_BC'], _ = boxcox(df['Price'])

# Yeo-Johnson (works with negatives too)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
df['Price_PT'] = pt.fit_transform(df[['Price']])
      

③ Binning / Discretization — Turn Continuous into Categorical

Group continuous values into bins. Useful when the relationship isn't linear (e.g., "new" vs "old" houses matters more than exact year). Can also help with outliers.

#	Year Built	House Age	Age Bin	Age Label
1	2010	16	1	Recent
2	2015	11	1	Recent
3	2005	21	2	Middle-Aged
4	2000	26	2	Middle-Aged
5	2018	8	0	New
6	2020	6	0	New
7	2012	14	1	Recent
8	1998	28	3	Old

# Equal-width bins
df['Age_Bin'] = pd.cut(df['House_Age'], bins=[0,10,20,30,100],
                      labels=['New','Recent','Middle-Aged','Old'])

# Equal-frequency bins (quantile-based)
df['Age_Qbin'] = pd.qcut(df['House_Age'], q=4, labels=['Q1','Q2','Q3','Q4'])
      

09

Feature Creation

Engineer entirely new features from existing ones

Often the best features aren't in your raw data — you have to build them. This is where domain knowledge meets data science.

Math Combinations — Multiply, Divide, Subtract

Create new numeric features by combining existing ones in meaningful ways.

#	Area	Bedrooms	Year Built	Area/Bedroom	House Age	Price/sqft (target)
1	1800	3	2010	600	16	45.6
2	950	2	2015	475	11	47.4
3	2500	4	2005	625	21	44.0
4	700	1	2000	700	26	40.0
5	1600	3	2018	533	8	41.9

df['Area_Per_Bedroom'] = df['Area'] / df['Bedrooms']
df['House_Age']       = 2026 - df['Year_Built']
df['Age_x_Quality']   = df['House_Age'] * df['Quality_Encoded']
df['Log_Area']        = np.log1p(df['Area'])
df['Area_Squared']    = df['Area'] ** 2  # polynomial feature
      

Interaction Features — Combine Two Columns

The combined effect of two features can be stronger than either alone. A "Good neighborhood with large area" might deserve its own feature.

#	Neighborhood Quality	Area	Luxury Score = Quality × Area	Is Luxury?
1	Excellent (3)	1800	5,400	1
2	Good (2)	950	1,900	0
3	Average (1)	2500	2,500	0
4	Poor (0)	700	0	0
6	Excellent (3)	3000	9,000	1

df['Luxury_Score'] = df['Quality_Encoded'] * df['Area']
df['Is_Luxury'] = ((df['Quality_Encoded'] >= 2) & (df['Area'] > 1500)).astype(int)
df['New_And_Spacious'] = ((df['House_Age'] < 10) & (df['Area'] > 1500)).astype(int)

# Automated polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['Area', 'House_Age']])
      

Aggregation Features — Group Statistics

Compute statistics (mean, max, count, std) per group. "What's the average house size in this city?" can be a powerful feature.

# Group-level statistics as new features
city_stats = df.groupby('City')['Area'].agg(['mean', 'max', 'std'])
city_stats.columns = ['City_AvgArea', 'City_MaxArea', 'City_StdArea']

df = df.merge(city_stats, on='City', how='left')

# Deviation from group mean (is this house larger than typical in its city?)
df['Area_vs_City_Avg'] = df['Area'] - df['City_AvgArea']
      

Date/Time Features — Extract Hidden Signals

If you have timestamps, decompose them into rich features that models can use.

df['Year_Built'] = pd.to_datetime(df['Year_Built'], format='%Y')

df['Decade_Built']     = (df['Year_Built'].dt.year // 10) * 10
df['Is_Post_2010']    = (df['Year_Built'].dt.year >= 2010).astype(int)
df['Age_at_Purchase']  = 2026 - df['Year_Built'].dt.year
      

10

When to Use What

Quick reference cheat sheet

Technique	Data Type	When to Use	Watch Out For
Label Encoding	Ordinal categorical	Order matters (Poor < Good < Excellent)	Never on nominal data — implies false order
One-Hot Encoding	Nominal categorical	Low cardinality (<20 unique values), tree or linear models	Dummy variable trap; explodes with high cardinality
Binary Encoding	Nominal categorical	High cardinality (50–500+ unique values)	Bit patterns may not have intuitive meaning
Target Encoding	Nominal, high cardinality	Strong cat–target relationship, regression tasks	Target leakage! Use cross-fold + smoothing
Frequency Encoding	Nominal categorical	Frequency carries signal; unsupervised preprocessing	Multiple categories with same freq → same encoded value
Feature Selection	Any	Too many features, overfitting, slow training	Don't select on full dataset; use CV to avoid bias
Scaling	Numeric	Distance-based models (KNN, SVM), Neural networks, PCA	Not needed for tree-based models (RF, XGBoost)
Log Transform	Numeric, right-skewed	Income, prices, counts — power-law distributed data	Can't take log of 0 or negatives → use log1p or Yeo-Johnson
Binning	Continuous numeric	Non-linear relationships, outlier-robust models	Loses information; choose bin boundaries wisely
Feature Creation	Any	Domain expertise suggests interaction effects	Too many engineered features → overfitting again

🏆 The Golden Rule of Feature Engineering: Always apply transformations AFTER splitting into train/test sets. Fit scalers, encoders, and imputers on TRAINING data only, then transform both train and test. Fitting on the full dataset causes data leakage — your test set secretly influences your preprocessing.

FeatureEngineering

Our Working Dataset: House Price Prediction