Feature
Engineering
A complete hands-on guide to encoding, selecting, transforming, and creating features β taught through a consistent real-world dataset.
| # | City | Neighborhood Quality | House Style | Year Built | Area (sqft) | Bedrooms | Has Pool | Price ($) |
|---|---|---|---|---|---|---|---|---|
| 1 | Mumbai | Excellent | Bungalow | 2010 | 1800 | 3 | Yes | 82,000 |
| 2 | Delhi | Good | Apartment | 2015 | 950 | 2 | No | 45,000 |
| 3 | Bangalore | Average | Villa | 2005 | 2500 | 4 | Yes | 110,000 |
| 4 | Mumbai | Poor | Apartment | 2000 | 700 | 1 | No | 28,000 |
| 5 | Chennai | Good | Bungalow | 2018 | 1600 | 3 | No | 67,000 |
| 6 | Delhi | Excellent | Villa | 2020 | 3000 | 5 | Yes | 145,000 |
| 7 | Bangalore | Average | Apartment | 2012 | 1100 | 2 | No | 52,000 |
| 8 | Chennai | Poor | Bungalow | 1998 | 1200 | 3 | No | 31,000 |
We assign a whole number to each category, preserving their natural order. Poor < Average < Good < Excellent becomes 0 < 1 < 2 < 3. The model can now understand that Excellent is "greater than" Good β which is exactly true here.
Before
| # | Neighborhood Quality |
|---|---|
| 1 | Excellent |
| 2 | Good |
| 3 | Average |
| 4 | Poor |
After Label Encoding
| # | Neighborhood Quality |
|---|---|
| 1 | 3 |
| 2 | 2 |
| 3 | 1 |
| 4 | 0 |
Creates one binary column per category. A row gets 1 in its city's column and 0 everywhere else.
No false ordering is implied β Mumbai is not "greater than" Delhi.
| # | City (original) | City_Mumbai | City_Delhi | City_Bangalore | City_Chennai |
|---|---|---|---|---|---|
| 1 | Mumbai | 1 | 0 | 0 | 0 |
| 2 | Delhi | 0 | 1 | 0 | 0 |
| 3 | Bangalore | 0 | 0 | 1 | 0 |
| 4 | Mumbai | 1 | 0 | 0 | 0 |
| 5 | Chennai | 0 | 0 | 0 | 1 |
| 6 | Delhi | 0 | 1 | 0 | 0 |
| 7 | Bangalore | 0 | 0 | 1 | 0 |
| 8 | Chennai | 0 | 0 | 0 | 1 |
drop_first=True in pandas.
Binary Encoding first converts each category to a number (like Label Encoding), then converts that number into its binary representation, and splits each binary digit into a separate column. 4 categories β 2 bits (columns). 8 categories β 3 bits. 1000 categories β just 10 columns!
Step 2: Convert to binary β 1=001, 2=010, 3=011, 4=100
Step 3: Each bit becomes a column β col_1, col_2, col_3
| City | Integer | Binary | City_b1 | City_b2 | City_b3 |
|---|---|---|---|---|---|
| Mumbai | 1 | 001 | 0 | 0 | 1 |
| Delhi | 2 | 010 | 0 | 1 | 0 |
| Bangalore | 3 | 011 | 0 | 1 | 1 |
| Chennai | 4 | 100 | 1 | 0 | 0 |
Replace each category with the mean of the target variable for rows in that category. If Mumbai houses average βΉ55,000, every row with City=Mumbai gets replaced with 55,000. This directly encodes the relationship between category and target.
Step 1: Calculate Mean Price per City
| City | Rows | Avg Price |
|---|---|---|
| Mumbai | 1, 4 | $55,000 |
| Delhi | 2, 6 | $95,000 |
| Bangalore | 3, 7 | $81,000 |
| Chennai | 5, 8 | $49,000 |
Step 2: Replace City with Mean
| # | City (original) | City_encoded |
|---|---|---|
| 1 | Mumbai | 55,000 |
| 2 | Delhi | 95,000 |
| 3 | Bangalore | 81,000 |
| 4 | Mumbai | 55,000 |
| 5 | Chennai | 49,000 |
| 6 | Delhi | 95,000 |
category_encoders.TargetEncoder which handles this.
Replace each category with either its count (how many rows have that value) or its frequency (proportion of total). No target variable needed β purely unsupervised. Mumbai appears 2 times β 2 (count) or 0.25 (frequency).
| # | City | House Style | City Count | City Freq | Style Count |
|---|---|---|---|---|---|
| 1 | Mumbai | Bungalow | 2 | 0.25 | 3 |
| 2 | Delhi | Apartment | 2 | 0.25 | 3 |
| 3 | Bangalore | Villa | 2 | 0.25 | 2 |
| 4 | Mumbai | Apartment | 2 | 0.25 | 3 |
| 5 | Chennai | Bungalow | 2 | 0.25 | 3 |
| 6 | Delhi | Villa | 2 | 0.25 | 2 |
| 7 | Bangalore | Apartment | 2 | 0.25 | 3 |
| 8 | Chennai | Bungalow | 2 | 0.25 | 3 |
Applies a hash function to categories, maps them to a fixed number of columns. Handles unseen categories. Memory efficient. Used in NLP feature hashing.
Like Target Encoding but computes the mean excluding the current row. Significantly reduces target leakage. Best for small datasets.
Compares each level of a category to the mean of all subsequent levels. Useful in ANOVA-style statistical analysis. Rarely used in ML.
For columns like Has Pool (Yes/No), simply map to 0 and 1. This is actually a special case of label encoding for binary nominal data.
Feature Selection = picking the RIGHT features. Too many features = overfitting, slow training, noise. Three main approaches:
Score each feature independently using statistical tests. Select top-k features based on score.
Correlation
For numeric features. Drop if |corr| with target < 0.1, or if two features have |corr| > 0.9 with each other.
Chi-Square (ΟΒ²)
For categorical features + categorical target. Tests statistical independence between feature and target.
ANOVA F-test
For numeric features + categorical target. Variance between groups vs within groups.
| Feature | Correlation with Price | Decision |
|---|---|---|
| Area (sqft) | +0.94 | β Keep β strong positive correlation |
| Bedrooms | +0.87 | β Keep |
| Year Built | +0.71 | β Keep |
| Neighborhood Quality | +0.82 | β Keep (ordinal encoded) |
| Has Pool | +0.12 | β Maybe drop β weak signal |
Train a model with different subsets of features, pick the subset that gives best performance. Expensive but powerful.
RFE (Recursive Feature Elimination)
Train model β remove weakest feature β retrain β repeat until k features remain. Like peeling an onion.
Forward / Backward Selection
Start empty, add best feature each step (Forward). OR start full, remove worst feature each step (Backward).
Feature selection happens INSIDE the model training itself. Lasso (L1 regularization) drives useless feature weights to exactly zero. Tree models provide feature importances.
| Feature | Random Forest Importance | Lasso Coefficient | Decision |
|---|---|---|---|
| Area (sqft) | 0.42 | 18.5 | β Most important |
| Neighborhood Quality | 0.28 | 12.1 | β Important |
| Year Built | 0.16 | 8.3 | β Moderate |
| Bedrooms | 0.11 | 3.2 | β Keep |
| Has Pool | 0.03 | 0.0 | β Lasso set to 0 β drop |
Many ML algorithms assume features are on similar scales or have a Gaussian distribution. Transformation fixes skewed or differently-scaled features.
Area ranges 700β3000, but Bedrooms ranges 1β5. Without scaling, Area dominates the model unfairly.
Min-Max Scaling (Normalization)
Use when: Neural networks, KNN, SVM. Sensitive to outliers.
Standard Scaling (Standardization)
Use when: Linear regression, Logistic regression, PCA. Robust to outliers.
| # | Area (original) | Area MinMax | Area StandardScaled |
|---|---|---|---|
| 1 | 1800 | 0.48 | +0.04 |
| 2 | 950 | 0.11 | β0.77 |
| 3 | 2500 | 0.78 | +0.71 |
| 4 | 700 | 0.00 | β1.01 |
| 5 | 1600 | 0.39 | β0.15 |
| 6 | 3000 | 1.00 | +1.19 |
House prices are often right-skewed (a few very expensive homes pull the distribution right). Log transform brings them closer to Gaussian, which many models prefer.
Before (Right-Skewed)
52,000 Β· 67,000 Β· 82,000
110,000 Β· 145,000
Skew: +1.2 (right-skewed)
After Log Transform
10.86 Β· 11.11 Β· 11.31
11.61 Β· 11.88
Skew: β0.1 (near-normal β)
Group continuous values into bins. Useful when the relationship isn't linear (e.g., "new" vs "old" houses matters more than exact year). Can also help with outliers.
| # | Year Built | House Age | Age Bin | Age Label |
|---|---|---|---|---|
| 1 | 2010 | 16 | 1 | Recent |
| 2 | 2015 | 11 | 1 | Recent |
| 3 | 2005 | 21 | 2 | Middle-Aged |
| 4 | 2000 | 26 | 2 | Middle-Aged |
| 5 | 2018 | 8 | 0 | New |
| 6 | 2020 | 6 | 0 | New |
| 7 | 2012 | 14 | 1 | Recent |
| 8 | 1998 | 28 | 3 | Old |
Often the best features aren't in your raw data β you have to build them. This is where domain knowledge meets data science.
Create new numeric features by combining existing ones in meaningful ways.
| # | Area | Bedrooms | Year Built | Area/Bedroom | House Age | Price/sqft (target) |
|---|---|---|---|---|---|---|
| 1 | 1800 | 3 | 2010 | 600 | 16 | 45.6 |
| 2 | 950 | 2 | 2015 | 475 | 11 | 47.4 |
| 3 | 2500 | 4 | 2005 | 625 | 21 | 44.0 |
| 4 | 700 | 1 | 2000 | 700 | 26 | 40.0 |
| 5 | 1600 | 3 | 2018 | 533 | 8 | 41.9 |
The combined effect of two features can be stronger than either alone. A "Good neighborhood with large area" might deserve its own feature.
| # | Neighborhood Quality | Area | Luxury Score = Quality Γ Area | Is Luxury? |
|---|---|---|---|---|
| 1 | Excellent (3) | 1800 | 5,400 | 1 |
| 2 | Good (2) | 950 | 1,900 | 0 |
| 3 | Average (1) | 2500 | 2,500 | 0 |
| 4 | Poor (0) | 700 | 0 | 0 |
| 6 | Excellent (3) | 3000 | 9,000 | 1 |
Compute statistics (mean, max, count, std) per group. "What's the average house size in this city?" can be a powerful feature.
If you have timestamps, decompose them into rich features that models can use.
| Technique | Data Type | When to Use | Watch Out For |
|---|---|---|---|
| Label Encoding | Ordinal categorical | Order matters (Poor < Good < Excellent) | Never on nominal data β implies false order |
| One-Hot Encoding | Nominal categorical | Low cardinality (<20 unique values), tree or linear models | Dummy variable trap; explodes with high cardinality |
| Binary Encoding | Nominal categorical | High cardinality (50β500+ unique values) | Bit patterns may not have intuitive meaning |
| Target Encoding | Nominal, high cardinality | Strong catβtarget relationship, regression tasks | Target leakage! Use cross-fold + smoothing |
| Frequency Encoding | Nominal categorical | Frequency carries signal; unsupervised preprocessing | Multiple categories with same freq β same encoded value |
| Feature Selection | Any | Too many features, overfitting, slow training | Don't select on full dataset; use CV to avoid bias |
| Scaling | Numeric | Distance-based models (KNN, SVM), Neural networks, PCA | Not needed for tree-based models (RF, XGBoost) |
| Log Transform | Numeric, right-skewed | Income, prices, counts β power-law distributed data | Can't take log of 0 or negatives β use log1p or Yeo-Johnson |
| Binning | Continuous numeric | Non-linear relationships, outlier-robust models | Loses information; choose bin boundaries wisely |
| Feature Creation | Any | Domain expertise suggests interaction effects | Too many engineered features β overfitting again |