House Price Prediction Using Machine Learning

Predicting house prices is one of the best projects to learn machine learning from scratch. It covers every stage of a real ML workflow — exploring data, cleaning it, engineering useful features, training models and measuring how well they work.

By the end of this guide you will have a working model that takes inputs like square footage, number of bedrooms, neighbourhood and year built, and produces a predicted sale price. Along the way you will understand exactly why every step matters and what happens if you skip it.

The Problem We Are Solving

A real estate company wants to estimate the market value of a house before listing it. Instead of hiring an appraiser for every property, they want a model that can take the house's characteristics and instantly output a predicted price.

This is a supervised regression problem. Supervised because we have labelled training data — houses with known sale prices. Regression because the output is a continuous number (price) rather than a category.

The inputs (also called features) are things we know about the house: size in square feet, number of bedrooms and bathrooms, the neighbourhood, whether it has a garage, the year it was built and so on. The model learns the relationship between these features and the sale price from historical data, then uses that pattern to estimate prices for new houses it has never seen.

The ML Pipeline — Every Step in Order

Every machine learning project follows the same general sequence. Understanding this sequence first makes all the individual steps easier to follow.

Load the dataRead the dataset into a pandas DataFrame and take a first look

Explore the data (EDA)Look at distributions, check for outliers, understand relationships between features

PreprocessFill missing values, encode categorical columns, scale numeric features

Feature engineeringCreate new, more informative features from existing ones

Split the dataSeparate into training set (model learns from this) and test set (model is evaluated on this)

Train modelsFit a Linear Regression and a Random Forest on the training data

EvaluateMeasure predictions on the test set using MAE, RMSE and R² score

Predict new housesUse the trained model to estimate prices for houses it has never seen

Setup and Loading the Data

Python — install packages and load the dataset
# Install if needed
# pip install pandas numpy scikit-learn matplotlib seaborn

import pandas    as pd
import numpy     as np
import matplotlib.pyplot as plt
import seaborn   as sns

from sklearn.model_selection  import train_test_split
from sklearn.preprocessing    import StandardScaler, LabelEncoder
from sklearn.linear_model      import LinearRegression
from sklearn.ensemble          import RandomForestRegressor
from sklearn.metrics           import mean_absolute_error, mean_squared_error, r2_score
from sklearn.impute            import SimpleImputer

# Load the California Housing dataset (built into scikit-learn)
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target * 100000  # convert to actual dollar values

# First look at the data
print(df.shape)          # (20640, 9) — 20640 houses, 9 columns
print(df.head())         # first 5 rows
print(df.info())         # column names, data types, nulls
print(df.describe())     # mean, min, max, std for every column

ℹ️ About the dataset: the California Housing dataset has 20,640 houses with features like median income, house age, average rooms per household, population and location (latitude and longitude). The target is median house value. It is a clean, well-known dataset perfect for learning regression.

Exploring the Data

Before building any model, you need to understand your data. Exploratory Data Analysis (EDA) answers three questions: What does the data look like? Are there any obvious problems? Which features seem most related to price?

Looking at Distributions

Python — exploring the price distribution and spotting issues
# Check the target variable (Price) distribution
print("Price stats:")
print(df['Price'].describe())
print(f"Min:  ${df['Price'].min():,.0f}")
print(f"Max:  ${df['Price'].max():,.0f}")
print(f"Mean: ${df['Price'].mean():,.0f}")

# Plot price distribution
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
df['Price'].hist(bins=50)
plt.title('Price Distribution (original)')

# Log transform reduces skew — many ML models work better with symmetric data
plt.subplot(1, 2, 2)
np.log1p(df['Price']).hist(bins=50)
plt.title('Log Price Distribution (more symmetric)')
plt.tight_layout()
plt.show()

# Check how many missing values each column has
print(df.isnull().sum())

Finding Correlations

A correlation tells you how strongly two variables move together. A correlation of 1 means they move perfectly together. A correlation of 0 means no relationship at all. A correlation of -1 means they move in opposite directions. Features with high correlation to Price are the most useful for prediction.

Python — correlation matrix to find which features matter most
# Compute correlation of every feature with Price
corr_with_price = df.corr()['Price'].sort_values(ascending=False)
print(corr_with_price)
# MedInc (median income) is most correlated with price — makes sense
# Latitude and Longitude have moderate negative correlation

# Heatmap of full correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

# Scatter plot of the most important feature vs Price
plt.figure(figsize=(8, 5))
plt.scatter(df['MedInc'], df['Price'], alpha=0.1, s=5)
plt.xlabel('Median Income')
plt.ylabel('House Price')
plt.title('Income vs Price')
plt.show()

Preprocessing — Cleaning the Data

Raw data is almost never clean. Before you can train a model you need to handle missing values, convert text categories into numbers and scale your numeric features. Skipping any of these steps leads to a model that either crashes or performs poorly.

Handling Missing Values

Most ML models cannot handle missing values — they will either crash or silently produce wrong answers. You have two options: remove the rows with missing values (if there are not many) or fill them in with a sensible estimate.

Python — filling missing values with SimpleImputer
from sklearn.impute import SimpleImputer

# Check what is missing
print(df.isnull().sum())

# For numeric columns — fill missing with the median of that column
# Median is better than mean when there are outliers
num_imputer = SimpleImputer(strategy='median')
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
numeric_cols.remove('Price')  # do not impute the target

df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])

# For categorical columns — fill missing with the most common value
cat_cols = df.select_dtypes(include=['object']).columns
if len(cat_cols) > 0:
    cat_imputer = SimpleImputer(strategy='most_frequent')
    df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

Encoding Categorical Features

Machine learning models only work with numbers. If you have a column like "neighbourhood" with values like "downtown", "suburbs" and "rural", you need to convert those text values into numbers. There are two common approaches.

Python — label encoding vs one-hot encoding
import pandas as pd

# Example: a house type column
df_example = pd.DataFrame({
    'house_type': ['detached', 'flat', 'terraced', 'detached']
})

# One-hot encoding — creates a separate column for each category
# Better when there is no order between categories (detached is not "bigger" than flat)
encoded = pd.get_dummies(df_example, columns=['house_type'], drop_first=True)
print(encoded)
#    house_type_flat  house_type_terraced
# 0              0                   0   (detached = 0 0)
# 1              1                   0   (flat = 1 0)
# 2              0                   1   (terraced = 0 1)

# Apply one-hot encoding to the main dataframe
df = pd.get_dummies(df, drop_first=True)

Feature Scaling

Different features often have very different ranges. A house might have 3 bedrooms but 2000 square feet. If you feed these directly into a linear model, the large-valued features will dominate simply because they are bigger numbers, not because they are more important. Scaling puts all features on the same scale.

Python — scaling features to the same range
from sklearn.preprocessing import StandardScaler

# StandardScaler transforms each feature to have mean=0 and std=1
# Important for Linear Regression and any distance-based algorithms
# Tree-based models like Random Forest do not need scaling

features = [c for c in df.columns if c != 'Price']

X = df[features].values
y = df['Price'].values

scaler    = StandardScaler()
X_scaled  = scaler.fit_transform(X)

print(f"Before scaling: mean={X[:,0].mean():.2f}, std={X[:,0].std():.2f}")
print(f"After scaling:  mean={X_scaled[:,0].mean():.2f}, std={X_scaled[:,0].std():.2f}")

⚠️ Fit the scaler only on training data. A common mistake is to scale the entire dataset before splitting. This leaks information from the test set into the scaler. Always split first, then fit the scaler on training data only and transform both sets.

Feature Engineering — Creating Better Inputs

Feature engineering means creating new columns from existing ones that the model can learn from more easily. Instead of giving the model raw numbers, you give it combinations and ratios that carry more meaning.

Python — creating new features from existing columns
# Rooms per person — divides total rooms by population in the block
# A house with 10 rooms but 50 people is much more crowded than 10 rooms and 3 people
df['RoomsPerPerson']    = df['AveRooms']    / df['Population'].clip(1)

# Bedrooms per room ratio — high ratio means fewer living spaces relative to bedrooms
df['BedroomRatio']     = df['AveBedrms']  / df['AveRooms'].clip(1)

# Household density — persons per household
df['HouseholdDensity']  = df['Population']  / df['AveOccup'].clip(1)

# Income squared — captures non-linear relationship
# Doubling income often more than doubles house value
df['MedInc_squared']   = df['MedInc'] ** 2

# Distance from San Francisco bay area centre (approximate)
# Proximity to desirable areas affects price significantly
bay_lat, bay_lon = 37.77, -122.42
df['DistFromSF'] = np.sqrt(
    (df['Latitude']  - bay_lat) ** 2 +
    (df['Longitude'] - bay_lon) ** 2
)

print(f"Dataset now has {df.shape[1]} columns after feature engineering")

Train Test Split — Keeping an Honest Score

You need two separate chunks of your data. The training set is what the model learns from. The test set is a chunk the model never sees during training — you use it at the end to get an honest measurement of how well the model performs on data it has never encountered.

If you evaluated the model on the same data it trained on, it would look great — it already memorised those examples. The test set simulates real-world performance.

Python — splitting into train and test sets
from sklearn.model_selection import train_test_split

feature_cols = [c for c in df.columns if c != 'Price']

X = df[feature_cols]
y = df['Price']

# 80% training, 20% testing
# random_state=42 ensures you get the same split every time you run the code
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} houses")
print(f"Test set:     {X_test.shape[0]} houses")

# Now scale AFTER splitting — fit only on training data
scaler   = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform train
X_test_scaled  = scaler.transform(X_test)        # only transform test — no fit!

Linear Regression — The Simple Baseline

Linear Regression assumes the price is a straight-line combination of all the features. It multiplies each feature by a learned weight and adds them up to produce a prediction. It is simple, fast and easy to interpret. It is always a good idea to start with it as a baseline before trying more complex models.

Python — training and evaluating Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Train
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

# Predict on test set
lr_preds = lr.predict(X_test_scaled)

# Evaluate
lr_mae  = mean_absolute_error(y_test, lr_preds)
lr_rmse = mean_squared_error(y_test,  lr_preds, squared=False)
lr_r2   = r2_score(y_test,           lr_preds)

print("Linear Regression Results:")
print(f"  MAE:  ${lr_mae:,.0f}")
print(f"  RMSE: ${lr_rmse:,.0f}")
print(f"  R²:   {lr_r2:.3f}")

Random Forest — A More Powerful Model

A Random Forest is a collection of many Decision Trees that each vote on the answer. Each tree is trained on a slightly different random sample of the data and uses a random subset of features at each split. The final prediction is the average of all the trees' predictions.

This approach — many imperfect models voting together — is called ensemble learning. It works much better than any single tree because the mistakes of individual trees cancel each other out.

Python — training Random Forest with hyperparameter tuning
from sklearn.ensemble import RandomForestRegressor

# Train a Random Forest
# n_estimators = number of trees (more trees = better but slower)
# max_depth    = how deep each tree can grow (prevents overfitting)
# n_jobs=-1    = use all CPU cores for faster training
rf = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_split=5,
    random_state=42,
    n_jobs=-1
)

# Note: Random Forest does not need scaled data
rf.fit(X_train, y_train)

rf_preds = rf.predict(X_test)

rf_mae  = mean_absolute_error(y_test, rf_preds)
rf_rmse = mean_squared_error(y_test,  rf_preds, squared=False)
rf_r2   = r2_score(y_test,           rf_preds)

print("Random Forest Results:")
print(f"  MAE:  ${rf_mae:,.0f}")
print(f"  RMSE: ${rf_rmse:,.0f}")
print(f"  R²:   {rf_r2:.3f}")

Evaluation Metrics — How Do You Know If It Is Good

There are three main metrics for regression problems. Each one tells you something slightly different.

Metric	What It Measures	Ideal Value	Units
MAE	Mean Absolute Error — average prediction error	As low as possible	Same as target (dollars)
RMSE	Root Mean Squared Error — penalises large errors more	As low as possible	Same as target (dollars)
R²	How much of the variance the model explains	Closer to 1.0	0 to 1 (unitless)

ℹ️ Interpreting R²: an R² of 0.85 means the model explains 85% of the variation in house prices. The remaining 15% is noise the model cannot capture — location-specific quirks, renovations, negotiation etc. For real estate, R² above 0.80 is generally considered good.

Python — plotting actual vs predicted prices
# Actual vs Predicted scatter plot — perfect model = all dots on the diagonal
plt.figure(figsize=(8, 6))
plt.scatter(y_test, rf_preds, alpha=0.2, s=10)

# Perfect prediction line
min_val = min(y_test.min(), rf_preds.min())
max_val = max(y_test.max(), rf_preds.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=1.5)

plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title(f'Random Forest: Actual vs Predicted (R² = {rf_r2:.3f})')
plt.show()

# Residuals plot — good model has residuals randomly scattered around zero
residuals = y_test - rf_preds
plt.figure(figsize=(8, 4))
plt.scatter(rf_preds, residuals, alpha=0.2, s=5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Price')
plt.ylabel('Residual (Actual - Predicted)')
plt.title('Residuals Plot')
plt.show()

Feature Importance — What the Model Learned

One of the most valuable things about Random Forest is that it tells you which features mattered most. This insight is useful both for improving the model and for understanding the business problem — which characteristics of a house most strongly drive its price.

Python — plotting feature importance from the Random Forest
# Get feature importances from the trained Random Forest
importance_df = pd.DataFrame({
    'Feature':    feature_cols,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print(importance_df.head(10))

# Plot top 10 most important features
plt.figure(figsize=(10, 6))
top10 = importance_df.head(10)
plt.barh(top10['Feature'], top10['Importance'])
plt.title('Top 10 Most Important Features')
plt.xlabel('Importance Score')
plt.gca().invert_yaxis()
plt.show()

# Expected result: MedInc (median income) and location features
# will likely top the list — income is the strongest predictor of home price

Predicting Prices for New Houses

Once the model is trained, using it is straightforward. You build a dictionary with the house's features, convert it to a DataFrame so the column order matches, and pass it to predict().

Python — predicting the price of a new unseen house
def predict_house_price(model, house_features, feature_cols):
    """Predict price for one house given its features as a dictionary."""
    input_df = pd.DataFrame([house_features])

    # Add engineered features
    input_df['RoomsPerPerson']   = input_df['AveRooms']   / input_df['Population'].clip(1)
    input_df['BedroomRatio']    = input_df['AveBedrms']  / input_df['AveRooms'].clip(1)
    input_df['HouseholdDensity'] = input_df['Population'] / input_df['AveOccup'].clip(1)
    input_df['MedInc_squared']  = input_df['MedInc'] ** 2
    bay_lat, bay_lon           = 37.77, -122.42
    input_df['DistFromSF']    = np.sqrt((input_df['Latitude'] - bay_lat)**2 +
                                         (input_df['Longitude'] - bay_lon)**2)

    input_df = input_df[feature_cols]   # ensure column order matches training
    price    = model.predict(input_df)[0]
    return price

# Example: a house in a high-income area near San Francisco
new_house = {
    'MedInc':     8.5,      # high median income area
    'HouseAge':   15,       # relatively new
    'AveRooms':   6.5,
    'AveBedrms':  1.1,
    'Population': 1200,
    'AveOccup':   2.8,
    'Latitude':   37.78,   # very close to SF
    'Longitude':  -122.43
}

predicted_price = predict_house_price(rf, new_house, feature_cols)
print(f"Predicted house price: ${predicted_price:,.0f}")

Model Comparison

Model	MAE	RMSE	R²	Needs Scaling	Interpretable
Linear Regression	~$52,000	~$72,000	~0.64	Yes	Very
Random Forest	~$34,000	~$50,000	~0.82	No	Moderate

✅ Next steps: to push performance further, try XGBoost or LightGBM (even more powerful ensemble methods), use cross-validation instead of a single train/test split, and use GridSearchCV or RandomizedSearchCV to find the best hyperparameters systematically.

⚡ Key Takeaways

House price prediction is a supervised regression problem — you have labelled examples (houses with known prices) and the output is a continuous number.
Every ML project follows the same pipeline: load, explore, preprocess, feature engineer, split, train, evaluate, predict. Do not skip steps — each one directly affects model quality.
EDA (Exploratory Data Analysis) is not optional. Correlation matrices reveal which features matter most. Distribution plots reveal skew and outliers that need handling before training.
Always fill missing values before training. Use median for numeric columns (robust to outliers) and most_frequent for categorical columns.
Fit the scaler only on training data, then transform both sets. Fitting on the whole dataset leaks test information and gives an over-optimistic accuracy score.
Feature engineering — creating ratios, squared values and derived features — can significantly improve accuracy without adding more data. RoomsPerPerson is more informative than AveRooms alone.
Start with Linear Regression as a baseline. It gives you a reference point and reveals whether your features have any linear relationship with the target at all.
Random Forest handles non-linear relationships, interactions between features and outliers far better than linear regression. It almost always beats linear regression on real-world datasets.
Use all three metrics: MAE (average error in dollars), RMSE (punishes big errors more), and R² (percentage of variance explained). One metric alone does not give the full picture.
Feature importance from Random Forest shows which input features actually drove the predictions. In real estate, median income and location typically top the list.

Tags: Machine Learning Python Regression scikit-learn Random Forest Feature Engineering

Shashank Shekhar

Founder & Creator — Hoopsiper.com

Full stack developer and educator. Building Hoopsiper to help developers learn faster through practical, no-fluff coding guides on JavaScript, AI/ML, Python and modern web development.

House Price PredictionUsing Machine Learning

The Problem We Are Solving

The ML Pipeline — Every Step in Order

Setup and Loading the Data

Exploring the Data

Looking at Distributions

Finding Correlations

Preprocessing — Cleaning the Data

Handling Missing Values

Encoding Categorical Features

Feature Scaling

Feature Engineering — Creating Better Inputs

Train Test Split — Keeping an Honest Score

Linear Regression — The Simple Baseline

Random Forest — A More Powerful Model

Evaluation Metrics — How Do You Know If It Is Good

Feature Importance — What the Model Learned

Predicting Prices for New Houses

Model Comparison

Regression vs Classification

Model Inference Speed and Pruning

Data Pipeline Efficiency

House Price Prediction
Using Machine Learning