Predicting house prices is one of the best projects to learn machine learning from scratch. It covers every stage of a real ML workflow — exploring data, cleaning it, engineering useful features, training models and measuring how well they work.

By the end of this guide you will have a working model that takes inputs like square footage, number of bedrooms, neighbourhood and year built, and produces a predicted sale price. Along the way you will understand exactly why every step matters and what happens if you skip it.

Download the Complete Code Get the full Python script for this house price prediction project — ready to run in a Jupyter notebook or as a standalone .py file.
Python script  ·  Full project  ·  Free

The Problem We Are Solving

A real estate company wants to estimate the market value of a house before listing it. Instead of hiring an appraiser for every property, they want a model that can take the house's characteristics and instantly output a predicted price.

This is a supervised regression problem. Supervised because we have labelled training data — houses with known sale prices. Regression because the output is a continuous number (price) rather than a category.

The inputs (also called features) are things we know about the house: size in square feet, number of bedrooms and bathrooms, the neighbourhood, whether it has a garage, the year it was built and so on. The model learns the relationship between these features and the sale price from historical data, then uses that pattern to estimate prices for new houses it has never seen.

The ML Pipeline — Every Step in Order

Every machine learning project follows the same general sequence. Understanding this sequence first makes all the individual steps easier to follow.

1
Load the dataRead the dataset into a pandas DataFrame and take a first look
2
Explore the data (EDA)Look at distributions, check for outliers, understand relationships between features
3
PreprocessFill missing values, encode categorical columns, scale numeric features
4
Feature engineeringCreate new, more informative features from existing ones
5
Split the dataSeparate into training set (model learns from this) and test set (model is evaluated on this)
6
Train modelsFit a Linear Regression and a Random Forest on the training data
7
EvaluateMeasure predictions on the test set using MAE, RMSE and R² score
8
Predict new housesUse the trained model to estimate prices for houses it has never seen

Setup and Loading the Data

Python — install packages and load the dataset
# Install if needed # pip install pandas numpy scikit-learn matplotlib seaborn import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score from sklearn.impute import SimpleImputer # Load the California Housing dataset (built into scikit-learn) from sklearn.datasets import fetch_california_housing housing = fetch_california_housing() df = pd.DataFrame(housing.data, columns=housing.feature_names) df['Price'] = housing.target * 100000 # convert to actual dollar values # First look at the data print(df.shape) # (20640, 9) — 20640 houses, 9 columns print(df.head()) # first 5 rows print(df.info()) # column names, data types, nulls print(df.describe()) # mean, min, max, std for every column
ℹ️ About the dataset: the California Housing dataset has 20,640 houses with features like median income, house age, average rooms per household, population and location (latitude and longitude). The target is median house value. It is a clean, well-known dataset perfect for learning regression.

Exploring the Data

Before building any model, you need to understand your data. Exploratory Data Analysis (EDA) answers three questions: What does the data look like? Are there any obvious problems? Which features seem most related to price?

Looking at Distributions

Python — exploring the price distribution and spotting issues
# Check the target variable (Price) distribution print("Price stats:") print(df['Price'].describe()) print(f"Min: ${df['Price'].min():,.0f}") print(f"Max: ${df['Price'].max():,.0f}") print(f"Mean: ${df['Price'].mean():,.0f}") # Plot price distribution plt.figure(figsize=(10, 4)) plt.subplot(1, 2, 1) df['Price'].hist(bins=50) plt.title('Price Distribution (original)') # Log transform reduces skew — many ML models work better with symmetric data plt.subplot(1, 2, 2) np.log1p(df['Price']).hist(bins=50) plt.title('Log Price Distribution (more symmetric)') plt.tight_layout() plt.show() # Check how many missing values each column has print(df.isnull().sum())

Finding Correlations

A correlation tells you how strongly two variables move together. A correlation of 1 means they move perfectly together. A correlation of 0 means no relationship at all. A correlation of -1 means they move in opposite directions. Features with high correlation to Price are the most useful for prediction.

Python — correlation matrix to find which features matter most
# Compute correlation of every feature with Price corr_with_price = df.corr()['Price'].sort_values(ascending=False) print(corr_with_price) # MedInc (median income) is most correlated with price — makes sense # Latitude and Longitude have moderate negative correlation # Heatmap of full correlation matrix plt.figure(figsize=(10, 8)) sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm') plt.title('Feature Correlation Heatmap') plt.show() # Scatter plot of the most important feature vs Price plt.figure(figsize=(8, 5)) plt.scatter(df['MedInc'], df['Price'], alpha=0.1, s=5) plt.xlabel('Median Income') plt.ylabel('House Price') plt.title('Income vs Price') plt.show()

Preprocessing — Cleaning the Data

Raw data is almost never clean. Before you can train a model you need to handle missing values, convert text categories into numbers and scale your numeric features. Skipping any of these steps leads to a model that either crashes or performs poorly.

Handling Missing Values

Most ML models cannot handle missing values — they will either crash or silently produce wrong answers. You have two options: remove the rows with missing values (if there are not many) or fill them in with a sensible estimate.

Python — filling missing values with SimpleImputer
from sklearn.impute import SimpleImputer # Check what is missing print(df.isnull().sum()) # For numeric columns — fill missing with the median of that column # Median is better than mean when there are outliers num_imputer = SimpleImputer(strategy='median') numeric_cols = df.select_dtypes(include=['number']).columns.tolist() numeric_cols.remove('Price') # do not impute the target df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols]) # For categorical columns — fill missing with the most common value cat_cols = df.select_dtypes(include=['object']).columns if len(cat_cols) > 0: cat_imputer = SimpleImputer(strategy='most_frequent') df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

Encoding Categorical Features

Machine learning models only work with numbers. If you have a column like "neighbourhood" with values like "downtown", "suburbs" and "rural", you need to convert those text values into numbers. There are two common approaches.

Python — label encoding vs one-hot encoding
import pandas as pd # Example: a house type column df_example = pd.DataFrame({ 'house_type': ['detached', 'flat', 'terraced', 'detached'] }) # One-hot encoding — creates a separate column for each category # Better when there is no order between categories (detached is not "bigger" than flat) encoded = pd.get_dummies(df_example, columns=['house_type'], drop_first=True) print(encoded) # house_type_flat house_type_terraced # 0 0 0 (detached = 0 0) # 1 1 0 (flat = 1 0) # 2 0 1 (terraced = 0 1) # Apply one-hot encoding to the main dataframe df = pd.get_dummies(df, drop_first=True)

Feature Scaling

Different features often have very different ranges. A house might have 3 bedrooms but 2000 square feet. If you feed these directly into a linear model, the large-valued features will dominate simply because they are bigger numbers, not because they are more important. Scaling puts all features on the same scale.

Python — scaling features to the same range
from sklearn.preprocessing import StandardScaler # StandardScaler transforms each feature to have mean=0 and std=1 # Important for Linear Regression and any distance-based algorithms # Tree-based models like Random Forest do not need scaling features = [c for c in df.columns if c != 'Price'] X = df[features].values y = df['Price'].values scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print(f"Before scaling: mean={X[:,0].mean():.2f}, std={X[:,0].std():.2f}") print(f"After scaling: mean={X_scaled[:,0].mean():.2f}, std={X_scaled[:,0].std():.2f}")
⚠️ Fit the scaler only on training data. A common mistake is to scale the entire dataset before splitting. This leaks information from the test set into the scaler. Always split first, then fit the scaler on training data only and transform both sets.

Feature Engineering — Creating Better Inputs

Feature engineering means creating new columns from existing ones that the model can learn from more easily. Instead of giving the model raw numbers, you give it combinations and ratios that carry more meaning.

Python — creating new features from existing columns
# Rooms per person — divides total rooms by population in the block # A house with 10 rooms but 50 people is much more crowded than 10 rooms and 3 people df['RoomsPerPerson'] = df['AveRooms'] / df['Population'].clip(1) # Bedrooms per room ratio — high ratio means fewer living spaces relative to bedrooms df['BedroomRatio'] = df['AveBedrms'] / df['AveRooms'].clip(1) # Household density — persons per household df['HouseholdDensity'] = df['Population'] / df['AveOccup'].clip(1) # Income squared — captures non-linear relationship # Doubling income often more than doubles house value df['MedInc_squared'] = df['MedInc'] ** 2 # Distance from San Francisco bay area centre (approximate) # Proximity to desirable areas affects price significantly bay_lat, bay_lon = 37.77, -122.42 df['DistFromSF'] = np.sqrt( (df['Latitude'] - bay_lat) ** 2 + (df['Longitude'] - bay_lon) ** 2 ) print(f"Dataset now has {df.shape[1]} columns after feature engineering")

Train Test Split — Keeping an Honest Score

You need two separate chunks of your data. The training set is what the model learns from. The test set is a chunk the model never sees during training — you use it at the end to get an honest measurement of how well the model performs on data it has never encountered.

If you evaluated the model on the same data it trained on, it would look great — it already memorised those examples. The test set simulates real-world performance.

Python — splitting into train and test sets
from sklearn.model_selection import train_test_split feature_cols = [c for c in df.columns if c != 'Price'] X = df[feature_cols] y = df['Price'] # 80% training, 20% testing # random_state=42 ensures you get the same split every time you run the code X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(f"Training set: {X_train.shape[0]} houses") print(f"Test set: {X_test.shape[0]} houses") # Now scale AFTER splitting — fit only on training data scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # fit + transform train X_test_scaled = scaler.transform(X_test) # only transform test — no fit!

Linear Regression — The Simple Baseline

Linear Regression assumes the price is a straight-line combination of all the features. It multiplies each feature by a learned weight and adds them up to produce a prediction. It is simple, fast and easy to interpret. It is always a good idea to start with it as a baseline before trying more complex models.

Python — training and evaluating Linear Regression
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Train lr = LinearRegression() lr.fit(X_train_scaled, y_train) # Predict on test set lr_preds = lr.predict(X_test_scaled) # Evaluate lr_mae = mean_absolute_error(y_test, lr_preds) lr_rmse = mean_squared_error(y_test, lr_preds, squared=False) lr_r2 = r2_score(y_test, lr_preds) print("Linear Regression Results:") print(f" MAE: ${lr_mae:,.0f}") print(f" RMSE: ${lr_rmse:,.0f}") print(f" R²: {lr_r2:.3f}")

Random Forest — A More Powerful Model

A Random Forest is a collection of many Decision Trees that each vote on the answer. Each tree is trained on a slightly different random sample of the data and uses a random subset of features at each split. The final prediction is the average of all the trees' predictions.

This approach — many imperfect models voting together — is called ensemble learning. It works much better than any single tree because the mistakes of individual trees cancel each other out.

Python — training Random Forest with hyperparameter tuning
from sklearn.ensemble import RandomForestRegressor # Train a Random Forest # n_estimators = number of trees (more trees = better but slower) # max_depth = how deep each tree can grow (prevents overfitting) # n_jobs=-1 = use all CPU cores for faster training rf = RandomForestRegressor( n_estimators=200, max_depth=15, min_samples_split=5, random_state=42, n_jobs=-1 ) # Note: Random Forest does not need scaled data rf.fit(X_train, y_train) rf_preds = rf.predict(X_test) rf_mae = mean_absolute_error(y_test, rf_preds) rf_rmse = mean_squared_error(y_test, rf_preds, squared=False) rf_r2 = r2_score(y_test, rf_preds) print("Random Forest Results:") print(f" MAE: ${rf_mae:,.0f}") print(f" RMSE: ${rf_rmse:,.0f}") print(f" R²: {rf_r2:.3f}")

Evaluation Metrics — How Do You Know If It Is Good

There are three main metrics for regression problems. Each one tells you something slightly different.

MetricWhat It MeasuresIdeal ValueUnits
MAEMean Absolute Error — average prediction errorAs low as possibleSame as target (dollars)
RMSERoot Mean Squared Error — penalises large errors moreAs low as possibleSame as target (dollars)
How much of the variance the model explainsCloser to 1.00 to 1 (unitless)
ℹ️ Interpreting R²: an R² of 0.85 means the model explains 85% of the variation in house prices. The remaining 15% is noise the model cannot capture — location-specific quirks, renovations, negotiation etc. For real estate, R² above 0.80 is generally considered good.
Python — plotting actual vs predicted prices
# Actual vs Predicted scatter plot — perfect model = all dots on the diagonal plt.figure(figsize=(8, 6)) plt.scatter(y_test, rf_preds, alpha=0.2, s=10) # Perfect prediction line min_val = min(y_test.min(), rf_preds.min()) max_val = max(y_test.max(), rf_preds.max()) plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=1.5) plt.xlabel('Actual Price') plt.ylabel('Predicted Price') plt.title(f'Random Forest: Actual vs Predicted (R² = {rf_r2:.3f})') plt.show() # Residuals plot — good model has residuals randomly scattered around zero residuals = y_test - rf_preds plt.figure(figsize=(8, 4)) plt.scatter(rf_preds, residuals, alpha=0.2, s=5) plt.axhline(0, color='red', linestyle='--') plt.xlabel('Predicted Price') plt.ylabel('Residual (Actual - Predicted)') plt.title('Residuals Plot') plt.show()

Feature Importance — What the Model Learned

One of the most valuable things about Random Forest is that it tells you which features mattered most. This insight is useful both for improving the model and for understanding the business problem — which characteristics of a house most strongly drive its price.

Python — plotting feature importance from the Random Forest
# Get feature importances from the trained Random Forest importance_df = pd.DataFrame({ 'Feature': feature_cols, 'Importance': rf.feature_importances_ }).sort_values('Importance', ascending=False) print(importance_df.head(10)) # Plot top 10 most important features plt.figure(figsize=(10, 6)) top10 = importance_df.head(10) plt.barh(top10['Feature'], top10['Importance']) plt.title('Top 10 Most Important Features') plt.xlabel('Importance Score') plt.gca().invert_yaxis() plt.show() # Expected result: MedInc (median income) and location features # will likely top the list — income is the strongest predictor of home price

Predicting Prices for New Houses

Once the model is trained, using it is straightforward. You build a dictionary with the house's features, convert it to a DataFrame so the column order matches, and pass it to predict().

Python — predicting the price of a new unseen house
def predict_house_price(model, house_features, feature_cols): """Predict price for one house given its features as a dictionary.""" input_df = pd.DataFrame([house_features]) # Add engineered features input_df['RoomsPerPerson'] = input_df['AveRooms'] / input_df['Population'].clip(1) input_df['BedroomRatio'] = input_df['AveBedrms'] / input_df['AveRooms'].clip(1) input_df['HouseholdDensity'] = input_df['Population'] / input_df['AveOccup'].clip(1) input_df['MedInc_squared'] = input_df['MedInc'] ** 2 bay_lat, bay_lon = 37.77, -122.42 input_df['DistFromSF'] = np.sqrt((input_df['Latitude'] - bay_lat)**2 + (input_df['Longitude'] - bay_lon)**2) input_df = input_df[feature_cols] # ensure column order matches training price = model.predict(input_df)[0] return price # Example: a house in a high-income area near San Francisco new_house = { 'MedInc': 8.5, # high median income area 'HouseAge': 15, # relatively new 'AveRooms': 6.5, 'AveBedrms': 1.1, 'Population': 1200, 'AveOccup': 2.8, 'Latitude': 37.78, # very close to SF 'Longitude': -122.43 } predicted_price = predict_house_price(rf, new_house, feature_cols) print(f"Predicted house price: ${predicted_price:,.0f}")

Model Comparison

ModelMAERMSENeeds ScalingInterpretable
Linear Regression~$52,000~$72,000~0.64YesVery
Random Forest~$34,000~$50,000~0.82NoModerate
Next steps: to push performance further, try XGBoost or LightGBM (even more powerful ensemble methods), use cross-validation instead of a single train/test split, and use GridSearchCV or RandomizedSearchCV to find the best hyperparameters systematically.

⚡ Key Takeaways
  • House price prediction is a supervised regression problem — you have labelled examples (houses with known prices) and the output is a continuous number.
  • Every ML project follows the same pipeline: load, explore, preprocess, feature engineer, split, train, evaluate, predict. Do not skip steps — each one directly affects model quality.
  • EDA (Exploratory Data Analysis) is not optional. Correlation matrices reveal which features matter most. Distribution plots reveal skew and outliers that need handling before training.
  • Always fill missing values before training. Use median for numeric columns (robust to outliers) and most_frequent for categorical columns.
  • Fit the scaler only on training data, then transform both sets. Fitting on the whole dataset leaks test information and gives an over-optimistic accuracy score.
  • Feature engineering — creating ratios, squared values and derived features — can significantly improve accuracy without adding more data. RoomsPerPerson is more informative than AveRooms alone.
  • Start with Linear Regression as a baseline. It gives you a reference point and reveals whether your features have any linear relationship with the target at all.
  • Random Forest handles non-linear relationships, interactions between features and outliers far better than linear regression. It almost always beats linear regression on real-world datasets.
  • Use all three metrics: MAE (average error in dollars), RMSE (punishes big errors more), and (percentage of variance explained). One metric alone does not give the full picture.
  • Feature importance from Random Forest shows which input features actually drove the predictions. In real estate, median income and location typically top the list.
Download the Complete Python Script The full house price prediction project — data loading, EDA, preprocessing, feature engineering, Linear Regression and Random Forest — all in one ready-to-run file.
Python script  ·  Full project  ·  Free