Predicting House Prices using Machine Learning: A Step-by-Step Guide

As the real estate market continues to grow, there is an increasing demand for accurate predictions of house prices. With the power of machine learning, it is now possible to predict the price of a house based on its features like the number of bedrooms, bathrooms, and square footage. In this post, we will walk through a step-by-step guide to predicting house prices using machine learning.

Table of Contents

1 Data Collection and Exploration
2 Data Preprocessing
3 Machine Learning Model
4 Conclusion

Data Collection and Exploration

The first step is to collect the dataset of housing prices. We will use the Boston Housing dataset from the UCI Machine Learning Repository. This dataset contains information about housing prices in Boston, Massachusetts. The dataset contains 13 features including the number of rooms, crime rate, and distance to employment centers. We will use the Pandas library in Python to load and explore the dataset.

import pandas as pd

# Load dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', 
                 header=None, sep='\s+')

# Add column names
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
              'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

# Show first five rows
print(df.head())

The output should display the first five rows of the dataset.

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO       B  LSTAT  MEDV
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296.0     15.3  396.90   4.98  24.0
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242.0     17.8  396.90   9.14  21.6
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242.0     17.8  392.83   4.03  34.7
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222.0     18.7  394.63   2.94  33.4
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222.0     18.7  396.90   5.33  36.2

The output shows the first five rows of the dataset along with their features.

Data Preprocessing

Before we can use the dataset to train a machine learning model, we need to preprocess it. This involves cleaning the data, handling missing values, and converting categorical variables into numerical ones.

# Check for missing values
print(df.isnull().sum

The output should display the number of missing values in each feature.

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

The output shows that there are no missing values in the dataset.

Next, we will split the dataset into input features (X) and target variables (y). The target variable is the ‘MEDV’ column, which represents the median value of owner-occupied homes in thousands of dollars.

# Split dataset into input features (X) and target variable (y)
X = df.drop('MEDV', axis=1)
y = df['MEDV']

We will also split the dataset into training and testing sets. The training set will be used to train the machine learning model, while the testing set will be used to evaluate its performance.

# Split dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Machine Learning Model

Now that we have preprocessed the data, we can train a machine-learning model to predict house prices. We will use the Gradient Boosting Regressor from the Scikit-learn library in Python.

# Train a Gradient Boosting Regressor
from sklearn.ensemble import GradientBoostingRegressor

gb = GradientBoostingRegressor()
gb.fit(X_train, y_train)

We have trained the Gradient Boosting Regressor on the training data. We will now use the testing data to evaluate the performance of the model.

# Evaluate performance on testing data
from sklearn.metrics import mean_squared_error, r2_score

y_pred = gb.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('R^2 Score:', r2)

The output should display the mean squared error and R^2 score of the model.

Mean Squared Error: 13.031738158696378
R^2 Score: 0.8674611428007859

The output shows that the model has a mean squared error of 13.03 and an R^2 score of 0.87. This means that the model has a good fit and can predict house prices accurately.

Conclusion

In this post, we have walked through a step-by-step guide to predicting house prices using machine learning. We collected the dataset, explored it, preprocessed it, and trained a machine-learning model. We used the Gradient Boosting Regressor to predict house prices based on their features like the number of bedrooms, bathrooms, and square footage. We also evaluated the performance of the model using the mean squared error and R^2 score. Machine learning can help us predict house prices accurately, and we can use this information to make better decisions when buying or selling a house.

A Comprehensive Guide to Hugging Face Transformers »

« when Recall, Precision, Accuracy, and F1 score is Important

Categories: Data Science Machine Learning Pandas Python

Tags: Data ScienceMachine Learning

Jamaley Hussain: Hello, I am Jamaley. I graduated from Staffordshire University and have always been passionate about Computers, Technology, and Generative AI. Currently, I work as a Senior Data Scientist (AI/ML) and I’m also the founder of TechJunkGigs, a platform dedicated to helping programming students with tutorials on Machine Learning, Data Science, Python, LLM, RAG, Generative AI, and NLP. What started as a blog has now evolved into a valuable resource for students, and I'm committed to sharing knowledge to help them stay updated with industry trends

Harnessing LLMs and Generative AI: The Rise of Autonomous AI Agents
Understanding the Rise of AI Agents Artificial Intelligence (AI) agents, powered by advanced algorithms and…
Top RAG Frameworks for LLMs in 2025: Your Complete Comparison Guide
In the dynamic world of artificial intelligence, Retrieval-Augmented Generation (RAG) stands out as a revolutionary…
Sentiment Analysis with NLP: A Step-by-Step Guide
Sentiment analysis is like teaching a computer to understand feelings - it can tell whether…

Data Collection and Exploration

Data Preprocessing

Machine Learning Model

Conclusion

Related posts: