X

Predicting House Prices using Machine Learning: A Step-by-Step Guide

As the real estate market continues to grow, there is an increasing demand for accurate predictions of house prices. With the power of machine learning, it is now possible to predict the price of a house based on its features like the number of bedrooms, bathrooms, and square footage. In this post, we will walk through a step-by-step guide to predicting house prices using machine learning.

Data Collection and Exploration

The first step is to collect the dataset of housing prices. We will use the Boston Housing dataset from the UCI Machine Learning Repository. This dataset contains information about housing prices in Boston, Massachusetts. The dataset contains 13 features including the number of rooms, crime rate, and distance to employment centers. We will use the Pandas library in Python to load and explore the dataset.

 

import pandas as pd

# Load dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', 
                 header=None, sep='\s+')

# Add column names
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
              'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

# Show first five rows
print(df.head())

The output should display the first five rows of the dataset.

 

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO       B  LSTAT  MEDV
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296.0     15.3  396.90   4.98  24.0
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242.0     17.8  396.90   9.14  21.6
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242.0     17.8  392.83   4.03  34.7
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222.0     18.7  394.63   2.94  33.4
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222.0     18.7  396.90   5.33  36.2

The output shows the first five rows of the dataset along with their features.

Data Preprocessing

Before we can use the dataset to train a machine learning model, we need to preprocess it. This involves cleaning the data, handling missing values, and converting categorical variables into numerical ones.

 

# Check for missing values
print(df.isnull().sum

The output should display the number of missing values in each feature.

 

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

The output shows that there are no missing values in the dataset.

Next, we will split the dataset into input features (X) and target variables (y). The target variable is the ‘MEDV’ column, which represents the median value of owner-occupied homes in thousands of dollars.

 

# Split dataset into input features (X) and target variable (y)
X = df.drop('MEDV', axis=1)
y = df['MEDV']

We will also split the dataset into training and testing sets. The training set will be used to train the machine learning model, while the testing set will be used to evaluate its performance.

 

# Split dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Machine Learning Model

Now that we have preprocessed the data, we can train a machine-learning model to predict house prices. We will use the Gradient Boosting Regressor from the Scikit-learn library in Python.

 

# Train a Gradient Boosting Regressor
from sklearn.ensemble import GradientBoostingRegressor

gb = GradientBoostingRegressor()
gb.fit(X_train, y_train)

We have trained the Gradient Boosting Regressor on the training data. We will now use the testing data to evaluate the performance of the model.

 

# Evaluate performance on testing data
from sklearn.metrics import mean_squared_error, r2_score

y_pred = gb.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('R^2 Score:', r2)

The output should display the mean squared error and R^2 score of the model.

 

Mean Squared Error: 13.031738158696378
R^2 Score: 0.8674611428007859

The output shows that the model has a mean squared error of 13.03 and an R^2 score of 0.87. This means that the model has a good fit and can predict house prices accurately.

Conclusion

In this post, we have walked through a step-by-step guide to predicting house prices using machine learning. We collected the dataset, explored it, preprocessed it, and trained a machine-learning model. We used the Gradient Boosting Regressor to predict house prices based on their features like the number of bedrooms, bathrooms, and square footage. We also evaluated the performance of the model using the mean squared error and R^2 score. Machine learning can help us predict house prices accurately, and we can use this information to make better decisions when buying or selling a house.

Jamaley Hussain: Hello, I am Jamaley. I did my graduation from StaffordShire University UK . Fortunately, I find myself quite passionate about Computers and Technology.
Related Post