As the real estate market continues to grow, there is an increasing demand for accurate predictions of house prices. With the power of machine learning, it is now possible to predict the price of a house based on its features like the number of bedrooms, bathrooms, and square footage. In this post, we will walk through a step-by-step guide to predicting house prices using machine learning.
Table of Contents
Data Collection and Exploration
The first step is to collect the dataset of housing prices. We will use the Boston Housing dataset from the UCI Machine Learning Repository. This dataset contains information about housing prices in Boston, Massachusetts. The dataset contains 13 features including the number of rooms, crime rate, and distance to employment centers. We will use the Pandas library in Python to load and explore the dataset.
import pandas as pd # Load dataset df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', header=None, sep='\s+') # Add column names df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] # Show first five rows print(df.head())
The output should display the first five rows of the dataset.
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV 0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
The output shows the first five rows of the dataset along with their features.
Data Preprocessing
Before we can use the dataset to train a machine learning model, we need to preprocess it. This involves cleaning the data, handling missing values, and converting categorical variables into numerical ones.
# Check for missing values print(df.isnull().sum
The output should display the number of missing values in each feature.
CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 MEDV 0 dtype: int64
The output shows that there are no missing values in the dataset.
Next, we will split the dataset into input features (X) and target variables (y). The target variable is the ‘MEDV’ column, which represents the median value of owner-occupied homes in thousands of dollars.
# Split dataset into input features (X) and target variable (y) X = df.drop('MEDV', axis=1) y = df['MEDV']
We will also split the dataset into training and testing sets. The training set will be used to train the machine learning model, while the testing set will be used to evaluate its performance.
# Split dataset into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Machine Learning Model
Now that we have preprocessed the data, we can train a machine-learning model to predict house prices. We will use the Gradient Boosting Regressor from the Scikit-learn library in Python.
# Train a Gradient Boosting Regressor from sklearn.ensemble import GradientBoostingRegressor gb = GradientBoostingRegressor() gb.fit(X_train, y_train)
We have trained the Gradient Boosting Regressor on the training data. We will now use the testing data to evaluate the performance of the model.
# Evaluate performance on testing data from sklearn.metrics import mean_squared_error, r2_score y_pred = gb.predict(X_test) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print('Mean Squared Error:', mse) print('R^2 Score:', r2)
The output should display the mean squared error and R^2 score of the model.
Mean Squared Error: 13.031738158696378 R^2 Score: 0.8674611428007859
The output shows that the model has a mean squared error of 13.03 and an R^2 score of 0.87. This means that the model has a good fit and can predict house prices accurately.
Conclusion
In this post, we have walked through a step-by-step guide to predicting house prices using machine learning. We collected the dataset, explored it, preprocessed it, and trained a machine-learning model. We used the Gradient Boosting Regressor to predict house prices based on their features like the number of bedrooms, bathrooms, and square footage. We also evaluated the performance of the model using the mean squared error and R^2 score. Machine learning can help us predict house prices accurately, and we can use this information to make better decisions when buying or selling a house.
Leave a Reply