Predicting House Prices Pith Machine Learning

If you’re going to sell a house, you need to know what price tag to put on it, for this purpose I wrote a regression algorithm to predict home prices

Hilal Alhwaiti
4 min readNov 4, 2020

Introduction

The average sales price of new homes sold in the U.S. is US$388,000. The house price mainly dependent on Location, size and condition … these factors influence a home’s value.

The goal of this artical is to predict house prices using one basic machine Linear Regression and Random Forest , in this mainly we will look at data exploration and data cleaning, I used data from Kaggle with 79 explanatory variables describing every aspect of residential homes such ad Area, space, materials used …

Business Understanding

We are interested to answer the following two questions:

Q1: Get an overview of the our traget variable and find out what distribution its follow ?

Q2 : What are the most important variables to our traget ?

Q3: What is the best model that give us the best result ?

Q4: Can you improve the accuracy of a model ?.

Data Cleaning and Exploration

First we want to see if we have missing value in our data

It looks like we have few columns with a lot of missing valsue.

From above we can safely drop first 5 columns since they have approximately 50% missing value.

Since we don't know much about our data we can replace the the missing value in our categorical variables with mode. Similarly for the continuous variables we will replace nan with the mean.

Explore data

We want get an overview of the our traget variable and find out what distribution its follow

As we see, the target variable SalePrice is not normally distributed.
This can reduce the performance of the ML regression models because some assume normal distribution.

we will make a log transformation, the resulting distribution looks much better.

Q2 : What are the most important variables to our target ?

Our target variable is SalePrice, so we want to see what are the variables that have strong relationship with our response.

From above we can see the the most important variables to Sale Price, we will consider any variables have more than 4 % corr have good relationship:

1-GrLivArea 2-OverallQual 3-GarageCars 4-GarageArea 5-TotRmsAbvGrd 6-FullBath 7–1stFlrSF 8-TotalBsmtSF 9-YearBuilt

Data Analysis

I decided to use Linear Regression . We will also use Random Forest to try to improve our prediction accuracy.

From above is when we are plotting predicating value to actual value, although the chart does not look right, but we can see the orange lines kinda not getting like they should be .

After dropping the categorical variables we got the accuracy increase by more than half.( we increase test score is 40 % to 81%)

It looks much better than before

Conclusion

In this article we went through some exploratory questions about our variables. We conclude that GrLivArea, OverallQual, GarageCars and GarageArea have big influence a home’s value.Random Forest is the most accurate model for predicting the house price. It scored an estimated accuracy of 81% , to approve this maybe we need to do more feature selection to include only the important variables in our model.

We improve our accuracy for the model more than 50%. We notice that our model did not performance good with including the categorical variables. This is might happened because there is no relationship between SalePrice and these variables.

For details about the analysis, feel free to visit my GitHub : house_prices

--

--