top of page

Data Overview

I obtained a dataset from Kaggle containing information on nearly 21,000 houses spread across 21 columns. While exploring the data with python, I encountered several challenges, including outliers and non-normally distributed variables. For example, house prices show significant variation, with some properties priced at over $5 million. These may be considered outliers, but not excessively so. On the other hand, some houses are priced below $100,000, highlighting the wide range of values in the dataset.


From my analysis, the most influential columns affecting house prices include square_feet_living houses with larger square footage tend to have higher prices and features like grade and waterfront. Bathrooms also appear important during basic exploratory data analysis. However, when I trained a machine learning model on this dataset, I found that some columns I initially thought were significant, like bathrooms, bedrooms, and floors, had little impact on the model's predictions. This was surprising, as these columns seem important when analyzing the data manually.


For example, houses with around 1,500 square feet typically have 2–3 rooms and 2 bathrooms. While deviations from this norm (e.g., fewer or more rooms/bathrooms) do affect prices, the impact appears minimal. On the other hand, features like view significantly influence prices—houses with better views are priced higher, while those with poor or no views tend to have lower prices.

I’ve also developed a machine learning model based on this dataset and might share it soon. It's been interesting to observe how the importance of certain features differs between exploratory data analysis and machine learning models.


Exploratory Data Analysis


The price data is left-skewed due to the presence of a few extremely high-priced houses, though these cases are very rare. This is evident in the initial portion of the distribution graph. The skewness is further exaggerated because the majority of houses in the dataset have relatively low prices. As a result, the data heavily clusters towards the lower end of the price range, creating a pronounced left-skew in the distribution.


Analyzing the distribution of different columns helps us better understand the data. For example, most houses have between 2 to 5 bathrooms, which is reasonable. However, there is one house with over 30 bathrooms, yet its price is less than a million dollars, which doesn't make sense. To address this inconsistency, I adjusted the number of bathrooms for this house to 3, aligning it with more realistic values.

Regarding waterfront properties, there aren’t many houses with this feature only about 150 in the dataset. However, these houses tend to be significantly more expensive than houses with a standard view, with an average price of around $1.5 million. This highlights how features like waterfront views can dramatically influence house prices compared to normal view properties.


The correlation between price and sqft-living indicates a strong positive relationship as the square footage of a house increases, the price tends to increase as well. Similarly, the grade of a house also shows a high correlation with price. This is because the grade represents the overall condition of the house and the quality of its infrastructure, which are significant factors influencing its value. These two features sqft-living and grade stand out as key predictors of house prices in the dataset.




The graph showing the average price in each view category reveals that as the quality of the view improves, the price also increases. Houses with a view have the highest average prices, while houses with zero view are priced at roughly half of that, indicating a significant price difference based on the view quality.

Another graph displays the average price in each grade category, along with the number of houses in each grade. The "C" on top of the bars represents the count of houses in each grade. As I mentioned earlier, houses with a higher grade tend to have higher prices, which is reflected in the graph—better quality houses generally command higher prices.

bottom of page