Boston Airbnb Data Analysis and Price prediction

Suveesh Malachiyil
7 min readJan 4, 2021
Original photo by Anthony Delanoix on Unsplash

Airbnb is an online marketplace which lets people to rent their properties, rooms in their house, or share their rooms to the guests. This blog is an effort to interpret the Airbnb, Boston dataset retrieved from Kaggle and answer few business questions, mentioned below.

The dataset retrieved is a collection of property listings, their key features and types, such as property type, host type, neighborhood, reviews and much more.

Size of the dataset: (3585, 95)

The analysis and its findings are only observational and not the result of a formal study. General business questions are listed below to guide us through the analysis to create a model that can predict the rental price based on some features. For more detailed analysis, here is my Github repository, and Kaggle notebook with all the needed codes.

Who could be benefited from these business questions and how?

New host or someone who plans to list their property on Airbnb, could possibly understand the pricing strategy of different types of properties and their features. The machine learning model has been created and evaluated to predict pricing each property.

Business Questions:

1.) What are the features that influence the property pricing?

2.) What time of year has the highest rental prices?

3.) Do Superhosts perform better than other hosts?

4.) How are the listings distributed on Boston city?

Task:

1.) Create a model to predict property price

1.) What are the features that influence the property pricing?

We will be further investigate the comparison of above features with the price column and identify the relationship and inferences from the results.

(i) Neighborhood and the price differences

(a) Bar Chart — Neighborhood and Price in $

The bar chart (a.) between neighborhood vs price, depicts how the change in price based on the neighborhood. Based on the visualization, South Boston Waterfront in the neighborhood is the most expensive listing next to Bay Village and Leather District.

(ii) Room Types and Bedrooms pricing

Bar chart between price in $ and ( b.) Room types by bedroom, (c.) Superhost by room types

The difference in listing prices is co-related to the room types and the number of bedrooms. Based on the chart (b.) the Entire home/apt is an expensive stay in the Boston data set, also, the price increases relatively with an increase in the number of bedrooms. The chart (c.) again depicts a similarity in high pricing for the Entire home/apt, whereas the another insight we could draw is the properties with Superhost are higher than the properties with normal host.

(iii) Property Type and Pricing

(d) Box Plot — Property Type and price in $

Based on the box plot (d.) the other property type is in large numbers with high variance in price, apart from that, the insights are also says that large number of Villa type property available in the median price range. Further the analyses depict that the Condominium type properties are expensive than the Apartments and Villas.

(iv)Cancellation Policy and Pricing

(e)Box plot — Cancellation Policy and Price in $

The Types of cancellation policy as per the listing data set are Flexible, Moderate, Strict, and Super strict. As per the data the super strict policy median value represents the high price(300$). However, the outliers in other three policies distributed till the maximum price range. Most of the policies pricing, considering 75th percentile in the box plot for Flexible is less than 200$, for Moderate and strict policy is between 200$ to 300$.

(V) Price Trend based on the months in a year

(f) Line plot — Monthly trend on pricing

The line chart (f.) depicts a seasonal trend in the change in monthly pricing. The pricing trend looks increasing throughout the months, starting from January, where the largest leap in pricing is identified in August and September months and drops between October and December. Also, the pricing during January to March is slightly flat and takes a small leap at March and April.

2.) What time of year has the highest rental prices?

The line chart(f.) visualizes the change in pricing per month, and it further says that September is the expensive month of the year to travel and stay in the Airbnb listings atBoston.

3.) Do Superhosts perform better than other hosts?

(i) Average response time by the each host

(g) Bar Chart — Superhost and average number of response time

The Bar chart (g) is between superhost column and average number of each value in host_response_time column. Comparing the host and superhost response time, superhost has more than 50% of the response time as within an hour. Also, the superhost doesn’t seem to delay a response time more than a day, whereas the host 1.5% of response time as few days or more. In this comparison, we can find that superhost has performed better than the other hosts.

(ii)Host Response and Acceptance rate

Bar chart — superhost Vs (h) response rate by each room type, (i) acceptance rate by each room type

The response rate of the superhost is above 90% for all the room types,where the other host response rate are slightly below to the superhost, but doesn’t reflect a large difference. The acceptance rate has a difference in Entire home/apt and shared room features. The superhost have performed better in the Entire home/apt and fell below in share rooms than other hosts. From this visualization, we can infer that superhost has a better response rate, and fairly equal acceptance rate with the other host in the listings.

(iii) Review for Cleanliness and Communication

(j) Table — Average reviews for cleanliness and communication for sperhost
(k)scatter plot — Review sores for cleanliness for the superhost by each room type

The Reviews for cleanliness to the superhost as per the scatter plot (k) are between 6–10, and the average review score is 9.8 as per (j). The other hosts have received reviews between 2–10 and the average review score is 9.3.

(l) Scatter plot — Review score for communication for the superhost by each room type

Similar to the Cleanliness chart (k), superhosts have received highest review scores for communication between 8–10, and more number of reviews scores were 10. The superhost’s average review score for communication as per the table (j) 9.96, and other host have received 9.70. As per the insights generated, the superhosts have performed better than other hosts in terms of cleanliness and communication.

4.) How are the listings distributed on Boston city?

(q) Spatial data plotted over Boston Map

The distribution of listings on the Boston city map shows a dense number of listings are distributed along the motorway and railway. Also, the listings are highly populated along Boston north station, Cambridge port, Brighton, and East Boston. This means that most of the listings are easily accessible to transportation and shopping in the city.

Create a model to predict property price

In this part, we are going to evaluate Random forest Regressor algorithm and Linear Regression algorithm to predict the listing price, and compare the results to select the best performing model.

Mean Absolute Error: This measures the absolute average distance between the real data and the predicted data.

We have already selected the required features for fitting the model, completed the data cleaning and saved the dataframe in df_int variable. df_int is further split into train and test data set, please check my Github repository for detailed understanding.

(r)Mean Absolute Error of Random Forest Regressor and Linear Regression
(s)Scatter Plot — Difference between Predicted value and Actual Value by Linear Regression
(t) Scatter Plot — Difference between Predicted value and Actual Value by Random Forest Regressor

The MAE of Random Forest Regressor and Linear Regression is demonstrating a large difference in performance in predicting the price value. Also the scatter plot (s) and (t) is a scatter plot, visualizing the difference between predicted value and actual value. Random Forest Regressor is the best performing model as per the MAE value with the room_type, bedrooms, and neighborhood as the highly contributing features to the result.

The Conclusion

  1. ) We have found that property pricing is dependent on many number of features and strongly correlated with the property size, and room type. Thus provided, pricing is a sensitive feature and we cannot be relied only on the available features in our observations and consider many other factors to determine the price.
  2. ) Superhost are the recognition provided by Airbnb for performing well in hospitality. We have analysed the data to verify that they are really performing better than other host in the listings.
  3. ) September and October are the expensive months to travel and stay in Airbnb rooms.
  4. ) Properties are highly distributed along the Boston north station, Cambridge port, Brighton, and East Boston.
  5. ) We have created a model with Random Forest Regressor algorithm to predict the property pricing with few features selected from the listings dataset.

I hope you like my presentation, please feel free to reach out.

LinkedIn Profile.

--

--