Skip to content

afandigit/Market-Housing

Repository files navigation

sale-sold-hand-signature-c4f2784a22e29601b6011a9e268398ec

Market Housing Project



Overview

I am using the Scrapy Framework to scrape datasets from websites offering real estate listings for sales and rentals. The goal is to analyze this data and develop a Machine Learning model that suggests listings to clients based on their preferences and financial capabilities.

This involves extracting relevant features from the listings, such as location, price, property type, and amenities, and using these features to train a recommendation system.

The system will leverage data analysis, data science techniques, and business intelligence to provide personalized and financially feasible real estate options for clients.

This project aims to enhance the user experience and optimize the property search process, ultimately driving better business outcomes for real estate platforms.

Features

  • Web Scraping : Scrapy Framework. [Done]
  • Data Cleaning. [Done]
  • Data Analysis - Explore and Visualize data : Python libraries - Microsoft Power BI. [Done]
  • MERN Stack. [Current step ...]
  • Data Preparation for Machine Learning : Data standardization and preprocessing. [Future step ...]
  • Machine Learning. [Future step ...]
  • Deploy the trained recommendation model in real time. [Future step ...]

Dataset

I have been scraping this dataset using the Scrapy framework for several days, totaling more than 50 hours, from two real estate websites in Morocco (Avito and Mubawab). Each website has its specific Spider. This initial data scraping is to kickstart the data cleaning process, which I plan to automate in the future. My goal is to scale this dataset to include the most common real estate websites worldwide. Currently, the dataset contains over 14,000 records, with each record representing data scraped from an individual real estate ad.

Dataset description

Column Name Description
advertisement_url Which is the full URL of the page in the website on which i retrieve detailed information on the current real estate announcement.
title The title of the property advertisement.
publication_date The date the advertisement was published by its owner.
price The price of the property (in DH = Moroccan DirHam : the official monetary currency of Morocco).
location The exact location of the property.
description Detail parts of the property.
complete_description Complete description of the property established by its owner.
features_list A liste of The property type (Apartment, House, Villa, farmhouse, ...), property state, number of floors ...
insert_date The Date the current announcement was scrapped.

Data Cleaning - Using Power BI & Python

Notice : the powerBI file is available above ... cleaning

Cleaning Process ....

  1. First step i commit is that i remove duplicates records, so as not to distort our analysis in the future. Because in practice, some people repost their real estate ads multiple times on these websites, causing them to appear multiple times for visitors.
  • Results : we found 3848 duplicate records.
  1. Cleaning advertisement_url column --> Adding "website_name" column based on current column and Visualizing the number of records per website name.
  2. Cleaning title column --> Adding three new columns based on the current one and supported by feature_list and price columns; 'ad_type' column, 'property_type' column and 'property_surface' column + Analyzing all titles by counting the most appeared words + analyzing the new 'property_surface' column.
    1. Ad_type column is gonna indicate the type of advertisment; Sale, Rental or Vocation Rental.
      • if you notice bellow in the Analysis part, specificly in the visual which describe the count of words, we found in title or description columns that the ad owner might write the type of ad properly like "rental" or "appartment for sale" but i found a lot of mistake in termes of grammers, so i decide to deeply analyse the title and description ... and after that analysis i create normalize dictionnaries of words for each type that a person might enter.
    2. property_type column indicate one of these categories, Apartments, Villas and Riads, Houses, Rooms, Land and Farms, Desks, Flatsharing, Warehouses or Other Real Estate.
      • Sometime the price of rentals real estate expressed as "xxxx DH par jour", "xxxx DH/jour", "xxxx DH/night", "xxxx DH/day" ... so based on the price we could classify the ad as vacation rental category.
    3. property_surface column indicate the surface of the property (m²).
      • we check title first, then features list then description for surface format, which is " xxxx m²" or "xxx م". in some cases, the ad owner dont mension the unit "m²" which make it hard for me to scrap it, but after deep analysis i found that these kind of pepole indicate their surface are in feature list in the "Surface habitable" category, for example : "[ ...... ;Âge du bien;Neuf;Surface habitable;55;Étage;4; ....]".
  3. Cleaning publication_date and insert_date column --> Adding year, month, day of publication columns based on current column.
    • sometimes when we start the scrap we encounter some ads that are published in the current day and the publication date is mentionned as "publié aujourd'hui, but while the spider continue scrapping ads, we encounter for some ads that publish day was yesterday or even some couple of days ago or months, so we need to store as well the date of scrap 'insert_date' to calculate later the real date of the publication, which is the diffirence between them."
  4. Cleaning price column --> adding 'property_price', 'price_currency', 'price per priod (for rental ad)' columns based on the current one.
  5. Cleaning location column --> i keep this column but i clean it.
    • sometimes i notice that there is some unnecessary details in the location an this is can distort out analysis after, for example some ads have 'Secteur Touristique à Agadir' as location but othors have 'Agadir' and there are the same location but couputer consider them different. so we need to remove the unnecessary details on them.
    • usually ads owners do not spell well the location (city) name well, for example :
    • 'laäyoune' --> 'laayoune'
    • 'asilah' --> 'assilah'
    • 'béni yakhlef' --> 'Ben Yakhlef'
    • etc...
    • so i have created manually a disctionnary to normalize all cities names that are not correctly wrotten by ads owner. then i used a json file which include all cities name in frensh and arabic version, so i can transform all arabic cities name to frensh.
  6. Cleaning features_list column --> Adding 'number of rooms' column, cause it is one of the most meaningful insight from this list of property descriptions is: the number of rooms, after what we extract earlier.

Data Analysis

  • Avito 25.73% higher than Mubawab in terms of records in this dataset. and the most

and in term of property type, appartments is the biggest category sold or rent in the Moroccan market.


Chart Improvement ...


I used the datastorytelling rule to make my graphic capable of conveying the message in less than 5 seconds.


Instead of using a Chart Line to compare the number of ads over time, we use the slop graphic, which shows the degree of variation over time in terms of the number of scrapped ads.



* The majority of properties have a surface area between 50 m² and 175 m².
  • The average room number in this dataset is:

Dashboard number 1

In this dashboard, i compare the statistics measures between the two website Avito and Mubawab, including Max sale of a specific type of real etate, in our case here for apartments per website and per location, but in our case here we just select the max sale of apartments in all Morocco that have the surface betwwen 100 and 200 m².


In Sale category, we notice that Avito is the best market to search for apartement for sale cause of the down average compare to the Mubawab market. and also for the rentals or vacation rentals apartments.


BUT !!! if we select agadir apartments that have between 100 and 120 m² in their surface for example, we notice that Mubawab has best prices compares to Avito in terms of Rental or Sales but Avito still good if we wanna rent an apartment for Vacations.




Dashboard number 2

This dashboard gives you a general idea of the property market in terms of surface area and average room for each type of property in each city.




Dashboard number 3

If I want to invest my money, I can easily choose where and from which platform I can buy for a lower price with the same features.


Chart Improvement ...


There's more information than meets the eye in this graphic and the message isn't clear in terms of the price difference between the two platforms so I can decide quickly where I can invest my money.


I reduced the noise level and used the data storytelling rule to improve the transmission of messages to the audience in less than 5 seconds.

From this chart



To this chart



To this chart




Dashboard number 4



Dashboard number 5

)



Dashboard number 6



Dashboard number 6



For more details, and if you would like to try out the above dashboards interactively, please download the PowerBI file provided in the files section above ...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published