Scrape data from web with python –A simple case of realestate.com

posted in: Python | 0

Introduction

Collecting data is always the first step for a data analyzing task. Apart from a company's internal database, open data source from millions of website could give us a better understanding of the nature and trends of our business. So it is alway helpful for us to learn how to collect data from websites automatically.

Web Scraping could be head-scratchingly complicated and also probably as easy as a pie, depending on what kind of data source you want to explore.

In this article, a simple case is introduced to illustrate the steps and tools involved in web scraping.Here the word "simple" means that data is collected straightforward by sending requests and receiving responses, regardless of the intricacy such as cookies and authentication.

The purpose of codes included below is to collect the data of property supply amount by suburbs in Great Sydney area from realestate.com.

Generally, realestate.com reveives request pair of suburbs name and postcode and shows the result page after search button is clicked. Here is an example of "Eastwood, NSW 2022" as following picture shows.

The working mechanism of web crawler is to imitate human behavior related to sending request and handling response and obtain desired results automatically.

We're leaded to web pages by urls. So to build a web crawler, it's always a good idea to start with analyzing the result page's url. The example page's url is https://www.realestate.com.au/buy/in-eastwood,+nsw+2122/list-1. It is not hard to find the keywords "estwood","nsw" and "2122" we have inputted.

As we geting new result by sending new request, replacing keyword in url also works for most of cases. This time we use "Epping, NSW 2121" as an example. After replacing “epping” with "eastwood" and "2121" with "2122" , a new url https://www.realestate.com.au/buy/in-epping,+nsw+2121/list-1 is produced. As we had expected, the new url returned the desired page successfully.

So the whole work could be broken into 2 steps.
1. Finding all suburbs and their postcodes of Sydney and concatenate urls to be ready to visit.
2. Sending each url to server and record responsed value.

1. Concatenating urls

Suburbs are listed as SA2 level according to Australian Statistical Geography Standard (ASGS) published by Australian Bureau of Statistics(ABS) in 2011. Postcodes could be found in its correspondence between ASGS and the previous Australian Standard Geographical Classification (ASGC).

Zip file "Postcode 2011 to Statistical Area Level 2 2011" could be downloaded at http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1270.0.55.006July%202011?OpenDocument.

To Simplify the process, an excel file containing information of postcodes and urls has been made and can be downloaded at https://docs.google.com/spreadsheets/d/1C-ugqEQOuYDMdoeI2TTo2LBGKbzP8Onkp37OSWlFRao/edit?usp=sharing.

Import pandas to read the excel file.

index regions Suburb Postcode Suburb Name url supply
0 1 Sydney CBD 2000 Sydney City https://www.realestate.com.au/buy/in-Sydney+Ci... 125
1 2 Sydney CBD 2007 Ultimo https://www.realestate.com.au/buy/in-Ultimonsw... 41
2 3 Sydney CBD 2008 Chippendale https://www.realestate.com.au/buy/in-Chippenda... 56
3 4 Sydney CBD 2009 Pyrmont https://www.realestate.com.au/buy/in-Pyrmontns... 63
4 5 Sydney CBD 2010 Surry Hills https://www.realestate.com.au/buy/in-Surry+Hil... 47

Besides "suburb name", "postcode" and "url", we have one column named " supply". The purpose of our programme is to update those values.

2.Record the responsed value

Three functions are defined in the following codes.
- "readlist" receives a path string of excel file and return a list of urls that are ready to visit.
- "sendrequest" receives the list of urls, sends request to web server and then parse responded html file via regular expression. It will return a list of supply values.
- "exportfile" receives the list of supply values, write them into the excel file and renames the file by adding a postfix of date.

It will return a excel file as followsing screenshot shows.

Comments are closed.