In this topic, we will delve into the intriguing world of web scraping protection on Redfin Houses - a well-known platform for real estate listings. The primary technique employed here involves utilizing HTTP requests with pure Python, eliminating the need for a web driver like Selenium. This approach allows us to utilize it in various environments, including Colab and non-browser-supported environments.
Before proceeding with the web scraping technique described herein, it is important to understand and acknowledge the legal and ethical implications associated with scraping web data. This disclaimer serves as a reminder that the use of web scraping tools or techniques to extract data from websites may be subject to legal restrictions or terms of service imposed by the website owners.
Initially, I employed Postman to test the request, using the following link as an example: https://www.redfin.com/zipcode/10001/filter/include=sold-6mo.
Regrettably, Redfin’s web scraping protection detected the suspicious request from Postman, resulting in an “Oops!” page being returned. You can see the screenshot below
I then switched back to the Edge browser to explore the Network used for that request. To access it, you can press F12 and navigate to the Network tab.
My next step was to investigate whether there was any JSON data being transferred between the client and server. To do this, I pressed Ctrl + F and searched for the text “410 W 25th St,” referring to the first listing card. The results indicated that this data only existed in the main request homepage. You can refer to the screenshot below:
Unfortunately, I discovered that no data was available for retrieval. It appears to be hidden using some React technique that I am unfamiliar with, but it’s quite intriguing.
Considering the above processes, I decided back to Postman because I found that the original request could do edit and resend
without any data loss, as there were no new headers, including updated cookies, and no access state handled yet. Thus, I assumed that the headers could be retrieved, and by manipulating them, I could create a deceptive pure HTTP request that exposed the data without involving JavaScript.
To achieve this, I clicked on “Edit and Resend,” which sent the request to another tab at the bottom of the Network Console. See the screenshot below:
I successfully sent multiple requests using those headers without losing any data. Refer to the screenshot below:
I will try to simulate that headers, by copying cURL (bash)
Next, I attempted to simulate those headers by copying the cURL command (bash) and pasting it into an online converter such as https://curlconverter.com/. When I ran the converted code for the first time, it provided the desired response with the required data. The next step involved removing unnecessary elements one by one to reduce the size. See the screenshot below:
Finally, I present a simplified code snippet that accomplishes the task:
import requests
cookies = {
'RF_UNBLOCK_ID': 'avWazYWD',
}
response = requests.get('https://www.redfin.com/zipcode/10001/filter/include=sold-6mo', cookies=cookies)
With these modifications, the remaining steps become much simpler. I will utilize BeautifulSoup to handle the HTML, specifically extracting the price and address fields for demonstration purposes:
from bs4 import BeautifulSoup
import requests
cookies = {
'RF_UNBLOCK_ID': 'avWazYWD',
}
response = requests.get('https://www.redfin.com/zipcode/10001/filter/include=sold-6mo', cookies=cookies)
soup = BeautifulSoup(response.text, 'html.parser')
frames = soup.find_all('div', {'class': 'bottomV2'})
for i in frames:
price = i.find('span', {'class': 'homecardV2Price'})
if price is None:
price = i.find('span', {'class': 'priceLabelV2 font-size-smaller'})
print({
'price': price.text,
'address': i.find('div', {'class': 'homeAddressV2'}).text
})
The result obtained from executing the code is as follows:
{'price': ' Sign in for price ', 'address': '410 W 25th St Unit PHB, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '357 W 29th St Unit GB, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '20 W 27th St #5, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '146 W 29th St Unit 9RE, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '408 W 34th St Unit 6B, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '11 W 30th St Unit 9-F, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '433 W 34th St Unit 3-E, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '520 W 27th St #601, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '315 Seventh Ave Unit 8C, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '144 W 27th St Unit 4F, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '252 Seventh Ave Unit 9O, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '252 Seventh Ave Unit 10X, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '252 Seventh Ave Unit 17C, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '225 W 25th St Unit 5E, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '408 W 25th St Unit 4FW, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '261 W 25th St Unit 3E, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '261 W 25th St Unit 3B, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '50 W 30th St Unit 4-C, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '1200 Broadway Unit 7A, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '520 W 28th St #10, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '15 Hudson Yards Unit 71E, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '522 W 29th St Unit 5-B, New York, NY 10001'}
{'price': '$2,800,000', 'address': '233 W 26th St Unit 7-E, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #6101, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #5704, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #5702, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #5901, New York, NY 10001'}
{'price': '$2,750,000', 'address': '1182 Broadway Unit 13A, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #8102, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #6001, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '215 W 28th St Unit 14B, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '215 W 28th St Unit 8D, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #5302, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #5502, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #6201, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #6202, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #8103, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #8201, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '35 Hudson Yards #8501, New York, NY 10001'}
{'price': ' Sign in for price ', 'address': '25 W 28th St Ph 40D, New York, NY 10001'}
It is worth noting that in the near future, the hardcoded cookie RF_UNBLOCK_ID': 'avWazYWD'
may change. However, we can address this by making an initial request to renew the RF-UNBLOCK_ID and subsequently using it for future requests.
To further enhance the functionality to show Sign in for price
, one possibility would be to implement a sign-in process and apply the same techniques to obtain and utilize the relevant cookies associated with a logged-in session. This would allow for access to specific prices and potentially provide more detailed information. However, it is important to be cautious when employing these methods, as there is a risk of raising suspicion and potentially having your account locked. Considering the potential risks involved, if you decide to pursue further developments or have any other questions, feel free to investigate them more.