Abdessettar's Blog.

Serverless at Scale: Building a Resilient Real Estate Scraper on AWS

The code related to this article can be found in the following repository.

To understand how real estate market behaves over time, simply browsing listing websites is not enough. Relying solely on published statistics and market analysis is not sufficient either since these are released with a delay and therefore reflect past situation rather than current market climate. To accurately track market trends such as price evolution and time on market, access to raw, high-frequency data is essential.

With this in mind, we decided to build a system to capture the daily flow of the Belgian real estate market. Our focus is limited to two types of assets and two transaction types: apartments and houses for the former, sale and rent for the latter. Which gives us four target categories. By collecting this data on a daily basis, one can observe market movements as they happen rather than after the fact.

The Context

The real estate market is highly fragmented: transactions happen through multiple channels, including real estate agencies, direct sales between private parties, and notaries. As a result, no single source provides complete market coverage. However, several online platforms attempt to aggregate a large share of publicly listed properties into a single interface.

While these platforms do not capture the full set of market transactions, some of them achieve sufficient scale and consistency to serve as meaningful proxies. When analyzed with an understanding of their coverage limitations and potential biases, the data they offer can still provide valuable signals about current market conditions and dynamics.

Therefore, we deliberately focused on the market-leading platform. Although it does not represent the entire market, its widespread adoption across most active market players provides sufficient coverage and consistency to make it a reliable source.

The goal was not to simply get the data but to do it in an efficient and maintainable way. This implies a set of constraints across different dimensions:

  1. Cost: Adopting a frugal architecture mindset, the system needed to remain cost-efficient and to incur costs only when strictly necessary while operate primarily within the AWS Free Tier.
  2. Volume: Each target category contains approximately ~10,000 active listings at any given time, spread across more than 300 paginated search result pages.
  3. Defense: The target platform discourages automated access and has several anti-scraping measures that need to be bypassed.
  4. Time: Attempting to process all listings sequentially through a monolithic process would inevitably take a very long time.

Consequently, the solution to build had to find a balance to tackle all these constraints and other underlying ones.

I. Reconnaissance and Reverse Engineering

Before moving to code or infrastructure, we start at the browser level. The goal is to understand how the target website is structured and to observe precisely which data is exchanged between the browser and the backend. This analysis allows us to identify the minimal set of requests and responses that carry the information of interest: separate the signal from the noise.

Modern websites introduce significant bloat through advertising, tracking, and client-side logic that is not relevant to data extraction. By isolating and ignoring these components, we can focus exclusively on the data-bearing signals. This approach reduces the volume of data retrieved, simplifies downstream processing, and directly improves the speed and reliability of the system.

The approach was to model the system as if the data was being retrieved manually, page by page and listing by listing. In addition to exploring the visible structure of the website, we conducted a detailed analysis using browser Developer Tools to observe the network interactions triggered by different user actions and page loads.

This systematic inspection of the underlying request–response patterns revealed two critical insights that directly informed the architectural design of the system.

I.A. The Search Logic

We began by navigating search result pages as a user would, moving through results while monitoring network activity in the browser’s Developer Tools. This revealed a consistent URL pattern used to retrieve paginated results:

…/fr/search-results/{transaction_type}?page=2

Closer inspection showed that the URL returns a server-generated JSON payload rather than rendered markup. The response contains a structured list of IDs (listing identifiers) for the requested search page, along with a set of associated data. Furthermore, it also contains the number of online listings for the transaction type/category.

This observation had direct architectural implications. Because the response exposes the IDs of all listings visible on the platform, we can retrieve all available IDs through the search pages. This significantly reduces the number of requests required and enables an efficient, ID-driven data collection strategy.

I.B. The Listings Logic

A key insight resulted from inspecting (no pun intended) individual listing pages. A straightforward approach would be to download the full HTML document and extract the desired fields using a parser such as BeautifulSoup (e.g., by targeting elements like

).

In practice, this approach is fragile. HTML-based scraping is tightly coupled to the website’s presentation layer: changes to CSS classes or page structure can break the scraper and introduce infuriating maintenance. Having encountered this limitation in earlier attempts, we looked beyond the rendered HTML by inspecting the underlying data flows.

By analyzing network activity in the browser’s Developer Tools, we noticed that listing pages are populated dynamically via an internal API endpoint, just like search pages. Given a listing’s unique identifier, this endpoint returns a server-generated response at:

…/fr/classified/get-result/{ad_id}

The response is a lightweight JSON payload containing the attributes of the listing, which makes it a more stable and efficient data source for extraction; without the payload bloat that comes with full HTML pages presentation-layer (i.e., images, headers, or footers)

II. The Architecture

The key finding of Phase 1 is that the target data exists at two distinct layers: search results pages, which contain lists of listing IDs, and classified details, which contain the attribute set for each listing. A straightforward implementation would iterate through search pages, retrieve each identifier, and fetch the corresponding listing details sequentially.

However, this monolithic approach does not scale well and offers limited robustness to failures. So to address scalability and fault tolerance, we adopted a fan-out architecture: each unit of work, be it a (batch of) search results pages or individual listings, is processed independently to enable parallel execution and better isolation of failures.

II.A. The Pipeline Overview:

The system is modular and involves three distinct stages in a fan-out pattern, where each one has a lambda function connected to Amazon SQS to enable scalable and event-driven processing:

II.A.1. Dispatcher
II.A.2. Discovery Worker
II.A.3. Extractor Worker

fan-out-diagram

Why SQS ? Well, the "easier" solution is to invoke a lambda function directly from another lambda function. However, using a queue provides two needed features to our system:

1. Robustness: If a worker crashes, the message doesn't get lost but returns to the queue (a Dead Letter Queue) to be retried later. No data is lost.

2. Throttling: If we tried to scrape 10.000 pages instantly, we could trigger the site's DDoS protection. SQS allows us to control the concurrency and acts as a shock absorber.

III. Challenges

During the implementation of our pipeline we faced several issues, some that appeared accros all Lambda functions, while others only in individual ones. In this section, we go cover the ones worth talking about, how we tackled them, and what potential optimizations to further improve performance and maintainability.

III.A. IP blocking

Sending a large volume of requests from a single IP address within a short time window typically triggers automated bot-detection and rate-limiting mechanisms, resulting in IP-based blocking. This issue appeared when running the workload on AWS Lambda, where outgoing requests originates from one IP-address. From the target website’s perspective, the traffic is text-book automated behavior (i.e., high requests rate, low entropy in timing and headers), causing it to be flagged as non-human. The outcome is an immediate 403 Forbidden response or the presentation of a CAPTCHA challenge. To bypass this, the outbound traffic should to be distributed across multiple IP addresses.

After evaluating several options, we chose the open-source library requests-ip-rotator. This library programmatically provisions an AWS API Gateway to act as a proxy between our Lambda functions and the target platform, enabling automated IP rotation. API Gateway offers a large pool of IP addresses distributed across multiple AWS regions, providing not only rotation but also geographic diversity for outgoing requests, which helps reduce fingerprinting by the target service.

Alternative approaches could also be used to address this issue, most notably residential proxy networks. In our testing, this was the most effective solution, as residential proxies provide access to a large and diverse pool of IP addresses that are significantly less detectable than those originating from major cloud provider datacenters. When combined with parallel request execution using Python’s concurrent.futures (which is an excellent combo), this approach delivered the performance and efficiency, with batches completing in record time and achieving a near perfect success rate. However, given the volume of data we process and the budget constraints of the project, residential proxies were not a cost-effective option. This approach is better suited for entities with a budget for large-scale data collection or data scraping (it is one of the reasons the residential proxy market exists).

III.B. The "Dangling" resources

The solution to the previous issue did create a new issue of its own. We ended up with non-trivial problem: if a worker crashes, exceeds the 15min execution timeout, or is terminated before gateway.shutdown() is executed, the associated API Gateway remains in the wild (Note that it could still be re-used in following runs). These orphaned resources are not automatically cleaned up unfortunately. Over time, repeated failures can lead to the accumulation of unused API Gateways, increasing operational load and potentially exhausting account-level resource limits.

To tackle this, we took advantage of our batch based architecture. First we amortized the "cold start" cost of the gateway across an entire batch (gateway.start() can take a few seconds sometimes), instead of creating a new one for each request. This way our requests are still randomly rotated across different IPs. Second, we wrap our logic in a try...finally block to ensure cleanup happens even if the code fails with errors. Finally, we call the context.get_remaining_time_in_millis() function to make sure we leave enough time for the cleanup in each run.

This solution is not without limitations as the time required for cleanup can still exceed expectations. One possible optimization is to introduce a dedicated "janitor" worker responsible for identifying and removing orphaned API Gateways left behind. Ideally, such a worker would be at the end of the pipeline to perform cleanup after each run.

In practice, however, we opted for a simpler approach: a locally scheduled cleanup script running daily on a Raspberry Pi. The reason was to avoid exceeding the AWS Lambda free-tier execution limits, even tho that the current implementation leaves enough room for such a worker.

III.C. Partial failures

In a distributed system, failure is not an if but a when question. When processing a batch of 100 listings, it is statistically likely that at least one item will fail due to any given reason, ranging from a network timeout on the target service to a malformed input. As such, if 99 listings are processed successfully but the last item fails, the Lambda function raises an error and AWS treats the entire batch as failed. This results in an all-or-nothing game.

A seemingly straightforward solution is to delete messages from the queue as they are processed (e.g., pushing sqs.delete_message inside the script loop). However, this approach is an anti-pattern as it couples functional logic with infrastructure's and introduces a critical failure mode. If the Lambda function crashes (e.g., out-of-memory) after deleting a message but before sending the data to S3, that batch is permanently lost. If we try to delete it after sending the data, there is still a probability that the Lambda function crashes and the message will appear again in the queue to be retried.

To solve this, we use Partial Batch Response's ReportBatchItemFailures which, instead of crashing the function or manually deleting messages, introduces fault-tolerance. Concretely, We wrap the processing of each individual URL in a try...except block that helps adding failed ID's to a list of failures. This latter will be the returned payload at the end and sent back to SQS for re-processing.

Further optimizations can be added against unknown failure modes. If a URL cannot be retrieved by the current logic and fails with an unhandled error, the message could be retried indefinitely (or its retention period expires). A more robust approach is to configure a Dead Letter Queue on SQS so that, after a set number of failed processing attempts, the message is automatically moved to a separate queue. This mechanism will help catching unknown edge cases without blocking the pipeline or incurring excessive cost from retries.

III.D. Statelessness

When a Lambda finishes execution, it forgets everything, which is problematic for a daily script: how to track what has been done in the past?. In our case, many of the listings available online were already retrieved in a previous run, so retrieving them again will add both storage and computing cost. Furthermore, additional unnecessary requests increase the chance to trigger the website's mechanism of bot detection.

To address this in a cost-effective manner, we went with AWS Systems Manager (SSM) Parameter Store as a lightweight persistence layer. Although SSM is typically used for configuration and secrets management, it also functions well as a key–value store for small amounts of states. Since a new listing ID is an incrementation of the latest one created, we store the highest processed listing ID as a parameter. At the start of each batch, the worker retrieves this value and filters search results accordingly, ensuring that only newly created listings are processed (i.e., the delta since the last run).

If strict consistency was required, or if the IDs were not incremented integers, or tracking the IDs retrieved was important, then we would write the IDs to a dedicated table in a database (which would increase costs).

Conclusion

This project serves as a case study in navigating the constraints of modern cloud engineering, from balancing cost and resilience, to scaling against aggressive defensive measures. By moving beyond a monolithic approach to an event-driven "Fan-Out" architecture, we transformed a fragile, long-running process into a robust pipeline that scales to zero when idle.

The implementation tackles the twin challenges of access and efficiency by combining ephemeral infrastructure for IP rotation with the precision of reverse-engineered internal APIs, effectively reducing compute costs considerably. While we accepted calculated trade-offs, such as prioritizing the simplicity of SSM state management over the strict consistency of database, the result is a system that is functionally antifragile: it handles network volatility, rate limits, and partial failures without needing human intervention. Ultimately, this demonstrates that the difference between a script that breaks and a system that endures lies not in the complexity of code, but in the architectural decisions made before the first line is even written.