Abdessettar's Blog.

Belgian Houses Fair Value

The webapp can be accessed through this link
The code related to this article can be found in the following repository. Feel free to reach out for any questions or suggestions.

I. Introduction
    I.A. The problem
    I.B. What we deliver
    I.C. Why we wrote this
    I.D. What this article is not

II. Related Work

III. The Data
    III.A. Sources
    III.B. Collection and integration
    III.C. Cleaning and the scraping ethics line

IV. Feature Engineering
    IV.A. Raw listing fields
    IV.B. Derived property features
    IV.C. Spatial features
    IV.D. Socioeconomic enrichment
    IV.E. The target encoding trap

V. Model
    V.A. Why LightGBM
    V.B. Quantile regression, not point regression
    V.C. Temporal split
    V.D. Hyperparameter tuning

VI. Uncertainty Quantification
    VI.A. Why prediction intervals matter for a buyer
    VI.B. Split conformal prediction in plain language
    VI.C. The story of how the band went from 61% to 79%
    VI.D. Final coverage and what it costs

VII. Explainability
    VII.A. Global view: which features the model leans on
    VII.B. Local view: per listing drivers, in euros
    VII.C. The narrative layer

VIII. Results
    VIII.A. Headline numbers
    VIII.B. The model blindspots
    VIII.C. Honest residual analysis

IX. What we would do differently

X. Closing

I. Introduction

I.A The problem

Every house listed for sale online comes with one number: the asking price. The seller chooses this number, sometimes with the help of an real estate agent, sometimes by looking at what the neighbours got last year, and sometimes, we suspect, by adding ten thousand euros for the "new" kitchen they installed in 2004. The average buyer, on the other side of the table, has no easy way to tell whether this number is reasonable or not. They have a gtu feeling, they have a budget, they have anxiety, but they do not have a second number to compare against: this is the problem we set out to address. We want to give the buyer a second number, and to be very clear about how confident we are in it (plus how and why we got it).

I.B What we deliver

For every active for sale house listing in Belgium, our system produces three principal things:

For this project, the architecture is not original, and the data pipeline is not revolutionary either. What is, perhaps, slightly original is the level of paranoia we applied to coverage on a temporally shifted test set, and the work we put into making the per-listing explanations readable by a person who is not a data analytics person.

I.C Why we wrote this

We spent a lot of time, possibly more than was wise, reading the literature on real estate price prediction, on quantile regression, on conformal calibration, and on feature engineering for tabular data, before we touched a single line of code. Some of that reading was useful, but some of it was, in retrospect, a polite form of procrastination (you know how it goes with personal projects sometimes). This article is the version of what we wish we had read on the first day, where someone tells us which ideas survive contact with a real, messy, non-stationary dataset, and which ideas break the moment we apply them to our actual problem.

We also wrote this because we got things wrong before we got them right, and the wrong attempts are, in our experience, the most instructive part of the whole exercise. The first version of our model produced an 80% prediction band that, on the held out test set, only covered 61% of the true prices. That is a serious miss, and a very frustrating one. We will spend a section explaining why this happened, what we tried, what did not work, and which intervention finally closed the gap. If you should take only one thing away from this article, we would prefer it to be that section, and not the headline metric at the end.

I.D What this article is not

This article is not a tutorial on gradient boosting or anythin else. There are many excellent ones already, written by people who understand the underlying mathematics far better than we do. Where we use a result from the literature, we try to cite it briefly if we canand move on. This article is also not a deployment story. The fact that the entire product runs at zero hosting cost on Cloudflare Pages, with the analytical database loaded directly inside the user's browser, with daily retraining triggered by a GitHub Actions cron, with an AI powered natural language search proxy that costs tenth of a cent per request, all of that is genuinely interesting and deserves its own article. It will get one, hopefully, but for the present text, we draw a clean line at the boundary between "the model and its evaluation" and "everything that happens after a parquet file is written to disk". Finally, this article is not a defence of any particular library or framework. We use LightGBM because it works well on heterogeneous tabular data with many missing values, not because we are loyal to it. We use Optuna because it is convenient, not because we believe random search is fundamentally inferior. Where we made a tooling choice that mattered for the result, we will say so. Where the tooling choice was incidental, we will skip the discussion.

The goal here is not to map out the entire field, but to give the reader just enough context to understand which design decisions we adopted from the literature, and which ones we made on our own. The first body of work we leaned on is the long tradition of real estate price modeling. Classical hedonic pricing models, going back to Rosen in 1974, decompose the observed price of a property into contributions from each of its measurable attributes. These models have the advantage of being easy to interpret, but they assume an additive linear structure that fits poorly to a country's market where the value of a swimming pool depends very strongly if it is open air or not, the value of a garden depends on its orientation, and the value of an extra bedroom is not constant across surface ranges. Unsurprisingly, recent work has moved to non-parametric methods, with random forests and gradient boosted trees as the typical choices. The work we found most useful in this line is the one that reports on actual deployments, where authors are forced to confront temporal drift, missing data, and the question of what the model should do when a listing falls outside the training distribution. We have tried to follow their example by being explicit about each of these issues.

The second body of work concerns the part of the system that we consider, frankly, more interesting than the price model itself: the quantification of uncertainty. The original quantile regression formulation by Koenker and Bassett in 1978 gives us a way to fit a model that targets a specific quantile of the conditional response distribution rather than its mean. Combined with gradient boosting, this becomes a practical tool for producing prediction intervals from heterogeneous tabular data. However, the bands produced by quantile regression alone do not carry any formal coverage guarantee, and in practice they are often overconfident on data that differs in distribution from the training set. Conformal prediction, developed by Vovk and collaborators starting in the early 2000's, provides a distribution free way to calibrate these bands using a held out set. Recent work, including the line of papers around conformalised quantile regression, has made this technique very practical. Our contribution is not theoretical here, we are simply applying a method that the literature has matured. What we do try to add is an honest account of how the method behaves when validation and test sets are not exchangeable, which is the situation we faced with our temporal split, and which is a situation that the textbook treatment tends to gloss over.

The third body of work, less famous but in our view equally important for thsi type of prediction projects, is the small but persistent literature on feature engineering for heterogeneous tabular data. While much of the recent attention in machine learning has gone to deep learning approaches on images, text, and time series, tabular data with a few dozen features of mixed types remains the case where careful manual feature engineering, combined with a strong gradient boosting baseline, still wins against deep models with very few exceptions. We took several ideas from this literature, including out of fold target encoding for high cardinality categorical variables, multiplicative ratios such as surface per bedroom, and the careful inclusion of socioeconomic features that are external to the listing itself. Each of these choices is defended in a section below, where we describe our feature set in detail. For the present section, the relevant point is that we did not invent any of this. We selected and combined techniques that other people had already shown to work, and we paid attention to the mistakes those people had documented.

There are several adjacent areas that we deliberately did not pursue, and that we will mention only to be honest about the gap. We did not work with image features from the listing photographs, despite the fact that images contain rich information about a classified. We did not use graph based representations of neighbourhood structure. We did not attempt to model temporal price dynamics jointly with cross sectional prices. Each of these directions is plausible, and each of them is something we would consider in a future iteration. For now, we made the deliberate choice to do the simpler things well, before doing the more sophisticated things badly as they add more complexity to the equation.

III. The Data

Before we discuss the model, we owe the reader a careful account of the data, because in our experience the difference between a useful (real estate pricing) model and a useless one is almost entirely determined at this stage. Choosing LightGBM over CatBoost, or tuning sixty Optuna trials instead of thirty, contributes at most a small percentage of improvement to the final metric. Cleaning the messy features properly, choosing the right window of history, and joining the socioeconomic features without leakage, contributes far more. So we will describe what we did and, where it is interesting (from our point of view at least), what we considered but did not do.

III.A Sources

Our data combines three sources of information, each of them imperfect in its own way. The first source is the listings themselves from immoweb.be. Every house currently offered for sale in Belgium appears, sooner or later, on a small number of online portals, and immoweb serves as an aggregator as its the main platform in Belgium where both private sellers and real estate agencies publish their listings there. We collect a daily snapshot of these listings and persist the subset of fields that are useful for our model, most notably: the asking price, the geographic coordinates, the postal code, the surface in square meters, the number of bedrooms and bathrooms, the construction year, the PEB energy score, the property subtype, a binary indicator for amenities such as garden or terrace or swimming pool, and a few timestamps that record when the listing was first created and when it was last modified. The listing is the only source that gives us an asking price, which is what we are ultimately trying to model, so without this source there is no project.

The second source is engineered features, which are not data in the strict sense but are quantities we compute from the listings themselves. We will discuss feature engineering in detail in section four, so we mention only the principle here: every additional column that the model can use, beyond the raw fields the listing provides, costs us implementation time and risks introducing a subtle leak. We add a feature only when the prospective gain in predictive accuracy is large enough to justify both costs.

The third source is socioeconomic data published by Statbel, the Belgian federal statistical office. From their open data portal we extract three signals at the commune level. The first is the median sell price for the most recent period for which the data is published. The second is the share of the local population at risk of monetary poverty, which serves as a proxy for the desirability and the average income of the area. The third is the total population of the commune. None of these three signals is decisive on its own, but together they give the model a way to distinguish between two properties that look identical on every listing field but sit in very different parts of the country.

We considered, and chose not to integrate (for now at least), several other potentially informative sources. We did not use nearby points of interest, which would tell us how close a property is to a school, a tram stop, or a supermarket. We did not use commercial datasets that promise per-street "valuable" data, because we could not validate their methodology and because they are expensive. The reader should treat the absence of these sources as a deliberate choice, not as an oversight. Each of them is on the list of things we would add in a future iteration.

III.B Collection and integration

The pipeline that brings these three sources together has three stages, and in each stage there is at least one detail that took us much longer to get right than we expected. In the first stage, we collect the daily listing snapshot, which is produced by an upstream scraping process that writes a JSON file per day into an S3 bucket (this process is documented in this article). Our pipeline reads the most recent file from that bucket and persists the subset of columns we actually use, dropping the rest at this very early stage to keep downstream files small. We deliberately do not retain any field that we do not currently consume, because experience taught us that retaining "just in case" columns leads to big files, slow loads, and eventually slow iterations.

In the second stage, we maintain a historical accumulation of all listings ever observed, as a separate table in BigQuery. The reason we maintain history rather than working only with the daily snapshot (i.e., only listings online) is that our model is trained on a twelve month window, and a single day's snapshot (~10000 listings) would not be enough to fit a credible model. The historical accumulation handles deduplication on the listing id, so that a property that stays online for ninety days appears as one row in the modelling dataset rather than ninety. We keep the most recently modified version of each id, which is the version that reflects any price reduction or feature update that happened over the listing's life.

In the third stage, we join the socioeconomic data. The join is the single awkward part of the pipeline, and it deserves a paragraph of its own, because it bit us repeatedly. Belgian postal codes and the code used for statistical data treatment by authorities (called NIS codes in administrative practice) are not in a one to one relationship all the time. A single NIS code can span several communes, especially in semi rural areas where a postal agency serves multiple villages. Conversely, a single large commune can contain several postal codes. Our socioeconomic data is published at the NIS level, while our listings are tagged at the postal code level, so a naive join produces duplicate rows whenever the postal code maps to more than one NIS code. We worked around this by joining to a representative NIS code per postal code, chosen as the most populous one, which on average minimizes the introduced bias. We are not entirely happy with this choice, and a more principled solution would project listings onto a NIS code using the actual coordinates of each property and a polygon lookup. We have this on the improvement list and we have not yet implemented it.

III.C Cleaning and the scraping ethics line

Even after these three stages, the data is far from clean, and we spend a non trivial amount of effort applying a series of filters that remove rows we do not want to model on. The first filter is on the price where we keep only listings with an asking price falls between 20K and 5m euros. Below twenty thousand we are looking at rows where the price is missing or encoded incorrectly, or where the listing represents a parking spot or a derelict structure that does not belong in the same model as houses. Above five million we are looking at the very long right tail of the housing market, where prices are driven by features that our model cannot observe (e.g., interior decoration, historic value, private parks) and where we have very few examples to learn from. Both extremes were degrading the model's behaviour on the central mass of the distribution, and removing them improved every metric we care about.

The second filter is on duplicates. The historical accumulation in BigQuery, despite our deduplication on listing id, occasionally produces duplicate rows when the upstream join with socioeconomic data fans out. We sort by the modification timestamp and keep the last copy. In the most recent run, this dropped roughly 1800 rows out of 170k. Small in percentage, but enough to bias evaluation if we had ignored it.

The third filter, which is more an engineering choice than a statistical one, is on listings that are clearly outside the training distribution. Life annuities, where the buyer pays a small upfront amount and a monthly pension to the seller for the rest of the seller's life, follow a price formation process that has nothing to do with regular sales. New construction projects, which are listed before the building exists and where the price reflects a mixture of land cost, construction cost, and developer margin, are similarly outside the regular distribution. We do not remove these listings from our display, because a buyer browsing the map still expects to see them, but we do flag them as not priceable, and we withhold any fair value estimate. The model abstains rather than guesses. We consider this an important honesty feature, and we discuss it again in section eight when we look at error cases.

The data discussion would be incomplete without a few words on ethics. Scraping a commercial portal, even for a project that is free and non commercial, sits in a delicate legal and ethical position. We have made a number of deliberate choices to stay on the careful side of that line. We rate limit our snapshot collection to a rhythm that is comparable to a curious human user. We do not republish raw listing photographs, descriptions, or other any content that would be considered the source portal's intellectual property. We retain only the structured fields, which on their own do not constitute a substantial reproduction of the original content. On the front end, we removed every reference to the source portal's brand name from the user interface, with the exception of an explicit credit on each listing card that links back to the original page, and on the landing page when we mention the source of the data. We also configured the site to suppress the HTTP referrer header on every outbound navigation, so that the source portal cannot easily detect that a visitor arrived from our tool. None of this is legal advice, and we are not lawyers. It is simply an honest account of the choices we made when we considered the question and discussing it with other lawyer Maitre Claude.

We mention these choices because they affected the data we work with. A more aggressive scraping strategy would have given us more listings, more historical depth, and more frequent snapshots, all of which would probably have produced a marginally better model. We chose to trade some predictive accuracy for a more comfortable position on the ethics question. Others might choose differently but we think the trade off is worth being explicit about.

IV. Feature Engineering

The previous section described the data as it arrives in the pipeline, while the present section describes how we transform that data into the matrix that the model actually sees. The transformations are not, individually, very sophisticated and most of them are obvious to anyone who has spent ten minutes thinking about real estate prices. What is more interesting, in our view, is the cumulative effect of applying many small transformations carefully, and the one place where a careless implementation would have produced a large but invisible amount of target leakage. We will describe the careless version and then the careful version, because the contrast is the most instructive part of the whole section.

IV.A Raw listing fields

We begin with the fields the listing provides directly, before any derivation. Roughly forty raw fields make it through our pipeline, of which we keep around twenty five for modelling. The selection is a mixture of common sense and empirical investigation: a field has to be filled often enough to be useful, and it has to carry enough signal about the price to justify the column it occupies. The numeric fields we keep include the surface in square meters, the number of bedrooms, the number of bathrooms, the number of toilets, the number of facades, the construction year, the count of rooms in total, and a small number of further counts such as parking spots and shower rooms. We keep the geographic coordinates, latitude and longitude, both as inputs to the model and as anchors for the spatial features described in section 4.3. We keep the postal code as a string, not as a numeric value, because Belgian postal codes have an administrative structure where neighbouring numbers belong to neighbouring areas in some parts of the country and not in others. Treating them as a number would have implied a smoothness that is not real: the model might incorrectly infer, for example, that Namur (5000) is "greater than" Brussels (1000).

The categorical fields we keep include the property subtype (house, villa, mansion, mixed use, and so on), the kitchen type (unequipped, semi equipped, fully equipped, hyper equipped, in roughly increasing order of value), the heating type (gas, oil, electric, heat pump, wood), the garden orientation, the terrace orientation, the EPC energy score (a letter grade from A through G that summarizes the property's thermal performance and that has become a strong price signal in Belgium since the regulation tightened in the last few years), and the postal code treated as a categorical string.

We also keep a long list of binary indicators that the listing exposes as yes or no flags. These include whether the property has a garden, a terrace, a swimming pool, a heat pump, photovoltaic panels, double glazing, an attic, a basement, a fireplace, a lift, a sauna, a secure access alarm, and so on. Some of these flags have a clear positive contribution to the price, some have a more subtle effect, and some, frankly, are redundant with the property subtype. We keep them all rather than try to prune them by hand, because a tree based model handles redundant binary features gracefully and we did not want to make selection decisions that we would have to revisit every few months as the market progresses.

Several raw fields are present in the source data but we discard them. The full text of the listing description is one of them, for two reasons. First, processing free text would push us into a different modelling regime that we deliberately want to avoid for this iteration. Second, the description is the part of the listing that is most clearly the source portal's intellectual property, and we wish to stay clear of it for the reasons discussed in section 3.3. We also discard fields that are filled in too rarely to be useful, such as the year of the last renovation (filled in fewer than 15% of cases), or the listing's commercial description (similar coverage). For these very sparse fields, imputation would do more harm than good. We are aware that discarding them slightly hurts the model on the small subset of listings where they are present and informative, but we have not yet found a clean solution to this trade off.

IV.B Derived property features

To the raw fields we add a small number of derived numeric features. The principle here is that gradient boosted trees are perfectly capable of representing non linear interactions between two raw features, but they often need many trees to do so. By providing the useful combination as a column of its own, we help the model find the relationship faster and with less overfitting risk on the small sub regions of the input space.

The first derived feature is the property's age, computed as the current year minus the construction year, then clipped to the range zero to two hundred years. Clipping protects against rare construction years that are clearly typographical errors (we have seen 19, 1819, 1019. No one is buying ruins), and against the model treating a building from 1850 as fundamentally different from a building from 1700, when in reality both belong to the same large category of "very old".

The second derived feature is the construction decade, which is the construction year rounded down to the nearest ten and bounded between 1800 and the current year. The motivation here is that the Belgian housing market has clear cohorts: houses built in the 1970s, in the 2000s, in the 2010s each have characteristic floor plans, insulation standards, and resale dynamics. The raw construction year already contains this information, but the model picks up the cohort structure faster when we present it explicitly.

The third derived feature is the surface per bedroom, computed as the total habitable surface divided by the bedroom count. This ratio is informative because two listings with the same total surface and the same number of bedrooms can carry very different per square meter prices. A ninety square meter property with one bedroom is a larger living space per bedroom than a ninety square meter property with three, so the market values the former and the latter differently. We could have given the model a more elaborate feature here, for instance the ratio of bathrooms to bedrooms or the ratio of total rooms to surface, and we tried both. Neither added enough on a held out evaluation to be worth keeping. Also, note that this new feature serves more as a proxy than gives the exact surface of each bedroom, which is good enough in our case.

We considered, and rejected, several other derived features that looked tempting. We rejected an "is renovated" binary computed from a heuristic on the description text (it would add more complexity by requiring other algorithms, and uncertain added value), and we discard the description text entirely. We rejected a "luxury score" that combined swimming pool, sauna, and lift into a single index, because the model already had access to those three flags and the combination did not help. We rejected a price per square meter feature on the listing's own price, because it would have been a direct leak from the target. The temptation to add features is constant, and we tried to be disciplined about adding only those that justified themselves on a clean validation set.

IV.C Spatial features

Location, in real estate, is the dominant feature. A property in the Brussels region and a property in a poor town in southern Wallonia, with otherwise identical characteristics, can differ in price by a factor of three or four (e.g., Woluwé-St-Pierre vs Charleroi). The model needs a way to capture this, and the way we provide that capture is the topic of this subsection where we give the model two complementary views of location.

The first view is categorical and granular, in the form of the postal code treated as a categorical variable with around eleven hundred unique values. LightGBM handles this natively (one of the reasons we chose it), by learning a per category contribution at each split, with regularisation against rare categories. The second view is geometric and continuous, in the form of the haversine distance, in kilometres, from the property's coordinates to each of seven Belgian cities: Brussels, Antwerp, Ghent, Liege, Bruges, Charleroi, and Namur. We chose these seven because they are, by population, the largest urban centres in the country, and because they cover both linguistic regions reasonably well. A property close to one of them is, on average, in a more expensive part of the country than a property far from all of them.

We hesitated for some time about whether to use a more sophisticated location representation though. The candidates we considered were h3 hexagonal cells (Uber's geographic indexing system, which projects coordinates onto hexagons of a chosen resolution), learned position embeddings (where the model assigns a small dense vector to each cell or coordinate), and Gaussian random features over the coordinates. Each of these has known advantages on certain types of spatial problem, but we did not adopt any of them, for two reasons. First, the postal code categorical, combined with the seven distance features, was already capturing the bulk of the spatial signal. Second, more sophisticated representations carry an interpretability cost. We want to be able to tell a buyer that the model deducted ten thousand euros from the fair value because the property is forty kilometres from Brussels rather than ten kilometres. We would not be able to tell the same buyer, in equally simple terms, what the embedding dimension three contributed. Given that the simpler representation already captures the signal, the more complex representation would be paying interpretability cost for no guaranteed predictive gain.

We would revisit this decision if we had reason to believe that the spatial structure of Belgian house prices is significantly more complex than the seven city distances can express. Concretely, if we saw systematic residuals concentrated in particular regions, the extra modelling capacity would be the natural next step. We do not, at the moment, see such residuals.

IV.D Socioeconomic enrichment

The third group of features is the one we extract from the Statbel commune level statistics, joined to each listing through the postal code mapping discussed in section 3.2. We retain three signals: the median sell price for the most recent published period, the share of the local population at risk of monetary poverty, and the total population of the commune.

It is worth being honest about what each of these signals is actually picking up, because they are not the variables their names suggest. The median sell price is the most direct of the three, and it is essentially a coarse prior on the local price level. It is correlated with the listings themselves, but it is not the same as the listings: Statbel reports recorded transactions, which are sales that have actually closed, while our listings are asking prices, which are intentions and demands. The two diverge most strongly in markets where sellers are unrealistic, which is precisely the kind of market the model needs to understand. As an anecdote, I was recently speaking with someone who is house-hunting. They found a property listed at 420k€ and submitted an offer, but a bidding war drove the price past €460k, with biddings still ongoing. That’s roughly a 10% increase, which is crazy, especially given that the sellers claimed to be in a hurry to sell.

The poverty index is more interesting. It is supposed to measure the share of the population at risk of monetary poverty, which sounds like an income measure. In our experience, it correlates with the median sell price almost perfectly in the centre of cities, but it diverges in two specific situations. The first is rural communes where measured income is low but housing prices are stable, because the local cost structure is genuinely lower. The second is what we see as gentrifying neighbourhoods, where the measured population is mostly long term residents on lower incomes, but the housing market is being driven by incoming buyers from elsewhere (e.g., the center of Brussels). The poverty index, in those cases, adds information that the median sell price alone would miss.

The total population is, on the surface, the least informative of the three, but we kept it because it acts, in practice, as a smoothness regulariser. Two communes with similar median sell price and similar poverty index can have very different reliability of those estimates. A commune with a hundred thousand residents has a stable, well estimated median sell price. A commune with a thousand residents has a noisy estimate based on a handful of transactions per year. The total population gives the model a way to discount the noisier estimates, and we found in ablation that removing it slightly degraded the metric on small communes.

We considered including additional Statbel signals, in particular the unemployment rate and the share of the population with a tertiary education. We did not include them, because we see them as strongly correlated with the three signals we already have, and adding them would introduce an interpretability problem (i.e., which of the four was the model relying on for a given prediction?) without a guaranteed predictive gain. We will consider adding more meaningful ones in future iterations to boost the model's performance.

IV.E The target encoding trap

We promised at the start of this section that we would describe one specific feature where a careless implementation produces a large amount of invisible target leakage. The feature in question is the commune level median price per square meter, which we add as one final numeric column. The careful version of this feature is computed as follows: we group all training listings by postal code, compute the median ratio of price to surface within each group, and use that group level median as the value of the feature for each property. This is a form of target encoding, where a high cardinality categorical (the postal code) is replaced by a numeric summary of the target conditional on that category. Target encoding is a known technique with known benefits and known risks. The main risk, in our case, is that the median price per square meter within a postal code is computed on the very listings that the model will be trained on. If a listing is in a small postal code with only a handful of training rows, then the listing's own price, divided by its own surface, is itself one of the contributors to the median. The model, when it sees the feature, is therefore seeing a transformed version of its own target, and it will use that signal to predict the target with apparent ease on the training set, and the training metrics will look excellent. The validation metrics, which are computed on rows the model never saw, will look fine. The test metrics, also unseen, will also look fine. Everything will look fine, except that on a fresh dataset where the leakage is not present, the model will perform much worse than the validation and test numbers suggested.

The careful implementation that prevents this is what is called out of fold target encoding. We partition the training set into five folds, and for each fold we compute the postal code level median price per square meter using only the other four folds, and we use that value to encode the rows in the held out fold. No row is encoded using its own target. For the validation set and the test set, we compute the median once on the entire training set, and we apply that single lookup, because in production these splits play the role of "rows the model has never seen". The arithmetic is slightly more complicated than the careless version, but the cost is small and the benefit is large. How large? On a controlled experiment, where we trained the same model first with the careless version of the feature and then with the out of fold version, the careless version reported a training mean absolute error that was about 30% lower than the out of fold version. On the test set, both versions performed identically. The careless version, in other words, was buying nothing on unseen data while reporting a training metric that suggested a much better model. We mention this experiment because it is the clearest example we have of a feature engineering decision that was silently wrong before we noticed and made it carefully right.

The lesson generalises: whenever a feature is computed as a function of the target, even an indirect function such as a per group average, the computation has to be done in a way that no row sees its own contribution. Out of fold encoding is one way, leave one out encoding is another, pre computing the encoding on a separate historical period before training is a third. The choice depends on the data structure. The non choice, which is to compute the encoding on the training set and apply it to the same training set, is the one that quietly inflates every metric and that we would like to avoid discourage.

V. Model

We can now describe the model itself since the previous sections were the careful preparation that the model needs in order to do anything useful. The present section is comparatively short, because the modelling choices, taken individually, are not surprising. We will nevertheless try to defend each choice with a sentence or two of honest reasoning, and we will be explicit about the few places where our setup departs from the textbook recipe.

V.A Why LightGBM

The honest answer to the question of why we chose LightGBM is that on the kind of data we have, gradient boosted decision trees are the default. They handle heterogeneous tabular data with mixed types, they handle missing values without explicit imputation, they are robust to feature scaling, they pick up non linear interactions between features without help, and they train two orders of magnitude faster than any deep architecture we considered. None of these properties is unique to LightGBM specifically, CatBoost and XGBoost share most of them. Our preference for LightGBM over the others is mostly a question of taste and of iteration speed: we found it slightly faster to fit, and its categorical handling, while not as polished as CatBoost's, is good enough for our use case.

We did, for honesty, train a Ridge regression baseline on the same features and a CatBoost model with default hyperparameters, just to make sure we were not missing something obvious. The Ridge model arrived at a mean absolute error of about 135k€ on the held out test set. The commune median price per square meter, used as a single feature predictor, gave about 147k€. The global mean of the training prices gave about 257k€, which sets a useful upper bound on what a model that knows nothing can achieve. The LightGBM quantile regressor, after tuning, gave about 90k€. CatBoost with default settings gave a comparable number, slightly worse but within the noise of the experiment. We did not invest in tuning CatBoost, because LightGBM was already ahead and because the architectural switch did not promise a large enough additional gain to justify the work in our opinion.

A note on neural network alternatives before we continue as tabular deep learning has matured in recent years, with architectures such as TabNet, NODE, and SAINT claiming competitive performance against gradient boosting. We considered them and decided not to use them, and the reasons are pragmatic ones: the data scales we work with are well below the regime where deep models start to win, and the iteration time on those models is significantly higher. A future iteration of the project, with more data and a willingness to spend more compute time per training run, might be the right context for revisiting this decision. For now, gradient boosting is the obviously correct choice, and we did not lose much sleep over it.

V.B Quantile regression, not point regression

The single most important architectural choice in our model is that we do not predict a single number per listing but we predict three. Specifically, we fit three separate LightGBM boosters, each one trained with the quantile regression objective at a different target quantile of the conditional response distribution. The lower booster targets the 0.025 quantile, which means it tries to find the value below which roughly 2.5% of the prices for similar listings would fall. The middle booster targets the 0.5 quantile, which is the median, while the upper booster targets the 0.975 quantile. The interval between the lower and upper boosters is a nominal 95% prediction band. In a nutshell, it tries to answer three questions:

A reader familiar with the textbook treatment of prediction intervals will notice that we use 0.025 and 0.975 rather than the more common and standard 0.1 and 0.9. The widening from a nominal 80% to a nominal 95% at the booster level is a deliberate choice, and it is the most consequential single choice in the whole pipeline. We will explain it in detail in section 6, where we discuss the calibration of the prediction band. For the present subsection we will simply state that without this widening, the empirical coverage of the band on a temporally held out test set was about 61% rather than the 80% we wanted,, and that no amount of split conformal calibration could close that gap, because the validation set on which the calibration is computed was not exchangeable with the test set on which we measure coverage. Widening the underlying quantiles, before any calibration, was the intervention that worked. We also use a transformation on the target, which is worth mentioning briefly. Rather than fitting the quantile regressor on the raw price, we fit it on the natural logarithm of one plus the price. This is the standard log price transformation, and we use it because the distribution of Belgian houses prices is heavy tailed. A property at 300k€ and a property at three million euros are not in the same regime of the distribution, and training a model on raw prices would let the very expensive properties dominate the loss. The log transformation makes the quantile loss more nearly symmetric across the price range, and it also has the practical benefit that an additive contribution in log space corresponds to a multiplicative contribution in euros, which is closer to how prices actually compose (e.g., a swimming pool adds roughly the same percentage to a property's value at any price level, not the same absolute amount).

One last detail is we weight training rows by recency, with rows modified within the last six months receiving a weight three times larger than rows modified more than twelve months ago. The intermediate range receives a linearly interpolated weight. The motivation is that the Belgian housing market drifts month to month (well, maybe less frequently), and a model that gives equal weight to a listing from fifteen months ago and a listing from last week will be biased towards the historical price level rather than the current one, so recency weighting is a small but reliable improvement. It is also the kind of detail that is easy to forget, and that we forgot in the first version of the pipeline, with predictable consequences. These weights computation and selection could be improved with more experiments and tuning.

V.C Temporal split

The way we split the data into training, validation, and test sets is we use a temporal split, not a random one.We sort all listings by their last modification date, we take the oldest seventy percent as the training set, the next fifteen percent as the validation set, and the most recent fifteen percent as the test set. The training, validation, and test sets are therefore three consecutif windows of time, with no overlap. A random split would have been simpler to implement, and it would have produced metrics that look a few percentage points better, but those metrics would have been wrong in a sneaky and dangerous way. A random split allows the model to train on a listing from December and validate on a listing from February of the same year. The two listings, even if they describe different properties, share a lot of context: similar ecoomic climate, similar buyer expectations, similar market conditions. A model that learns to exploit the December conditions will look like it generalises to February, but only because February happens to be in the same regime. When the model is deployed on listings from a later month, in a different regime (a recent concrete example would be December 2025 vs March 2026), it generalises much worse than the random split would have suggested. We discovered the importance of this point the hard embarrassing way as an early version of our pipeline used a random split. The reported metrics were better than the model deserved since when we switched to a temporal split and re ran the same code, the validation mean absolute error increased by roughly fifteen percent, and the test mean absolute error increased by roughly twenty percent. The model itself had not changed, only the evaluation had become honest. The earlier numbers were the result of look ahead bias, where the model effectively had information from the future of the training set leaking into the past of the validation set. Since then, we have used a temporal split for every evaluation we report. A subtle point about the temporal split is that the validation set and the test set are themselves not exchangeable with each other. The validation set covers a period roughly six to nine months before the test set, and if the market drifts during that interval, then a calibration computed on the validation set will be miscalibrated on the test set, in a direction that depends on which way the market drifted. We will return to this point in section 6, because it is the source of the coverage gap that the quantile widening had to close.

V.D Hyperparameter tuning

We tune the LightGBM hyperparameters with Optuna, using its tree structured Parzen estimator sampler, on the same train and validation split described above. We run sixty trials, with the mean absolute error on the validation set as the objective. The search space covers the learning rate (between 0.01 and 0.2 on a logarithmic scale), the number of leaves (between 31 and 512), the maximum tree depth (between 4 and 15), the minimum number of data points per leaf (between 10 and 200), and the L1 and L2 regularisation coefficients (each between 0.001 and 10 on a logarithmic scale). The tuning takes about 15-20 minutes on a laptop, which is fast enough to re run after any meaningful change to the feature set. Once the best parameters are selected, we use them to fit the three quantile boosters, with each booster choosing its own iteration count on a separate early stopping schedule, because quantile heads typically need two to three times more iterations than a mean squared error head to calibrate properly.

VI. Uncertainty Quantification

This is the section we found the most interesting to work on, and the one we believe makes the model genuinely useful rather than merely accurate. The previous sections produced a single number per listing, the median fair value, while the present section produces a range around that number, and ensures that the range carries an honest empirical guarantee. The work that earned its keep here was not the choice of method, which is well established in the literature and industry, but rather the diagnostic patience required to figure out why the textbook method was failing on our data, and the willingness to keep running experiments until the failure mode revealed itself.

VI.A Why prediction intervals matter for a buyer

A model that returns a single number per listing is, in our opinion, incomplete. Suppose our model returns a fair value of 390k€ for a particular property. The buyer reads this number, and what is the buyer supposed to do with it? Is the fair value precisely 390k€, plus or minus a few hundred euros? Or could it equally plausibly be anywhere between 340k€ and 440k€? The two scenarios call for very different decisions, and the buyer cannot distinguish between them from a point prediction alone. A prediction interval addresses this directly. Instead of returning one number, we return three: a lower bound, a central estimate, and an upper bound. The lower and upper bounds together form a band, and the band is calibrated so that, on average, the true price falls inside it with a known probability. We chose 80% as our target probability, which is the level at which the band is narrow enough to be informative and wide enough to capture most cases. A 95% percent band would be more conservative but often too wide to support a decision. A fifty percent band would be more optimistic but would leave the buyer in doubt half of the time. The eighty percent figure is a compromise, and like all compromises it is open to criticism, but in our humble opinion it is the level at which buyers and reviewers feel comfortable.

There is a second reason we care about prediction intervals as the width of the band, by itself, is informative. A property where the model is very confident produces a narrow band, perhaps fifteen percent of the central estimate. A property where the model is uncertain, perhaps because it has unusual features or because it sits in a postal code with very few comparable training examples, produces a wide band, perhaps 40% of the central estimate. The buyer can read the band width as a kind of confidence score, and adjust their behaviour accordingly. A wide band is a signal to look at the listing more carefully and to consider talking to an expert.

VI.B Split conformal prediction in plain language

The technique we use to calibrate the prediction band is called split conformal prediction. It comes from a line of work originally developed by Vovk and collaborators, and it has been refined and popularised over the last decades. We will describe it in the plainest language we can, because the literature on conformal prediction can be intimidating to a reader who is not familiar with the formalism (like most literature).

The idea is the following: Our quantile boosters produce a raw band for each listing. The lower booster predicts a value the price should be above with high probability, and the upper booster predicts a value the price should be below with high probability. If the boosters were perfectly calibrated, the empirical coverage on unseen data would match the nominal coverage exactly. In practice they are not perfectly calibrated, and the empirical coverage deviates from the nominal coverage by some amount that depends on the data and the method. Split conformal prediction is a way to correct for this deviation, using a held out set that the boosters have not seen during training. Concretely, we compute, for each listing in the validation set, the amount by which the listing's true price falls outside the raw band, with a positive sign if the price is above the upper bound or below the lower bound, and zero if the price is inside. This quantity is called the non conformity score. We compute the 80th percentile of these non conformity scores, call it q hat, and we widen the raw band by q hat on each side. The widened band, on a fresh test set drawn from the same distribution as the validation set, will have empirical coverage close to eighty percent by construction. The mathematical guarantee that justifies this construction is distribution free, in the sense that it does not depend on the internal structure of the boosters or on any parametric assumption about the data. It depends only on the assumption that the validation set and the test set are exchangeable. That last word, exchangeable, will be doing a lot of work in the next subsection, because in our case the validation set and the test set are not exchangeable as explained earlier, and the failure of the assumption is precisely the problem we had to solve.

A practical detail. The version of conformal prediction we use is sometimes called conformalised quantile regression, because the raw band itself comes from a quantile regressor rather than from a single regressor with a symmetric residual scale. The quantile regressor produces a band that is already adapted to the local shape of the conditional distribution, narrower in regions where the model is confident and wider in regions where it is not. The conformal step then applies a uniform correction on top, so the empirical coverage matches the nominal target. The combination is strictly more flexible than a constant width band, and it is the combination we prefer.

VI.C The story of how the band went from 61% to 79%

When we first ran the pipeline end to end, we set the lower booster to the 0.1 quantile and the upper booster to the 0.9 quantile, and we calibrated with a single q hat computed on the validation set, exactly as the textbook prescribes. The validation coverage came out at about 80%, comfortably close to the target, so we were pleased. We then evaluated on the held out test set, and the coverage came out at sixty one percent. We were way less pleased. A nineteen point gap between validation coverage and test coverage is a substantial failure of the conformal guarantee. The textbook treatment of conformal prediction would suggest that this should not happen, because the conformal procedure does not know about the boosters' internal calibration and corrects for whatever miscalibration exists. If validation coverage is at the target, test coverage should also be at the target. The fact that it was not told us that something in our setup was violating the assumptions of the textbook treatment. We spent several weeks running diagnostic experiments, and we will describe them honestly, because the eventual fix becomes much more intuitive once we understands what did not work.

The first experiment we tried was a recency weighted version of the calibration, where the reasoning was that the validation set covers a period of several months, and if the market drifts during that period, then the older listings in the validation set might be less representative of the test conditions than the more recent ones. We recomputed q hat using only validation listings from the last thirty days, and we recomputed it again using exponential recency weights that gave more importance to recent rows. Neither change improved test coverage by more than one or two percentage points. The validation set was already producing a q hat that, when applied to the test set, gave 61% coverage. Adjusting the calibration set's effective composition was not going to fix that. The second experiment was a conditional conformal procedure, where we computed a separate q hat per price decile of the validation set, under the hypothesis that the miscalibration might be larger in some price ranges than in others. The conditional approach is known in the literature, and it does help in some scenarios, but in ours it did not help much. Test coverage moved by another small amount, in the wrong direction in some deciles and the right direction in others, with a net effect close to zero. We were systematically miscalibrated across the entire price range, not in some deciles specifically.

The third experiment was a hybrid of the first two, where we combined recency weighting with per decile q hats. This produced the same null result, slightly noisier. At this point we had four calibration variants and a baseline, all giving roughly the same test coverage, all roughly 19 points below the target. We had to step back and ask why, in principle, none of these adjustments was helping as we wished for. The answer, when we finally saw it, was uncomfortable but obvious in retrospect. The conformal guarantee assumes that the validation set and the test set are exchangeable. In our temporal split, the validation set covers, roughly, months 6 through 9 of the training window, and the test set covers months 10 through 12. If the housing market drifts during those three or four months, then the conditional distribution of price given features is not the same in the validation set and in the test set. The two sets are not exchangeable. The conformal correction is computed on the validation set, where it works correctly: it takes a model that under covers there and widens it just enough to reach 80 percent. But when we apply that same correction on a test set drawn from a different distribution, the correction is no longer the right one. Specifically, the model under covers more on the test set than it under covers on the validation set, because the test set is harder. The validation calibration was, in a sense, fitting to the wrong problem.

Once this was clear, we asked what intervention could possibly fix it. Recency weighting and conditional conformal both operated on the validation set, and the validation set was the wrong place to intervene, because the validation set was already doing what conformal expected it to do. The intervention had to happen elsewhere. The natural place to intervene was at the level of the nominal quantiles themselves. If the boosters at 0.1 and 0.9 produced a band that under covered both validation and test by similar amounts, the conformal correction would close the gap. The problem was that they produced a band that was correctly calibrated on validation and severely under covered on test. The fix was to widen the nominal quantiles enough that the boosters over covered the validation set, then let the conformal procedure pull the band back down. The widening shifted the bias direction so that the test set, while still harder than validation, would be only slightly under covered after calibration rather than catastrophically under covered. We tried 0.05 and 0.95 first, and we got an improvement of about ten percentage points on test coverage. We tried 0.025 and 0.975, and we got the rest of the gap. The validation coverage at the 0.025/0.975 setting was about 96% before conformal, which the conformal step then pulled down to about 80% on test. The conformal q hat in this setting was a negative number, which meant that the procedure was tightening the band rather than widening it. We were, in effect, using conformal as a calibration that subtracts uncertainty rather than adds it, because the over wide booster band was already covering generously.

The lesson, more general than our specific case, is that when validation and test sets are not exchangeable, no procedure that operates only on the validation set can fix the resulting miscalibration. The intervention has to change the model itself, in a direction that introduces a bias that the conformal correction can then absorb. We arrived at this lesson by running four experiments that did not work, and we believe the four failed experiments are at least as instructive as the one that did if not more.

VI.D Final coverage and what it costs

After the widening, the empirical coverage on the held out test set sits at 79.5% against a target of 80%. We do not consider the half a percentage point gap to be significant: it is well within the noise of a roughly eight thousand row test set, and we have observed it move by one to two percentage points between consecutive retraining runs as. We accept this level of noise as the cost of operating on a non stationary market. The cost of the widening, in band width, is real but moderate. The final bands are about ten to fifteen percent wider, on average, than the bands a naive 0.1-0.9 booster pair would have produced without conformal correction. We argue that this cost is justified as the naive bands, despite being narrower, would have covered only 61 percent of the test prices, which means that for 39% of listings the buyer would have been told that the true price was inside a range that did not contain it. That is a silent failure mode of the worst kind. The widened bands cover 79.5%, which is honest, and the additional 15% of band width is a small price to pay for the honesty.

In euros, the typical eighty percent band is roughly forty thousand euros wide on a property with a fair value of 300k€, which is about plus or minus 6% of the central estimate. The width grows roughly proportionally with the fair value, so a property at one million euros has a band of about 120 euros, and a property at 150k€ euros has a band of about seventeen thousand euros. A reader who finds these widths uncomfortably large is, in our view, correctly perceiving the limits of what the model can promise on a market with this much heterogeneity. A model that promises much narrower bands and still hits eighty percent coverage on a temporally held out test set would be a next target for us to reach.

VII. Explainability

A model that returns a fair value and a band around it is already more useful than a single number. A model that, in addition, can tell the buyer why it arrived at that fair value, and which features of the listing pushed the value up or down, is even more useful. The technical machinery for producing such per-listing explanations is mature and well known: it is the family of methods known as SHAP (SHapley Additive exPlanations). The interesting work, in our case, was not in implementing SHAP, which is essentially one library call. The interesting work was in deciding what to show to the user, on what scale, and with how much narrative scaffolding around the raw numbers.

VII.A Global view: which features the model leans on

Before we discuss per-listing explanations, it is useful to look at the model's behaviour at the population level. The standard diagnostic for this is the global SHAP summary, which ranks features by the mean absolute SHAP value across all listings. A feature with a high mean absolute SHAP is one that, on average, moves the prediction by a large amount in either direction, regardless of which direction. It is a measure of how much the model leans on that feature overall. The top of our global ranking contains very few surprises. The single most influential feature is the net habitable surface, which is the obvious answer. The second is the postal code treated as a categorical, which captures most of the location signal that the model uses. The third is the commune level median price per square meter, the target encoded feature we discussed at length in section 4.5. The fourth is the construction year, the fifth is the EPC energy score, and the sixth is the Statbel median sell price for the local area. Together these six features account for the majority of the model's effective behaviour. A reader who has spent any time thinking about price of houses to sell will find this list unremarkable. Where the global ranking became more informative is in the middle of the list, where the rankings of features were less obvious in advance. The distance to Brussels and the distance to Antwerp rank above the distances to the smaller cities, which is expected, but the distance to Ghent ranks higher than the distance to Liege, which initially surprised us considering the number of people living near each one. On reflection, this makes sense: the Ghent metropolitan area extends along several axes that intersect many of the high price postal codes in East Flanders, while Liege's influence on prices is more localised. The distance feature is picking up an axis of demand that simply does not extend as far geometrically around Liege.

Two further entries surprised us in a more useful way. The construction decade, which we added as a derived feature, ranks above several of the binary amenity flags, which suggests that the model is picking up cohort effects in the housing data that are not captured by the construction year alone. The surface per bedroom ratio, the other derived numeric feature, ranks above thebathroom count, which suggests that the geometry of how the surface is divided matters more to the model than the simple count of any one room type. Both of these confirmed that the small set of derived features we kept were earning their place.

We also produce a beeswarm visualisation of the SHAP values, which shows for each feature both the magnitude of the contributions and their direction as a function of the feature value. The beeswarm diagram is useful for understanding non monotonic relationships, where a feature pushes the price up in some regions of its range and down in others. The construction year, in particular, exhibits a non monotonic shape: very recent construction adds value, very old construction also adds value (the heritage premium for pre 1900 houses), and the middle of the range, particularly the 1950s to 1970s, contributes negatively to the price. The model captures this shape correctly without us having to encode it manually, which is one of the practical advantages of using a gradient boosted tree ensemble over a parametric form.

VII.B Local view: per listing drivers, in euros

When we first implemented per listing explanations, we did the obvious thing: we computed the SHAP values for each listing, selected the three with the largest absolute magnitude, and displayed them on the listing card alongside the fair value. Each driver appeared as a small row with an arrow indicating direction and a label naming the feature. The result looked clean, and it was technically correct, and it was, in retrospect, almost useless to the buyer. The problem was that the SHAP values themselves are not on a scale that a non technical reader can interpret. Recall from section 5.2 that we fit our boosters on the natural logarithm of one plus the price. This means that the SHAP values, which decompose the booster's prediction additively, are also on the log price scale. A SHAP value of +0.1 for the surface feature means that the surface contributed an additive 0.1 to the model's log price output. The buyer sees this +0.1 and has no way to translate it into anything actionable. Is plus 0.1 a lot? Is it a little? Does it correspond to ten cents, or to one hundred thousand euros? The answer depends on the magnitude of the prediction itself, on the values of the other SHAP contributions, and on the non linearity of the exponential function that maps log prices back to euros. There is no simple translation from a log space SHAP value to a number the buyer can reason about. We fixed this by converting the SHAP values from log space to euros and the conversion is straightforward in principle. The model's log space prediction for a listing is the sum of a base value (the average log price across the training set) and the per feature SHAP contributions. To compute the euro contribution of feature i, we take the difference between the actual euro prediction and a counterfactual euro prediction in which feature i has been removed from the sum. In symbols, if pred_log is the model's log prediction and shap_i is the log space SHAP value of feature i, then the euro contribution of feature i is the actual price expm1(pred_log) minus the counterfactual price expm1(pred_log minus shap_i). The sign of the result matches the sign of shap_i, but the magnitude is now in actual money, and varies with the listing's predicted price as one would expect.

The decomposition is not perfectly additive, in the sense that the sum of the per feature euro contributions is not exactly equal to the difference between the listing's prediction and the population mean. This is because the conversion from log space to euros is multiplicative rather than additive, and the multiplicative combination of contributions does not in general decompose as a clean sum. We considered several ways to enforce additivity, such as redistributing the residual across features in proportion to their absolute magnitudes, and we ultimately decided not to bother. The residual is small enough that the contributions still sum approximately to the right total, and a buyer who reads three or four contributions does not need them to balance to the last euro. We considered the cost of explaining the residual, in interface and in cognitive load, to be larger than the cost of accepting it.

Let's be concrete with an example: take a listing in Verviers, priced at 390k euros, with a fair value of about 263k euros, 230 four square meters of habitable surface, an EPC score of C, and a construction year of 1850. The model's three largest absolute contributions, in euros, are: postal code ('4800', the Verviers area) at minus 32.4k€, construction year at minus 28.4k€, and habitable surface at plus 25.1k€. The next three are also negative: the Statbel median sell price at minus 26.9k€, the commune median price per square meter at minus 24.6k€, and the EPC score at plus 9.2k€. A buyer reading these six numbers does not need a SHAP tutorial, they can see directly that the model considers the property to be in a low price commune (four contributions pulling the value down by roughly one hundred thousand euros in total), and that the size and the EPC score are the main features pushing the value back up. The asking price of three 390k€, against this decomposition, looks ambitious.

We also give the user the option to expand the list and see all six top drivers, rather than only the top three. We considered showing all sixty or so contributions, and decided that the cost in screen space was not worth the marginal information. Most listings have a long tail of small contributions that sum to something useful in aggregate but are individually too small to report. We display the top six because, on inspection, the seventh and eighth contributions for most listings are below five thousand euros and below the threshold where we believe a buyer would reasonably act on them.

A note on visualisation. We display each contribution as a small horizontal bar, with the bar extending right of a centre line for positive contributions and left for negative contributions. The bar length is proportional to the absolute euro contribution, scaled within the listing so that the largest contribution fills the available width. The bar is a small visual addition that turns the list of numbers into something a reader can scan in a fraction of a second, and we think that one can grasp the structure of an explanation faster from the bars than from the numbers alone.

VII.C The narrative layer

The euro denominated SHAP values, displayed as a list with mini bars, are already a substantial improvement over the original arrow only display. They are still, however, a list. A reader who is in a hurry, or who does not enjoy reading tables, can miss the overall shape of the explanation and walk away without a clear sense of why the model arrived at the fair value it did. We addressed this by adding a narrative layer on top of the explanations. Above the list of contributions, we display a single sentence, auto generated from the top contributions, that summarises the model's reasoning in plain language. The sentence has a small number of templates, chosen by which combination of positive and negative contributions is present in the top of the list. When both positive and negative contributions are present, the sentence has the form "the fair value is lifted by feature A (value) and feature B (value), and held back by feature C (value) and feature D (value)". When only positive contributions are in the top, the sentence shifts to "the fair value is mostly driven up by feature A (value) and feature B (value)". When only negative contributions are in the top, the sentence shifts to "the fair value is held back by feature A (value) and feature B (value)". The templates are mechanical, and a careful reader will notice this. We do not consider this a defect as the sentence is intended as a quick orientation, not as natural prose.

Returning to the Vervier example from the previous subsection, the auto generated sentence reads roughly: "the fair value is lifted by habitable surface (234 m²) and EPC score ©, and held back by postal code (4800) and construction year (1850)". A readed reading this sentence gets, in one line, the same picture they would get from reading and aggregating the six numbered contributions. A reader who wants more detail can then look at the list. A reader who is in a hurry has already extracted the most useful piece of information.

We consider the combination of euro denominated SHAP values, mini bars, and a one sentence narrative as the minimum useful level of per listing explanation for a model of this kind. Each component on its own is incomplete: the numbers without the bars are slow to scan, the bars without the narrative require the reader to do the synthesis, and the narrative without the numbers does not support a reader who wants to dig further. The three together provide an explanation that a non technical reader can use, and a technical reader can audit, with neither group being patronised.

VIII. Results

We have spent the previous sections describing what the model is, how it was built, how it is calibrated, and how it explains itself. The present section reports how well it works. We will be deliberate about presenting the headline numbers honestly, about showing the cases where the model is wrong, and about being specific on where the residual errors concentrate.

VIII.A Headline numbers

We evaluate on the held out test set, which is the most recent fifteen percent of the modelling window, ordered by the listing's last modification date. The test set contains roughly eight thousand listings, none of which the model has seen during training or during validation, and none of which contributed to the conformal calibration. The reported numbers are therefore genuinely out of sample. On this test set, the LightGBM quantile regressor at the median booster produces a mean absolute error of approximately 90300 euros, a median absolute percentage error of 11.4%, and a coefficient of determination of 81.5%. The eighty percent prediction band, after conformal calibration, achieves an empirical coverage of 79.5%, which is within half a percentage point of the nominal target. We consider these four numbers, together, to be the headline performance of the system. For the numbers from the most recent run, please refer to the "About the model" on the left panel of the website.

For comparison, we ran three baselines on the same test set, with the same evaluation procedure. The first baseline is the global mean, which predicts every listing's price as the mean of the training set's prices. This baseline produces a mean absolute error of approximately 256k euros and a median absolute percentage error of forty two percent. The second baseline is the commune median price per square meter, which multiplies the listing's surface by the median price per square meter in its postal code. This baseline produces a mean absolute error of approximately 147k euros and a median absolute percentage error of 20.9%. The third baseline is a ridge regression on the same numeric features, with the categorical features handled through a simple target encoding. This baseline produces a mean absolute error of approximately 134k euros and a median absolute percentage error of 17.8 percent.

The LightGBM quantile regressor improves on the commune median baseline by approximately a factor of 1.6x on mean absolute error, and on the ridge regression baseline by approximately a factor of 1.5 on the same metric. The ratio is not as dramatic as one sometimes sees in machine learning research, and we want to be honest about that. The commune median baseline is, in itself, a suprisingly strong heuristic for Belgian real estate, because location dominates the price signal and the commune median already captures most of the location effect. A model that did not beat the commune median by a comfortable margin would not have been worth building. A model that beat it by a factor of three on mean absolute error would, in our view, be suspiciously too good and would call for a careful audit for leakage. Our model sits in the comfortable range, and we sleep easier thanks to it.

A note on the choice of metrics. We report mean absolute error because it is what a buyer cares about: the average size of the gap between the model's prediction and the truth, in euros. We report median absolute percentage error because it is robust to the long right tail of property prices and gives a calibration free sense of relative error. We report the coefficient of determination because of its ability to quantify the model’s explanatory power. We do not report root mean squared error, despite it being more standard in some venues, because it gives disproportionate weight to a small number of very large errors and we found it less useful for the diagnostic work. The choice of metric is, as always, partly a matter of taste.

VIII.B The model blindspots

A summary statistic is, by construction, a summary. The mean absolute error of 90k€ is the average across the test set, and inside that average there are a handful of cases where the model is wrong by a much larger amount. Some of these cases are interesting in themselves, because they reveal something about the shape of the problem that the metric alone does not expose. We will describe three of them.

The first case is a property in Verviers, a medium sized city in eastern Wallonia, listed at 390k€. The model values it at approximately 263k€ and is confident: the eighty percent band runs from about 225k€ to about 310k€ euros, and the listing's asking price falls clearly well outside this band on the upper side. The verdict the model emits is "overvalued by roughly forty percent". The decomposition we discussed in section 7.2 shows that the postal code, the Statbel median sell price, and the commune median price per square meter together pull the fair value down by close to 100k€. The surface (234square meters) and the EPC score © push it back up, but not enough. Is the model right? We do not know with certainty.

What we know is that the model has reasoned consistently from the features it can observe, and that an asking price forty percent above the model's estimate is, statistically, a strong signal that the seller is testing the market. A buyer who reads this verdict and decides to negotiate aggressively, or to walk away, is acting on a reasonable signal. A buyer who decides to trust the seller's price has a small but non zero probability of being right and a larger probability of overpaying.

The second case is the family of new construction projects in Brussels. These are listings that describe a building that does not yet exist, where the seller is a developer rather than an owner, and where the asking price reflects a mixture of land cost, onstruction cost, and developer margin. They sometimes carry prices that look implausibly high relative to existing buildings of similar specifications, because the buyer is paying for a brand new property with the corresponding warranty and the absence of any maintenance backlog. Our model, trained on existing properties and on their corresponding existing prices, has no way to handle these listings well, because they are outside the training distribution. We do not pretend otherwise. As discussed in section 3.3, we flag these listings as "not priceable" and we withhold any fair value estimate. The user interface displays a verdict that explicitly says the model is not in a position to comment, along with a short explanation of why. We consider this abstention the correct behaviour because a model that produced a fair value for these listings would be guessing in a way that the user could not detect, and it would gradually erode the user's trust in the model on the listings where the model is genuinely competent.

The third case is properties with very old construction years, by which we mean built before approximately 1900. The Belgian housing landscape contains a non trivial number of these, particularly in the historic centres of cities like Brussels, Antwerp, Liege, and Ghent. They are difficult for the model for two reasons. First, the training set contains relatively few of them, so the model has limited evidence on how their prices behave. Second, their prices are driven in significant part by features that the model cannot observe: the historic value of the architecture, the quality of the restoration if the property has been renovated, the cultural prestige of the address if any. The model treats a property built in 1850 and a property built in 1750 as roughly equivalent, because in the training set both fall into the long flat tail of the construction year feature where the partial dependence has very little gradient. The Verviers case from earlier is on the edge of this regime: a building from 1850, where some of the construction year penalty the model applies may be too aggressive, because the heritage premium on well maintained stone houses is a real market force that the available features do not adequately capture. We acknowledge this and we do not have a clean fix (yet).

VIII.C Honest residual analysis

Beyond these named cases, we looked at the test set residuals as a whole and tried to understand where they concentrate. The findings are, on the whole, what one would expect, but it is worth stating them explicitly so that future work has a clear list to attack. Errors concentrate first in the high price tail, where listings priced above one and a half million euros account for a small fraction of the test set but a disproportionate share of the largest absolute errors. The reason is partly that the training set contains few examples in this range, so the model's local function fit is noisier, and partly that very expensive properties are differentiated by features (luxury finishes, exceptional views, historic significance) that the listings do not expose systematically. We could attempt to handle this by training a separate model for the high price segment, but the lack of examples would limit how much such a specialist model could learn. A more promising direction is to incorporate features extracted from the listing photographs, which we discuss in section 9.

Errors concentrate second in postal codes with very few training examples. There are a handful of small rural postal codes in the country where the entire training set contains fewer than twenty listings. The model's per postal code categorical handling is designed to fall back on the postal code's first digit (the province) for sparse postal codes, but the fall back is necessarily coarser than the model would like. A finer grained approach, where we incorporate continuous spatial information (e.g., proximity to neighbouring postal codes' price levels) more aggressively, would likely help.

Errors concentrate third in the very old housing entities, as discussed in section 8.2 above. The third case is the most stubborn, because the missing information is genuinely not in our feature set, and adding it would require a different data source.

A fourth, smaller, concentration of errors appears in listings where the property type is ambiguous. The property types are not always cleanly assigned by the upstream data: a house listed as "mixed use" might have a substantial commercial component, a house listed as "manor" might be either a real heritage manor or a suburban single family house with a marketing flourish. The model treats the property type as a categorical feature, and it learns the average behaviour for each category, which is wrong on the sub categories where the label is unreliable. This is a data quality problem more than a modelling problem, and it would be addressed by better upstream classification rather than by changes to the model. An efficient approach would be to manually inspect these property types and expand the cleaning phase to handle such cases.

A note on what is not, in itself, a residual problem. The model performs comparably across the linguistic regions of the country, once one controls for price level. We were initially worried that the model might be systematically better in Flanders than in Wallonia, simply because Flanders has a higher density of training listings. The residual analysis shows a small effect in this direction, but smaller than we expected and not large enough to justify the engineering cost of separate per region models. We mention this because it is the kind of bias that would have been embarrassing if we had not checked.

Taken together, these four concentrations of error account for most of the cases where the model is more than fifty percent off the listed price. The cases that remain are individually small in volume, and they are spread across the feature space in ways that do not suggest a clean intervention. We consider the model, in its current form, to be performing close to what the available features and training set size will allow. The path to a substantially better model goes through richer features (notably photographs and points of interest) rather than through a better model architecture.

IX. What we would do differently

Every project that runs long enough produces a list of decisions the team would make differently if it could start over. We keep ours short, in the spirit of being useful to the reader rather than exhaustive. Four items, each in a paragraph, each with a concrete rationale.

If we were starting over, we would integrate OpenStreetMap points of interest into the feature set from the very first iteration. The spatial features we currently use, the seven city distances and the postal code categorical, are good but coarse for our liking. They tell the model where in the country a property sits, but they do not tell the model whether the property is two minutes from a tram stop or twenty, whether it is in walking distance of a supermarket or not, whether it is next to a school or next to the highway. Each of these has a clear effect on price, and the data to compute them is freely available, structured, and updated. Our reasons for not including them so far have been about engineering effort rather than data availability, and on reflection we should have prioritised this earlier. We expect that POI based features alone could shrink the mean absolute error by ten to twenty percent. For our next iteration of the project, this will definitely make the cut of new features.

Furthermore, we would also work out from the beginning how to extract features from the listing photographs. The photographs carry information about interior quality, about the state of maintenance, about the aesthetic style, that no structured field captures. The methodology we would adopt is straightforward: encode each photograph with a pretrained VLM, project the embeddings to a small number of dimensions through principal components analysis, and use the resulting numeric features as additional inputs to LightGBM. The implementation cost should be moderate, and the engineering cost of storing and updating the embeddings is real but manageable. We expect the gain to be largest precisely on the listings where our current model struggles, namely the very expensive properties and the heritage ones, where the photograph carries information that the structured fields simply cannot encode. As the old saying goes: a picture is worth a thousand features.

Next, we would diagnose the exchangeability problem in the conformal calibration earlier than we did. The four calibration experiments described in section 6.3 were instructive in retrospect, but they consumed several weeks (no shame about it) that, with a sharper diagnostic up front, we could have spent on the eventual fix. The diagnostic itself is simple. After fitting the boosters, plot the empirical coverage of the raw band on validation and on test as separate numbers, if the two are far apart, the conformal correction will not save the situation, and the issue is upstream. We did not produce this plot until late in the process, and once we did, the path to the fix became immediate. A future project would include this diagnostic as a routine step in the evaluation, before attempting any calibration.

If we were starting over, we would prune the long list of binary amenity indicators more aggressively. We currently include seventeen binary flags, ranging from "has a garden" (very informative, kept) to "has a sauna" or "has a secure access alarm" (very rare, very weakly correlated with the price, almost certainly noise contributors). The model handles them gracefully, in the sense that the gradient boosted tree learns to ignore the non informative ones, but their presence increases the feature matrix, slows training slightly, and complicates the global SHAP ranking by introducing a long tail of features with marginal contributions. A future iteration would keep five or six of the flags that survive a credible variable selection step, and would discard the rest. This is the place where we most clearly chose the lazy default ("keep everything, the model will sort it out") over the more disciplined approach, and on reflection it was a small mistake.

X. Closing

This article described a model that estimates the fair value of Belgian for sale listings, calibrates an honest eighty percent prediction band around that estimate, and explains its reasoning to the buyer in euros and in plain language. We presented the data sources, the feature engineering, the choice of LightGBM quantile regression, the temporal split, the conformal calibration and the diagnostic story behind it, the global and local explainability layers, the headline metrics on a temporally held out test set, and an honest account of where the model falls short. We also listed four concrete directions in which future iteration(s) could improve.

We deliberately left out the deployment side of the project, which is genuinely interesting and which we will treat in a separate article. That story includes how we run the analytical database inside the user's browser, how the entire product fits within a zero cost hosting envelope on Cloudflare Pages, how a nightly GitHub Actions cron retrains the model and refreshes the data, and how a small natural language search proxy handles user queries through a large language model. None of those choices affect the model itself, but each is worth its own discussion, and we did not want to dilute the present text by mixing the two stories.

The full source code, the training scripts, the data pipeline, and the calibration experiments are all available in a public repository (see link at the start of the article). We hope the article is useful to readers who are either building something similar, or evaluating someone else's build, or simply curious about how a real estate price model behaves once it is exposed to a real, messy, slightly non stationary market. Comments, corrections, and counter examples are always welcome.