Gloria Gaynor: At first I was afraid, I was petrified…
At Opendoor, we buy and sell thousands of homes a month. The longer a home remains in our inventory, the more exposure we have to macroeconomic shifts. Moreover, property taxes, home financing, and maintenance fees all increase with time. These costs—which are sometimes referred to as “liquidity risk”—directly impact our unit economics. It is critical that we’re able to understand and predict how long a house will take to sell.
In this post, we’ll discuss modeling the holding time of our homes using a technique called survival analysis. This dramatically improves our ability to respond to changing market conditions, and to more accurately predict our unit economics.
Describing the Data
Opendoor analyzes a large collection of historical market transactions. We have details about the individual homes that sell—like the number of bedrooms and bathrooms they have, or their square footage—as well as information about each particular listing, like the sale date, sale price, when the home listed, and at what price.
We also look at similar data for homes that are currently active on the market, or that are taken off the market without transacting (we call these “delistings”). For these listings, we observe a lower-bound for their time-on-market, but not necessarily their true value.
The image above is an example of Opendoor’s dataset. Each row is a listing on the market. We have a variety of home-level details (e.g., square feet) and listing-level details (list price) for each transaction. Our target variable is days-on-market, which we do not get to observe for all listings (some are still active, awaiting buyers, while others delist without ever selling).
Framing as Regression
Given our goal of predicting how long it will take to sell each home, the first approach you might try is to directly predict the variable days-on-market as a function of home and listing features. To do this, we’ll remove transactions that haven’t yet closed, or end up dropping off the market without selling—since we don’t know their actual label—and train a model using the remaining data points.
One way we understand our model’s performance is to take the predictions for all homes that list in a certain month, and then measure the accuracy of these predictions after a suitable amount of time has elapsed (for example, a year later). Below we plot the average error by predicted days-on-market on a synthetic hold-out set of data.
The plot above shows the average errors of a (simulated) group of homes listed around the same time of predictions ordered by their expected time on market. The deviation between the solid “linear regression” line and the dashed “unbiased model” line in this simulation demonstrates the systematic bias in a model that results from dropping censored observations. The riskiest transactions end up being the most poorly explained!
As you can see from the plot of the average error of sale dates, our model is negatively biased on homes that take a long time to sell, resulting in under-predicting listings that take a long time to close. These are the riskiest homes, and we’re doing a poor job at predicting them—our model is too optimistic!
The grayed listings above are censored observations, indicating partial information for our target.
By removing unsold homes from our training data, we end up introducing a significant bias into our model. Opendoor’s transactions are systematically different from standard market sales—we must sell our homes. The active or delisted homes that we discarded are like Wald’s planes that didn’t return; they’re censored observations. This is dangerous because it indicates a discrepancy between our offline backtested metrics and our online metrics. Our machine learning objective function doesn’t match our business objective function.
Reframing as Classification
Another way of viewing the problem is as a binary classification problem: a home is either “risky” or “not-risky”. Homes that sell fast are considered less risky. Listings that move slowly, or that never close at all, are highly risky. The only thing required is choosing a threshold for “risky”, such as homes that don’t close within 100 days on the market. Many time-to-event prediction problems, like churn prediction, are handled in this manner (we often ask “did a user churn or not”, rather than “when, if ever, did the churn occur”).
It’s a simple trick—just a variable transformation and a switch from regression to classification. By looking at the problem this way, we side-step a lot of the censoring issues from before. We get to use data from delistings (that have survived at least 100 days), homes that we’ve defined to be inherently risky, as well as active listings that have exceeded our risk threshold.
By changing our target, and reframing the problem as classification allows us to use more data, particularly about homes that are still active (52 Downtown Ave) or that withdrew from the market without selling (90 Outskirts Lane). However, this approach still throws away information. By quantizing our target, we treat all homes in the same category as equivalent, which is unlikely to be true (Is a listing that closed in 7 days really equivalent to one that closed in 99 days?).
Additionally, we discard homes that haven’t exceeded the risk threshold; these are by definition recent transactions. This causes us to respond sluggishly to evolving market conditions. If the market rapidly turns, our model will lag the change significantly, unintentionally increasing the risk the company takes on.
Finally, it’s not clear how to interpret this as a non-technical stakeholder, “Okay, this home is riskier, but how much riskier?”. It’s difficult to answer questions like “When should I expect to finish clearing this cohort of homes?”.
A General Recipe for Survival Analysis
Fortunately there’s a sub-area of statistics that’s focused on modeling problems exactly like this, survival analysis. By slightly re-framing our classification problem, we can avoid the issues mentioned above, and leverage a range of techniques related both to discrete time models as well as classification problems.
To do this, we leverage the fact that our target variable is discrete, and expand our dataset to include examples of listing-day pairs. Each day a home is on the market without closing is marked as a negative example, and whenever we observe a sale, we label it as a positive example. The censored active transactions and delistings get full representation in this formulation. They simply have no corresponding positive examples.
While this dramatically increases the size of our dataset, the ecosystem for solving large-scale classification problems is mature and well-understood.
To map back to our original problem of predicting the days-on-market for a home, we predict the probability of a sale on a fixed number of listing-days for the particular property. This corresponds to the conditional probability of a listing closing on a particular day, given that it survived on the market until then (in survival analysis parlance, this is the “hazard rate”). With these probabilities, we can stitch together the probability mass function (pmf) for a sale occurring on a specific day and therefore the expected days until sale, with the following relations.
Our classifier predicts the conditional probability of a home selling in a particular period conditioned on that home not selling before that. To convert hazard to the more standard pmf, we compute the probability that the home failed to sell on any of the previous days it was on market, and then multiply by the probability that it sold on the specific day in question (i.e, the hazard).
One major advantage of this approach is that it provides an intuitive mechanism for adding time-varying features. To code features like the number of active competitors, the current list price of a home, or whether a home is active during a seasonally slow period (e.g. Thanksgiving), we simply add new features for this behavior with the appropriate instantaneous value of the attribute.
A second benefit with this structure is that it is immediately intuitive to business stakeholders and industry partners. In real estate, a key market-level indicator is the inventory turnover rate. This measures the ratio of sales in a given period to the number of active listings in that same period. By defining our classification problem in the listing-day space, we model this pervasive and widely-understood relationship. To convert from our unit-level representation to this aggregate view, we can take the mean of sold-in-the-next-day grouped by time.
Operating in the listing-day space provides a natural framework for adding time-varying features. In the example above, 410 Main Street undergoes a series of price drops over time, from the initial list price of $200,000 down to $170,000. To encode this behavior, we just add a feature for “current list price” and assign the value to be the list price on the specific day.
The chart above shows the inventory turnover rate (also referred to as “clearance rate”) in Phoenix for three consecutive years. The rate is defined as the ratio of sales in a given period to the total number of active listings in that same period. As we can see, for most of the 2018, homes were clearing faster than in previous years, but the seasonal dip this year appears to be more dramatic than in the past.
At Opendoor, our mission is to empower everyone with the freedom to move. A key enabler of this is our ability to accurately understand, model, and price liquidity risk. This methodology has improved our ability to accurately forecast our cost structure and dynamically respond to changing market conditions. Because of this, we can charge fairer and more transparent fees.
Thanks to Chris Said, Jackson Gorham, Frank Xia, Todd Small, Sherwin Wu, Mike Chen, Jules Landry-Simard, Joseph Gomez, and many others for their helpful comments and contributions!
Interested in joining our team?
From data science to home renovation, we’re looking for great people who love solving tough challenges.