Taming missing features at serving time

October 3, 2019 — Written by Leo Pekelis

“We skated over the matter of where our data comes from in the first place, and what we plan to ultimately do with the outputs from our models.” – D2L

At Opendoor, the Data Science team has built predictive models to power multiple business applications. In previous blog posts, we discussed valuing homes in order to make competitive offers to home sellers, as well as forecasting how long our inventory takes to sell to accurately estimate liquidity risk.

Transience in real estate markets is a challenge to all such models. If prices of homes have increased in the past month, then a model trained with last month’s list prices as a feature has difficultly operating this month because it is now forced to make predictions out-of-sample. More subtle and insidious examples abound. 

Dive into Deep Learning classifies this phenomena covariate shift and attaches a mathematical framing. Imagine we are tasked with learning a conditional distribution $$p(\mathbf{y}|\mathbf{x})$$ of some output(s) $$\mathbf{y}$$, given some other input(s) $$\mathbf{x}$$. Even if the conditional distribution stays the same between development and deployment, a shift in the input distribution $$p(\mathbf{x})$$ can degrade deployed performance. 

In this blog post, I’ll talk about a very specific example of covariate shift, and a neat data augmentation trick to tame it, that works with any modeling architecture.

The data

To serve as a reproducible and practical example, I’ll use the house price dataset from Kaggle, freely available at:

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

The training set contains 1460 house prices of sales in Ames, Iowa from the period of 2006-2010. Our objective will be to build a model that predicts the sale price using other features in the dataset.

A self-contained jupyter notebook that reproduces all the results shown below can be found here.

Occlusion: missing features at serving time (the problem)

Keeping the example going, now imagine that we want to deploy our model into production so that we can use it to predict market value and make a competitive offer to homeowners wanting to sell their home to Opendoor. There is just one wrinkle…

The overall quality score (a feature in the model identified by OverallQual) requires the expert eye of an Opendoor technician to physically survey the home, score it, and submit their report. Furthermore, we don’t always have a technician’s report finished by the time we need to make an offer. This is an example of serving occlusion, where we at times need to make predictions with features missing at serving time, but will eventually see them in time for model training.

The OverallQual feature is important for the model. It improves the mean absolute error (MAE) of our ability to predict sale prices by 12% when used alongside 31 other features. It is doing more than its fair share of the work and is actually the most predictive feature in the model.

But what happens when some percentage of OverallQual values are not available?

In the following simulation, I train a simple model that scales all features to have mean 0 and variance 1, impute any missing feature values with 0 (the overall training set mean), and finally fits a Poisson regression to explain SalePrice with 32 numeric features in the data. The Poisson regression is mostly to emphasize that all this works for any modeling architecture, and also because home prices (rounded to the nearest $1k) roughly follow a Poisson distribution. I’ll refer to this model as the “naive” estimator.

I run 100 simulations. For each, I train the naive estimator, then mask a random fraction of OverallQual values in the training set (serving set is training set plus masking), use it to predict SalePrice, and calculate MAE. The violin plots below illustrate the distribution of MAE across different amounts of occlusion. Even though the estimator can serve results when features are occluded, it doesn’t mean it does it well. In fact, if more than 30% of OverallQual values are occluded at serving time, it’s better to not include the feature at all in the naive estimator!

Since we train on the 100% visible dataset, and especially since we have been calculating training (or in-sample) error this whole time (only who will be occluded is unknown at training time), surely we can do better.

Build multiple models (one solution)

Some motivation for improvement comes from the correlation of OverallQual with other columns in the data.

Besides SalePrice, OverallQual is also highly correlated with other features we are using as inputs in the model. For example, it is positively correlated with YearBuilt. However, we naively impute 0 or average quality for all homes, no matter when they’re built, leading to poor predictions. What we should be doing is imputing with the level of quality we expect, given the other feature values that we observe. Luckily, removing the OverallQual feature from the model does roughly that.

This suggests building two models: one with the OverallQual feature and one without. Then at serving time, we can simply pick which model we use to predict depending on whether or not quality is available. I’ll call this the “multiple models” estimator.

Running the same simulation using the multiple models estimator recovers the behavior we expect. Multiple models always perform better than the naive estimator without the OverallQual feature (orange line), and has less and less improvement over it as more of the feature is occluded at serving.

Data stacking (an upstream solution)

One defect of the multiple models estimator is that it increases serving complexity. When we deploy it into production, we’ll also have to build a ModelSwitcher module that chooses between our models based on the input. While the switching logic is simple in this example, multiple sources of occlusion across multiple features is harder to resolve, and any additional overhead is a potential source of failure in a critical part of the system.

A one model solution is to use data stacking. First, concat two copies of our training data, but replace all the values of OverallQual in the second copy with nan (or the missing indicator of your choice). Second, add an indicator feature for occlusion that is 0 over the 1st copy and 1 over the 2nd copy. Third, add interactions between the indicator and all other features besides OverallQual.

Now our model has both kinds of examples to learn, occluded and non-occluded, and it has enough flexibility to encode all the differences between predicting when a feature is occluded and when it’s visible. Intuitively, the interactions allow features correlated with OverallQual to impute it when it is missing. I’ll call this model the “data stacking” estimator.

On the same 100 simulations of occlusion, the stacked estimator achieves almost identical MAE to the multiple models estimator.

Look to the appendix below for more simulation results.

Conclusion

Data stacking is a slightly more technical way to solve for serving occlusion, but it pushes code in the right direction—upstream to the feature and model training pipelines. This typically demands fewer nines of uptime. Also, the interaction terms (steps 2 and 3) are not strictly necessary in architectures that do automated feature engineering, like random forests or deep neural nets.

At Opendoor, data scientists and engineers work closely together to truly solve problems end to end. If this type of work interests you, find out more at our careers page.

Interested in joining our team?

From data science to home operations, we’re looking for ambitious people who love solving tough challenges.

We’re hiring!

Appendix

More simulation results

Furthermore, the predictions of the two estimators are very similar – 90% of predictions are less than $2500 apart, which is 10% of the naive MAE. And even the coefficients are roughly the same up to an arbitrary scaling between feature and coefficient – 90% of coefficient values have less than 20% difference.

A note on other imputation techniques

There are lots of other imputation techniques I could have used. Many are substantially more powerful than the SimpleImputer in the naive estimator above. For example, the IterrativeImputer included in sklearn predicts missing feature values from a multivariate model fit to all other features, in a round-robin fashion. Combining it with step 1 of Data Stacking gives very similar results to full Data Stacking, though it is computationally more complex.

While efficiency of imputation algorithms is an interesting topic on it’s own (and maybe enough content for a separate blog post), my main goal here was to remind data scientists that augmenting data is a valid and sometimes powerful item in our toolbox.