Published May 2, 2020
A common data science problem is that data could get very sparse after breakdowns along several dimensions.
For example, a friend of mine at Uber had to tackle this problem for surge pricing: They get tons of rides data everyday, but when doing pricing, they have to break them down into groups by several key factors. The price multiples for people going from golden gate bridge to the airport could be very different from those from twin peak to Berkeley. On top of that, the multiples fluctuate hourly and daily. In order to take these factors into consideration, they have to break down their data with combos of these factors. The number of rides very quickly shrinks into single digit for each breakdown group .
This is just one example. We also ran into this issue when building models for dynamic pricing. Breaking parking transactions down into hour of entry and day of week, together with length of stay could get the number of transactions in each group down to almost 0.
Sparse data is rarely good for model building, and in this case it’s caused by the curse of dimensionality.
The issue with sparse group data is that we cannot train models for each group independently. We have to somehow link models for each group together and train them at the same time.
I happened to go through some notes I took during my PhD days recently, and found an interesting one about hierarchical Bayes models. It is one of many approaches for tackling this problem.
The hierarchical Bayes model takes a compromise between two extreme approaches: 1) aggregate all data and ignore the heterogeneity of each group, and 2) analyze at the group level with few and noisy data. Hierarchical Bayes models assume that each group’s data come from a probability distribution specific for that group, but their parameters are drawn from a shared distribution. One can also use more than two levels, thus the “hierarchy” in the name.
Consider a table of weights of 50 rats in every 10 days for a year (a 50 * 36 table). One can either pool all data and do a regression for the weight over time :
Or do a regression for each rat :
Hierarchical Bayes model instead assumes that and come from common distributions and . With this assumption, one can use computational approaches, most commonly Markov Chain Monte Carlo (MCMC), to estimate the parameters , , , and .
With the numbers of home-game win-loss of all NFL football teams, one can either pool all the data and get the average home-game win rate, or take the average for each team. Hierarchical Bayes model does a partial pooling, assuming the win rates are random draws from a larger population.
For this example, we can have the following assumptions:
is the distribution of the number of winning home games for team , and is the number of home games played by team .
is the distribution of the home-game winning probability for team .
With some algebra, given the observed number of winning home games , we have:
Using this result, we can obtain the distribution of the number of winning home games for each team.
This is an over-simplified model for illustration purposes. In reality, hierarchical models need to incorporate many other factors, such as opponents, timing, improvement, injuries, etc. The model can quickly become very complicated. One would need computational tools to do the estimations. The most widely used approach is again Markov Chain Monte Carlo (MCMC).
Hierarchical Bayes models provide a way to tackle the problem of data sparsity when a large amount of data are broken down along several dimensions. It artificially creates a “hierarchy” of parameters that follow certain distributions. This way each group of data follows a group-specific probability distribution, while parameters of these distributions are tied together with a common distribution, thus creating a connection between all models. This is sometimes also referred to as “learning from others”.
Note that hierarchical Bayes model is more of a statistical approach than a machine learning approach. It depends on reasonable assumptions about probability distributions and model hierarchies for the problem at hand. This typically requires deep understanding of the data and its source. But when used in right contexts, it could be more accurate and take significantly less time to train than machine learning models.