A $1M qualifying result was achieved on the public Netflix test set by a 3-way ensemble team. This is just in time for Yehuda‘s presentation at KDD, which I’m sure will be one of the best attended ever.

This isn’t quite over—there are a few days for another super-conglomerate team to come together and there is some small chance that the performance is nonrepresentative of the final test set, but I expect not.

Regardless of the final outcome, the biggest lesson for ML from the Netflix contest has been the formidable performance edge of ensemble methods.

Here is my take on the most important scientific contributions produced by this contest:

1. Matrix factorization methods are very powerful. By assigning a vector to each user and movie, a simple gradient descent is very accurate. In addition, this method can be extended to include other information. I believe matrix factorization methods will spread to almost all other fields of machine learning.

2. ASVD is probably the best neighborhood method published to date in any field of machine learning. Before the contest, collaborative filtering relied heavily Pearson correlation. Now, Pearson correlation is used as a small component in the overall blend.

3. The best way to blend, when possible, is to simultenously train multiple algorithms in parallel. In other words, use a gradient descent on all algorithms at once and repeat multiple times.

I don’t know of any more significant advance than these, other than the restricted boltzmann machine, to come out of the machine learning community in the past few years.

Each method is can be implemented in ten’s of lines of code, can process 100M entries in under an hour on a standard PC, and significantly advanced the state of the art performance.

Can you provide refs for these key ideas? I’m not familiar with them, but they sound good to know.

This is a great summary of matrix factorization techniques known for colloborative filtering:

http://www.research.att.com/~volinsky/netflix/Bellkor2008.pdf

The first published paper on NSVD (called NSVD in this paper, not ASVD):

http://rainbow.mimuw.edu.pl/~ap/ap_kdd.pdf

I believe the most significant breakthroughs approximately were in this order:

1) Basic per-feature factorization by stoch gradient with tikh regularization: Simon Funk’s infamous blog post.

2) Extension of this model to include bias terms and train over multiple features simultaneously.

3) Asynchronous factor model (was this indeed Arek’s?)

4) Integration of #3 with #2 (Bellkor possibly – SVD++)

5) Integration of neighborhood training with model #4. (might be simultaneous discovery).

6) Integration of temporal effects that consider changing customer preference. Koren just got deserved praise for his approach on this. I have a model that is performing under .8730 (~8.3%) that I might publish.

Had this contest continued to go on, I think that advancement in ensembling and blending would have increased in focus in terms of publications. I don’t believe there was a single NetflixPrize-specific paper devoted to the concepts of ensembling or blending.

What makes you describe Simon Funk’s blog post as “infamous”??

Could you explain why you call NSVD (Arek’s term) a neighborhood model? I believe it’s principal advantage over the simpler factorizations is the ability to incorporate usefully the fact that we know that users have rated certain movies (those qualifying set), but we don’t have the actual ratings on hand. This is validated by comparing performance by looking at user support levels.

The value of neighborhood support as far as I can tell starts to get pretty slim at higher feature counts, but works really well if you run a low-feature factorization.

Can anyone suggest a very good paper or book for learning about why ensemble methods work so well? Per John’s post, this is a topic I want to understand better now.

The first hit on Google looks good enough to me: machine learning ensemble methods.

It does explain the reasons but the conclusion I made is that this does well because of the lack of understanding about what would be a

properlearning algorithm, fully covering the hypothesis space, computationally stable and returning a weighted distribution of ALL the matching hypothesis instead of an hapazard sampling of one or some of them.John Elder has a great talk on why ensembles work well that became a chapter in the new book Statistical Analysis & Data Mining Applications (Nisbet, Elder, Miner) — Model Complexity (and How Ensembles Help) that describes the idea of generalized degrees of freedom. As it turns out, ensembles actually “act” more simply than many individual models, so that even with thousands of models in the ensemble, it is often smoother and better behaved than single models. I’ve written a few things also (like http://www.tdan.com/view-articles/4960/), and the one cited in this thread by Dietterich is good (he is always very readable). Lastly, at my blog a couple of years ago there was a thread on the PAKDD07 competition. One comment by Phil was particularly interesting as he described different ways ensembles of the model ensembles could be built and beat even the individual ensembles themselves. http://abbottanalytics.blogspot.com/2007/05/comparison-of-algorithms-at-pakdd2007.html

Cheers.

Jean-Yves reminds me that Jensen’s Inquality applied to a PAC-Bayes bound (tutorial) provides a direct weak-assumption theoretical justification for ensemble methods for squared loss.

The table for minimum separation bounds is not showing up correctly in the PDF. FYI