Okay! So you want to learn about gradient boosting. Well, first let me point you to the obvious source https://en.wikipedia.org/wiki/Gradient_boosting I'll wait for you to go read it and come back.
Back? Think you understand it? Good! Then you don't need to read further.... probably. I should warn you now in the strictest sense this post is entirely backstory on getting to the point of implementing gradient boosting. You might want to skip to part 2 if you want more explanation of what gradient boosting is doing.
When I first tackled Gradient boosting, I tried it and it didn't work. What I mean to say is I got worse results than Random Forest https://en.wikipedia.org/wiki/Random_forest. Perhaps I'm getting ahead of myself. Let me back up a little more and explain my perspective.
Most people at https://www.kaggle.com/ use tool kits or languages with libraries and built in implementations of all the core functionality they need. That is, the tool kits that they use have everything written for them. They make API calls that perform the bulk of the work. They just assemble the pieces and configure the settings for the calls.
So why does that make difference (toolkit vs handwritten)? Well, i try stuff and have to figure things out. And because of that my version of the technology might work in a fundamentally different way. Or prehaps what I settle on isn't as good (though it probably seems better at the time). Then when I try and leverage that code for the implementation of gradient boosting, it doesn't work like it should.
The core of gradient boosting and random forests is the decision tree https://en.wikipedia.org/wiki/Decision_tree_learning. When it comes to random forest I have been very pleased with the tree algorithm I've designed. However they just didn't seem to work well stubbed out for gradient boosting. I can only think of three explanations for this.
- My stubbed out trees tended to be biased a certain way.
- They have a lot of noise in the output.
- My gradient boosting algorithm got mucked up due to poor implementation.
That at least is the best I can figure.
Recently I made a giant leap forward in my decision tree technology, I had an 'AH HA!' moment and changed my tree implementation to use my new method. When I got it all right my scores went up like 10% (ball-parking here) and all of a sudden when I tried gradient boosting it worked as well.The results I got with that were fully another 15-20% better still! All at once I felt like a contender. The rest of the world had long since moved on to XGBoost https://github.com/dmlc/xgboost which was leaving me in the dust. Not so much anymore, but i still haven't won any contests or like made million on the stock market :) .
What changed in my trees that made all the difference? I started using the sigmoid https://en.wikipedia.org/wiki/Logistic_function to make my splitters. I had tried that as well, many years ago but the specific implementation was key. The ah-ha moment was when I realize how to directly translate my data in to a perfect splitter without any messing around "honing in" on a the best split. I think this technology not only gives me an great tree based on better statistics model than using accuracy alone (accuracy is how my old tree worked). But the results are more "smooth" so noise is less of an issue.