Since this blog is starting over, I think I’ll give a little background on my exploits. I never use tool kits when it comes to the primary algorithm I use for data mining. So things like R, scikit, xgboost and mattlab, where the code is already implemented, I don't use. In my mind half the fun IS developing algorithms and trying to improve on what has been made. The latter is generally the better approach as truly new algorithms seem to fall short.
Now there are times when I do use libraries or someone else's code. I use them for data transformation. So while I might write my own distance function or my own correlation coefficient. Somethings like TSNE or SVD would be just exercises in implementation. And really, I don't want to do that as i'm not trying to speed them up or somehow improve them.(though I did make my own version of TSNE but I did still start with someone else's code.)
When it comes to the actual data mining a new thing (read: kaggle contest), I generally throw some stock stuff I've written at it and then try new variations on how it is implemented. Most the variations are new ways to improve the mechanism in general and have very little to do with that specific data. Usually I then move to a new contest, find that the variations weren't all that good in general and back track some.
The 2 tools I have at the moment that are the best I have to offer are variation on random forests. It’s worth mentioning every time i try to implement gradient boosting the results are lack luster. Someday, maybe, I'll get a good version of that. It is a bit disappointing ‘cause I have never been in the running for winning at a contest at https://www.kaggle.com/. And lately xgboost https://www.youtube.com/watch?v=X47SGnTMZIU which is a very very good implementation of gradient boosting has dominated.
Yeah, despite thousands of man hours I've had 2 top 10% finishes. That's really the difference between developing new and using out of the box implementations that have been developed using standard well test/thought out/prove techniques. So, you know if your thing is winning stick with known mechanism. Most people who had spent the time I've spent using those would have had at least 1, top 10 finish or even an in the money finish by now. Well, I would hope anyways.
So what's different about my 2 random forest implementations? It’s the splitters. 1 uses a splitter that does the best I can come up with to get a pure left and right at each tree node. I picks the purity based on property sorted percentages that weight the accuracy the correct or incorrect split using normal curves
It also is written so it can use multiple features at once. Which can either consolidate down to a point or a line. It tries all 4 possibilities, 2 points 1 point and 1 line or 2 lines. The distance from said point/line dictates the side it falls on. (closer is better)
The other version uses logistic (sigmoid) functions based on normal distributions of the values at that point in the tree. It was an "AH HA!" moment about a month and half ago, when I realized there was a direct translation between normal curves and the sigmoid function cumulative distribution. You don’t need to do gradient descent to get optimal values. You just need a normal curve and it translates perfectly. You can see here https://en.wikipedia.org/wiki/Logistic_distribution the average and standard deviation are known from the normal curve. You just need to do some simple math to figure out the values you need. This works amazingly well. There are more details to it in terms of which side is the "high side" (left or right). The upshot from doing it this way is you can make a multi-dimensional splitter (at any level of dimensionality) in ridiculously fast time. You can make a sigmoid curve that is either based on the sum of features or has the features multiplied out (since combining standard deviations and averages is super simple). I find the sums version works better. This makes sense if your features are all independent you definitely could mix and match features, combing some and making sums out of others. I haven’t tried to figure out if there are any gains doing that yet.
so beyond those things and playing around with GBM that's where I currently am. It’s been a long long road to get here and I'm really hoping for that break through that puts me over the top. I've been looking at uses indexes back in to my tree to use data from further down the tree to change parent splitters, but the problem is tricky. Unlike gxboost, I want a fully populated tree so doing that is problematic from a run time perspective, and really it’s hard to know exactly what makes the best sense to do.
Tomorrow I'll give some examples of how well these work using a contest ( https://www.kaggle.com/c/homesite-quote-conversion ) I joined today that is ending in a week. Just to give people who are curious, an idea of how well things work.