Dim Red Glow

A blog about data mining, games, stocks and adventures.

A contest in a week (part 1)

I thought I'd give some examples of how things generally go for me. I imported the data from the https://www.kaggle.com/c/homesite-quote-conversion contest. I had to do this 2 times as I messed it up the first time, which is par for the course. I have some stock data importers I wrote that handle most spreadsheet style data with very little tweaking.

Next I ran a few out of the box tests. I get my results back in a standard form. I believe this scoreboard is being calculated using ROC (which is essentially AUC, and to a lesser extent the gini score). You may notice I have AUC of 1 and LogLoss of NAN in all these tests. Well my AUC is probably being calculated wrong (like maybe it has some rounding hardcoded in it or expects values 0 - 1 or some such).  I just haven't gone and looked yet. My logloss is NAN cause that does specifically want values 0 -1.

Generally speaking I use which ever test I have that is an exact match or is the closest. Because they all approach a perfect score, and spending the time implement yet another scoring mechanism really is not very high on list of things to do. Basically, I don’t usually worry about it too much unless its radically different from what I already have.

As for getting the actual scores, I use cross validation like pretty much the rest of the world. I generally only do 3 fold cross validation because significant gains are very apparent even at 3. Increasing it to 4,5,6...9, 10...etc  just end up making the tests take longer for little extra information. Though if you are looking for accuracy go with 10. That seems to be the sweet spot. I’ve toyed with going back to 5 or even 9 cause all too often I do get in to the weeds with this stuff and finding little gains becomes important if you want to win.

Legend

rfx1 = random forest experiment 1 ... basic normal random forest with my custom splitters

rfx2 = random forest experiment 2 ... I tested this but the results were really bad

rfx3 = random forest experiment 3 ... basic normal random forest with my logistic/sigmoid splitters. the number behind the logistic represents how many features are used in each split. I should note I do not randomly select features with this mechanism, i use those that correlate the most with the final score.

rfx4 = random forest experiment 4 ... another logistic splitter this one uses an additional mechanism that figures in parent node scores and accuracies in to the final score for the tree's leaves. It didn’t' improve results but I thought I'd show it here just to give an example of the kind of things I try.

Results
rfx1 - 46 trees
gini-score:0.796266356063114
CorCoefRR:0.378157688020071
Accuracy:0.7456163354227
LogLoss:NaN
AUC:1
RMSE:1.5631840915599
MAE:1.07671401193999
Rmsle:0.405897189976455
Calibration:0.2543836645773

rfx3 - 46 trees (logistic 1) Leaderboard score with 46 trees 0.90187
gini-score:0.795058723660037
CorCoefRR:0.409256922331928
Accuracy:0.780350470167233
LogLoss:NaN
AUC:1
RMSE:1.52284323987427
MAE:0.979367384563889
Rmsle:0.389736465098101
Calibration:0.219649529832767

rfx3 - 46 trees (logistic 2)
gini-score:0.900834289021121
CorCoefRR:0.592472621731979
Accuracy:0.854328645381246
LogLoss:NaN
AUC:1
RMSE:1.25910166514491
MAE:0.658794207857713
Rmsle:0.316163290273993
Calibration:0.145671354618754

rfx3 - 46 trees (logistic 3) Leaderboard score with 460 trees 0.95695
gini-score:0.905913368972696
CorCoefRR:0.600941784768692
Accuracy:0.859935155804304
LogLoss:NaN
AUC:1
RMSE:1.24723196082576
MAE:0.640500763913664
Rmsle:0.311818233193152
Calibration:0.140064844195696

rfx3 - - 46 trees (logistic 4)
gini-score:0.90578470857155
CorCoefRR:0.599962582265556
Accuracy:0.857539596865357
LogLoss:NaN
AUC:1
RMSE:1.25525929882265
MAE:0.657799102194174
Rmsle:0.312383492790915
Calibration:0.142460403134643

rfx4 - 46 trees (logistic 3)
gini-score:0.905049421077021
CorCoefRR:0.599288488034863
Accuracy:0.83152333785067
LogLoss:NaN
AUC:1
RMSE:1.24604271464082
MAE:0.714368432524074
Rmsle:0.326817108831641
Calibration:0.16847666214933

 

Calibration is something I change ad hoc. I have iterative processes that can test things while I'm not at the computer and I use the calibration field to decide what is working and what is not. it currently set to be 1-accurracy

I should talk about accuracy. How do you determine how accurate something is? It’s easy when the values are 0 to 100 and scores are evenly distributed. That’s rarely the case. Here's how i do it. I build a normal distribution of the expected results then I see where the final result fell compared to the where the expected result fell on the normal curve. Then I find the % change for the two results and the 1 - difference is my accuracy.

An example might go something like, test value is -1, and train value is 2. The average value is 0 and let’s say the standard deviation is 1. Well the 50% are left and right of the mean. so -1 standard deviation is 15.9% and +2 standard deviation 97.7% so my accuracy is 1 - (97.7-15.9) or 18.2% (which is terrible!) but you see how it works. I do this for all accuracies I need whether it is bagging, final scores or splitting.

That's enough for today, I’ll follow up in a few days with how things are going.

 

Current state of my datamining

Since this blog is starting over, I think I’ll give a little background on my exploits. I never use tool kits when it comes to the primary algorithm I use for data mining. So things like R, scikit, xgboost and mattlab, where the code is already implemented, I don't use. In my mind half the fun IS developing algorithms and trying to improve on what has been made. The latter is generally the better approach as truly new algorithms seem to fall short.

Now there are times when I do use libraries or someone else's code. I use them for data transformation. So while I might write my own distance function or my own correlation coefficient. Somethings like TSNE or SVD would be just exercises in implementation. And really, I don't want to do that as i'm not trying to speed them up or somehow improve them.(though I did make my own version of TSNE but I did still start with someone else's code.)

When it comes to the actual data mining a new thing (read: kaggle contest), I generally throw some stock stuff I've written at it and then try new variations on how it is implemented. Most the variations are new ways to improve the mechanism in general and have very little to do with that specific data. Usually I then move to a new contest, find that the variations weren't all that good in general and back track some.

The 2 tools I have at the moment that are the best I have to offer are variation on random forests. It’s worth mentioning every time i try to implement gradient boosting the results are lack luster. Someday, maybe, I'll get a good version of that. It is a bit disappointing ‘cause I have never been in the running for winning at a contest at https://www.kaggle.com/. And lately xgboost https://www.youtube.com/watch?v=X47SGnTMZIU which is a very very good implementation of gradient boosting has dominated.

Yeah, despite thousands of man hours I've had 2 top 10% finishes. That's really the difference between developing new and using out of the box implementations that have been developed using standard well test/thought out/prove techniques. So, you know if your thing is winning stick with known mechanism. Most people who had spent the time I've spent using those would have had at least 1, top 10 finish or even an in the money finish by now. Well, I would hope anyways.

So what's different about my 2 random forest implementations? It’s the splitters. 1 uses a splitter that does the best I can come up with to get a pure left and right at each tree node. I picks the purity based on property sorted percentages that weight the accuracy the correct or incorrect split using normal curves

It also is written so it can use multiple features at once. Which can either consolidate down to a point or a line. It tries all 4 possibilities, 2 points 1 point and 1 line or 2 lines. The distance from said point/line dictates the side it falls on. (closer is better)

The other version uses logistic (sigmoid) functions based on normal distributions of the values at that point in the tree. It was an "AH HA!" moment about a month and half ago, when I realized there was a direct translation between normal curves and the sigmoid function cumulative distribution. You don’t need to do gradient descent to get optimal values. You just need a normal curve and it translates perfectly. You can see here https://en.wikipedia.org/wiki/Logistic_distribution the average and standard deviation are known from the normal curve. You just need to do some simple math to figure out the values you need. This works amazingly well. There are more details to it in terms of which side is the "high side" (left or right). The upshot from doing it this way is you can make a multi-dimensional splitter (at any level of dimensionality) in ridiculously fast time. You can make a sigmoid curve that is either based on the sum of features or has the features multiplied out (since combining standard deviations and averages is super simple). I find the sums version works better. This makes sense if your features are all independent you definitely could mix and match features, combing some and making sums out of others. I haven’t tried to figure out if there are any gains doing that yet.

so beyond those things and playing around with GBM that's where I currently am. It’s been a long long road to get here and I'm really hoping for that break through that puts me over the top. I've been looking at uses indexes back in to my tree to use data from further down the tree to change parent splitters, but the problem is tricky. Unlike gxboost, I want a fully populated tree so doing that is problematic from a run time perspective, and really it’s hard to know exactly what makes the best sense to do.

Tomorrow I'll give some examples of how well these work using a contest ( https://www.kaggle.com/c/homesite-quote-conversion ) I joined today that is ending in a week. Just to give people who are curious, an idea of how well things work.

 

Time for a reboot

I've decided to start my blog over with actual blog software i didn't write. I did this mainly because I wanted some additional features and didn't want to spend time writing them. Also, I have had 2 posts in 1 year which means everything was stagnant anyways and it seemed like a good point to start over with a clean slate. Hopefully this new site will give me the gumption to write a little more.