Dim Red Glow

A blog about data mining, games, stocks and adventures.

A contest in a week (part 2)

Here we are a few days later if you look at the leader board  https://www.kaggle.com/c/homesite-quote-conversion/leaderboard the current best is 0.97062 . I am still way below that at 0.96305 . That doesn't sound like much but evidently this is a pretty straight forward set of data and the value is in eeking out that last little bit of score. My score IS higher than last time I posted. It was improved 2 different times.

It first went up when I realized the data I had imported had features that had been flagged categorical. My importer does the best it can to figure that sort of thing out, but its very very far from perfect. Just because values are whole numbers doesn't mean it is necessarily categorical. and even if it DOES have whole real numbers it doesn't mean it should be treated as a linear progression. In short I changed all the features to be considered real numbers and this gave me a large gain.

I should mention i handle categorical features differently than real number features. basically i make bags that represent some of the feature values and try to make the bags the same size. in cases where i need to evaluate feature as a number and its not been changed to a yes or a no, i look at the training data and find an average value for that category's value. its not perfect but in some cases its way faster than splitting out 10000 categories especially when you want to see if a particular feature has some sort of correlation with the scores.

The 2nd much smaller gain came from flagging any features with 10 or less values and telling the system to treat it as a categorical field. (chosen mainly as a cut off to keep total feature count down to a minimum) Then I went ahead and did a "one-hot encoding" style split on those features to get them in to their component parts. that is you take all the possible values for a feature and give it, its own feature. 5 different values means 5 different features. Each feature then has either a 0 or 1 indicating if that value is present. I flag all those new features as real number features and not categorical features. I turn off the original feature. This gave me a small gain.

My current internal testing result looks like this:


rfx3 - 460 trees - new cat setup - 0.96305
gini-score:0.919898981627662
CorCoefRR:0.634927047984772
Accuracy:0.886946585874578
LogLoss:NaN
AUC:1
RMSE:1.18522649708472
MAE:0.530871232940142
Rmsle:0.295421422778189
Calibration:0.0801010183723376

I mentioned before i should probably fix AUC, as that is really what ROC is. I did take a glance at it and didnt see anything obviously wrong, but before this is over I'll almost definitely have to get in there and fix it. I've continued using Gini 'cause it seems pretty close.

My next steps are to figure out if there is a way I can reduce noise and/or if there is a way I can increase the weight of correct values. In a different contest I wrote a mechanism that attempts to do a version of plat scaling https://en.wikipedia.org/wiki/Platt_scaling  (which really, is all my sigmoid splitters are) on the result to better pick whole number answers to an exact value. I didn't do a standard implementation which is very me. I worked really well. This contest does not use whole values though, so I'll have to go take a look at the code to see if I can make it work for that. To be clear, i intend on getting the appropriate weight for any given result from the plat scalling.

This doesn't directly handle another problem I'd like to fix. noise in the data. any given sample is fuzzy in nature. the values may be exact and correct, but the underlining truth of what it means is bell shaped. Ideally we would get at that truth and have empirical flags (or real values) come out of each feature that always gave the right answer when used correctly. I don't have a magic way to do that... yet. :) in the mean time the best thing we can do is find the features that do more harm than good. unfortunately the only way I have to do that is at this point is to brute force test removing each feature and seeing if there are gains or not. terrible, i know. its been running for about 24 hours and has found only 1 feature it thinks is worth turning off. And even that improvement was right on the line of what could be considered variance in my results. This is why you want higher cross validation, so you can say with much more certainty that you should do a change or not. I might stop it and pickup the search later (i can always resume where I left off). It's a good way to fill time productively when you are working or just not interested in programming.

The only other way I can think of off hand to improve my overall score is to increase the data set size i train on. Currently I'm doing my training using a bagged training set. This means I hold out about 1/3 ( 1.0/e ) of the data and use it to score the other 2/3rds (1.0 - 1.0/e) I build the tree on. This then becomes my measure of accuracy for that tree. if I can find a good way increase that I should in principle have a better, more accurate model. ideally the ensemble nature of the random forest takes care of that. combined with the plat scaling I'm looking at adding, there may be no real gains. I would think though if you can minimize the statistics and maximize the use of the data you would get to a precise answer faster. we will see what I come up with.

 

 

A contest in a week (part 1)

I thought I'd give some examples of how things generally go for me. I imported the data from the https://www.kaggle.com/c/homesite-quote-conversion contest. I had to do this 2 times as I messed it up the first time, which is par for the course. I have some stock data importers I wrote that handle most spreadsheet style data with very little tweaking.

Next I ran a few out of the box tests. I get my results back in a standard form. I believe this scoreboard is being calculated using ROC (which is essentially AUC, and to a lesser extent the gini score). You may notice I have AUC of 1 and LogLoss of NAN in all these tests. Well my AUC is probably being calculated wrong (like maybe it has some rounding hardcoded in it or expects values 0 - 1 or some such).  I just haven't gone and looked yet. My logloss is NAN cause that does specifically want values 0 -1.

Generally speaking I use which ever test I have that is an exact match or is the closest. Because they all approach a perfect score, and spending the time implement yet another scoring mechanism really is not very high on list of things to do. Basically, I don’t usually worry about it too much unless its radically different from what I already have.

As for getting the actual scores, I use cross validation like pretty much the rest of the world. I generally only do 3 fold cross validation because significant gains are very apparent even at 3. Increasing it to 4,5,6...9, 10...etc  just end up making the tests take longer for little extra information. Though if you are looking for accuracy go with 10. That seems to be the sweet spot. I’ve toyed with going back to 5 or even 9 cause all too often I do get in to the weeds with this stuff and finding little gains becomes important if you want to win.

Legend

rfx1 = random forest experiment 1 ... basic normal random forest with my custom splitters

rfx2 = random forest experiment 2 ... I tested this but the results were really bad

rfx3 = random forest experiment 3 ... basic normal random forest with my logistic/sigmoid splitters. the number behind the logistic represents how many features are used in each split. I should note I do not randomly select features with this mechanism, i use those that correlate the most with the final score.

rfx4 = random forest experiment 4 ... another logistic splitter this one uses an additional mechanism that figures in parent node scores and accuracies in to the final score for the tree's leaves. It didn’t' improve results but I thought I'd show it here just to give an example of the kind of things I try.

Results
rfx1 - 46 trees
gini-score:0.796266356063114
CorCoefRR:0.378157688020071
Accuracy:0.7456163354227
LogLoss:NaN
AUC:1
RMSE:1.5631840915599
MAE:1.07671401193999
Rmsle:0.405897189976455
Calibration:0.2543836645773

rfx3 - 46 trees (logistic 1) Leaderboard score with 46 trees 0.90187
gini-score:0.795058723660037
CorCoefRR:0.409256922331928
Accuracy:0.780350470167233
LogLoss:NaN
AUC:1
RMSE:1.52284323987427
MAE:0.979367384563889
Rmsle:0.389736465098101
Calibration:0.219649529832767

rfx3 - 46 trees (logistic 2)
gini-score:0.900834289021121
CorCoefRR:0.592472621731979
Accuracy:0.854328645381246
LogLoss:NaN
AUC:1
RMSE:1.25910166514491
MAE:0.658794207857713
Rmsle:0.316163290273993
Calibration:0.145671354618754

rfx3 - 46 trees (logistic 3) Leaderboard score with 460 trees 0.95695
gini-score:0.905913368972696
CorCoefRR:0.600941784768692
Accuracy:0.859935155804304
LogLoss:NaN
AUC:1
RMSE:1.24723196082576
MAE:0.640500763913664
Rmsle:0.311818233193152
Calibration:0.140064844195696

rfx3 - - 46 trees (logistic 4)
gini-score:0.90578470857155
CorCoefRR:0.599962582265556
Accuracy:0.857539596865357
LogLoss:NaN
AUC:1
RMSE:1.25525929882265
MAE:0.657799102194174
Rmsle:0.312383492790915
Calibration:0.142460403134643

rfx4 - 46 trees (logistic 3)
gini-score:0.905049421077021
CorCoefRR:0.599288488034863
Accuracy:0.83152333785067
LogLoss:NaN
AUC:1
RMSE:1.24604271464082
MAE:0.714368432524074
Rmsle:0.326817108831641
Calibration:0.16847666214933

 

Calibration is something I change ad hoc. I have iterative processes that can test things while I'm not at the computer and I use the calibration field to decide what is working and what is not. it currently set to be 1-accurracy

I should talk about accuracy. How do you determine how accurate something is? It’s easy when the values are 0 to 100 and scores are evenly distributed. That’s rarely the case. Here's how i do it. I build a normal distribution of the expected results then I see where the final result fell compared to the where the expected result fell on the normal curve. Then I find the % change for the two results and the 1 - difference is my accuracy.

An example might go something like, test value is -1, and train value is 2. The average value is 0 and let’s say the standard deviation is 1. Well the 50% are left and right of the mean. so -1 standard deviation is 15.9% and +2 standard deviation 97.7% so my accuracy is 1 - (97.7-15.9) or 18.2% (which is terrible!) but you see how it works. I do this for all accuracies I need whether it is bagging, final scores or splitting.

That's enough for today, I’ll follow up in a few days with how things are going.

 

Current state of my datamining

Since this blog is starting over, I think I’ll give a little background on my exploits. I never use tool kits when it comes to the primary algorithm I use for data mining. So things like R, scikit, xgboost and mattlab, where the code is already implemented, I don't use. In my mind half the fun IS developing algorithms and trying to improve on what has been made. The latter is generally the better approach as truly new algorithms seem to fall short.

Now there are times when I do use libraries or someone else's code. I use them for data transformation. So while I might write my own distance function or my own correlation coefficient. Somethings like TSNE or SVD would be just exercises in implementation. And really, I don't want to do that as i'm not trying to speed them up or somehow improve them.(though I did make my own version of TSNE but I did still start with someone else's code.)

When it comes to the actual data mining a new thing (read: kaggle contest), I generally throw some stock stuff I've written at it and then try new variations on how it is implemented. Most the variations are new ways to improve the mechanism in general and have very little to do with that specific data. Usually I then move to a new contest, find that the variations weren't all that good in general and back track some.

The 2 tools I have at the moment that are the best I have to offer are variation on random forests. It’s worth mentioning every time i try to implement gradient boosting the results are lack luster. Someday, maybe, I'll get a good version of that. It is a bit disappointing ‘cause I have never been in the running for winning at a contest at https://www.kaggle.com/. And lately xgboost https://www.youtube.com/watch?v=X47SGnTMZIU which is a very very good implementation of gradient boosting has dominated.

Yeah, despite thousands of man hours I've had 2 top 10% finishes. That's really the difference between developing new and using out of the box implementations that have been developed using standard well test/thought out/prove techniques. So, you know if your thing is winning stick with known mechanism. Most people who had spent the time I've spent using those would have had at least 1, top 10 finish or even an in the money finish by now. Well, I would hope anyways.

So what's different about my 2 random forest implementations? It’s the splitters. 1 uses a splitter that does the best I can come up with to get a pure left and right at each tree node. I picks the purity based on property sorted percentages that weight the accuracy the correct or incorrect split using normal curves

It also is written so it can use multiple features at once. Which can either consolidate down to a point or a line. It tries all 4 possibilities, 2 points 1 point and 1 line or 2 lines. The distance from said point/line dictates the side it falls on. (closer is better)

The other version uses logistic (sigmoid) functions based on normal distributions of the values at that point in the tree. It was an "AH HA!" moment about a month and half ago, when I realized there was a direct translation between normal curves and the sigmoid function cumulative distribution. You don’t need to do gradient descent to get optimal values. You just need a normal curve and it translates perfectly. You can see here https://en.wikipedia.org/wiki/Logistic_distribution the average and standard deviation are known from the normal curve. You just need to do some simple math to figure out the values you need. This works amazingly well. There are more details to it in terms of which side is the "high side" (left or right). The upshot from doing it this way is you can make a multi-dimensional splitter (at any level of dimensionality) in ridiculously fast time. You can make a sigmoid curve that is either based on the sum of features or has the features multiplied out (since combining standard deviations and averages is super simple). I find the sums version works better. This makes sense if your features are all independent you definitely could mix and match features, combing some and making sums out of others. I haven’t tried to figure out if there are any gains doing that yet.

so beyond those things and playing around with GBM that's where I currently am. It’s been a long long road to get here and I'm really hoping for that break through that puts me over the top. I've been looking at uses indexes back in to my tree to use data from further down the tree to change parent splitters, but the problem is tricky. Unlike gxboost, I want a fully populated tree so doing that is problematic from a run time perspective, and really it’s hard to know exactly what makes the best sense to do.

Tomorrow I'll give some examples of how well these work using a contest ( https://www.kaggle.com/c/homesite-quote-conversion ) I joined today that is ending in a week. Just to give people who are curious, an idea of how well things work.

 

Time for a reboot

I've decided to start my blog over with actual blog software i didn't write. I did this mainly because I wanted some additional features and didn't want to spend time writing them. Also, I have had 2 posts in 1 year which means everything was stagnant anyways and it seemed like a good point to start over with a clean slate. Hopefully this new site will give me the gumption to write a little more.