I thought I'd give some examples of how things generally go for me. I imported the data from the https://www.kaggle.com/c/homesite-quote-conversion contest. I had to do this 2 times as I messed it up the first time, which is par for the course. I have some stock data importers I wrote that handle most spreadsheet style data with very little tweaking.
Next I ran a few out of the box tests. I get my results back in a standard form. I believe this scoreboard is being calculated using ROC (which is essentially AUC, and to a lesser extent the gini score). You may notice I have AUC of 1 and LogLoss of NAN in all these tests. Well my AUC is probably being calculated wrong (like maybe it has some rounding hardcoded in it or expects values 0 - 1 or some such). I just haven't gone and looked yet. My logloss is NAN cause that does specifically want values 0 -1.
Generally speaking I use which ever test I have that is an exact match or is the closest. Because they all approach a perfect score, and spending the time implement yet another scoring mechanism really is not very high on list of things to do. Basically, I don’t usually worry about it too much unless its radically different from what I already have.
As for getting the actual scores, I use cross validation like pretty much the rest of the world. I generally only do 3 fold cross validation because significant gains are very apparent even at 3. Increasing it to 4,5,6...9, 10...etc just end up making the tests take longer for little extra information. Though if you are looking for accuracy go with 10. That seems to be the sweet spot. I’ve toyed with going back to 5 or even 9 cause all too often I do get in to the weeds with this stuff and finding little gains becomes important if you want to win.
rfx1 = random forest experiment 1 ... basic normal random forest with my custom splitters
rfx2 = random forest experiment 2 ... I tested this but the results were really bad
rfx3 = random forest experiment 3 ... basic normal random forest with my logistic/sigmoid splitters. the number behind the logistic represents how many features are used in each split. I should note I do not randomly select features with this mechanism, i use those that correlate the most with the final score.
rfx4 = random forest experiment 4 ... another logistic splitter this one uses an additional mechanism that figures in parent node scores and accuracies in to the final score for the tree's leaves. It didn’t' improve results but I thought I'd show it here just to give an example of the kind of things I try.
rfx1 - 46 trees
rfx3 - 46 trees (logistic 1) Leaderboard score with 46 trees 0.90187
rfx3 - 46 trees (logistic 2)
rfx3 - 46 trees (logistic 3) Leaderboard score with 460 trees 0.95695
rfx3 - - 46 trees (logistic 4)
rfx4 - 46 trees (logistic 3)
Calibration is something I change ad hoc. I have iterative processes that can test things while I'm not at the computer and I use the calibration field to decide what is working and what is not. it currently set to be 1-accurracy
I should talk about accuracy. How do you determine how accurate something is? It’s easy when the values are 0 to 100 and scores are evenly distributed. That’s rarely the case. Here's how i do it. I build a normal distribution of the expected results then I see where the final result fell compared to the where the expected result fell on the normal curve. Then I find the % change for the two results and the 1 - difference is my accuracy.
An example might go something like, test value is -1, and train value is 2. The average value is 0 and let’s say the standard deviation is 1. Well the 50% are left and right of the mean. so -1 standard deviation is 15.9% and +2 standard deviation 97.7% so my accuracy is 1 - (97.7-15.9) or 18.2% (which is terrible!) but you see how it works. I do this for all accuracies I need whether it is bagging, final scores or splitting.
That's enough for today, I’ll follow up in a few days with how things are going.