Here we are a few days later if you look at the leader board https://www.kaggle.com/c/homesite-quote-conversion/leaderboard the current best is 0.97062 . I am still way below that at 0.96305 . That doesn't sound like much but evidently this is a pretty straight forward set of data and the value is in eeking out that last little bit of score. My score IS higher than last time I posted. It was improved 2 different times.
It first went up when I realized the data I had imported had features that had been flagged categorical. My importer does the best it can to figure that sort of thing out, but its very very far from perfect. Just because values are whole numbers doesn't mean it is necessarily categorical. and even if it DOES have whole real numbers it doesn't mean it should be treated as a linear progression. In short I changed all the features to be considered real numbers and this gave me a large gain.
I should mention i handle categorical features differently than real number features. basically i make bags that represent some of the feature values and try to make the bags the same size. in cases where i need to evaluate feature as a number and its not been changed to a yes or a no, i look at the training data and find an average value for that category's value. its not perfect but in some cases its way faster than splitting out 10000 categories especially when you want to see if a particular feature has some sort of correlation with the scores.
The 2nd much smaller gain came from flagging any features with 10 or less values and telling the system to treat it as a categorical field. (chosen mainly as a cut off to keep total feature count down to a minimum) Then I went ahead and did a "one-hot encoding" style split on those features to get them in to their component parts. that is you take all the possible values for a feature and give it, its own feature. 5 different values means 5 different features. Each feature then has either a 0 or 1 indicating if that value is present. I flag all those new features as real number features and not categorical features. I turn off the original feature. This gave me a small gain.
My current internal testing result looks like this:
rfx3 - 460 trees - new cat setup - 0.96305
I mentioned before i should probably fix AUC, as that is really what ROC is. I did take a glance at it and didnt see anything obviously wrong, but before this is over I'll almost definitely have to get in there and fix it. I've continued using Gini 'cause it seems pretty close.
My next steps are to figure out if there is a way I can reduce noise and/or if there is a way I can increase the weight of correct values. In a different contest I wrote a mechanism that attempts to do a version of plat scaling https://en.wikipedia.org/wiki/Platt_scaling (which really, is all my sigmoid splitters are) on the result to better pick whole number answers to an exact value. I didn't do a standard implementation which is very me. I worked really well. This contest does not use whole values though, so I'll have to go take a look at the code to see if I can make it work for that. To be clear, i intend on getting the appropriate weight for any given result from the plat scalling.
This doesn't directly handle another problem I'd like to fix. noise in the data. any given sample is fuzzy in nature. the values may be exact and correct, but the underlining truth of what it means is bell shaped. Ideally we would get at that truth and have empirical flags (or real values) come out of each feature that always gave the right answer when used correctly. I don't have a magic way to do that... yet. :) in the mean time the best thing we can do is find the features that do more harm than good. unfortunately the only way I have to do that is at this point is to brute force test removing each feature and seeing if there are gains or not. terrible, i know. its been running for about 24 hours and has found only 1 feature it thinks is worth turning off. And even that improvement was right on the line of what could be considered variance in my results. This is why you want higher cross validation, so you can say with much more certainty that you should do a change or not. I might stop it and pickup the search later (i can always resume where I left off). It's a good way to fill time productively when you are working or just not interested in programming.
The only other way I can think of off hand to improve my overall score is to increase the data set size i train on. Currently I'm doing my training using a bagged training set. This means I hold out about 1/3 ( 1.0/e ) of the data and use it to score the other 2/3rds (1.0 - 1.0/e) I build the tree on. This then becomes my measure of accuracy for that tree. if I can find a good way increase that I should in principle have a better, more accurate model. ideally the ensemble nature of the random forest takes care of that. combined with the plat scaling I'm looking at adding, there may be no real gains. I would think though if you can minimize the statistics and maximize the use of the data you would get to a precise answer faster. we will see what I come up with.