So already, back on track with data mining? Yeah, I got a rudimentary implementation done last night. And I got to say the new idea, well it isn't too bad. But I've already had to change it slightly from what I was thinking originally. But it is leading down a path that is good. I've got the next logical progression for improvements already scoped out. But here, I'm getting way ahead of myself. Let me explain what I did, and what kind of results I got.
What you got? Okay, the first version worked like this. I load a bootstrapped sample in to the tree. There is no preprocessing or tree building initially. The tree takes the rows it has and a test row we are scoring. it looks at all the features splitting them in half at their average. then checking both halves against the test data. The side that the test data is on compares its average to the the test data's value. the further away the test data is from the average, the betterin terms of standard deviation. We choose the feature that does this best and use the rows from that side to build the next level of the tree. Repeat the process till you have 1 row or a series of rows with the same value.
What are you trying to accomplish? In short I was trying to pick sets of data that best represented the most notable features of test data. When I say most notable, i mean features that are iconic in their outliernessah a made up word. It turns out this method gives around a 60% average accuracy in my mythical test data. This is on par with what random forest gives me but random forest gives a result with a much lower log-loss error. Which is to say, when my code runs and guesses wrong, it is really wrong. Which makes sense as those extremes I'm using may not be the signal, they may be the noise.
You improved it though? I did, so instead I decided to make it follow the average of the group. I did the same split of each feature and average of both halves. But after I did that, instead of finding the most extreme case I went for the one that nails the group average the best again in terms of the standard deviation for that side. This produced a 66% percent average instead. It still has the problem of having a really terrible log-loss. I realize there are ways of mathematically adjusting predictions to minimize that error, but for now I'm just going to take the raw result as that's what I care about.
Cool so an improvement in accuracy! Yeah, but I think it's going to get better. If you think about what is going on here, i'm building 2 sets of results for each feature and trying to find the one that my test data fits best in by being spot on the average, a representative example. I just happen to be doing this by halving the data when I do it. This make sense from a searching/building information perspective. We are making 1 decision and with 1 decision we should be adding 1 bit of information to our decision. Thus we should be eliminating 1 bit worth of solutions aka half the number of possible solutions. I think if we are approaching it this way, there is more we can do here.
What is the improvement? that would be telling! Well another way to think about the feature matching process is to think of it as being like nearest neighbor. Normally in nearest neighbor you find either the 1 nearest or group of K neighbors and take the average or most popular and use that as your solution. Instead of building our list of neighbors we are matching known neighborhoods with our test. I think the improvement will be to build the neighborhood around our test data.
This is done by taking the closest rows to our test data on a particular feature. We want half of them for the reasons I already pointed out. So if we had 100 training rows we want the 50 closest to our test data. That is, the closest to 1 particular feature. Then repeat the process for the other features. each group we build is just that, a group with an average. Ideally our data is in the middle, but it wont be due data's distribution, test data having a sample near the extreme or just noise in the data. We pick the feature that places our test data in the closest to the middle in terms of standard deviations of that feature's group. We then use those rows for the next iteration. repeat the process until you are down to 1 row or have the same answer for all the rows. I believe this should further improve the accuracy of the tree building process. I'm hopeful that we can see some sizable gains 5% again? over the forced neighborhood method.
Don't you need to sort to get the top 50 that are closest, isnt that expensive to do? Yes and no, you could definitely sort the distances and take the top half. But really all you need is that middle dividing line. I'll do a 2 pass method where I figure every points distance and keep a running total of the distances. Then I'll divide to get the average distance and take any that are under it. That will give me the top and bottom half, and my run-time stays linear instead going to n log(n) for that step.
any other ideas for things to improve it? nothing off hand but if it does get more accurate and the log-loss doesn't improve i might take a look at thing I can do to improve that, or maybe just keep pushing on that accuracy. if you get the accuracy high enough eventually the log-loss will come down as they both approach 0 at the same time.