So.. uh Diablo 3? grin Ya, I worked a lot on various code related things Friday and to a lesser extent Saturday and Sunday, but really, I played more Diablo 3 than anything this weekend. It's a pretty good game. I was a little worried I wouldn't be able to get in to it, and had thrown away the money I paid for the game. But nah, that didn't happen. It's quite a bit of fun!
Did you do anything productive in data mining this weekend? Do I ever? Sort of, but first I should take a moment to say a few things about my binary tree classifier. Instead of using a Gini split or a entropy split I'm using a split I've devised that models what the actual data is doing. essentially I build a normal distribution at each split in the tree for both the 1s cases and the 0s cases. I find the center spot between the two averages and that is my splitting point. I'm glossing over the details but that's the gist of it, but here's a picture to give you a visual of the idea OMG!
Sounds kind of slow ... It isn't. You make one pass over the data to find the averages and standard deviations you use those for choosing a splitter and that's it. More important than the speed though is it produces better results than either of the other two methods. at least in this test scenario Where is this going? Well, other than nominal improvements on this method I haven't really had an gains since I implemented random forest. My current ranking over the baseline random forest is only due to my slightly better splits than the one the baseline used. So, I need something new, something that will improve the technique in a major way.
What about boosting and throwing out outliers? I tried boosting but I'm not sure I ever did it right. My results were slightly poorer. And while they say you can "boost forever" the way my algorithms honed in on values it made no sense in boosting after like 2 iterations probably due to me using normal curves and not solid 1 and 0 values. I think maybe I need to better understand how you implement a binary classifier with boosting. Or maybe I'm doing it just fine but it's just not any better than Random Forests. I should say, despite quite a bit of searching, I never did find a good explanation of how Adaboost is implemented over a binary classifier. I mean I found how to do the weighting the math behind it, which works just fine. I also found the idea behind it spelled out but the actual use of the weighted values inside the algorithm (and how the tree rebalances itself or if it even does) I never found written out. So, I tried a bunch of different things but it's all guess work. And like I said, nothing ever improved over Random Forests. As for outliers I'm putting that idea on the back burner for now. I think that's more of a tweaking technique and wont give me the gains I need to get to the front of the pack.
What I am going to do is this... I'm going to try making a different kind of classifier. One which is loosely built off of the ideas that seem strong in Random forest but hopefully more accurate. Binary tree classifiers more or less halve the size of the data at each node based on some decision at that node. And unless that node's attribute is selected again later, that information is never used again. My classifier will make just 1 decision based on all the randomly selected attributes. Will it work?eh maybe Will it be awesome?unlikely or will it be just a colossal waste of time?!i vote that stay tuned!
I see the site is running again! ah yes. it is indeed. I'm still no a 100% sure what happened, but apparently the bios decided to reset itself. Which brought everything down. Then when I fixed it, for some odd reason the network IP info was wiped. I didn't set the gateway righttsk tsk and well... 1 day later here we are.:)
So, now... on bagging, I want to clear a few things up that have become much clearer since my last blog post or at least I think they have It's good in theory!. And I want to set the record as straight as I can, especially, considering that I don't think I explained it right. When you combine two classifiers you don't always get a better result than both by themselves which I was thinking you would, hence my "forest of forests" thing.. You will get a better result *if* there is some variance in the classifiers predictions and they move around a point that is an improvement. Essentially, you are going to get the most probable thing back once every classifier has had it's weighted say. Well, if you combine enough of them. Said another way, each classifier is throwing out predictions that have a some margin of variance around a center point that they all circle. by taking bunches of them you get a dense spot at the center point.
so? So if you take a forest of forests... bagging or no, you probably aren't going to ever see an improvement since they are all returning more or less the same thing! of course!you thought about that way to much. You could have been sleeping instead. You in essence NEED variance around a point in order for multiple predictions to produce said center point.
So the model can never be better than that best point that the classifiers are moving around!?! Yes I think hopes! That explains also why you approach a certain accuracy and never pass it. What I need to do is improve the accuracy of determining the center point (for all cases). One way of doing that is by removing the outliers.... Technically, that's not an improvement as much as it is removing items that do more harm than good. They are completely valid cases and should be handled as such. But since they don't "fit" the model you remove them for an improvement. This leads me to the thinking that the outlier cases really need to be modeled in their own group of classifiers.
Okay, I took a stab at boosting once before a few days ago. I don't think I implemented it correctly. I'm about to take another go at it as the forest of forests thing didn't work. I basically got the same result I got with one big forest except it was slower well and a little less accurate. It may be that I didn't implement it right probably, but I think the real problem was, I wasn't really thinking about what bagging is really "doing" you listen but you don't hear!. Bagging is adjusting for bias as I understand it and isn't so much a measure of training accuracy which is how I was thinking of it. I was using the concept in the forest of forests to improve each level's accuracy with that, but since it's just bias, it didn't help any.
Shouldn't it have helped some? well, it might have but I was actually computing bias wrong tooughhah. I wasn't doing my bagging with replacement. I was actually just selecting 36.8% of the train data samples to build my trees on. It worked but probably not as well as it should have. I'm actually just reran my my "best" method (with replacement) and got a similar result. So, ya, it was good enough for government work.
This leads me back to boosting. In general the idea of boosting if I didn't say before is to increase or decrease the weight of "something" based on if that "something" produced a good result. Typically, the "something" is a classifier a tree. And you are doing this with the tree as a whole. so, run after bias is applied? I think so, Since Bias has no direct correlation with performance. But really I'm not positive they can work together like that. It may be the best application is make a bunch of small forests and boost those. back to forest of forests! That's been explored by others. I'll need to experiment a little to see what makes the best sense. I could try and boost individual nodes or pieces of data going in to the classifier but that I think would be hard to do. I'll leave that for later.
Seriously? Okay. I admit that heading seemed a lot cooler in my head than it actually turned out, but there it is and I ain't changing it. Ya, I've continued to be focused on data mining though I've got an itch to look at data compression think fractals. Why oh why do I have to go to a day job! poor baby. Annnyway I've beaten random forests to death as you can imagine. but nothing I've done seems to improve over the initial technique well not significantly anyway. I've started looking at boosting and am doing a sort of cross breed of the two ideas. It's running in the background as I write this. I don't really expect it to be all that even if it works out pretty well.
So, once that's done what next? Well, I'm thinking maybe I could make a forest of forests I'm sure that's been done by someone somewhere. It probably produces diminishing returns, but I'll give it a whirl all the same. I think the trick is to get as accurate a classifier as possible (aka a tree) so you can use a few trees as possible to make a decent forest (using a very small subset of the data). Then that forest works as a single classifier in a 2nd larger forest. It wouldn't work any better than one big forest if I reused all of the data for each forest. so each will have it's own subset. Which is why it's so important to have a as strong classier as possible from very little data at the lowest level.
Normally, you set aside 36.8% of the data to use as a measure of accuracy. I expect to do that each level down. so the top level sends 63.2% of the data to the forest classifiers. those send 39.92% of the total data (63.2% of what they got) down to the trees. I guess you could take out as far as you like till you end up with forests of 2 trees with 1 tree to score the forest by. But really, that would be fairly useless for digging out good data from. I expect the minimum size is directly related to the number of attributes a sample has. with a lower end forest size being around 18 (just so no matter what you have some level of confidence in the results.). Needless to say, I might do that. But even with the separation of the data like I'm proposing it may actually not turn out to be any better than what I'm currently doing with 1 big forest. Or if it is, the processing time might be to horrendous to be useful. Its hard to see clearly in my head if all I'm doing is approaching a limit faster or if there actually is more information to be milked out of the data. I guess we will see! I'm sorry were you talking?
Hello there! you seem in a good mood. Any reason? Nah, I mean other than it being Friday. Any new breakthroughs with data mining? Not really. I've been going back and forth between looking for optimizations to my random forest implementation and trying out some new ideas of mine. So far nothing has really panned out.
For the last few days, I have been watching all of the foamy
(the squirrel) shortsThere are a LOT of them 300+
. Some of them are really good but most are kinda light on the laughshe lies they are awesome
. Maybe I'm just saturated. Either way it's something to do while running tests and I enjoy them on the whole.
For some odd reason, today I find myself dwelling on one of my favorite "tricks" in compression. So for no particular reason other than that, I'll share Burrows–Wheeler transform
. It occurs to me just now, (related to compression, not data mining) that maybe I could redefine what "sorting" is slow down there cowboy
and far improve the transformation. That is if you take the last column after the transformation you can reverse out the transformation. This would be true for ANY mechanism that reordered as long as the mechanism was defined rigidly. I bet someone already did that. probably. most simple data compression is so thoroughly explored
Maybe there is some sort of application in my random forest implementation... hmmm...
Yeah.... So, I've been struggling with understanding why my competitive score was so low. That is, the score based on results I submitted to kaggle for the most recent competition I'm in. You see, when I run a test locally my algorithms would be scored and show a result that looked good. However, when I submit my results it would always score fairly badly. Like really badly. And last night I finally said "ok, I'm gonna figure out what the deal is". It turns out while my training data was fine which I use for local testing, my test data was all kinds of messed up. The columns were off by 1 and 1 column was missing entirely. Needless to say, this was the equivalent of sending in random data and trying to score it.hah!
Until today, had you ever sent in good results? Ah.. nope. Does that mean you might have skipped over a better algorithm? Probably not. Since my training data was OK, and I was testing locally with said training data, I got a good idea of how each of my algorithms should perform. The current mechanism I'm using is, by that standard, out performing everything I've done before. So even though I don't know, really, how well my old algorithm were doing, I know they were worse than what I've currently got. So I guess it's all OK.:)
How are you doing?
You can see for yourself http://www.kaggle.com/c/bioresponse/leaderboard
I'm the user j_scheibel. My current score is 0.46041 which is JUST above the random forest benchmark. Position 197bleh
:( still a lot of people ahead of me. but at least now I'm on the positive side of the top benchmark! Really the difference between the top performer and my isn't as much as you think. But like so many other things... it becomes increasingly hard to squeeze out a little more performance when you hit a certain wall. I still have a long time before this contest is over and suddenly being much higher up on the ranking is very motivating.