5/15/2012
Okay, I took a stab at boosting once before a few days ago. I don't think I implemented it correctly. I'm about to take another go at it as the forest of forests thing didn't work. I basically got the same result I got with one big forest except it was slower well and a little less accurate. It may be that I didn't implement it right probably, but I think the real problem was, I wasn't really thinking about what bagging is really "doing" you listen but you don't hear!. Bagging is adjusting for bias as I understand it and isn't so much a measure of training accuracy which is how I was thinking of it. I was using the concept in the forest of forests to improve each level's accuracy with that, but since it's just bias, it didn't help any.
Shouldn't it have helped some? well, it might have but I was actually computing bias wrong tooughhah. I wasn't doing my bagging with replacement. I was actually just selecting 36.8% of the train data samples to build my trees on. It worked but probably not as well as it should have. I'm actually just reran my my "best" method (with replacement) and got a similar result. So, ya, it was good enough for government work.
This leads me back to boosting. In general the idea of boosting if I didn't say before is to increase or decrease the weight of "something" based on if that "something" produced a good result. Typically, the "something" is a classifier a tree. And you are doing this with the tree as a whole. so, run after bias is applied? I think so, Since Bias has no direct correlation with performance. But really I'm not positive they can work together like that. It may be the best application is make a bunch of small forests and boost those. back to forest of forests! That's been explored by others. I'll need to experiment a little to see what makes the best sense. I could try and boost individual nodes or pieces of data going in to the classifier but that I think would be hard to do. I'll leave that for later.
5/13/2012
Seriously? Okay. I admit that heading seemed a lot cooler in my head than it actually turned out, but there it is and I ain't changing it. Ya, I've continued to be focused on data mining though I've got an itch to look at data compression think fractals. Why oh why do I have to go to a day job! poor baby. Annnyway I've beaten random forests to death as you can imagine. but nothing I've done seems to improve over the initial technique well not significantly anyway. I've started looking at boosting and am doing a sort of cross breed of the two ideas. It's running in the background as I write this. I don't really expect it to be all that even if it works out pretty well.
So, once that's done what next? Well, I'm thinking maybe I could make a forest of forests I'm sure that's been done by someone somewhere. It probably produces diminishing returns, but I'll give it a whirl all the same. I think the trick is to get as accurate a classifier as possible (aka a tree) so you can use a few trees as possible to make a decent forest (using a very small subset of the data). Then that forest works as a single classifier in a 2nd larger forest. It wouldn't work any better than one big forest if I reused all of the data for each forest. so each will have it's own subset. Which is why it's so important to have a as strong classier as possible from very little data at the lowest level.
Normally, you set aside 36.8% of the data to use as a measure of accuracy. I expect to do that each level down. so the top level sends 63.2% of the data to the forest classifiers. those send 39.92% of the total data (63.2% of what they got) down to the trees. I guess you could take out as far as you like till you end up with forests of 2 trees with 1 tree to score the forest by. But really, that would be fairly useless for digging out good data from. I expect the minimum size is directly related to the number of attributes a sample has. with a lower end forest size being around 18 (just so no matter what you have some level of confidence in the results.). Needless to say, I might do that. But even with the separation of the data like I'm proposing it may actually not turn out to be any better than what I'm currently doing with 1 big forest. Or if it is, the processing time might be to horrendous to be useful. Its hard to see clearly in my head if all I'm doing is approaching a limit faster or if there actually is more information to be milked out of the data. I guess we will see! I'm sorry were you talking?
5/11/2012
Hello there! you seem in a good mood. Any reason? Nah, I mean other than it being Friday. Any new breakthroughs with data mining? Not really. I've been going back and forth between looking for optimizations to my random forest implementation and trying out some new ideas of mine. So far nothing has really panned out.
For the last few days, I have been watching all of the
foamy (the squirrel) shorts
There are a LOT of them 300+. Some of them are really good but most are kinda light on the laughs
he lies they are awesome. Maybe I'm just saturated. Either way it's something to do while running tests and I enjoy them on the whole.
For some odd reason, today I find myself dwelling on one of my favorite "tricks" in compression. So for no particular reason other than that, I'll share
Burrows–Wheeler transform. It occurs to me just now, (related to compression, not data mining) that maybe I could redefine what "sorting" is
slow down there cowboy and far improve the transformation. That is if you take the last column after the transformation you can reverse out the transformation. This would be true for ANY mechanism that reordered as long as the mechanism was defined rigidly.
I bet someone already did that. probably. most simple data compression is so thoroughly explored Maybe there is some sort of application in my random forest implementation... hmmm...
5/8/2012
Yeah.... So, I've been struggling with understanding why my competitive score was so low. That is, the score based on results I submitted to kaggle for the most recent competition I'm in. You see, when I run a test locally my algorithms would be scored and show a result that looked good. However, when I submit my results it would always score fairly badly. Like really badly. And last night I finally said "ok, I'm gonna figure out what the deal is". It turns out while my training data was fine which I use for local testing, my test data was all kinds of messed up. The columns were off by 1 and 1 column was missing entirely. Needless to say, this was the equivalent of sending in random data and trying to score it.hah!
Until today, had you ever sent in good results? Ah.. nope. Does that mean you might have skipped over a better algorithm? Probably not. Since my training data was OK, and I was testing locally with said training data, I got a good idea of how each of my algorithms should perform. The current mechanism I'm using is, by that standard, out performing everything I've done before. So even though I don't know, really, how well my old algorithm were doing, I know they were worse than what I've currently got. So I guess it's all OK.:)
How are you doing? You can see for yourself
http://www.kaggle.com/c/bioresponse/leaderboard I'm the user j_scheibel. My current score is 0.46041 which is JUST above the random forest benchmark. Position 197
bleh :( still a lot of people ahead of me. but at least now I'm on the positive side of the top benchmark! Really the difference between the top performer and my isn't as much as you think. But like so many other things... it becomes increasingly hard to squeeze out a little more performance when you hit a certain wall. I still have a long time before this contest is over and suddenly being much higher up on the ranking is very motivating.
5/4/2012
It's Friday! Doesn't that make you happy?!!? YES! shortest blog ever? Nah... here's what happened last night. I double checked my code, found a few things that could be done far more efficiently. I also double checked that I was using the best values to split on. I also ran a test or 10 to make sure everything was working well. It seems to be hooray! My big test ran over night but it didn't quite finish. So I still don't have a new set of numbers to submit to kaggle. bleh
What will you do next? Well, I actually think there must be lots of improvements to be made still, just in making my program the "stock" version of Random forest. My results from little tests still don't seem good enough to get a really good placement overall say, the top 170 aka the baseline random forest result the show. If I can figure out what I'm doing wrong, I'll start in on my first big modification I want to speed up mechanism (and hopefully accuracy) by selecting multiple attributes instead of one attribute to split on. And I want change the way it works in terms of the smoothness of the generated results. Think in terms of selecting points instead of blocks and then finding the line between two points. That's not how i'll do it per say, but it's an easy was to visualize it.
5/3/2012
First a little entertainment.
isn't this whole thing supposed to be entertainment???indeed! I've been watching a starcraft 2 video cast series called "when cheese fails"
They also have a will cheese fail series. The two casters watch a video sent in by viewers where someone tries to win a game using an attack that is typically either a cannon rush, 6 pools attacks or a proxy base. Thus far I think this is the best one I've seen. If you like starcraft 2, hopefully, you will enjoy it.
www.youtube.com/watch?v=0mPGYeuU2Oc
As for the random forests stuff. I finally got it working last night. It definitely looks like the results are going to be way better.=D I'm not 100% sure I've got everything optimized for the best traditional result yet. It's hard to tell because it runs so slow and as it currently operates and I was only able to do a test run with 800 "trees" in the forest. I'm pretty sure i need a couple thousand trees. But before I can do that I need to speed it up. I think I would like to take a crack at explaining random forest as I understand it now that I've got a version working. So here goes...
Random Forest Explained (core concept)
So let's say you are a doctor and you have a bunch of patients. For each patient you have height, weight, gender and age. 1 in 10 of your patients developed the flu last year. You know who they are. Predict who will get the flu this year. To do this you take all the data for last year and randomly select a few properties about the patient technically you randomly select the square root of the number of attributes ... but we have 4 which is 2. Let's say Age and Weight. Then for both of them you find out which has the highest entropy or some system of choosing one with a lot of variation. You process that one. Let's say it was age. Next You find the correct age to use for splitting the group in half. That is you will pick an age that will move half the entropy to one side and the other half to the other.That is, the exact mechanism by which you pick the split is also based on entropy not on an strict average number of occurrences.. You now have a two groups of people.
You then repeat this process on the two groups separately and then again on their groups and then their subgroups etc.. (subdividing them further and further), until you get to either a group with only people who were all sick, healthy or have identical information. You stop dividing at this point. It is completely valid to randomly select the same property again and again. Or possibly once then some other property a few times then come back to it. So you may end up splitting by age 3 times in a row or more... or maybe you wont split by age at all. The property you select to do the division must be random each time. Regardless of the choice, each time you divide you are halving the number of possible results to be processed in the next iteration. Eventually you will get to a scenario for stopping.
You have constructed 1 tree. This can be used to score the patents for this year. Granted, some of the stats for a patient are the same year to year, but age always changes. You might also have lost some patients and gotten some new ones. You can score each patient by starting at the top and following the logic of where they end up. "Oh their age is less than this, so they go to this question." When you reach a stopping spot you have an answer for that patent.
The thing is for a random forest you make 100s and 100s of these trees. Since there are lots and lots of different ways the data can be sliced up you are trying to build a tree for a good selection of those possibilities. Each time you follow the logic of a particular tree you get a vote for a possible answer. When you have take 1 patient down all the trees you will have a number a certain number of votes for will get the flu and won't get the flu. This is that patient's odds of getting the flu.
That doesn't sound so hard... At it's core it isn't. The trick lies in finding the right splitting point. Having properties that are independent of each other (don't for instance take age of the patient and decade they were born in. Those both relate to each other.) Oh and I suppose making the process fast. The last one helps, because the more trees you have the more accurate the results. You can test with 1 tree, but more is always better statistically. There are a world of improvements that can be made but that's outside the scope of that explanation, more on some of those another time.