Two blogs back to back?! Crazy!! Heh, so today something real quick. The contest that has been motivating me to a near frenzy of focus and programming ends in a week and a few days. I imagine at the end of that contest I'll be more human and talk about other things...we all know this isn't true hehe. But in the mean time work continues. Also, the I need to say that the super simple classifier I was working on had a bug in it the worst kind of bug... the kind that means it was using all the data to train on :(it happens never happens to meliar!nope perfect code, every time. By training on OOB data i was building on data I shouldn't have known. As soon as I removed the extra data the performance dropped radically down to like 55%. Basically that idea is a bust and I need to go back to the drawing board for a better classifier.
hmmm.. On that note... It's interesting how thoughts change. I was just looking back on my comments on bias and accuracy from a 3 weeks back. I tried to explain that they are different. It's worth saying that when it comes to Out of bag data I think they are exactly the same, or very close. But that's not necessarily the case in general as far as I can tell. I just wanted to throw that out there at first, I'll come back to it in a moment.
So for tonight's development?I have 2 new ideas for tonight to try. One involves eliminating attributes/variables/features call them what you will. They are all the same thing to me that do not contribute to the result. Either because they act as noise or they never vary. There has been a lot of work on this and it seems like a good source of improvement. That is, IF here we go my improved splitter and attribute selection mechanism isn't more or less already doing this 6 of 1 half dozen of the other. It isn't directly, I wrote it with other thoughts in mind but it has a weighting system for picking the appropriate attribute to use, so it may in effect already be cutting those bad attributes out. I think possibly attributes with a lot of noise are still in there so I'll be interested to see if those can be filtered out and improve the result.
And the other idea?The other idea involves a new technique for identifying effectiveness of a particular tree. I was talking about that yesterday with regards to the simple classifier, but the ideas I tried last night just didn't work. Hopefully, this new method will be better at it. If it does work the idea is to use it as a filter, then measure accuracy on data that gets through the filter. If the filter works, it improves accuracy of what is measured. And here is the bias thing I mentioned earlier. The bias is still there but now that a filter is in essence removing some of the cases that don't fit the model's bias if not, all of them!dream on. the net effect is increased accuracy! The bias of any given tree will always remain the same, you just need to find the right cases to use it on. Incidentally, If I can remove attributes that are noisy, that should actually help both bias and accuracy.
I worked on the data mining stuff all weekend. And I learned quite a bit but didn't make all that much forward progress in terms of a better result. What about the idea you quit with? Did it produce a better result? In a word no. Sending the data down both paths of the tree does seem to improve results, just not by much. It also increases run time a lot. Scoring data by running down both sides of the completed tree however does not seem to improve results further. It seems once the tree is built, it is best to use it as is.
No improvement at all? Well, if it does, it does so ever so marginally. As soon as you start going down multiple sides of a tree you need to weight one sides importance or another. You can naively make an average but then you aren't making a decision and the accuracy drops. When you do weight it accurately the result ends up being very very close to what it would have been if you didn't do both sides, because one sides weight is so very small. When building the tree, similar is true. The trick in building the tree like i mentioned before is in knowing exactly how much of the data just pasted the splitter to send down the wrong side. and i still don't have any "correct" numbers. It's guess work. take to much and it runs forever as the tree has a billion million nodes (it's data size hardly decrease from node to node) take to little and it's like you took nothing at all. That being said, I'm beginning to think the improvements I got where more due to luck of the data's distribution than any serious form of insight into improved logic. It may well be true that there is always some margin you should take on the opposite side of the splitter. but without a way to determine exactly how much one should take based on the data, It's to random to be useful. That doesn't mean I won't do it! just that I'll back burner the idea as a last minute thing I can do to eek out a few more points of accuracy. for performance and sanity sake no 2nd guessing i got the right spot, I'm turning off that bit of logic for now.
All weekend working on that and nothing? 'fraid so!bleh I spent the bulk of my my time looking for a 'smoothing' way to distribute the values moving down the tree. I wanted to do my calculations more accurately. It was a waste of effort in terms of improvement. It ran way slower and in the end produced the same result or worse result. It seems that weighting the rows based on previous node splits is just not what you want to do. Each node needs to make it's decision independent of the previous to work right.
Ok, anything else? I got creative once I learned all that. I started thinking about how each tree's opinion is really accurate for a particular type of data. And trying to merge them is a mistake. But how does one figure that out? I figured the OOB data could help with that, but that is actually what they do right now with the whole accuracy thing. They everyone who implements random forest run the test data against it and find out that say, 66% of the time this tree matches the data you put in and gives an accurate result. You improve the result over all because the 37% of the time it's wrong has is out voted by other trees. of course, those trees are only accurate around 66% of the time as well, and so the overlap of two independent data sets produces a net result of around 77-80%. You can increase that if you make your models more accurate or don't let them vote on data they have business voting on. but that's the hard part.
This idea of overlapping independent voters works great on just about any classifier. So i put it on a super simple classifier i wrote that just make an average of the data with pseudo randomly selected attributes. It returned a 75% result. It wasn't usable because the log loss result was almost as bad as randomly guessing. I'm not quite sure why that was, perhaps I was selecting very nearly the same attributes over and over. As it seems it was consistent in it's errors. So when it was wrong, it was wrong all the time in the exact same way. which is bad for this kind of scoring. At any rate. It's food for thought.
Here we are once again! Things have been going well. I keep making gains and keep finding new improvements.:) It's not all roses and sunshine though is that even a real idiom?. The biggest thing I'm struggling with is an absence of any ground breaking improvements it's cause you lack awesomeness. I just continue to refine the existing ideas. The most recent idea I tried that didn't work out was an attempt to improve my accuracy measure by turning the OOB data in to a random forest into itself. Where each row was marked as successfully or unsuccessfully predicted. I was trying to make a better measure of accuracy by preprocessing each row and getting a row by row measure of accuracy. but it turned out to be slightly less accurate than the methods I've been using. And it was waaaaaaaay slower.
What have you done to improve your results recently? Well two things, I've found a better way to move data down the tree. Normally you split your data as it moves down the tree so it slowly diminishes till you hit the end point. What I do now is I take a little bit extra from the wrong side (just pasted the split) and I move that data down the wrong side. and I do the same for the opposite side. The details are still hazy, i found a sweet spot for how much overlap there is but it was found experimentally. I would really like to get some theoretical estimates as to the best cutoffs to use It's probably not worth spending a whole of time figuring out. I can always stick with experimentally finding the sweet spots. I think this makes a lot of sense. Essentially you are certain to misclassified some data no matter where your split is since data is normally distributed. This is especially true just past the splitting spot. So, the gains of taking a little data past that mark to be analyzed later out weigh any bad data you end up taking along with as well. Mainly because they will get moved around correctly in subsequent splits.
And the other improvement? Well this is sort of just more of the same. Not only do you do this for analysis but you do this for scoring! ???? That's right, you pass your data down both sides of the tree if it is close enough to the split that you can't be sure where to send it and average the two results that come back. But wait there's more! oh joy Eventually, when I get everything working well, I'll be weighting the data going down the tree during construction based on it's distance from the center of that sides distribution. I'll be doing the same for scoring. Essentially everything gets very very fuzzy. And thus my smooth tree is born! Well, assuming someone else didn't make it like 20 years ago.... which seems to happen with any idea I ever come up withheh I need to get the math figured so I know I'm weighting things right as they move around for tree building and for scoring as to maximize the gain. This will hopefully do away with experimentally found cutoffs. I also need to submit some new results to the contest. My current score doesn't have any of that logic in there. It is just the modified tree splitting/creation that uses values I found experimentally.
Any down side? Ya, it's slower. Not a lot slower, but slower. Tree construction time goes up a lot as the data at each node increases. Not to mention the trees go deeper as it requires more splits to get to a stopping spot. Scoring takes longer as well as you don't go down a single path. Now you go down large sections of the tree. Not the whole tree mind you. There are some cutoffs that can be made to speed things up say, when 1 sides result represents less than 1% of the weight in the final result. Once I get things right, it'll be very interesting to see how much closer to the top competitors I get. I don't really expect to be THAT much closer but some.
So this new classifier thing... how's that workin' out for ya? It works well! really? Yes really!will wonders never cease I moved on the ranks another 50-60 spots score of .45002. surprise, surprise! Basically, when the classifier is built on multiple properties it produces a better way split up the groups at that level of the b-tree.
Details man, what exactly did you end up doing? If you recall, for a couple days I said I looked in to using a single large model instead of a B-tree, then making 1 decision based on that. And it was over fitting because of it. I then looked in to adapting the whole thing to work as a single b-tree node. Where the node had a 3 way to split based on two decisions. Do I know enough about this node to make an informed split and if so which items do I split? Well, i went back to a classic split of with only 2 options at each each node. When you start using the classifier over and over again there was no benefit to doing a 3 way split and indeed it seemed to make things worse. It was better to just find the best split I could and move data around. why? I think it's mainly due to wanting the tree paths as homogenous as possible equal amounts of "final results" of 1s and 0s in that branch of tree all the way to the bottom at the leaf node where you completely separate out the two. To prevent over fitting.
How many attributes do you pick for each multi-attribute classifier? I settled on randomly picking between 1 and 2 times log of N (base 2) where N is the number of attributes available. I did toy around with others but this works well and I'll tell you why I think that is. I want some randomization in the selection so the results can hone in on different aspects the data for making decisionssplitting the data as it flows down the tree. The number of attributes it takes to make an informed decision if each attribute leads to 2 others read: each attribute halves the remaining selection from a decision tree is log of n (base 2) assuming the tree is evenly balanced. I double it because it's a random variable and I'd like to to average around there, as well as there could be cases where the tree isn't balanced.
I pick the actual attributes to use in the exact same manner i did before. I randomly pick some Square root of N and pick the one with the best deviation of ones and zeros then assembled each of those in to the model that classifier will use to do the splitting. I tried varying up the number I pick but really square root of N works really well. The split that occurs for data passing down the tree is done by finding which group the data is closer too collectively for the randomly selected attributes at that node.
That's all well and good, did you do anything else? I did indeed. I also found a better way to measure accuracy of a particular forest. instead of using OOB data to test against the final tree. I can measure the success of a model by how well it splits it's data while being made. At each node down the tree I tally the ones compared to the zeros and save it on the classifier. In the end you get a percentage that varies based on how well the nodes as a whole are splitting the data in too a homogenous mix. So, this isn't a measure of accuracy per say but it is a measure of what we care about for weighting. How homogenous the tree is tells us how impartial the tree is. I use a Gini index on the final tallies for the classifier and use that as my weighting for those results! This improved my results slightly over using OOB to weight it but more importantly, I don't have to use the OOB data to score it! Which means I can make a larger training data set if I want to. Buuuuut...ah there is always a but I've left things alone as they stand. I want the training data for the trees to vary up so I still build my training data as a subset of the same size as the original that was sampled from the original over and over again with replacement!.
Seems a waste to throw away that extra data/information... My thoughts exactly. In my most recent endeavors, I've taken to using that data to improve my classifier tree by scoring the accuracy of those results and then going back and fixing the tree based on the scores! sounds an awful lot like boosting... I admit it's the same idea, but the mechanism by which i go about this is nothing like it. The short version is I clip and rebuild any leaves in the tree that don't live up to their parents expectations. Hah, I bet parents all over the world wish they could do that. That is, if one child far out produces the other in accuracy I re-create the problem child. I do this for the entire tree. I could make it iterative, but haven't yet. (I plan to this afternoon) It is making it better :) . yay! I could go on and on about the details of that mechanism and some of the variables and concerns I have but that's enough for today.
1 day later. What has changed? I tried making a classifier that wasn't a B-Tree classifier. I selected attributes with large deviations over the training data well, from a randomly selected sub-set of attributes . Then I compared it to an Out of bag set of data to get an accuracy measure. It worked really well according the the OOB data. The problem was, because the core of the mechanism was making a model of the data and not a classifier it's performance degrades terribly.huh?
classifier vs model... what's the difference? in the nomenclature I'm spelling out here, a model represents an idealized version of the data. I make these by grouping and weighting various samples from the training data. By itself it can't classify any other data. It is just a model of the data as a whole.hence model! The classifier typically takes in some data and via a function produces a result that indicates some sort of classificationalso, hence the name!. So what I did last night was build some really accurate models and use a very simple distance style classifier to say if a test row was part of the group or not. This however seriously over fit the data by building exacting cases that, when used to compare data in general, failed far more often than one wants. It was generally really right or really wrong. This produced a terrible log loss performance which the most recent contest I'm working on uses.
So start over and do something different? Well... hold your horses there cowboy. What I built was basically a more complicated version of a classic splitter read: gini entropy or what I described last blog. And many of the accuracies I got where in the 80-90%. What's more, i can build second model for data that indicates whether the first model should or should not be even considered. This is done by looking at all the attributes for a piece of data that aren't used. So, say I have a model that uses 8 attributes (of the 1800 odd in the test data) that indicate if the test data is in the group or out of the group. I can use the remaining 1792 attributes as a model as well to indicate if I should even look at the 8 attribute group. Then decision is made by using a classifier on the model that is a distance-from-the model style classifier.
Can I have some numbers? Sure! Let's say that model of 8 is 80% accurate and the model of 1792 indicates 60% of the records should be validated using the model of 8. That gives us a pretty decent splitter. We have 48% of the cases that can be correctly classified. the other 52% need to be reconsidered. Technically 12% of those are miscalculations and we don't know it. And the other 40% of those are ones we just "gave up on" because they failed the 1792 model check to test against the 8. Using these models and classification we really should make a 3 path node. left (out of group case), right (in group case) and center (unknown - failed the 1792 check). the center path gets 40% of the test data which we have no idea about. The other 60% are in the left and right with the 12% error somewhere across both those two nodes. We build new nodes with new models down each path. This is what I intend to do tonight.
So.. uh Diablo 3? grin Ya, I worked a lot on various code related things Friday and to a lesser extent Saturday and Sunday, but really, I played more Diablo 3 than anything this weekend. It's a pretty good game. I was a little worried I wouldn't be able to get in to it, and had thrown away the money I paid for the game. But nah, that didn't happen. It's quite a bit of fun!
Did you do anything productive in data mining this weekend? Do I ever? Sort of, but first I should take a moment to say a few things about my binary tree classifier. Instead of using a Gini split or a entropy split I'm using a split I've devised that models what the actual data is doing. essentially I build a normal distribution at each split in the tree for both the 1s cases and the 0s cases. I find the center spot between the two averages and that is my splitting point. I'm glossing over the details but that's the gist of it, but here's a picture to give you a visual of the idea OMG!
Sounds kind of slow ... It isn't. You make one pass over the data to find the averages and standard deviations you use those for choosing a splitter and that's it. More important than the speed though is it produces better results than either of the other two methods. at least in this test scenario Where is this going? Well, other than nominal improvements on this method I haven't really had an gains since I implemented random forest. My current ranking over the baseline random forest is only due to my slightly better splits than the one the baseline used. So, I need something new, something that will improve the technique in a major way.
What about boosting and throwing out outliers? I tried boosting but I'm not sure I ever did it right. My results were slightly poorer. And while they say you can "boost forever" the way my algorithms honed in on values it made no sense in boosting after like 2 iterations probably due to me using normal curves and not solid 1 and 0 values. I think maybe I need to better understand how you implement a binary classifier with boosting. Or maybe I'm doing it just fine but it's just not any better than Random Forests. I should say, despite quite a bit of searching, I never did find a good explanation of how Adaboost is implemented over a binary classifier. I mean I found how to do the weighting the math behind it, which works just fine. I also found the idea behind it spelled out but the actual use of the weighted values inside the algorithm (and how the tree rebalances itself or if it even does) I never found written out. So, I tried a bunch of different things but it's all guess work. And like I said, nothing ever improved over Random Forests. As for outliers I'm putting that idea on the back burner for now. I think that's more of a tweaking technique and wont give me the gains I need to get to the front of the pack.
What I am going to do is this... I'm going to try making a different kind of classifier. One which is loosely built off of the ideas that seem strong in Random forest but hopefully more accurate. Binary tree classifiers more or less halve the size of the data at each node based on some decision at that node. And unless that node's attribute is selected again later, that information is never used again. My classifier will make just 1 decision based on all the randomly selected attributes. Will it work?eh maybe Will it be awesome?unlikely or will it be just a colossal waste of time?!i vote that stay tuned!