Nearing the home stretch... Just over 3 days left and I think I'm out of mundane refinements that I could try to my program That only took 4 weeks :). I managed to squeeze a little more score out of my program by basically rigorously testing ever variation of any idea I could come up with. Things like how best to calculate percentage so-and-so. And what if I split this particular data in half... how about thirds... what if I make the program do a half pike somersault over the fire and land in to the dried up lake bed... etc that sounds like it should work!. Needless to say, all I've done is refine and refine. The ideas about filtering by types of groups and subsets was more or less a bust. Well except for one variation on it that I do use. Using a filter based on training data I can predict accuracy slightly better than an OOB test set. Which is where i squeezed out the few extra points, but it wasn't a ground breaking discovery it just a slight improvement.
So just a few more points on the ladder nothing else worth noting? Well one other thing I did today. I made the program way... I mean WAY faster. Turns out creating tens of thousands of objects is really really slow. And I mean more than the overhead of using objects in the first place. Something odd is going on in the underpinnings of the computer. I'm not sure what CPU overhead is involved but the program was starting to bottle neck at around 50-60% CPU usage. That is, all 8 cores sitting there not willing to go any faster no matter how many threads I threw at them. Just to test that it was in fact the objects. I made an class that had no data or methods to test with. I then commented out the entire program other than a single loop in each thread execution that did nothing but create an object over and over again to see the problem first hand. If I removed the object and say, created an array instead. I'd peg all of the CPUs. If I used the object it would float at around 50-60%. So clearly the CPU is trying to do something with each object at creation and the decision is counting as idle time. Needless to say I went on a tear rarr! and rewrote all my objects to be structures and any function that got called over and over again that needed objects was made use global that never got recreated. The program is night and day faster. A test that took 9 hours to run can now be run in about 15 minutes. Granted a lot of that is from the use of primitives over objects, but some of it is from the fact that once again I can use all my CPU.
So that's it for now. Hopefully something will change in the next 2 days that will be awesome news! If not there is always the next contest *grin*.no comment
Data mining where do we stand!? 1 week left! I've been trying rather diligently to filter and slice and dice the data so that the voter in the random forest can only vote for entries that voter knows anything about. This seemsstill like the most logical improvement to me right now. have you considered hiring monkeys? But it has not gone well. I haven't made any improvements in some time now. Generally, i approach my best but never pass it. I did manage to get my old grouping method to work using an OOB mechanism but while the accuracy was decent (75%) the log-loss was terrrrrrrrible (.60). basically, unusable.
I was hoping I could use 1 method with the other since they work in fundamentally different ways. Basically, using 1 as validation for the other and modify accuracy accordingly, but it's not worked out well not as a direct comparison of two predictions on the same data. the results are always poorer. I expect that has something to do with the log-loss score on my grouping method. I did manage to get a small "improvement" better than what I had been doing, but not quite better than when I split the data down both sides of the tree by using a filter based on the classifier tree to help determine if the tree had a valid opinion/vote. I also found a way to merge two statistics using said filters to give a statistic better than the two children (most of the time anyway), much like voting does but using those filters in tandem. These ideas or a variants there in I think have the most potential for improvements over all.
Basically, both idea hinge on the same thing. One method weights an individual result. the other a group of results. In layman's terms what I'm doing is asking the person voting he just turned a classifier tree in to a person! how good are they expecting their vote to be based on what they learned NOT based on the OOB data. Though that is figured in when they do vote.. I just need to get better at doing it. I'm thinking I might need to introduce a multivariate normal curve to do that properly. That is definitely something I'm really looking forward to. read: lot of work, not sure it will pan out when it's done. The weekend lies ahead. The last weekend before this contest ends. So one way or the other this level of focus will end soon!
Two blogs back to back?! Crazy!! Heh, so today something real quick. The contest that has been motivating me to a near frenzy of focus and programming ends in a week and a few days. I imagine at the end of that contest I'll be more human and talk about other things...we all know this isn't true hehe. But in the mean time work continues. Also, the I need to say that the super simple classifier I was working on had a bug in it the worst kind of bug... the kind that means it was using all the data to train on :(it happens never happens to meliar!nope perfect code, every time. By training on OOB data i was building on data I shouldn't have known. As soon as I removed the extra data the performance dropped radically down to like 55%. Basically that idea is a bust and I need to go back to the drawing board for a better classifier.
hmmm.. On that note... It's interesting how thoughts change. I was just looking back on my comments on bias and accuracy from a 3 weeks back. I tried to explain that they are different. It's worth saying that when it comes to Out of bag data I think they are exactly the same, or very close. But that's not necessarily the case in general as far as I can tell. I just wanted to throw that out there at first, I'll come back to it in a moment.
So for tonight's development?I have 2 new ideas for tonight to try. One involves eliminating attributes/variables/features call them what you will. They are all the same thing to me that do not contribute to the result. Either because they act as noise or they never vary. There has been a lot of work on this and it seems like a good source of improvement. That is, IF here we go my improved splitter and attribute selection mechanism isn't more or less already doing this 6 of 1 half dozen of the other. It isn't directly, I wrote it with other thoughts in mind but it has a weighting system for picking the appropriate attribute to use, so it may in effect already be cutting those bad attributes out. I think possibly attributes with a lot of noise are still in there so I'll be interested to see if those can be filtered out and improve the result.
And the other idea?The other idea involves a new technique for identifying effectiveness of a particular tree. I was talking about that yesterday with regards to the simple classifier, but the ideas I tried last night just didn't work. Hopefully, this new method will be better at it. If it does work the idea is to use it as a filter, then measure accuracy on data that gets through the filter. If the filter works, it improves accuracy of what is measured. And here is the bias thing I mentioned earlier. The bias is still there but now that a filter is in essence removing some of the cases that don't fit the model's bias if not, all of them!dream on. the net effect is increased accuracy! The bias of any given tree will always remain the same, you just need to find the right cases to use it on. Incidentally, If I can remove attributes that are noisy, that should actually help both bias and accuracy.
I worked on the data mining stuff all weekend. And I learned quite a bit but didn't make all that much forward progress in terms of a better result. What about the idea you quit with? Did it produce a better result? In a word no. Sending the data down both paths of the tree does seem to improve results, just not by much. It also increases run time a lot. Scoring data by running down both sides of the completed tree however does not seem to improve results further. It seems once the tree is built, it is best to use it as is.
No improvement at all? Well, if it does, it does so ever so marginally. As soon as you start going down multiple sides of a tree you need to weight one sides importance or another. You can naively make an average but then you aren't making a decision and the accuracy drops. When you do weight it accurately the result ends up being very very close to what it would have been if you didn't do both sides, because one sides weight is so very small. When building the tree, similar is true. The trick in building the tree like i mentioned before is in knowing exactly how much of the data just pasted the splitter to send down the wrong side. and i still don't have any "correct" numbers. It's guess work. take to much and it runs forever as the tree has a billion million nodes (it's data size hardly decrease from node to node) take to little and it's like you took nothing at all. That being said, I'm beginning to think the improvements I got where more due to luck of the data's distribution than any serious form of insight into improved logic. It may well be true that there is always some margin you should take on the opposite side of the splitter. but without a way to determine exactly how much one should take based on the data, It's to random to be useful. That doesn't mean I won't do it! just that I'll back burner the idea as a last minute thing I can do to eek out a few more points of accuracy. for performance and sanity sake no 2nd guessing i got the right spot, I'm turning off that bit of logic for now.
All weekend working on that and nothing? 'fraid so!bleh I spent the bulk of my my time looking for a 'smoothing' way to distribute the values moving down the tree. I wanted to do my calculations more accurately. It was a waste of effort in terms of improvement. It ran way slower and in the end produced the same result or worse result. It seems that weighting the rows based on previous node splits is just not what you want to do. Each node needs to make it's decision independent of the previous to work right.
Ok, anything else? I got creative once I learned all that. I started thinking about how each tree's opinion is really accurate for a particular type of data. And trying to merge them is a mistake. But how does one figure that out? I figured the OOB data could help with that, but that is actually what they do right now with the whole accuracy thing. They everyone who implements random forest run the test data against it and find out that say, 66% of the time this tree matches the data you put in and gives an accurate result. You improve the result over all because the 37% of the time it's wrong has is out voted by other trees. of course, those trees are only accurate around 66% of the time as well, and so the overlap of two independent data sets produces a net result of around 77-80%. You can increase that if you make your models more accurate or don't let them vote on data they have business voting on. but that's the hard part.
This idea of overlapping independent voters works great on just about any classifier. So i put it on a super simple classifier i wrote that just make an average of the data with pseudo randomly selected attributes. It returned a 75% result. It wasn't usable because the log loss result was almost as bad as randomly guessing. I'm not quite sure why that was, perhaps I was selecting very nearly the same attributes over and over. As it seems it was consistent in it's errors. So when it was wrong, it was wrong all the time in the exact same way. which is bad for this kind of scoring. At any rate. It's food for thought.
Here we are once again! Things have been going well. I keep making gains and keep finding new improvements.:) It's not all roses and sunshine though is that even a real idiom?. The biggest thing I'm struggling with is an absence of any ground breaking improvements it's cause you lack awesomeness. I just continue to refine the existing ideas. The most recent idea I tried that didn't work out was an attempt to improve my accuracy measure by turning the OOB data in to a random forest into itself. Where each row was marked as successfully or unsuccessfully predicted. I was trying to make a better measure of accuracy by preprocessing each row and getting a row by row measure of accuracy. but it turned out to be slightly less accurate than the methods I've been using. And it was waaaaaaaay slower.
What have you done to improve your results recently? Well two things, I've found a better way to move data down the tree. Normally you split your data as it moves down the tree so it slowly diminishes till you hit the end point. What I do now is I take a little bit extra from the wrong side (just pasted the split) and I move that data down the wrong side. and I do the same for the opposite side. The details are still hazy, i found a sweet spot for how much overlap there is but it was found experimentally. I would really like to get some theoretical estimates as to the best cutoffs to use It's probably not worth spending a whole of time figuring out. I can always stick with experimentally finding the sweet spots. I think this makes a lot of sense. Essentially you are certain to misclassified some data no matter where your split is since data is normally distributed. This is especially true just past the splitting spot. So, the gains of taking a little data past that mark to be analyzed later out weigh any bad data you end up taking along with as well. Mainly because they will get moved around correctly in subsequent splits.
And the other improvement? Well this is sort of just more of the same. Not only do you do this for analysis but you do this for scoring! ???? That's right, you pass your data down both sides of the tree if it is close enough to the split that you can't be sure where to send it and average the two results that come back. But wait there's more! oh joy Eventually, when I get everything working well, I'll be weighting the data going down the tree during construction based on it's distance from the center of that sides distribution. I'll be doing the same for scoring. Essentially everything gets very very fuzzy. And thus my smooth tree is born! Well, assuming someone else didn't make it like 20 years ago.... which seems to happen with any idea I ever come up withheh I need to get the math figured so I know I'm weighting things right as they move around for tree building and for scoring as to maximize the gain. This will hopefully do away with experimentally found cutoffs. I also need to submit some new results to the contest. My current score doesn't have any of that logic in there. It is just the modified tree splitting/creation that uses values I found experimentally.
Any down side? Ya, it's slower. Not a lot slower, but slower. Tree construction time goes up a lot as the data at each node increases. Not to mention the trees go deeper as it requires more splits to get to a stopping spot. Scoring takes longer as well as you don't go down a single path. Now you go down large sections of the tree. Not the whole tree mind you. There are some cutoffs that can be made to speed things up say, when 1 sides result represents less than 1% of the weight in the final result. Once I get things right, it'll be very interesting to see how much closer to the top competitors I get. I don't really expect to be THAT much closer but some.
So this new classifier thing... how's that workin' out for ya? It works well! really? Yes really!will wonders never cease I moved on the ranks another 50-60 spots score of .45002. surprise, surprise! Basically, when the classifier is built on multiple properties it produces a better way split up the groups at that level of the b-tree.
Details man, what exactly did you end up doing? If you recall, for a couple days I said I looked in to using a single large model instead of a B-tree, then making 1 decision based on that. And it was over fitting because of it. I then looked in to adapting the whole thing to work as a single b-tree node. Where the node had a 3 way to split based on two decisions. Do I know enough about this node to make an informed split and if so which items do I split? Well, i went back to a classic split of with only 2 options at each each node. When you start using the classifier over and over again there was no benefit to doing a 3 way split and indeed it seemed to make things worse. It was better to just find the best split I could and move data around. why? I think it's mainly due to wanting the tree paths as homogenous as possible equal amounts of "final results" of 1s and 0s in that branch of tree all the way to the bottom at the leaf node where you completely separate out the two. To prevent over fitting.
How many attributes do you pick for each multi-attribute classifier? I settled on randomly picking between 1 and 2 times log of N (base 2) where N is the number of attributes available. I did toy around with others but this works well and I'll tell you why I think that is. I want some randomization in the selection so the results can hone in on different aspects the data for making decisionssplitting the data as it flows down the tree. The number of attributes it takes to make an informed decision if each attribute leads to 2 others read: each attribute halves the remaining selection from a decision tree is log of n (base 2) assuming the tree is evenly balanced. I double it because it's a random variable and I'd like to to average around there, as well as there could be cases where the tree isn't balanced.
I pick the actual attributes to use in the exact same manner i did before. I randomly pick some Square root of N and pick the one with the best deviation of ones and zeros then assembled each of those in to the model that classifier will use to do the splitting. I tried varying up the number I pick but really square root of N works really well. The split that occurs for data passing down the tree is done by finding which group the data is closer too collectively for the randomly selected attributes at that node.
That's all well and good, did you do anything else? I did indeed. I also found a better way to measure accuracy of a particular forest. instead of using OOB data to test against the final tree. I can measure the success of a model by how well it splits it's data while being made. At each node down the tree I tally the ones compared to the zeros and save it on the classifier. In the end you get a percentage that varies based on how well the nodes as a whole are splitting the data in too a homogenous mix. So, this isn't a measure of accuracy per say but it is a measure of what we care about for weighting. How homogenous the tree is tells us how impartial the tree is. I use a Gini index on the final tallies for the classifier and use that as my weighting for those results! This improved my results slightly over using OOB to weight it but more importantly, I don't have to use the OOB data to score it! Which means I can make a larger training data set if I want to. Buuuuut...ah there is always a but I've left things alone as they stand. I want the training data for the trees to vary up so I still build my training data as a subset of the same size as the original that was sampled from the original over and over again with replacement!.
Seems a waste to throw away that extra data/information... My thoughts exactly. In my most recent endeavors, I've taken to using that data to improve my classifier tree by scoring the accuracy of those results and then going back and fixing the tree based on the scores! sounds an awful lot like boosting... I admit it's the same idea, but the mechanism by which i go about this is nothing like it. The short version is I clip and rebuild any leaves in the tree that don't live up to their parents expectations. Hah, I bet parents all over the world wish they could do that. That is, if one child far out produces the other in accuracy I re-create the problem child. I do this for the entire tree. I could make it iterative, but haven't yet. (I plan to this afternoon) It is making it better :) . yay! I could go on and on about the details of that mechanism and some of the variables and concerns I have but that's enough for today.