6/25/2012
A whole week and no new blog? Where's the thrill, the adventure, the action you are supposed to be blogging about? Apparently you've been reading some other blog. The action and adventure took a pass and you're stuck with me talking about computers again. Seriously though sometimes I do, do other things. But rarely do I think they are worth talking about. For instance last night I made a mental list of all the things I would need to do to my house to get it ready to sell. Are you planning on moving? No. They are just outstanding things I need to do that some people out there could empathize with. It's definitely something I could write about but then I don't think it's all that interesting.
What have you been doing for a week? The title gives it away doesn't it? not really, but if I say yes, can we talk about food instead? No, but Let's talk about food briefly anyway. I'll save the data mining stuff till the last two paragraphs. I've been watching a little of the Cooking channel specifically old episodes of food detectives and food(ography). I guess food network found they had to much stuff for 1 channel? Whatever the case I'm convinced I don't eat enough interesting foods see what TV does to you?. I thought about making a vacation out of it, the problem isn't finding those places to eat at. No the problem is finding something else to do while you wait for the next meal. Maybe I'm being a little to 1 track about it. Ya think? Pretty sure this is a "first world problem".
Data mining, last time I said I thought I had a good idea for a new direction. Turns out it was a pretty good idea! or thus far seems to be Here's what I tried, I started out by thinking how can I apply dynamic time warping (dtw) theory to static data. DTW works by doing a closest match of time series data. it finds the closest match by minimizing the area between the curves in two series when compared to each other. DTW is very hard to beat when it comes to time series type data. If you have enough samples to compare with odds are it will match one of them correctly and do so as well as any other algorithm you might have. Thing is regression trees and random forests don't have time sensitive data in it so you can't make curves of the data change over time. So, what is the mechanism to get one to work with the other?
So, what I did is I stopped and considered why DTW works. It leverages the idea that over time the sample is just a distorted version of another sample. They are the same thing but one is just not being presented quite right. And what is time strictly speaking. Time is change. Or moving from one state to another. Now the stretch, isn't that what is happening to the data as it moves down the CART tree? You have a sample and it moves either left or right based on some decision point. The data is not changing but it's movement down the tree is producing a series of 1s and 0s (lefts and rights) that are what we are interested in. That is your time series. I'm going to stop there and explain more next time. There is quite a bit more to talk about.
6/19/2012
Do you have any surprises for us!? :D No :D :O ... :( shortest blog ever? Er no. Ok, so I have three bits of "news" if you can call it that. For starters I've decided NOT to do the Heritage Health Prize Contest over on Kaggle. Any particular reason? lazy? A couple reasons I suppose. First, the data is kind of a mess. So, any predictions I make from it will be partly my skill at understanding the true value of various pieces of data that they've given me. Heck, just getting that data in to a working form I can process will be a large effort, and if I make a mistake... well it could end up being a giant waste of time. So the project isn't a purely analysis style project because of it. I want to work on the core algorithms which this contest only lets you do after you try and form their data correctly. In short it doesn't sound like any fun. There are other things too. There is a giant thread on kaggle about why the contest doesn't appeal to another data scientist and he and other make some convincing arguments. So yeah... I'll just wait for a different contest. And the money? It could be 300 million and if it's no fun to do as a hobby... it's no fun to do as a hobby. If it was enough money, a person could form a business just to go after it. But, then that's work, not a pass-time.
NEXT! I'm going to start training for a marathon in October! wuwu! I've been meaning to do one for a couple years now, but always stopped training for one reason or another. It'll be a new thing for me to run one.
And finally... I'm going to continue working on my data from the last contest till a new one comes along. I have what I think might be a really good idea in a new direction.... a way to make another jump in analysis. Well, or maybe not. you know how these things go. The point is, as I don't have a fitting contest to focus my efforts on. I'll work on refining the old data I have. which reminds me, i did run my code on the contest from before this last one the one from tunedit.org which I've got the answers for. It turns out there is some key filtering you have to do to get results that are usable at all a factor he seems to have forgotten since last he worked on it. My Random forest implementation isn't designed to work with that data right now. I did do a test run but without the filter in place the results were terrible to many zeros in the data that mean nothing. I could retro fit the algorithm to make it work but with no incentive to do so, and it being a couple days work. I'll let it sit till a good motivator comes along. For now I'm going to just wait till a new contest and continue to refine my algorithms.
6/15/2012
I take it by the header you didn't win... Nope loser!,technically it's not over till 7 CST tonight and I'll do one more submission but I wont be changing much so it's not gonna win. It's more to see how the algorithm performs in a certain scenario but...here come the excuses I know my next step!too late! My next step is to pursue implementing probability based estimates at the leaf spots on my tree instead simple positive or negative indicators and then to apply probability scaling both to the tree and to the result. I have no idea how much better that will make the forest. But the few things I've seen indicate that is the next logical improvement. I also don't know if my performance is all that good considering my final algorithm is little more than a really good plain vanilla Random forest. I mean I improved on the splitter, found a good way to use all the training data during each classifier build and a good way to get a relatively accurate weight (better than standard oob it seems). I mean it seems good... .44108 compared to a base of around .46182... and it should work that way out of the box every time on any data set. And I made my program lightning fast. So to me that seems like something. but I dunno, I was a long, long way from being number 1 that's the spirit.
So, PET (probability estimation tree) and scaling? Yep! :) First one then the other. I'll pick a new contest to start working on this weekend so I can get a real world idea of how good it's going and start then trying to implement the stuff. I might actually take a day off from in front of the computer too!crazy talk!
6/14/2012
We join our narrator already ready in the process of hitting his head on a wall. one hit for each syllable.play along at home kidsdon'tMUST ... FIND ... BET ... TER ... WAY ... TO ... CLASS ...IF ... Y ... CRAP.
Ya... no improvements.only 1.5 days left! Get Crackin!that's 3 more submissions for those playing the home game What really gets me is the the amount of room for improvement and the ever so small separation between the top competitors and me. In terms of score that is. but, Jamie it seems like 10% difference ya, well don't let the numbers fool you. it's not. In some ways it's much more, in others it's much less. less you say? Yes, I can't remember if I explained this once before but I will explain real quick to be clear on the small difference between 1 score and the next.
The scoring system is done very simply each prediction for a row is between 0 and 1. There is of course a correct answer that goes with that row. So to score what is done is the prediction is subtracted from the correct answer and then that is added to the inverse of the prediction subtracted from the inverse of the answer. huh? like so. Y = Answer * prediction + (1-Answer) * (1-prediction). The effect is if you guess 0 and the answer is 0 the result is 1. if you guess 1 and the answer is 1. if you guess 0.5 the result is 0.5 . basically, it is built to adjust for which side the answer is on and give the same result it would for positive response as it would for negative response. if you guess .25 and the answer is 0 you get .75 if the answer is 1 it would be .25.
So why is it called log loss? Well that result is only part of it. you take the computed result there and get the logarithm of it, and the negate it. You negate it because they want positive results and all values returned by the log function between 1 and 0 are negative. so if you are right and the result is 1 ... -log of 1 is 0 the score is zero (the best score you can have) if you are completely wrong and you get a result of 0 ... -log of 0 is infinity.So 1 guess wrong destroys your results entirely? Not really, they just use a large number instead of infinity, but technically yes it should. if my program ever predicts a 1 or a 0, just to be safe i move it to .9999 or .0001. which if wrong, score around 7... a heck of a lot better than infinity All of the results are averaged and then that is your final score, that's it!
So the current best score is near 0.39, you have a score near 0.44, is that a lot? Yes and no. No because, if you assume every result scored the same 0.39 is an accuracy (working backwards) of 67.7% on each result and 0.44 is 64.4% ... So, not much difference there. And tons of room for improvement on both! the problem is, that's not the case! the wrong answers are what get you. That's the real problem. a few wrong answers can turn a great set of results in to crap. which is why it's so hard to improve. you could have 95% accuracy overall but if the 5% you got wrong score at a value of 7. Your overall score would be 1.4! So, ya.... it's better to guess 0.5which scores at 69.31%69! if you don't have a clue what the answer is, guessing wrong is penalized very heavily. All that being said. It's time for me to go back to staring at the screen and wondering how I can improve my results overall by 5% :)sounds easy!hah!
6/11/2012
Nearing the home stretch... Just over 3 days left and I think I'm out of mundane refinements that I could try to my program That only took 4 weeks :). I managed to squeeze a little more score out of my program by basically rigorously testing ever variation of any idea I could come up with. Things like how best to calculate percentage so-and-so. And what if I split this particular data in half... how about thirds... what if I make the program do a half pike somersault over the fire and land in to the dried up lake bed... etc that sounds like it should work!. Needless to say, all I've done is refine and refine. The ideas about filtering by types of groups and subsets was more or less a bust. Well except for one variation on it that I do use. Using a filter based on training data I can predict accuracy slightly better than an OOB test set. Which is where i squeezed out the few extra points, but it wasn't a ground breaking discovery it just a slight improvement.
So just a few more points on the ladder nothing else worth noting? Well one other thing I did today. I made the program way... I mean WAY faster. Turns out creating tens of thousands of objects is really really slow. And I mean more than the overhead of using objects in the first place. Something odd is going on in the underpinnings of the computer. I'm not sure what CPU overhead is involved but the program was starting to bottle neck at around 50-60% CPU usage. That is, all 8 cores sitting there not willing to go any faster no matter how many threads I threw at them. Just to test that it was in fact the objects. I made an class that had no data or methods to test with. I then commented out the entire program other than a single loop in each thread execution that did nothing but create an object over and over again to see the problem first hand. If I removed the object and say, created an array instead. I'd peg all of the CPUs. If I used the object it would float at around 50-60%. So clearly the CPU is trying to do something with each object at creation and the decision is counting as idle time. Needless to say I went on a tear rarr! and rewrote all my objects to be structures and any function that got called over and over again that needed objects was made use global that never got recreated. The program is night and day faster. A test that took 9 hours to run can now be run in about 15 minutes. Granted a lot of that is from the use of primitives over objects, but some of it is from the fact that once again I can use all my CPU.
So that's it for now. Hopefully something will change in the next 2 days that will be awesome news! If not there is always the next contest *grin*.no comment