Search
rss Logo
Pay no attention to the man behind the curtain!
5/3/2012
First a little entertainment.isn't this whole thing supposed to be entertainment???indeed! I've been watching a starcraft 2 video cast series called "when cheese fails" They also have a will cheese fail series. The two casters watch a video sent in by viewers where someone tries to win a game using an attack that is typically either a cannon rush, 6 pools attacks or a proxy base. Thus far I think this is the best one I've seen. If you like starcraft 2, hopefully, you will enjoy it. www.youtube.com/watch?v=0mPGYeuU2Oc
As for the random forests stuff. I finally got it working last night. It definitely looks like the results are going to be way better.=D I'm not 100% sure I've got everything optimized for the best traditional result yet. It's hard to tell because it runs so slow and as it currently operates and I was only able to do a test run with 800 "trees" in the forest. I'm pretty sure i need a couple thousand trees. But before I can do that I need to speed it up. I think I would like to take a crack at explaining random forest as I understand it now that I've got a version working. So here goes...
Random Forest Explained (core concept)

So let's say you are a doctor and you have a bunch of patients. For each patient you have height, weight, gender and age. 1 in 10 of your patients developed the flu last year. You know who they are. Predict who will get the flu this year. To do this you take all the data for last year and randomly select a few properties about the patient technically you randomly select the square root of the number of attributes ... but we have 4 which is 2. Let's say Age and Weight. Then for both of them you find out which has the highest entropy or some system of choosing one with a lot of variation. You process that one. Let's say it was age. Next You find the correct age to use for splitting the group in half. That is you will pick an age that will move half the entropy to one side and the other half to the other.That is, the exact mechanism by which you pick the split is also based on entropy not on an strict average number of occurrences.. You now have a two groups of people.
You then repeat this process on the two groups separately and then again on their groups and then their subgroups etc.. (subdividing them further and further), until you get to either a group with only people who were all sick, healthy or have identical information. You stop dividing at this point. It is completely valid to randomly select the same property again and again. Or possibly once then some other property a few times then come back to it. So you may end up splitting by age 3 times in a row or more... or maybe you wont split by age at all. The property you select to do the division must be random each time. Regardless of the choice, each time you divide you are halving the number of possible results to be processed in the next iteration. Eventually you will get to a scenario for stopping.
You have constructed 1 tree. This can be used to score the patents for this year. Granted, some of the stats for a patient are the same year to year, but age always changes. You might also have lost some patients and gotten some new ones. You can score each patient by starting at the top and following the logic of where they end up. "Oh their age is less than this, so they go to this question." When you reach a stopping spot you have an answer for that patent.
The thing is for a random forest you make 100s and 100s of these trees. Since there are lots and lots of different ways the data can be sliced up you are trying to build a tree for a good selection of those possibilities. Each time you follow the logic of a particular tree you get a vote for a possible answer. When you have take 1 patient down all the trees you will have a number a certain number of votes for will get the flu and won't get the flu. This is that patient's odds of getting the flu.
That doesn't sound so hard... At it's core it isn't. The trick lies in finding the right splitting point. Having properties that are independent of each other (don't for instance take age of the patient and decade they were born in. Those both relate to each other.) Oh and I suppose making the process fast. The last one helps, because the more trees you have the more accurate the results. You can test with 1 tree, but more is always better statistically. There are a world of improvements that can be made but that's outside the scope of that explanation, more on some of those another time.
"Random Forest" your time has come.
4/28/2012
O_o ? Random forest is a data mining technique that has been around quite a few years. To be fair I only first read about it about 6 months ago, but when I saw it I knew someday I would be implementing it. What is random forest? Well, you can look it up on wikipedia but I think you'll be disappointed in the explanation (I was). With a little searching I found this nice explanation. copied in its entirety to preserve it lest the linking page goes away. Random Forest explained
Author:Edwin Chen

Suppose you're very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you'll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you'll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like "Is X a romantic movie?", "Does Johnny Depp star in X?", and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.

Thus, Willow is a decision tree for your movie preferences.

But Willow is only human, so she doesn't always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you'd like to ask a bunch of your friends, and watch movie X if most of them say they think you'll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you'll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).

Now you don't want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you're not absolutely sure of your preferences yourself -- you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn't use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you *really really* loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don't change your love/hate decisions, you just say you love/hate some movies a little more or less (you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don't mention Harry Potter at all.

By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.

There's still one problem with your data, however. While you loved both Titanic and Inception, it wasn't because you like movies that star Leonardio DiCaprio. Maybe you liked both movies for other reasons. Thus, you don't want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you're building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren't allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you're injecting randomness at the model level, by making your friends ask different questions at different times.

And so your friends now form a random forest.

So you are gonna implement that? But use what you've learned up to this point to try and improve it? Yes. Yes I am and it will be glorious Damn Straight.
a new contest and more scoring stuff
4/27/2012
I went and had a look at kaggle.com to see if I had missed any new contests starting. I had!wooo wooo! They have a new contest which is right up my ally, Predicting a Biological Response. This contest is setup exactly like the previous. That is, I have a training set of data and test set of data. All attributes for each datum are provided. The datum is classified either as being responsive "1" or not responsive "0". I can for my purposes consider that a single group that you are either in or not. Since they are polar opposites and you can never be in both groups, if I did consider it two groups I end up with the same result in reverse on each line (which would score to a one or a zero any way).
Great, so have you submit a result? It sounds like other than loading the data there is no work to do here Yes, I've loaded the data and submitted a result. But unfortunately there is work to do here. my initial submission put me dead last in 338 teams.hahahhahah loser but there is a very good reason.we already covered that. loser The way they are scoring isn't built for predicting a 1 or a 0. they are using a log loss model (you can read up on it here), which basically produces an infinity when you guess wrong with a 100% certainty. Now my score wasn't infinity, so they are clearly considering an infinity to be a very large number but not infinity. The moral of the story is you can dramatically improve your score just by moving the score over a tiny bit off of 1 or 0 (assuming you guessed at least 1 wrong).
And, how did you score? Initially, I came in dead last of 338 teams with a score of 14.+++ the number one score is currently around .40++ . When I moved my predictions over a tiny bit my score dropped to 4.+++ . any other details?There are still 48 days left in the contest it's about half over. The top prize is $10,000 dollars and they expect to see how you did it to award the prize (as well as an explanation).
What I need to get busy doing is figuring out a good way to turn my predictions in to a percentage chance. I think I know how I'm going to do it at first at least. I'll build a normal curve of the prediction's distance from 0 typically I score each group in terms of positive and negative distance from zero. positive is a success, negative is not. Then turn that in to a percentage. So zero would be a 50% which is exactly right. and -100 would be closer to 0 and +100 would be closer to 1. I'll probably do all my calculations then artificially scale the number down to 0.02 to 0.98 just to make sure no infinities crop in to the scoring.
I don't want to sound like a broken record, but I don't want sound like a broken record, but I don't want to sound...
4/25/2012
Dear lord, this is about data mining again isn't it? And I bet it's more crap without picturesIt so is!!:DAll right, what innocuous thing have use done this time? Well, I've made further improvements, and my locally tested score is getting better. I'll share some numbers but first a word about scoring....joy.
there are really 3 scores I look at. They all approach 1 one as the algorithm get's closer and closer to the perfect prediction but each score measures different things so, depending on what you are looking for 1 may be preferable over the other. They are: accuracy, F-Score average and F-Score total.
Accuracy is simply the percentage correct over the total possible answers. This may seem like the best way to score, but it's not. Here's why, if you are picking yes or no for each item and 90% of the answers are no. Then naively picking no every time will produce a 90% result. The thing is the value is in picking out the Yeses. not the Nos. I list it mainly for insight into, well, overall accuracy. if my program is not naively picking all one or the other there is going to be a uniformity of error. So, when I see 90% (in the above example) it's generally going to be more like 90% of the yeses were right and 90% of the nos were right. not always mind you but generally.
F-Test scores are more complicated but are essentially a measure of explained variability divided by unexplained variability ... times 2. if you don't double the score it's actually a 0 to 0.5 score 0 to 1 is what I want. So what is variability? Wow, ok... I don't want to give a long winded explanation here that no one will care about,read or understand. So I'm going to jump to the chase and show the final result and let you stew on it. Then explain a little.
private double fTestMeasure(Set userLabels, Set trueLabels) {
  double commonSize = intersection(userLabels, trueLabels).size();
  long userLabelsSize = userLabels.size();
  long trueLabelsSize = trueLabels.size();

  double fmeasure = 0D;

  if (userLabelsSize != 0L) {
    double precision = commonSize / userLabelsSize;
    double recall = commonSize / trueLabelsSize;

    if (precision + recall > 0D)
      fmeasure = 2.0D * precision * recall / (precision + recall);
  }
  return fmeasure;
}
Consider That userLabels is the set of Yeses we selected. Also consider trueLabels to be the real set of answers (the correct yeses). commonSize is the overlap (what we got right). Or, another way to think about it, really there are 4 possible out comes. We predict yes, and the answer is yes yea!group a. We predict yes and the answer is noboo!group b. we predict no and the answer yesboo!group c. we predict no and the answer is no yea!group d. commonSize is group a. userLabelsSize is group a and group b. trueLabelsSize is group a and group c. Using these values we get our measure.
But you have two measures for f-test? I measure two different sets of solutions. There are many different groups being predicted. I measure each result and average them for one score and for the other I measure results of everything all at once.
So...? when the contest ended I had a best score (f-test average) of 0.439 the winning score was 0.535 the score to beat!. I don't have the answer set so I can't continue to do the exact same scoring this changes at the end of the month when they give us the answers. To continue working on this, I split my training data in half and use half to do testing with and have to train with. This made my score worse but at least I now have a mechanism to self score. my base method (which scores 0.43 not 0.439. 0.439 has an improvement on the base method) scores 0.954139 (accuracy) 0.38439 (f-test avg) 0.40644 (f-test overall). My most recent best method has since score 0.943880 (accuracy) 0.41057543 (f-test avg) 0.41424 (f-test overall). which is a nice improvement! note that the actual accuracy has dropped and the f-test score has gone up. this is due to me actually picking out more correct answers but getting more of the "no"s wrong, but since yeses are so much more rare it's a gain. I think that's enough for today!
You think we're sleeping in Dusseldorf? You think we're taking a nap in Cologne? No, we're working at night - each night a new dial, a new knob, a diode, a transistor ...
4/20/2012
Not to much to say in terms interesting news. I've been quietly working on improving the DMT (data mining tool) acronyms for the win! and for the lazy. Thus far I have nothing to show for it. I could go on about what I'm doing exactly but really it would be far more easily understood with pictures. So at some point I'll throw them up for you all to see. Just not right now.
I did have one idea that is worth sharing. Once I get this thing working well, I think I'm going to apply it to financial analysis. How? First, I'll grab a list of stocks probably everything from AMEX, NYSE and NASDAQ and their daily stock prices for a couple years. Then I'll setup attributes for each stock representing key financial statistics. I think those statistics all have to be relative to stock price or possibly market capitalization. This is done so that there is some reference point day to day on how good that attribute is. I'll probably also have to organize the data by industry. For example, I won't want to compare agricultural to pharmaceutical. They may be similar but one industry works off of a difference set of ideals in valuation than the other. I can try mushing it all together to see if it does perform similarly but in case it doesn't they will be stored separately. Finally, I setup some sort of criteria for groupings of buy and sell (hold will be everything else) based on how a stock ended up performing in the following months. Once that's all setup, I run the tool over the data and presto if you did everything right stock picks.
Aren't other people doing that sort of thing? Yep! It'll be my take on it. I'll just be one of many... many people doing that sort of thing. I need to get my algorithm in the same league as top performers in the last contest first though. And it'll never be perfect because the world is full of random events that make the best analysis fail. Hurricanes, early freezes, earthquakes, war... etc. all of those things can throw a monkey wrench in to the predictions. Regardless, once everything is performing really well, I'll attack that stock data and see what I can do.
RSS feed works... maybe
4/11/2012
I put together an RSS feed for the site. No idea if it works. I'm not much of an RSS guy. I've had to work with them before but I don't actively use RSS for notifications or reading articles. So, if there are any problem with it, sorry in advance. If not well great! :D