Dim Red Glow

A blog about data mining, games, stocks and adventures.

Gradient Boosting (Part 3)

I've been trying to find my next logical step in data mining development. I think I've settled on it. I'm going to make a tree that uses boosting in it's fundamental splitting nature. It would be nice if the final results from the tree produced better results in general (from my current trees), but if it can remove all trace of bias and move towards a good uniformly evaluated answer, it will be a better tree to be used in subsequent ensembles.

Currently my tree building mechanism tries to build trees to get the most exact answer possible by grouping like-scoring training rows. The algorithm looks for features to use as splitters that would move rows down the left hand of the tree when they score closer to negative infinity and down the right side of the tree when they score closer to positive infinity. This allows me to do feature selection using the correlation coefficient instead of random selection.

This turns out to be a great way to select features. Not only does the feature selected work towards getting a good separation, but it also has the next step(s) split built-in to the feature separation. That is it correlates with final score as a whole. Subdividing the training set so part goes left and part goes right might be just as accurate using a different feature that doesn't correlate as well, but the mismatches should be closer to correct in the version with a higher correlation coefficient.

I've been making my trees using some form of this method for years. I do lose the ability to discern good secondary features unless they show up due to bagging. This is something I would like to try and address in the new version. This is also what I was talking about in terms of bias nature in the opening paragraph. My bias is to the highly correlated features.

I also much more recently moved to the "logistic function" (sigmoid) style splitter. so I build a curve for each feature that represents the feature's distribution. It then gets combined with 2nd, 3rd .. etc features I choose to use as well (depending on the number of features or type of data) to determine which path down the tree the data is set to go. This works wonderfully, it gave me my last big boost in Random Forest before I moved on to Gradient boosting which also makes good use of those stub trees.

The new tree I'm going to be making will endeavor to take the best of all worlds as it were. I'll be using every feature and building a splitter from the net total sum of where the data says it should go. I'll be weighting each opinion by it's overall correlation to the final score (this might change, but you should have some system of setting importance otherwise its noise) and finially I'll be doing the process itteratively where like in GBM each pass will attempt to adjust the result of the previous to hone in on the actual correct score.

Once a tree is finished I'll return it out to a normal GBM tree ensemble mechanism which will make a new tree and repeat the process. In the case of a binary classifier as you final score. The first tree will more likely than not be a single depth tree because it will have a binary classifier to deal with. But the trees after that will start to get some depth to them as it finds more and more resolution. I'll need to set a maximum depth allowed much like a normal tree building exercise.

Give me a few days and we'll see how it goes.

Training for a marathon (part 2)

Just a small post today. I think I'm going to do updates on the marathon training once a week. I'll give mileage from the last update and the most recent timed run data I have. I don't time myself most runs as wearing the watch is a bit annoying but i'll try and do it the day before or day of the post.  Originally, I was going to do an update every time i ran but that's just too much/boring. 

But while i'm here I'll say, I did run today and i put in another 4 miles. The way the weather is shaping up I'll probably run some more this weekend. I also weighed myself (i usually do in the morning) and my weight was down to 196.2 i only mention this cause it's important to know that weight tends to vary +-2 pounds around some actual center weight. That's just another reason not to obsess about it. If you are seeing your own weight vary a lot from day to day (well a few pounds anyway) that's just food and water moving through you.

Training for a marathon (part 1)

I enjoy running. Well, most of the time. I've run in some form regularly for most of my life. I started in in track field in jr high and then did it again in high school. I ran for fun and to stay in shape periodically in my 20s and much more so in my 30s. In my mid 30s I decided it was time to try a marathon, partly because I wanted to see what it was like, partly cause it was an 'Everest' sort of thing.

I ended up running 3 marathons in about 2 years.  I did the rock and roll marathon in St. Louis 2 years in a row and the Beat the Blerch one in Seattle the first year they had it. The st. Louis rock and roll marathon stopped having full marathons last year (I guess turn out was too low), so that wasn't an option last year. The Beat the Blerch one, while fun, was never meant to be a reoccurring thing for me. I just wanted to travel and try it.

So, I didn't run a marathon at all last year. I guess I could have found one but I was 'meh' in terms of motivation. Also I have had chronic plantar fasciitis. I think I've finally beaten that thanks to a number of things: stretching, lots of time off, changing my running shoes to newtons (which emphasize toe sticking not heal striking) and changing my walking around shoes to a minimalist cross trainer (over a running shoe). I've always had mild problems with foot pain because I tend to be a heal toe walker/runner. I think however what caused this chronic condition was something else. I trained for a 5k (I was running for time) and at the same time I was training for the 2 marathons i did. When one marathon was done and the 5k was run I was already suffering pretty constant pain, and I didn't give my self any real time off to heal since I had 1 more 2 months later. That I think sealed the condition.

Skipping ahead around a year and half. I'm 40 now and it seems like as good a time as any to try again. I didn't ever really think marathons would be my thing and they probably won't be long term, but I did have a few goals that I haven't accomplished in the 3 i ran. I wanted an under 4 hours time (4:25 was my best) and I actually even wanted to do a 3:30 time once I did the 4. Some day I thought it might be fun to do a biatholon or triathalon as well but I want to make those goals first.

I've spent the last 3 or 4 weeks running periodically and mildly dieting, trying to get my body back in to some sort of shape. I can do my "stay in shape" now with out much though. I did it today here are the stats

distance: 4.02 miles

time: 37:46.6

pace: 9:23 /mile

weight: 197.6

The watch I have that tracks this stuff has heart rate and gps in it too, though i'm not sure if it keeps it all. At some point I'll have to hook it up to my computer and download the data and see what is what. my current weight is a heavy 197.6 (heavy for me anyways). A few things I'd like to point out, in a perfect world I'd weigh about 20 pounds less (i'm 5'10-5'9"). This is especially true when i'm about to run the marathon (lighter means less weight to carry). I've never quite reached it when training previously and always end up around 180-185. So that's a goal of sorts to go along with the 4 hour run. 1 really feeds the other. If you put in the miles the weight goes away and it helps you run faster. I'm not what you call hardcore about this though which is why previous attempts have met with limited success.

The mileage I want to be putting in eventually will be around 40 a week probably 4-5 days a week for 8-10 miles each with maybe a big run every other week of like 15-20. Right now i'm probably doing about 8 (two 4 mile runs), though the weather is changing and i'll be putting in a lot more soon. This weekend I'll probably put in 10 more and I'll probably put in 4 more tomorrow. I wont go too nuts till I pick a marathon and get a timeline. training for a marathon eats up a lot of time.

Diet is something I need to think about too.  I've actually been losing weight, sort of... I started a low carb diet but i havent been running enough or been strict enough to really shed the weight. I was actually around 201 when is started about a month ago.. lost 5 pounds real quick and have floated there for the last month probably taking in around 60 or so grams on average a day. That's going to stop now though. I want the energy to train with and not eating the starches puts undue strain on me. yes you will lose weight faster but this isn't about losing weight over night.

It's worth mentioning having a regular exercise routine is important to a long life, looking and feeling healthy, and once the marathon are over i plan on continue running regularly. I will try and give updates on the days I run so anyone interested can follow along.

Improving t-sne (part 3)

This post will be the last for the immediate future on t-sne and really will just cover what I've done and how it all ended up. I have a new direction to head. And while I may use t-sne , I don't think it will be the main focus (though i guess it could be! surprises do happen).

So without further adieu let me surmise what I've written about so far. T-sne is an algorithm that reduces dimensions of a data set while preserving the statistical probability dictated by distance between rows of data and the features there in. It does this by randomly distributing the points in whatever N dimensional space you choose. Then it slowly moves the points iterativly in such a way as best reflects the same probability distribution as what existed in higher order space. You could i suppose increase dimensions in this way too if you want to.

I covered how I would speed this up at the end of part 1. I use a series of smaller "mini" t-sne's and combine them together. All of those t-snes share a lot a large number of the same data points. these were seeded in to the same position. I use those points to stitch together all the various frames. using the closest point as an anchor to find its final relative position in 1 key frame i use to put all the points. In that way all points from all the frames are mapped to 1 of the frames. This seems to work really well though the results aren't exact but the run times are linear. (i'm calling this a win)

In the 2nd part I covered how one can augment the results to get a clear separation of data points based on a particular score.  In short it's not magical, you build features from a data mining tool and provide the results as new features. you then use only those as the t-sne inputs. In my case my GBM does some sub selecting and bagging internally. because of this, my outputs are never exact. This makes for a very good set of inputs for seeing the separation of the scores.

Unfortunately, the results you get don't provide a way for improving the score. I had hoped they would, but truly it's just another way at looking at the final output. If you train on a sub section of the data and feed the t-sne results back in to the GBM machine and re train using the other part and no overall improvements will be found.  (i'm calling this a loss or at best a draw. yes I did get wonderful separation which was my starting goal, but it's nothing more than a visualization tool and didn't lead to something that actually "improves" anything from a data mining perspective.)

It's worth noting that if you don't remove the bias and go ahead and retrain from the t-sne result using the same data you fed it originally to get scores to build the t-sne results, it will slowly get better and better but this is just a serious case of over-fitting. GBM does such a good job of finding connections between data points it will emphasize any aspects that are represented in the new t-sne feed that allow it to get closer to the right score.

I tried this out and found I could turn a .47 results (log-loss) in to a .43 result by adding in the t-sne data. then I could further reduce it all the way to .388 by letting the algorithm pick and choose the features to use (well if brute force trial and error is pick and choose). The point is it's not useful for improving the results of the test set (which actually score .48 when the cross validation set score .388). 

I'll leave this right there for now.

Improving t-sne (part 2)

I ended up last time saying "This brings me to the big failing of t-sne. Sometimes (most of the time) the data is scored in a way that doesn't group on it's own." I intend on trying to tackling that in today's post. I will tell you what I know, what I think might work and the initial results I've gotten so far.

Let me start out by saying I've been casually active in two kaggle contests as of late. https://www.kaggle.com/c/santander-customer-satisfaction and https://www.kaggle.com/c/bnp-paribas-cardif-claims-management . Both contests use a binary classifier for scores one scores using log-loss (which penalizes you very harshly for predicting wrong) and the other uses AUC (ROC if you prefer) though their AUC is predicted probability vs observed probability which is a little weird for AUC, but whatever.

Both of these contests are kind of ideal for t-sne graphing in that there are only 2 colors to represent. Also these contests represent different extremes. One has a very weak signal and the other is more of a 50/50. That is in weak signal case that the "No"s far outnumber the "Yes"s. It's easy to have the signal lost in the noise when the data is like that. If we can leverage t-sne to reduce our dimensions down to 2 or 3 and see via a picture that our grouping for positives and negatives are where they need to be, we probably really will have something.

T-sne gives us 1 very easy way to augment results and 1 not-so-easy way to augment then. The very easy way is to change how it calculates distances. It's important to remember at its core it is trying to assign a probability to a distance for every point vs every other point and then rebalance all the point-to-point probabilities in a lower dimensions. If we augment how it evaluates distance by say expanding or collapsing the distance between points in a uniform way we can change our results as certain dimensions will have more or less importance. The not so easy way i'll talk about later down the article.

I tried at first to improve the solutions using various mathematical transforms. Some where me taking dimensional values to powers (or roots). I tried Principle component analysis. And I even tried periodically rebuilding the probabilities based on some adhoc analysis of where the training data was laid out. All of this was a mess. Not only was I never really successful, but in the end I was trying to do build a data mining tool in my data mining tool!

Skipping a few days ahead I started looking at ways to import results from a Data mining tool to augment the dimensions. I tried using data from Gradient boosting. I tracked which features get used and how often. Then when I used that data, I shrunk the dimensions by the proportion they were used. So if a feature/dimension was used 10% of the time (which is a TON of usage) that dimension got reduced to 10% of its original size. If it was used .1% of the time it got reduced to 1/1000th of the size. This produced... better results but how much better wasn't clear and I definitely wasn't producing results that made me go "OMG!" I was still missing something.

Now we come to the hard way to augment the results. We can build better data for the t-sne program to use. This is what I did, I abandoned the usage information entirely and tried using entirely new features generated by GBM. The first step was to build a number based on how the row flowed down the tree. flowing left was a 0 flowing right was a 1. I actually recorded this backwards though since at the final node I want the 0 or 1 to be the most significant digit, so like values ended up on the same end of the number line. Since GBM is run iteratively I could take each loop and use it as a new feature.

This didn't really work. I think because a lot of the splits are really noisy and even with me making the last split the most significant digit, it's still way to noisy to be a useful way to look at the data. This brings me to my most recent idea which finally bears some fruit.

I thought "Well if I just use the final answers instead of the flow of data, not only do i get a method for generating features that can produce either reals or categorical answers (depending on the data you are analyzing) but you also can use any method you like to generate the features. you arent stuck with fixed depth trees."  I stuck with GBM but turned off the fixed depth aspect of the tree and just built some results. I ran 5 iterations with 9 cross validation and kept the training set's predictions from the tree building exercise. The results are really promising. See below (note: I will continue to use my psedo-t-sne which runs in linear time due to time constraints. if you want "perfect results" you will have to go else where for now)


You can see the different folds being separated since in this implementation they don't cover all the data (it zero fill the rows it was missing) I have since fixed this. But yes there were 9 folds.


Santander is the one I'm most impressed with its really hard to isolate those few positives in to a group. BNP.... is well less impressive, it was an easier data set to separate so the results are meh at best. The rubber will really meet the road when I put the test data in there and do some secondary analysis on the 2-d version of the data using GBM or random forest.

It's important to know that I'm "cheating" i build my boosted trees from the same training data that I'm then graphing. but due the GBMs baby step approach to solving the problem and the fact that I use bagging at each step of the way to get accuracy the results SHOULD translate in to something that works for the testing data set. I'll only know for sure once I take those results you see above... put the test data in as well and then send the whole 2-d result in to another GBM and see what it can do with it. I'm hopeful (though prior experience tells me I shouldn't be).

*Note: Since I initially put this post together, i sat on it since I wanted to try it before i shared it with the world. I have finished using the initial results on the BNP data and they are biased.. i think for obvious reasons, training on data then using the results to train further... I'll talk more about it in t-sne (part 3). if you want to read more about it right now. check out this thread https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/forums/t/19940/t-sne-neatness/114259#post114259 *

For fun here are the animations of those two image being made. (they are really big files 27 and 14 meg ... also firefox doesn't like them much... try IE or chrome.)





The ban update is in, eye of ugin is out

Well they posted the the new ban list for magic. Also poor eye of ugin you were .... so broken. It was totally the right choice though. Two free mana for any eldrazi spell when there are so many good fairly cheap eldrazi? yes please. They made the right decision. And really eldrazi is still totally viable. I'm going to try keeping the modern, standard and legacy decks I have together for now... though I might tweak the modern one some.

They unbanned 2 cards and that I found surprising. sword of the meek enables a pretty ridiculous combo and Ancestral Vision just makes blue decks and possibly any cascade decks better. If they have room for it that is.

The sword combo deck should change the meta. It's too strong of a combo not to have some people play it. I'll eventually get my hands on a playset and see if I like playing it, but i'm in no hurry.

However, the ancestral visions card will just make some decks better. There is no shardless agent or bloodbraid elf in modern, so you can't get a decent creature AND trigger visions in 1 go, at least not for 3 or 4 mana. I'm sure some decks will just throw it in for the draw advantage 4 turns later for the long game. There might be a way to cheat it in to play besides cascade. It occurs to me just now, I suppose you could use the new shadows over innistrad card Brain in a Jar to cast it outright. You would have to remove the jar from play before the ability resolves and have had no counters on it to begin with (it uses the last known information about the brain in a jar card when resolving the ability). Regardless, I picked up some of these up because they seem like a staple for deck building in modern now that it's legal.

Speaking of the new set, I have just a few small observations from playing in the prerelease(s). The green hate bears seem... nuts. White has some really good creatures as well as Archangel Avacyn (which just wins games). Red got some nice agro support and black got more zombies (never a bad thing) not to mention a vampire or two. I can see black agro variants and Green/white or red/green creature decks being a thing. Especially since collected company has not rotated out yet.

As I already mentioned I'll will continue to play my standard eldrazi for a while, probably changing it slightly to use a few of the newer cards... but I think (eventually) I'm going to try my hand at either a blue or blue/black mill deck. It seems like that could be the next thing.  Blue didn't seem to get all the obvious love that the other colors did but i think it probably got enough stall between Jace, Unraveler of Secrets and Engulf the Shore and mill cards like Startled Awake and Manic Scribe that it could be a real threat especially with a nicely timed languish. That being said add Sphinx's Tutelage to Forgotten Creation or dark deal and all at once the game gets over in a hurry.