Dim Red Glow

A blog about data mining, games, stocks and adventures.

another long over due blog

Hi again faithful readers (you know who you are). So the marathon is back off the plate. Basically, i can't afford to spend the extra cash right now on a trip. Or rather I'm not going to borrow the money for a trip like that. I over spent when i thought the trip was off and upon reflection when it came time to book the trip, it just didn't make sense. So... maybe next year.

So for the last month i've been toying with things with my data mining code base. I removed tons of old code that wasnt being used/tested. i honed it down to just GBM. then i spent some time seeing if i can make a version that boosts accuracy over log loss (normal gbm produces a very balanced approach) i was successful. I did it by taking the output from one gbm and sending it in to another in essence over fitting.

Why would i want to do that? Well there is a contest ( https://www.kaggle.com/c/santander-product-recommendation ) i tried to work on it a little, but the positives are far and few in between. the predictions generally put any given positive at less than 1% chance of being right so i tried bumping the number since right now it returned all negatives. The results are actually running right now. i still expect all negatives but the potential positives should have higher percentages and I can pick a cut off point to round to a positive that is a little higher than .0005% i would have had to use before. All this, so I can send in a result other than all negatives. (which scores you at the bottom of the leader board) Will this give me a better score than all negatives... no idea :)

I tried making a variation on the GBM tree i was using that worked like some stuff i did years ago. it wasnt bad, but still not as good as the current gbm implementation. I also modified the Tree to be able to handle time series data in that it can lock 1 row to another and put the columns in time sensitive order. this allows me to process multi month data really well. It also gave me a place to feed in fake data if i want to stack 1 gbm on top of another i can send in the previous gbm's results as new features to train on (along with the normal training data).

This leads me to where i think the next evolution of this will be. I'm slowly building a multi layer GBM ... which essentially is a form of neural net. The thing i need to work out is how best to sub divide the things each layer should predict. that is i could make it so the GBM makes 2 or 1000 different groups and predicts rows results for each and feeds those in for the next prediction...etc. till we get to the final prediction. the division of the groups is something that can probably be done using a form of multi variable analysis that makes groups out of variables that change together.  figuring out how to divide it in to multiple layers is a different problem all together.

Do you want an AI cause this is how you get AIs! heh, seriously, thats what it turns in to. once you have a program that takes in data builds a great answer in layers solving little problems and assembles them in to a final answer that is super great. well you pretty much have an AI.

Incidently, TSNE might also help here as it I might just feed the tsne results for that layer's training data (fake data included if we are a level or two down) to give the system a better picture of how things group statistically.

In other news, I started using blue apron. This is my first time trying a service like this and so far I'm really enjoying it. I'm pretty bad about going to the grocery store. And going every week ... well that aint gonna happen. This is my way to do that, without doing that :) . I'm sure most people have similar thoughts when they sign up, even if the selling point is supposed to be the dishes you are making. Honestly, I've just been eating too much take out. I don't mind cooking and the dishes they send you to prepare are for the most part really good.

 

 

Gradient Boosting (Part 3)

I've been trying to find my next logical step in data mining development. I think I've settled on it. I'm going to make a tree that uses boosting in it's fundamental splitting nature. It would be nice if the final results from the tree produced better results in general (from my current trees), but if it can remove all trace of bias and move towards a good uniformly evaluated answer, it will be a better tree to be used in subsequent ensembles.

Currently my tree building mechanism tries to build trees to get the most exact answer possible by grouping like-scoring training rows. The algorithm looks for features to use as splitters that would move rows down the left hand of the tree when they score closer to negative infinity and down the right side of the tree when they score closer to positive infinity. This allows me to do feature selection using the correlation coefficient instead of random selection.

This turns out to be a great way to select features. Not only does the feature selected work towards getting a good separation, but it also has the next step(s) split built-in to the feature separation. That is it correlates with final score as a whole. Subdividing the training set so part goes left and part goes right might be just as accurate using a different feature that doesn't correlate as well, but the mismatches should be closer to correct in the version with a higher correlation coefficient.

I've been making my trees using some form of this method for years. I do lose the ability to discern good secondary features unless they show up due to bagging. This is something I would like to try and address in the new version. This is also what I was talking about in terms of bias nature in the opening paragraph. My bias is to the highly correlated features.

I also much more recently moved to the "logistic function" (sigmoid) style splitter. so I build a curve for each feature that represents the feature's distribution. It then gets combined with 2nd, 3rd .. etc features I choose to use as well (depending on the number of features or type of data) to determine which path down the tree the data is set to go. This works wonderfully, it gave me my last big boost in Random Forest before I moved on to Gradient boosting which also makes good use of those stub trees.

The new tree I'm going to be making will endeavor to take the best of all worlds as it were. I'll be using every feature and building a splitter from the net total sum of where the data says it should go. I'll be weighting each opinion by it's overall correlation to the final score (this might change, but you should have some system of setting importance otherwise its noise) and finially I'll be doing the process itteratively where like in GBM each pass will attempt to adjust the result of the previous to hone in on the actual correct score.

Once a tree is finished I'll return it out to a normal GBM tree ensemble mechanism which will make a new tree and repeat the process. In the case of a binary classifier as you final score. The first tree will more likely than not be a single depth tree because it will have a binary classifier to deal with. But the trees after that will start to get some depth to them as it finds more and more resolution. I'll need to set a maximum depth allowed much like a normal tree building exercise.

Give me a few days and we'll see how it goes.

Improving t-sne (part 2)

I ended up last time saying "This brings me to the big failing of t-sne. Sometimes (most of the time) the data is scored in a way that doesn't group on it's own." I intend on trying to tackling that in today's post. I will tell you what I know, what I think might work and the initial results I've gotten so far.

Let me start out by saying I've been casually active in two kaggle contests as of late. https://www.kaggle.com/c/santander-customer-satisfaction and https://www.kaggle.com/c/bnp-paribas-cardif-claims-management . Both contests use a binary classifier for scores one scores using log-loss (which penalizes you very harshly for predicting wrong) and the other uses AUC (ROC if you prefer) though their AUC is predicted probability vs observed probability which is a little weird for AUC, but whatever.

Both of these contests are kind of ideal for t-sne graphing in that there are only 2 colors to represent. Also these contests represent different extremes. One has a very weak signal and the other is more of a 50/50. That is in weak signal case that the "No"s far outnumber the "Yes"s. It's easy to have the signal lost in the noise when the data is like that. If we can leverage t-sne to reduce our dimensions down to 2 or 3 and see via a picture that our grouping for positives and negatives are where they need to be, we probably really will have something.

T-sne gives us 1 very easy way to augment results and 1 not-so-easy way to augment then. The very easy way is to change how it calculates distances. It's important to remember at its core it is trying to assign a probability to a distance for every point vs every other point and then rebalance all the point-to-point probabilities in a lower dimensions. If we augment how it evaluates distance by say expanding or collapsing the distance between points in a uniform way we can change our results as certain dimensions will have more or less importance. The not so easy way i'll talk about later down the article.

I tried at first to improve the solutions using various mathematical transforms. Some where me taking dimensional values to powers (or roots). I tried Principle component analysis. And I even tried periodically rebuilding the probabilities based on some adhoc analysis of where the training data was laid out. All of this was a mess. Not only was I never really successful, but in the end I was trying to do build a data mining tool in my data mining tool!

Skipping a few days ahead I started looking at ways to import results from a Data mining tool to augment the dimensions. I tried using data from Gradient boosting. I tracked which features get used and how often. Then when I used that data, I shrunk the dimensions by the proportion they were used. So if a feature/dimension was used 10% of the time (which is a TON of usage) that dimension got reduced to 10% of its original size. If it was used .1% of the time it got reduced to 1/1000th of the size. This produced... better results but how much better wasn't clear and I definitely wasn't producing results that made me go "OMG!" I was still missing something.

Now we come to the hard way to augment the results. We can build better data for the t-sne program to use. This is what I did, I abandoned the usage information entirely and tried using entirely new features generated by GBM. The first step was to build a number based on how the row flowed down the tree. flowing left was a 0 flowing right was a 1. I actually recorded this backwards though since at the final node I want the 0 or 1 to be the most significant digit, so like values ended up on the same end of the number line. Since GBM is run iteratively I could take each loop and use it as a new feature.

This didn't really work. I think because a lot of the splits are really noisy and even with me making the last split the most significant digit, it's still way to noisy to be a useful way to look at the data. This brings me to my most recent idea which finally bears some fruit.

I thought "Well if I just use the final answers instead of the flow of data, not only do i get a method for generating features that can produce either reals or categorical answers (depending on the data you are analyzing) but you also can use any method you like to generate the features. you arent stuck with fixed depth trees."  I stuck with GBM but turned off the fixed depth aspect of the tree and just built some results. I ran 5 iterations with 9 cross validation and kept the training set's predictions from the tree building exercise. The results are really promising. See below (note: I will continue to use my psedo-t-sne which runs in linear time due to time constraints. if you want "perfect results" you will have to go else where for now)

BNP

You can see the different folds being separated since in this implementation they don't cover all the data (it zero fill the rows it was missing) I have since fixed this. But yes there were 9 folds.

Santander

Santander is the one I'm most impressed with its really hard to isolate those few positives in to a group. BNP.... is well less impressive, it was an easier data set to separate so the results are meh at best. The rubber will really meet the road when I put the test data in there and do some secondary analysis on the 2-d version of the data using GBM or random forest.

It's important to know that I'm "cheating" i build my boosted trees from the same training data that I'm then graphing. but due the GBMs baby step approach to solving the problem and the fact that I use bagging at each step of the way to get accuracy the results SHOULD translate in to something that works for the testing data set. I'll only know for sure once I take those results you see above... put the test data in as well and then send the whole 2-d result in to another GBM and see what it can do with it. I'm hopeful (though prior experience tells me I shouldn't be).

*Note: Since I initially put this post together, i sat on it since I wanted to try it before i shared it with the world. I have finished using the initial results on the BNP data and they are biased.. i think for obvious reasons, training on data then using the results to train further... I'll talk more about it in t-sne (part 3). if you want to read more about it right now. check out this thread https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/forums/t/19940/t-sne-neatness/114259#post114259 *

For fun here are the animations of those two image being made. (they are really big files 27 and 14 meg ... also firefox doesn't like them much... try IE or chrome.)

http://dimredglow.com/images/Santander.gif

http://dimredglow.com/images/bnp.gif

 

 

Gradient Boosting (Part 2)

This time around let’s start from scratch. We have some training data. Each row of the data has some various features with real number values and a score attached to it. We want to build a system that produces those scores from the training data. If we forget for a second "how" and think about it generically we can say that there is some magical box that gives us a result. The result isn't perfect put its better than a random guess.

The results that we get back from this black box are a good first step, but what we really need is to further improve the results. To further improve the results we take the training data and instead of using the scores we were given we assign new scores to the rows. This score is how much we were off from the black box prediction.

Now we send this data in to a new black box and get back a result. This result isn't perfect either but it’s still better than a random guess. The two results then combined give us a better answer then just the first black box call. We repeat this process over and over making a series of black box calls that slowly gets the sum of all the results to match the original scores. If we then send in a single test row (that we don't know the score to) in to each of these black boxes that have been trained on the train data, the results can be added up to produce the score the test data has.

In math this looks something like this

f(x) = g(x) + g'(x) + g''(x) + g'''(x) + g''''(x) .... (etc)

Where f(x) is our score and  g(x) is our black box call. g'(x) is the 2nd black box call which was trained with the adjusted scores, g''(x) is the 3rd black box call with the scores still further adjusted. etc...

A few questions should arise. First how many subsequent calls do we do? And 2nd what exactly is the black box?

I'll answer the second question first. The black box (at least in my case) is the lowly decision tree. Specifically it is a tree that has been limited so that it terminates before it gets to a singular answer. In fact it is generally stopped while large fractions the the training data are still grouped together. The terminal nodes just give averages for the scores of the group at the terminal node. It is important that you limit the tree because you want to build an answer that is good in lots of cases.

Why? Because if you build specific answers and the answer is wrong, correcting the result is nearly impossible. Was it wrong because of a lack of data? Was it wrong because you split your decision tree on the wrong feature 3 nodes down? Was it wrong because this is an outlier? Any one of these things could be true so to eliminate them all as possibilities you stop relatively quickly in building your tree and the get an answer that is never right but at least puts the results in to a group. If you go too far down you start introducing noise in the results. Noise that creeps in because your answers are too specific. It adds static to the output.

How far down should you go? It depends on how much data you have how varied the answers are and how the data is distributed. In general I do a depth of ((Math.Log(trainRows.Length) / Math.Log(2)) / 2.0) but it varies from data set to data set. This is just where I start and I adjust it if need be. In short I go half way down to a specific answer.

Now if you limit the depth uniformly (or nearly uniformly) the results from each node will have similar output. That is the results will fall between the minimum and maximum score just like any prediction you might make, but probably somewhere in the middle (that is important). It will have the same amount of decisions used to derive the result as well so the answers information content should on average be about the same.  Because of this next iteration will have newly calculated target values the range of those scores in the worst case is identical to the previous range. In any other case the range is decreased since the low and high ends got smothered in to average values. So it is probably shrinking the range of scores and at worst leaving the range the same.

Also the score input for the next black box call will still have most of the information in it since all we have done is adjust scores based on 1 result and we did it in the same way to a large number of results. Results that from the previous tree's decisions share certain traits. Doing this allows us to slowly tease out qualities of similar rows of data in each new tree. But since we are starting over when a new tree the groupings end up different each time. In this way subtle shared qualities of rows can be expressed together once their remaining score (original score minus all the black box calls) line up.

This brings us to the first question how many calls should I do? To answer that accurately it’s important to know that usually the result that is returned is not added to the rows final score unmodified. Usually they decrease the value by some fixed percentage. Why? This further reduces the influence that decision tree has on the final prediction. In fact I've seen times where people reduce the number to 1/100th of its returned value. Doing these tiny baby steps can help, but sometimes it just adds noise to the system as well since each generation of a tree may have bias in it or might to strongly express a feature. In any case it depends on your decision trees and your data.

This goes for how many iterations to do as well. At some point you have gotten as accurate as you can get and doing further iterations over fits to the training data and makes your results worse. Or worse yet it just adds noise as the program attempts to fit noise of the data to the scores. In my case I test the predictions after each tree is generated to see if it is still adding accuracy. If I get 2 trees in a row that fail to produce an improvement. I quit (throwing away the final tree). Most program just have hard iteration cut off points. This works pretty well, but leads to a guessing game based on parameters you have setup.

 

 

 

Gradient Boosting (Part 1)

Okay! So you want to learn about gradient boosting. Well, first let me point you to the obvious source https://en.wikipedia.org/wiki/Gradient_boosting I'll wait for you to go read it and come back.

Back? Think you understand it? Good! Then you don't need to read further.... probably. I should warn you now in the strictest sense this post is entirely backstory on getting to the point of implementing gradient boosting. You might want to skip to part 2 if you want more explanation of what gradient boosting is doing.

When I first tackled Gradient boosting, I tried it and it didn't work. What I mean to say is I got worse results than Random Forest https://en.wikipedia.org/wiki/Random_forest. Perhaps I'm getting ahead of myself. Let me back up a little more and explain my perspective.

Most people at https://www.kaggle.com/ use tool kits or languages with libraries and built in implementations of all the core functionality they need. That is, the tool kits that they use have everything written for them. They make API calls that perform the bulk of the work. They just assemble the pieces and configure the settings for the calls. 

I write my own code for pretty much everything when it comes to data mining. While I don't reinvent things I have no plans on improving there aren't to many things like that. I didn't write my own version of  PCA https://en.wikipedia.org/wiki/Principal_component_analysis I use the one that was out there in a library on the rare occasion I want to use it. And while I've got my own version of TSNE https://lvdmaaten.github.io/tsne/, it was a rewrite of a javascript implementation that someone else had written. Granted I've tweaked the code a lot for speed and to do some interesting things with it, but I didn't sit down with a blank class file and just write it. But everything else, I've written all by myself.

So why does that make difference (toolkit vs handwritten)? Well, i try stuff and have to figure things out. And because of that my version of the technology might work in a fundamentally different way. Or prehaps what I settle on isn't as good (though it probably seems better at the time). Then when I try and leverage that code for the implementation of gradient boosting, it doesn't work like it should.

The core of gradient boosting and random forests is the decision tree https://en.wikipedia.org/wiki/Decision_tree_learning. When it comes to random forest I have been very pleased with the tree algorithm I've designed. However they just didn't seem to work well stubbed out for gradient boosting. I can only think of three explanations for this.

  1. My stubbed out trees tended to be biased a certain way.
  2. They have a lot of noise in the output.
  3. My gradient boosting algorithm got mucked up due to poor implementation.

That at least is the best I can figure.

Recently I made a giant leap forward in my decision tree technology, I had an 'AH HA!' moment and changed my tree implementation to use my new method. When I got it all right my scores went up like 10% (ball-parking here) and all of a sudden when I tried gradient boosting it worked as well.The results I got with that were fully another 15-20% better still! All at once I felt like a contender. The rest of the world had long since moved on to XGBoost https://github.com/dmlc/xgboost which was leaving me in the dust. Not so much anymore, but i still haven't won any contests or like made million on the stock market :) .

What changed in my trees that made all the difference? I started using the sigmoid https://en.wikipedia.org/wiki/Logistic_function to make my splitters. I had tried that as well, many years ago but the specific implementation was key. The ah-ha moment was when I realize how to directly translate my data in to a perfect splitter without any messing around "honing in" on a the best split. I think this technology not only gives me an great tree based on better statistics model than using accuracy alone (accuracy is how my old tree worked). But the results are more "smooth" so noise is less of an issue.