This post will be the last for the immediate future on t-sne and really will just cover what I've done and how it all ended up. I have a new direction to head. And while I may use t-sne , I don't think it will be the main focus (though i guess it could be! surprises do happen).
So without further adieu let me surmise what I've written about so far. T-sne is an algorithm that reduces dimensions of a data set while preserving the statistical probability dictated by distance between rows of data and the features there in. It does this by randomly distributing the points in whatever N dimensional space you choose. Then it slowly moves the points iterativly in such a way as best reflects the same probability distribution as what existed in higher order space. You could i suppose increase dimensions in this way too if you want to.
I covered how I would speed this up at the end of part 1. I use a series of smaller "mini" t-sne's and combine them together. All of those t-snes share a lot a large number of the same data points. these were seeded in to the same position. I use those points to stitch together all the various frames. using the closest point as an anchor to find its final relative position in 1 key frame i use to put all the points. In that way all points from all the frames are mapped to 1 of the frames. This seems to work really well though the results aren't exact but the run times are linear. (i'm calling this a win)
In the 2nd part I covered how one can augment the results to get a clear separation of data points based on a particular score. In short it's not magical, you build features from a data mining tool and provide the results as new features. you then use only those as the t-sne inputs. In my case my GBM does some sub selecting and bagging internally. because of this, my outputs are never exact. This makes for a very good set of inputs for seeing the separation of the scores.
Unfortunately, the results you get don't provide a way for improving the score. I had hoped they would, but truly it's just another way at looking at the final output. If you train on a sub section of the data and feed the t-sne results back in to the GBM machine and re train using the other part and no overall improvements will be found. (i'm calling this a loss or at best a draw. yes I did get wonderful separation which was my starting goal, but it's nothing more than a visualization tool and didn't lead to something that actually "improves" anything from a data mining perspective.)
It's worth noting that if you don't remove the bias and go ahead and retrain from the t-sne result using the same data you fed it originally to get scores to build the t-sne results, it will slowly get better and better but this is just a serious case of over-fitting. GBM does such a good job of finding connections between data points it will emphasize any aspects that are represented in the new t-sne feed that allow it to get closer to the right score.
I tried this out and found I could turn a .47 results (log-loss) in to a .43 result by adding in the t-sne data. then I could further reduce it all the way to .388 by letting the algorithm pick and choose the features to use (well if brute force trial and error is pick and choose). The point is it's not useful for improving the results of the test set (which actually score .48 when the cross validation set score .388).
I'll leave this right there for now.