Dim Red Glow

A blog about data mining, games, stocks and adventures.

Some animations from the cancer research

I ran the data through (well my layered version of t-sne)  and got some pretty good results. good at least visually. I'm not sure if these are going to pan out well, but i'm hopeful. Let say 2 things before i get to them.

First the more I use the new log version of the image, the more i think that is fundamentally the best way to look at the image data. In fact, I think it might be the best way to deal with just about any real world "thing" that can't be modeled in discrete values. it just does everything you want and in a way that makes good sense. It is, essentially a wave form of the whole object broken down in to 1 number. It might actually be a great way to deal with sounds, 3d models, electrical signals, pictures ... you name it. I think there is room for improvement in the details but the idea itself seems good.

The 2nd thing I want to say is the files are huge, so I'm going to share still images of the final result and then have links to the .Gif animations. I might upload them to youtube but i'm pretty sure gif format is the most efficient way to send them to anyone who wants to see. They lossless which is great and smaller than corresponding mp4s/mpeg/fla files etc. This is because normal video compressors don't handle 1000s moving pixels nearly as efficiently as a gif file with its simple difference layers does.

In these images a purple dots represents cat scan images from a patient with no cancer. The green dots represent cat scan images where they had cancer. and in the final image yellow is test data we don't have solutions for. The goal here is to identify which slices/images of actually have cancer telling info in it. So I would hope most of the images  fall a mix of purple and green dots.

Okay this one is a mess. and your initial reaction might be. how is that useful? well there is an important thing to know here. the fields i fed in included the indexed location of the slices after they were sorted. while this is fairly useful for data mining it is all but useless for visualization (a linear set of numbers is not something that has a meaningful standard deviation or localized average for use in grouping.)  I knew this going in to the processing but I wanted to see the result all the same.

So now we remove the indexing features and try again. (remember each pixel is actually a bunch of features created from multiple grids of images)

Okay that looks a LOT better. in the bottom right you see a group of green pixels. that is some very nice auto-grouping. I went ahead and ran this once more this time with the test data in there as well. it is in yellow.

If anything that's even better. Adding more data does tend to help things. the top group is really good. the ones on the left might be something too. its hard to know, but you dont have to! that's what the data miner tools will figure out for you.

I want to do even more runs and see if I can build a better picture. a few of my features are based on index distances and they should probably be based on my log number instead. Either way it's fun to share! the rubber will really meet the road when i see if this actually gives me good results when i make a submission. (probably a day away at least)

here are the videos they are 69, 47 and 77 meg each so... it'll probably take a while to download.

http://dimredglow.com/images/animation1.gif

http://dimredglow.com/images/animation2.gif

http://dimredglow.com/images/animation3.gif

 

First steps to image processing improvement

where to begin... first I started looking at just how big 32x32 is on the 512x512 images. It seemed too small to make good clear identification of abnormalities. So I increased my grid segments to 128 ... this might be too big. I'll probably drop them down to 64 next run.

I found a post by a radiologist in kaggle's forums that further reinforced this as a good size (solely based on his identification of a 33x32 legion, given that the arbitrary location 64x64 would be perfect.) .So that's on the short list of changes to make.

also, I have fixed a bug in my sorting (it's really bad this existed :( ) and added a log version of the number used for sorting. (That's actuallly how I caught the sorting bug.) I don't generally need both but it is useful in some contexts to use 1 and in others to use the other. The Log number alone lacks the specificity since 512x512 image is awful big in number form. So rounding/lack of percusion is a problem.

unfortunately even with me fixing my garbage inputs from before my score didn't improve. This leads me to the real problem I need to solve.

I need to be able to identify which image actually has cancer. The patient level identification is just not enough. That's why my score didn't improve when I fixed the inputs. .. but how?

i hope t-sne can help here. I feed in all images with their respective grid data and let it self organize. That's running right now. Hopefully the output puts the images into groups based on abnormalities. Ideally grouping images with cancer. Since I don't actually know which do have cancer, I'll be looking for a grouping of images all marked 1 (from a cancer patient) and hope he that gets it. there will also hopefully be groups made of a mishmash from non cancer images .

I'm building one of those animations from before for fun. I'll share when it's finished

if this works I either feed that tsne data (x,y) in as a new feature or use kmeans to group up the groups and use that. We will see!

Minecraft and data mining (no relation) and stocks (some releation)

It's the 2nd day of 2017 and soon to be 2018 or so it always seems when you look back at years. I've dawdled for to long on some old work I wanted to do and the years turning over has put that in sharp relief.

First I started a you tube channel https://www.youtube.com/channel/UCoTP8WbdsCW_6FSLz0pUSsA (Hardcore in a Hurry) for my video game exploits. I don't play a lot of video games these days, but that being said I like playing games on the hardest setting possible. I'm not a fan cheat modes or easy walk through settings (or AIs that cheat for that matter, I mean seriously? couldn't you write a better AI?). That being said, minecraft is the only thing on there now. I don't have plans to add any other types of videos right now, but things change.

Next Let me say I've spent some more time on my layered Gradient boosting stuff. I learned a little bit about what t-sne can do for me. So feeding t-sne in initially can help if the groups do self organize out of the original data but normally, I think it is something you want to do after a few iterations have passed. it seems the further down the gradient you are the better the effect. The effects arent remarkable, but they aren't terrible either. The unfortunate takeway is that it is very time intensive. running t-sne processing on the current state of the gradient descent and the source data just takes ... well a while. maybe i can cherry pick features to use and speed it up, but as it stands right now unless the data set is pretty small its not a good option for eking out more performance. 

For a couple years now i've wanted to implement a stock analysis program, so i can do personal investment. I did this once along time ago with a friend using far less sophisticated methods. That's a story for another time. I've never been happy enough with my software for datamining to start the code/project for investing. I think i'm happy enough now. The really work will be in getting the data in to a form that is easily updated daily and makes measurable testable predictions. My recent work on https://www.kaggle.com/c/santander-product-recommendation got me to rewrite parts of my code to handle time series in a different way. which is key for the training and evaluation. (a contest i never got anywhere with. I've never implemented the map@<x> evaluation so it makes it hard to train for such a thing. )

 

adjusting the layers to be tsne driven

Hello again, i've slowly turned over the ideas in my head i wrote about last time, and i think there are 2 big flaws in my idea. first fitting a gbm to data you already have in the training and test data adds nothing. (that's huge). this is with regards to me selecting a few features to make something that has a correlation coefficent equal to some fixed amount of the final result. And 2nd making groups in any fashion that is not some form of transform will likely never give me any new information to work with.

I have a fix to both (of course :) ) i mentioned using tsne as well. i think that's where i need to get my groups from. the features i send in to tsne is the how i get various new groups. everthing else i've said though is still relevant. so it is no longer layers of gbm as much as layers of tsne. once i have relevant results from that i send it all in to gbm and let it do its work.

The groups i make out of tsne (2-d btw) will utilize the linear tsne i've already written (As its super fast and does what i need to do pretty well). and while i will send the the results from the tsne right in to the gbm I'll do more than that. I will also isolate out each regions in the results and build groups from each of those.

To do that, i make a gravity-well map for every point on the map... like they are all little planets. the resulting map is evenly subdivided (likes its a giant square map of X by X...i've been using 50 for each side) and have a 0 or larger value in every square.  (inverse squared distance ... 1/(1+ (sourceX - x0)^2 + sourceY - y0)^2 ). i've been limiting the influence to 5 squares from each point just to keep the runtime speedy (its only a 50x50 map anyway so that's pretty good reach and at 6 away the influence is 1/(1+36+36) which is only 0.013 so that's a decent cutoff anyways.  Once i have the map i look for any positive points that have lower or equal points in all directions. those points are then used to make groups. so there may only be 1... but likely there will be bunches.

the groups are based off of distance from the various centers of local low spot in the gravity well map. if you were to take the differential of the map, these would be 0 points.  so we just find the distance of every point from those centers. This has the added effect of possibly making multiple groups in 1 one pass (which is great) not to mention the data is made via a transformation that works completely statistically and has no direct 1 to 1 correlation with the data you feed it.  So you are actually adding something to the gbm mechanism to work with.

At that point the resulting groups should probably be filtered to see if they add any real value using the correlation mechanism i already mentioned in the last post. ideally gbm would just ignore them if they are noisy.

another long over due blog

Hi again faithful readers (you know who you are). So the marathon is back off the plate. Basically, i can't afford to spend the extra cash right now on a trip. Or rather I'm not going to borrow the money for a trip like that. I over spent when i thought the trip was off and upon reflection when it came time to book the trip, it just didn't make sense. So... maybe next year.

So for the last month i've been toying with things with my data mining code base. I removed tons of old code that wasnt being used/tested. i honed it down to just GBM. then i spent some time seeing if i can make a version that boosts accuracy over log loss (normal gbm produces a very balanced approach) i was successful. I did it by taking the output from one gbm and sending it in to another in essence over fitting.

Why would i want to do that? Well there is a contest ( https://www.kaggle.com/c/santander-product-recommendation ) i tried to work on it a little, but the positives are far and few in between. the predictions generally put any given positive at less than 1% chance of being right so i tried bumping the number since right now it returned all negatives. The results are actually running right now. i still expect all negatives but the potential positives should have higher percentages and I can pick a cut off point to round to a positive that is a little higher than .0005% i would have had to use before. All this, so I can send in a result other than all negatives. (which scores you at the bottom of the leader board) Will this give me a better score than all negatives... no idea :)

I tried making a variation on the GBM tree i was using that worked like some stuff i did years ago. it wasnt bad, but still not as good as the current gbm implementation. I also modified the Tree to be able to handle time series data in that it can lock 1 row to another and put the columns in time sensitive order. this allows me to process multi month data really well. It also gave me a place to feed in fake data if i want to stack 1 gbm on top of another i can send in the previous gbm's results as new features to train on (along with the normal training data).

This leads me to where i think the next evolution of this will be. I'm slowly building a multi layer GBM ... which essentially is a form of neural net. The thing i need to work out is how best to sub divide the things each layer should predict. that is i could make it so the GBM makes 2 or 1000 different groups and predicts rows results for each and feeds those in for the next prediction...etc. till we get to the final prediction. the division of the groups is something that can probably be done using a form of multi variable analysis that makes groups out of variables that change together.  figuring out how to divide it in to multiple layers is a different problem all together.

Do you want an AI cause this is how you get AIs! heh, seriously, thats what it turns in to. once you have a program that takes in data builds a great answer in layers solving little problems and assembles them in to a final answer that is super great. well you pretty much have an AI.

Incidently, TSNE might also help here as it I might just feed the tsne results for that layer's training data (fake data included if we are a level or two down) to give the system a better picture of how things group statistically.

In other news, I started using blue apron. This is my first time trying a service like this and so far I'm really enjoying it. I'm pretty bad about going to the grocery store. And going every week ... well that aint gonna happen. This is my way to do that, without doing that :) . I'm sure most people have similar thoughts when they sign up, even if the selling point is supposed to be the dishes you are making. Honestly, I've just been eating too much take out. I don't mind cooking and the dishes they send you to prepare are for the most part really good.

 

 

Improving t-sne (part 3)

This post will be the last for the immediate future on t-sne and really will just cover what I've done and how it all ended up. I have a new direction to head. And while I may use t-sne , I don't think it will be the main focus (though i guess it could be! surprises do happen).

So without further adieu let me surmise what I've written about so far. T-sne is an algorithm that reduces dimensions of a data set while preserving the statistical probability dictated by distance between rows of data and the features there in. It does this by randomly distributing the points in whatever N dimensional space you choose. Then it slowly moves the points iterativly in such a way as best reflects the same probability distribution as what existed in higher order space. You could i suppose increase dimensions in this way too if you want to.

I covered how I would speed this up at the end of part 1. I use a series of smaller "mini" t-sne's and combine them together. All of those t-snes share a lot a large number of the same data points. these were seeded in to the same position. I use those points to stitch together all the various frames. using the closest point as an anchor to find its final relative position in 1 key frame i use to put all the points. In that way all points from all the frames are mapped to 1 of the frames. This seems to work really well though the results aren't exact but the run times are linear. (i'm calling this a win)

In the 2nd part I covered how one can augment the results to get a clear separation of data points based on a particular score.  In short it's not magical, you build features from a data mining tool and provide the results as new features. you then use only those as the t-sne inputs. In my case my GBM does some sub selecting and bagging internally. because of this, my outputs are never exact. This makes for a very good set of inputs for seeing the separation of the scores.

Unfortunately, the results you get don't provide a way for improving the score. I had hoped they would, but truly it's just another way at looking at the final output. If you train on a sub section of the data and feed the t-sne results back in to the GBM machine and re train using the other part and no overall improvements will be found.  (i'm calling this a loss or at best a draw. yes I did get wonderful separation which was my starting goal, but it's nothing more than a visualization tool and didn't lead to something that actually "improves" anything from a data mining perspective.)

It's worth noting that if you don't remove the bias and go ahead and retrain from the t-sne result using the same data you fed it originally to get scores to build the t-sne results, it will slowly get better and better but this is just a serious case of over-fitting. GBM does such a good job of finding connections between data points it will emphasize any aspects that are represented in the new t-sne feed that allow it to get closer to the right score.

I tried this out and found I could turn a .47 results (log-loss) in to a .43 result by adding in the t-sne data. then I could further reduce it all the way to .388 by letting the algorithm pick and choose the features to use (well if brute force trial and error is pick and choose). The point is it's not useful for improving the results of the test set (which actually score .48 when the cross validation set score .388). 

I'll leave this right there for now.

Improving t-sne (part 2)

I ended up last time saying "This brings me to the big failing of t-sne. Sometimes (most of the time) the data is scored in a way that doesn't group on it's own." I intend on trying to tackling that in today's post. I will tell you what I know, what I think might work and the initial results I've gotten so far.

Let me start out by saying I've been casually active in two kaggle contests as of late. https://www.kaggle.com/c/santander-customer-satisfaction and https://www.kaggle.com/c/bnp-paribas-cardif-claims-management . Both contests use a binary classifier for scores one scores using log-loss (which penalizes you very harshly for predicting wrong) and the other uses AUC (ROC if you prefer) though their AUC is predicted probability vs observed probability which is a little weird for AUC, but whatever.

Both of these contests are kind of ideal for t-sne graphing in that there are only 2 colors to represent. Also these contests represent different extremes. One has a very weak signal and the other is more of a 50/50. That is in weak signal case that the "No"s far outnumber the "Yes"s. It's easy to have the signal lost in the noise when the data is like that. If we can leverage t-sne to reduce our dimensions down to 2 or 3 and see via a picture that our grouping for positives and negatives are where they need to be, we probably really will have something.

T-sne gives us 1 very easy way to augment results and 1 not-so-easy way to augment then. The very easy way is to change how it calculates distances. It's important to remember at its core it is trying to assign a probability to a distance for every point vs every other point and then rebalance all the point-to-point probabilities in a lower dimensions. If we augment how it evaluates distance by say expanding or collapsing the distance between points in a uniform way we can change our results as certain dimensions will have more or less importance. The not so easy way i'll talk about later down the article.

I tried at first to improve the solutions using various mathematical transforms. Some where me taking dimensional values to powers (or roots). I tried Principle component analysis. And I even tried periodically rebuilding the probabilities based on some adhoc analysis of where the training data was laid out. All of this was a mess. Not only was I never really successful, but in the end I was trying to do build a data mining tool in my data mining tool!

Skipping a few days ahead I started looking at ways to import results from a Data mining tool to augment the dimensions. I tried using data from Gradient boosting. I tracked which features get used and how often. Then when I used that data, I shrunk the dimensions by the proportion they were used. So if a feature/dimension was used 10% of the time (which is a TON of usage) that dimension got reduced to 10% of its original size. If it was used .1% of the time it got reduced to 1/1000th of the size. This produced... better results but how much better wasn't clear and I definitely wasn't producing results that made me go "OMG!" I was still missing something.

Now we come to the hard way to augment the results. We can build better data for the t-sne program to use. This is what I did, I abandoned the usage information entirely and tried using entirely new features generated by GBM. The first step was to build a number based on how the row flowed down the tree. flowing left was a 0 flowing right was a 1. I actually recorded this backwards though since at the final node I want the 0 or 1 to be the most significant digit, so like values ended up on the same end of the number line. Since GBM is run iteratively I could take each loop and use it as a new feature.

This didn't really work. I think because a lot of the splits are really noisy and even with me making the last split the most significant digit, it's still way to noisy to be a useful way to look at the data. This brings me to my most recent idea which finally bears some fruit.

I thought "Well if I just use the final answers instead of the flow of data, not only do i get a method for generating features that can produce either reals or categorical answers (depending on the data you are analyzing) but you also can use any method you like to generate the features. you arent stuck with fixed depth trees."  I stuck with GBM but turned off the fixed depth aspect of the tree and just built some results. I ran 5 iterations with 9 cross validation and kept the training set's predictions from the tree building exercise. The results are really promising. See below (note: I will continue to use my psedo-t-sne which runs in linear time due to time constraints. if you want "perfect results" you will have to go else where for now)

BNP

You can see the different folds being separated since in this implementation they don't cover all the data (it zero fill the rows it was missing) I have since fixed this. But yes there were 9 folds.

Santander

Santander is the one I'm most impressed with its really hard to isolate those few positives in to a group. BNP.... is well less impressive, it was an easier data set to separate so the results are meh at best. The rubber will really meet the road when I put the test data in there and do some secondary analysis on the 2-d version of the data using GBM or random forest.

It's important to know that I'm "cheating" i build my boosted trees from the same training data that I'm then graphing. but due the GBMs baby step approach to solving the problem and the fact that I use bagging at each step of the way to get accuracy the results SHOULD translate in to something that works for the testing data set. I'll only know for sure once I take those results you see above... put the test data in as well and then send the whole 2-d result in to another GBM and see what it can do with it. I'm hopeful (though prior experience tells me I shouldn't be).

*Note: Since I initially put this post together, i sat on it since I wanted to try it before i shared it with the world. I have finished using the initial results on the BNP data and they are biased.. i think for obvious reasons, training on data then using the results to train further... I'll talk more about it in t-sne (part 3). if you want to read more about it right now. check out this thread https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/forums/t/19940/t-sne-neatness/114259#post114259 *

For fun here are the animations of those two image being made. (they are really big files 27 and 14 meg ... also firefox doesn't like them much... try IE or chrome.)

http://dimredglow.com/images/Santander.gif

http://dimredglow.com/images/bnp.gif

 

 

Improving T-SNE (part 1)

Well, that's I've been doing (improving t-sne). What is t-sne? Here's a link and here is another link. The tldr version is t-sne is intended as dimensional reduction of higher order data sets. This helps us make pretty pictures when you reduce the data set down to 2-d (or sometimes 3-d). All t-sne is doing is trying to preserve the statistics of the features and re-represent the data in a smaller set of features (dimensions). This allows to more easily see how the data points were organized. However, all to often your labeling and the reality of how the data is organized have very little in common. But when they do the results are pretty impressive.

This isn't my first foray in to messing with t-sne as you can see here and in other places in kaggle if you do some digging. I should probably explain what I've already done and then i can explain what I've been doing. When I first started looking at the algorithm it was in the context of this contest/post. I did some investigation and found an implementation i could translate to C#. Then I started "messing" with it. It has a few limitations out of the box. First the data needs to be real numbers, not categorical. Second the run time does not scale with large sets of data.

The second problem has apparently been addressed using the Barnes-Hut n-body simulation  technique that (as i understand it) windows areas that have little to no influence on other areas. But you can read about it there. This drops the runtime from N Square to n log n . which is a HUGE improvement. I tried to go one better and make it linear. The results I came up with are "meh". They work in linear time but are fuzzy results at best. Not the perfect ones you see from the barnes-hut simulations.

Before I say what I did, let me explain what the algorithm does. It creates a statistical model of the different features of the data getting a probability picture of any given value. Then it randomly scatters the data points in X dimensions (this will eventually be the output and is where the dimensional reduction comes from). Then it starts moving those points step by step in a process that attempts to find a happy medium where the points are balanced statistically. That is, the points pull and push on each other based on how out of place they are with all other points. They self organize via the algorithm. The movement/momentum each row/point gains in each direction is set by the influence those other points should have based on distance and how far away they should be with their particular likelihood. I've toyed with making an animation out of each step. I'll probably do that some day so people (like me) can see first hand what is going on.

So to improve it, what I did is split the data in a whole bunch of smaller sections (windows) each of the same. Each window though shares a set of points with 1 master section. These master points are placed in the same seed positions in all windows and have normal influence in all windows. They get moved and jostled in each window but then it comes time to display results, we have some calculations to do. The zero window is left unaltered but each other window has it's non-shared points moved to positions relative to the closest shared point. so if say row 4000 isn't shared in all windows and its closest shared row is row 7421 and we see what it's dimensional offset is and place it according to that.

The idea is that statistically speaking the influence of those points is the same as any other points. And the influence the points in the main window have on those points is approximately the same as well. The net effect is that all the points move and are cajoled in to groups as if they were one big window. This of course is why my results are fuzzy. because truly only a tiny amount of data is drawing the picture. normally all the interactions between all the points should be happening. But because I wrote it he way I did, my results scale linearly.

Here's example from the original otto group data. first a rendering of 2048 rows randomly selected (of the 61876 ) with 1500 steps using the traditional t-sne.  It took about 10 minutes. A 4096 row should have taken something like 30 minutes (probably 5 minutes of the 10 minutes of processing was more or less linear)... I'm ball parking it definitely would have take a long while.

Now here's my result of all 61876 rows using the windowing technique where i windowed with a size of 1024. (note if i did 2048, it would have take longer than the above photo since each window would take that long) 1500 steps... run time was like 50 minutes.

As you can see what I mean by it not being as good at making clear separation. the first image is MUCH better about keeping 1 group out of another... but the rendering time is SOOOOO much faster in the 2nd when you think about how much it did. And is parallelized (which I'm taking advantage of) since the various windows can run concurrently. Larger windows make the results better but slow everything down. I believe there are 9 different groups in there (the coloring is kind of bad, i really need a good system for making the colors as far away from each other as possible).

Someday (not likely soon) I will probably combine the n-body solution and my linear rendering to get the best of both worlds. So each window will render as fast as possible (n log n time of the window size) and the whole thing will render-scale out linearly and be parallelize-able.

This brings me to the big failing of t-sne. Sometimes (most of the time) the data is scored in a way that doesn't group on it's own. See the groupings are based on the features you send in, the algorithm doesn't actually look at the score or category you have on the data. The only thing the score is used for is coloring the picture after it is rendered. What I need is a way to teach it what is really important not not just statistically significant. I'll talk about that next time!