Dim Red Glow

A blog about data mining, games, stocks and adventures.

making data mining better

I've been working on a new version (seems like it's been a while) of my data miner. the last one i worked on /got working used the logistic function to build curves that represented the left and right side of each splitter in the trees I built. This worked really well. I'm not quite sure it was state of the art or not but when combined with either random forests or with gradient boosting the results were better than my previous techniques.

So what's the new one do? well that would be telling :) . maybe a better thing to say is how much better is it? not much but it does seem to be better. some run of the mill tests on the last kaggle contest I worked on (Home Credit Default Risk) show it's producing around .74 auc compared to .73 with the old technique. "That's terrible!" no, it's not :) That is i'm just using 1 table they gave us (the main one). And I'm doing nothing special to it. i only created 1 feature. my results arent going to be really all that with so little work. That is, to me that seems pretty decent considering what it's working with. you can read the winner's solution with auc was .8057 on the leader board  here https://www.kaggle.com/c/home-credit-default-risk/discussion/64821

both with home credit default risk and santander value prediction I didn't really put the effort in I should have to get the training data setup. especially with santander. That one was starting to get interesting then they discovered the leak and i tried naively to implement it with basically 0 success and said. "meh. even if i get this working right.. i'm pretty far behind the curve. this has stopped being a contest I really want to do."

Another thing i've come to realize, the genetics can do the same thing my normal data mining does, but a better way it seems (at least for the run time) to use the genetics is to just make new features. I'm still working on mechanisms (in my head) for deciding how to determine a particular genetic out put is ideal. correlation seems ideal but also ideal is something that correlates highly with the solution but not with any given features. That is maximize one and minimize the other. I'm just not positive on the best fitness test to do that. Once you've done it once, you do it again and again. since each feature is independent they ~should~ add.

Incidentally, I'll be using the new aforementioned technique on my new for-the-public data mining webste ( Https://yourdatamine.com ). I've spent a lot of time cleaning up my code and moving it in to a wrapper so it's easily portable to the website. cleaning it up was good too, i got rid of LOTS of old code that either did nothing, wasn't used or  was being used but shouldn't. I also sped it up the load and made it so you can do new cross validations willy-nilly. previously i setup that in the database.  Now i can just pick a number of CV folds to do and the code will make new onest on the fly (or it can still use the DB). having a consistent cross validation does give a reproducible result but sometimes the random selection happens to be weird. so mixing it up can be good.

The website is going to get a new page that lets you know what kind of results you can expect from a training set but doing analysis on it using the on-the-fly cross validation and giving you expected error margins.

I made a little effort to get all that to run in memory. The processing was always done that way but the datasource used to come from a database (as i was mentioning).when i put the code in to a package I also wrote a class that directly imports the data in to the structures it needs. This has 2 upshots; 1 people can rest assured I dont keep or even care about their data. (i setup https for that very reason too). And 2 the website has no database to maintain. It's hard drive footprint never increases (though memory varies by current usage). if the site becomes popular, i'll make it so the requests will go to new instances/farms/application servers. Those will be able to be spun up easy peasy! so its a super scalable solution. (or should be)


new website on data mining and a neat video on heat death

Hello folks! Things keep moving forward (time is like that). I've been busy today. I've started to put together a website for data mining https://www.yourdatamine.com. This is a site that allows anyone with a few csv files or tab delimited files to do some data mining. I might monetize it later (make it pay per use) but till i gauge the interest I'm going to make it free. I need to finish it too of course. But I think I can have a basic version ready by end of day tomorrow. i only started on it today so that's something.

I also have continued to update projects on http://project106.com. I need to update it tomorrow as well (just to bring the weight loss section up to date and add a project for yourdatamine.com. But really things are going well there, even if updated sporadically.

Finally, I wanted to throw out a link to a video done by PBS space time on the blog's name sake.... the heat death of the universe. What? you didn't realize that's what dim red glow referred to? I .. I don't know you any more. *grin* seriously, the name dim red glow is a reference to the idea that at the very end of the universe all that will really be left are protons zipping around the universe slowly getting stretched out more and more as they grow fainter and fainter due to dark energy stretching them out over an ever increasing area. Eventually leaving nothing but a dim red glow. Anyway here you go, it's a good one  :) enjoy.

Still MORE genetic algorithm stuff, running and feathercoin

The short version is I streamlined the mutation process. I did 2 things first I made it very simple either you breed you take 1 of the 2 parent genes or you insert or remove a gene. the chance of doing the insert and remove is totally based on stagnation. (as is the chance of a bad copy when taking the parent genes) stagnation is how many generations we've had since a new best cell was found. since this constantly increases it will pass through whatever sweet spot it needs to get to to maximize a the odds for a new best.... till it finds a new best then it resets. also by starting out baby stepping it, i give the code a really good chance to improve the results.

My old code was far more off the cuff. this new method seems really really focused. unfortunately the new method seems to improve a little slower at first. I think because you dont get the super crazy mutations the old one did. i'm hoping it just doesnt stop though.

The other improvement was i made it so the number of channels and layers dont mutate right now. I think they should be allowed to enter in but the actual process of that is probably in the same vein as gene mutation, super low odds slowly increasing and even if you do add one it should be a small channel (1 or 2 genes) or a layer that comes before the final result that can start being used.

I've started looking in to GPUing the code it looks like amd gpu code choices are kinda limited there is a c++ library that gets wrapped in various c# libraries that i'll probably end up using. the trick will be in finding the one that is easiest use. but... i'm not in a serious hurry there.

The program is currently doing 2 layers with 3 channels  each and is at .22454 gini (which is terrible) its been running for about a day so hopefully this time tomorrow its up to between .235-.255  if it gets over .255 we'll be in a good spot. the contest has a week or so to go. i think it's my best chance to win, though i might go back and fiddle with the normal data mining program over the long weekend. i really want to see if i can maybe hone the GBM using that least squares stuff i did to the gene results to improve them.

Unrelated to any of that... marathon training. i told you i'm going to run in a marathon right? or at least that's the plan. i've been having a real tough time losing weight this time around. and getting the miles in on a treadmill is... hard. i stopped dieting (which was working well) this last week so i could do/enjoy thanksgiving but the weight is trying to come back.

In better news though, I think i've finally fixed my stride as my plantar fasciitis isn't getting worse despite the regular longer runs. which brings me to tonight. tonight i do 18 miles. this is kind of a rubber meets the road moment. Since I started training all attempts to do extended distances (over 13 miles) on the treadmill or otherwise have been cut short by me. usually due to concerns about damaged feet/ligiments or just not being in shape enough. so if this doesnt stick tonight i might put a pin in the planned marathon and do one further out where i can get in better shape/have more time. we will see.

One other thing I'll mention. I've been watching feathercoin do its thing lately. I haven't talked about cryptocurrency since i think before i reset the log... it's been a while. I'm a huge supporter of feathercoin and their  team and want to see them do well. not to mention the coin itself. The point is its nice to see it start to take off. and i mean really take off. it was 10 cents a coin a week ago, its toying between 25 and 30 cents now. it could be some sort of artifical spike but i cant find a source of the news/pumping so i think it might just be a case of some big buyers wanting to hold reserves. I'm hopeful at any rate.

The genetic algorithm delights don’t stop

And I’m back, and so quickly! So the improvements have started....

I tried the least squares weighting of answers ... this works but I think isn’t worth doing while training. I’ll do it on the final submission but train without it.

i've just gotten the layers implemented and that seems to work but it’s too early to know if gains are really there. I think maybe if anything this will allow the code to improve to much greater extents faster instead getting bogged down at a lower score. (Though code without layers may eventually get there too)

The final improvement is essentially heuristic modeling to be applied to the odds of any given thing happening. I did this a little a while back and have rethought what I said since last post. I think the big thing is to just balance the odds of feature/channel selection and function/method/mechanism selection. This should increase speed and accuracy.

I'm still amused by the idea of giving a result to a Kaggle contest without giving the technology. I mean I give the the algorithm it generated but not how you get there. It would be delightful to win a few contests in a row without actually giving the technology away. It would turn Kaggle on its head (especially since its not the sort of thing that translates in to  kernel)

once everything is working the last step is to migrate it to a video card runnable code so I can scale it massively.

more genetic code thoughts...

So I've been thinking about deep neural networks and genetic algorithms and b-trees. first let me say that i made some simplifications in the tree export (both in concept and in genes) and got the exported size down some. it should be in the neighborhood of 1/3 the the size. I say "should" as i only exported a 4 depth 4 stack tree and that isn't anywhere near as big as a 6 depth 16 stack tree. the whole action was i think academic at this point.

At the time I was still hopeful i could have the genetic program optimize it. It turns out that tree based data mining while systematic in approach and isn't very efficient. There are almost always far better ways to get the scores you are looking for and the genetic programs tend to find them and throw our your tree entirely. The reason the tree is used is really a case of methodology. It's a generic way to produce a systematic narrowing of results to statistically model the data. the genetic mechanism tends to find mathy ways to model the data. they could be in any form, tree based or otherwise.

This leads me to some serious thoughts on what is going on in deep neural networks. They tend to have a number of layers each layer has access to previous layers or the original data.... possibly both (depending on the nature of the convolution). Its a strong strategy for figuring things out that require a group of features to work together and be evaluated.

It turns out this is kind of what the introduction of channels is doing for me... its also (in one way of looking at ) what stacking results in GBM. Each channel or tree has their own concern. This made me realize that by limiting the channels to a fix number i was trying to shoehorn what it actually needs to describe the data in to two ideas that get merged. because of the strong adaptability of the channels this can work, but it isnt ideal. ideally you let it go as wide in channels as it needs to. in fact you really should let channels stack too.

I implemented the idea of random channel creation (or removal) and reworked the way genes are merged/created with that in mind. the results have not disappointed. it hasnt existed long so i cant tell you how far it will go but it does tend to get to a result faster than before.

I think there are 3 more major improvements to make. right now, i'm still just taking the sum of the channels to get my final output. i think this can be improved by doing a least squared on each channels' output with the expected result to find a coefficient for the channel. This isnt needed persay, but it will help get me to where i'm going faster.

The 2nd improvement is to make it so there can be layers... layers get access to the previous channels in addition to the features and what not. layers could be randomly added or removed like channels. if a layer references a previous channel and it doesn't exist due to mutation I would just return 0.

the 3rd improvement is to add some system of reinforcement. right now i do breed better scorers more often . But I think that isn't enough, I think some system needs to be devised that eliminates under performing genes when breeding. This is really tricky of course, because who can say there isn't some other impact. Essentially which genes are the good ones? I think some sort of heuristic needs to be added to a gene to track how well it has worked. Basically a weight and a score. if a gene is copied unmodified the weight goes up by 1, the score is the average score of the whole cell. adding. if a copy is made and some change happens to the gene or if the gene is new the data is set the score and weight of just that particular cell (no average history). When deciding which gene to take when breeding two cells the odds would be reflected by the two average score. or possibly by just the current score. I dont know how well this will really work in practice but if not this... something.



Behold my genetic algorithm

I recently started a new genetic algorithm and I've been really pleased how well it turned out. So let me give you the run down of how it went.

I was working on the zillow contest (which i won't go on about, needless to say you were trying to improve on their predictions) and thought a genetic algorithm might do well. I didnt even have my old code so i started from scratch (which is fine. it was time to do this again from scratch).

about 1 and half to 2 days later i had my first results and i was so happy with how well it worked. i should say what it was doing. it was trying to predict the score of the zillow data using the training features. all missing data was filled with average values and i only used real number data (no categories. if i wanted them they would be encoded in to new yes/no features for each category.. ie hot encoded features).

The actual predictions were ranked by how well they correlated to the actual result. i choose this cause i use the same mechanism in my prediction engine for data mining and it is historically a really good and fast way to know how well your data moves like the actual data moves. I found that it was really good at getting a semi decent correlation coefficient (say 0.07... which sounds terrible but against that data it was good) pretty quickly. the problem was it was over fitting. I could take out half the data at random and the coefficient would drop to crap.

Skipping a long bit ahead, i came to the conclusion that there were bugs in the scoring mechanism though apparent over fitting was there too. i tried doing some folding of the scoring data to help eliminate this but found it didn't really help. fixing the bugs was the big thing. This happened around the end of the zillow contest and by then i was improving my core algorithm to deal with zillows data. the contest ended and i set everything aside.

Skipping ahead about 2 weeks later i had started on a new contest for porto (still going on) and after running through the normal suite of things to try i went back to the genetic algorithm to see what it could do.

I left everything alone and went off going with it. pretty quickly some things occured to me. part of the problem with the genetic program is it is fairly limited in what it can do with 1 prediction. That is if it kicks out 1 feature to use/compare with the actual score this limits you in doing further analysis and makes evolution equally 1 dimensional. So I introduced the idea of 1 cell (1 genetic "thing") producing multiple answers. that is 1 for each dimension you care about. Essentially it would have 2 chains of commands to make two different predictions that in conjunction give you a "score".

But how to evaluate said predictions? there may be some good form of correlation coefficient off of 2 features, but whatever it is I never found it. Instead I decide to do a nearest neighbor comparison.

So basically each test is scored by where it is put in dimensional space vs all the other test points. The test are scored only against those they dont match (porto is a binary contest so this works well, i would probably do a median value if i was using this in normal data or the runtime gets ridiculous as its O^2).

In essence i don't care were training data with the same classifier does seen any of the others with the same classifier. they only see the data points of the opposite type, this seriously cuts down on run-time. In those cases I want as much separation as possible. Also I don't want scale to matter, artificially inflating numbers but keeping everything the same relative distance apart should produce the same score. So I scale all dimensions to be between 0 and 1 before scoring. Then I just use 1/distance  to get a value (you can use 1/distance^2 if you like).

this worked really well, but when is started looking at results i saw some odd things happening. things getting pushed to the 4 corners. I didnt want that specifically and in a lot of ways I just want separation not a case of extremes. To combat this I added a qualifier for scoring, if you are over a certain distances that is the same as a 0 score (the best score). this way you get regional groupings without the genetics pushing things to the corners just to get the tiniest improvement.

Lately i've left the region setting to 3 .. if two points are 1/3 of the map apart, they dont interact. also i cap my sample size to 500 of the positive and 500 of the negative. i could do more but again runtime gets bad quickly and this is about getting an idea how good the score for a cell is.

There are lots of details beyond that but thats the gist of it. The whole thing has come together nicely and is neat to watch run. things do grow in to corners sometimes since i only breed the strong cells (lately i've started throwing in a few week ones to mix it up) I kind of think i should have extinction events to help keep evolution from becoming stagnant in an attempt to get the best score possible. I'll have a follow up post with some pictures and some more specifics.


Some animations from the cancer research

I ran the data through (well my layered version of t-sne)  and got some pretty good results. good at least visually. I'm not sure if these are going to pan out well, but i'm hopeful. Let say 2 things before i get to them.

First the more I use the new log version of the image, the more i think that is fundamentally the best way to look at the image data. In fact, I think it might be the best way to deal with just about any real world "thing" that can't be modeled in discrete values. it just does everything you want and in a way that makes good sense. It is, essentially a wave form of the whole object broken down in to 1 number. It might actually be a great way to deal with sounds, 3d models, electrical signals, pictures ... you name it. I think there is room for improvement in the details but the idea itself seems good.

The 2nd thing I want to say is the files are huge, so I'm going to share still images of the final result and then have links to the .Gif animations. I might upload them to youtube but i'm pretty sure gif format is the most efficient way to send them to anyone who wants to see. They lossless which is great and smaller than corresponding mp4s/mpeg/fla files etc. This is because normal video compressors don't handle 1000s moving pixels nearly as efficiently as a gif file with its simple difference layers does.

In these images a purple dots represents cat scan images from a patient with no cancer. The green dots represent cat scan images where they had cancer. and in the final image yellow is test data we don't have solutions for. The goal here is to identify which slices/images of actually have cancer telling info in it. So I would hope most of the images  fall a mix of purple and green dots.

Okay this one is a mess. and your initial reaction might be. how is that useful? well there is an important thing to know here. the fields i fed in included the indexed location of the slices after they were sorted. while this is fairly useful for data mining it is all but useless for visualization (a linear set of numbers is not something that has a meaningful standard deviation or localized average for use in grouping.)  I knew this going in to the processing but I wanted to see the result all the same.

So now we remove the indexing features and try again. (remember each pixel is actually a bunch of features created from multiple grids of images)

Okay that looks a LOT better. in the bottom right you see a group of green pixels. that is some very nice auto-grouping. I went ahead and ran this once more this time with the test data in there as well. it is in yellow.

If anything that's even better. Adding more data does tend to help things. the top group is really good. the ones on the left might be something too. its hard to know, but you dont have to! that's what the data miner tools will figure out for you.

I want to do even more runs and see if I can build a better picture. a few of my features are based on index distances and they should probably be based on my log number instead. Either way it's fun to share! the rubber will really meet the road when i see if this actually gives me good results when i make a submission. (probably a day away at least)

here are the videos they are 69, 47 and 77 meg each so... it'll probably take a while to download.





First steps to image processing improvement

where to begin... first I started looking at just how big 32x32 is on the 512x512 images. It seemed too small to make good clear identification of abnormalities. So I increased my grid segments to 128 ... this might be too big. I'll probably drop them down to 64 next run.

I found a post by a radiologist in kaggle's forums that further reinforced this as a good size (solely based on his identification of a 33x32 legion, given that the arbitrary location 64x64 would be perfect.) .So that's on the short list of changes to make.

also, I have fixed a bug in my sorting (it's really bad this existed :( ) and added a log version of the number used for sorting. (That's actuallly how I caught the sorting bug.) I don't generally need both but it is useful in some contexts to use 1 and in others to use the other. The Log number alone lacks the specificity since 512x512 image is awful big in number form. So rounding/lack of percusion is a problem.

unfortunately even with me fixing my garbage inputs from before my score didn't improve. This leads me to the real problem I need to solve.

I need to be able to identify which image actually has cancer. The patient level identification is just not enough. That's why my score didn't improve when I fixed the inputs. .. but how?

i hope t-sne can help here. I feed in all images with their respective grid data and let it self organize. That's running right now. Hopefully the output puts the images into groups based on abnormalities. Ideally grouping images with cancer. Since I don't actually know which do have cancer, I'll be looking for a grouping of images all marked 1 (from a cancer patient) and hope he that gets it. there will also hopefully be groups made of a mishmash from non cancer images .

I'm building one of those animations from before for fun. I'll share when it's finished

if this works I either feed that tsne data (x,y) in as a new feature or use kmeans to group up the groups and use that. We will see!

cat scan analysis and first submissions

So things have moved forward and I've made a few submissions. Things have gone okay, but i'm looking for ... more.

It took the best part of a week to generate the data, though i didn't use suffix arrays. i transformed each cat scan image in to a series of images constructed of averages. first image was 1 byte that was an average of the whole thing. then the next 4 were for each quadrant. repeat in this manner till you have the level of detail you want. The point is, once it is sorted that level is done. So data might look something like this.

Image#       1 2 3 4 5 6 7 8

----------------------------------Data below here

sort this --> 4 2 1 1 0 2 1 4   <--- most significant byte (average of the whole image)

then this  -> 4 1 1 2 3 4 5 6   <---- next most significant byte (top left quadrant average)

then this  -> 0 2 1 0 0 0 0 2    <---- etc


I broke each image in to lots of little pieces, and each of those pieces got sorted in to 1 gigantic index using the method above. Once I had that data I built 3 different values for the grid elements, index location, average of the cancer content in nearby indexed cells (excluding gid segments from the same person) and finally how far away the nearest grid of the same position is (left and right on the indexed sort). This one is supposed to give me an idea of how out of place the index is. you would think that similar images would cluster and anomalies would be indicative of something special.

On the last method,  I'm not certain the data sample size is large enough for that (or self-similar enough). It might be but without a visual way check it out (i haven't written one) it can be hard to be certain. its the sort of thing i would expect to work real well with maps, I just don't know about body features. I do this to try and solve the simple problem, I know that there is cancer in the patient I just don't know which slice/grid part its in. I would argue that's the real challenge for this contest. (more on that later)

Once I have these statistics I build a few others that relevant to the whole image and then take all the grid data ad the few other statistics and try to make a cancer / no cancer prediction for each image. ... again tons of false positives since each image is one part of the whole and cancer may only be on 1 cat scan.

Once I have the results (using a new form of my GBM) I take the 9 fold cross validation resuilt of the train data and the results from test data and send it all in to another GBM. This one takes the whole body (the slices have been organized in order by a label) an produces a uniform 1024 slices broken out by percentage (i take the results from the nearest image, that becomes my feature for that percentage location.) then i build 512, 256,128... down to 1 . these features dont use the nearest but average of the 2, 4, 8.. etc elements that went in to the 1024. I send all that data in to a GBM, get predictions and ... bob's your uncle.

The accuracy... is okay. the problem i've had is bias is super easy to introduce. local testing puts the results at around .55 log loss. but my submissions were more in the .6 neighborhood. which makes me think that: 1, i got nothing special and 2, all the false positives are screwing things up. Btw, getting to that point took over 3 weeks beginning to end, with many long hours of my computer doing things. I ahve since i started radically improved the speed of the whole process and can probably get from beginning to end in about 2-3 days now. so that's good. The biggest/most important improvement of late was threading the tree generation internal to the tree itself. I've had the trees themselves be threaded for years, but never bothered to thread out the node work. it is now and it really helps make as much use of the cpu as I can. That change if nothing else will be great for the future.

So lets talk about false positives real quick and knowing exactly which slice of cat scan image has the cancer in it. I think most people make a 3d model and handle the data as a whole (getting rid of the problem) that may be my solution. that is essentially what i did with 2nd level of the GBM. I was trying to solve the problem in an image by image way previously but i think there is just to much noise unless i add some insight that is missing. So with regards to that there is maybe 1 possible way to add some insight. consider this

No Cancer                   Cancer

00100000100             00107000100             

00010100100             00900100100             



if each 0 or 1 is a whole image the trick is to realize that 7 and 9 are unique to the cancer side. but how do you get to that point? right now i take all the numbers by themselves and say cancer on no cancer. so 0 has a percentage chance.. etc. Even this is a simplification because i dont have a clear picture of "3"  or "2" it might actually just be a special pattern of 1s and 0s  (that is to say it all looks normal and the particular oddness of the slices in the order they show is what makes ID as cancer).

So I'm noodling on this a little to see if I can find a good way to get the insight. if I can get the image analysis to indicate 3 or 2 is present. i'm in there, but more likely I'll make data out of the whole and rethink how to make the predictions there by adding more data to each prediction but removing the false positives.





suffix arrays maybe

I've been hard at work in my free time working on this cancer image processing. i got my image load and processing down so it takes 7 minutes to load the images (this includes the white-black color balancing that makes "black" consistent on all images) The problem i'm having now is finding the best way to match a 32x32 block from the 512x512 images against all the other images to find similar entities.

I tried a brute force approach that measured differences between blocks. that wasn't gonna work. the runtime would have been centuries and the results were meh at best. So just to speed it up (before i improve the meh results) I then tried making an index out of the data by some averages and what not from each cell, storing them in 4 bytes as an index. I associated all rows that tied to that index in a hashmap for that index. the problem with this is, there are to many similar cells. That is to say the averaging functions weren't doing enough to distinguish some very similar looking cells. So the runtime went to like 7 years... with still meh results.

I tried short circuiting the results and just give me "something" after a few tries but even this was proving to take a loooong time to run. why is this taking so long? well 250,000 base images and i divide each one in to 32x32 segments and then i do a few versions of the grid where i offset by 16. i end up with  16*16+15*16+16*15+15*15 cells for each of the 250,000 images i want to process. since i really dont want to process every single possible cell because of runtime concerns. Or do I?

So sitting here thinking about it. I learned about suffix arrays years ago for some bioinformatics class i took (a story for another time). I've implemented one on 3 separate occasions but its been a while. What are they? they are a clever way of using pointers in to your original array. then sorting the those pointers by what they point at. if you had the string "ABA" your initial pointers would be 0,1,2 then after they are sorted they would be 0,2,1 (assuming you sort "end  of line" after "B") why is this useful? well you can quickly see where similar things the string occur by scooting over one to the right or one to the left.

I could then put all the images grid segments in to a giant array of bytes and sort them (not a fast process, but really not that slow either). I want to keep the little images as they represent my features i'll feed in to the database, i could make smaller images but 961 features seems more than enough for now. the first real problem here is the array of images is about 250,000,000 long and each of those has 1k of bytes in it. so you need a lot of memory (i'm okay there, the indexing i was doing before had the same problem) the 2nd thing you need to think about is the image segment. its really not rendered the way that is useful.

when i worked on the indexing i thought about that some and tried making averages for the whole grid. this kinda worked but really it dances around the issue cause i was only making 4 bytes of averages from a 1k image. I think to do this right, you need to translate the images in to something more... jpeg-y. you want 1 byte that represents some average details then subsequent bytes to refine the average so that as you move from the left to the right of byte array you get more and more detailed information. this will put similiar images next to each other once sorted.  You can certain translate an image in to another form that does something like that. i'll have to.

Once i've done that, and once i have the data loaded in to a suffix array its a simple matter of looking at the nearest cells in the suffix array and seeing if they are in a cancer patient or not (skipping cells that reference the original image for that grid, and cells in test data). once you've done all that you  can get an average value for cancer/nocancer and save it for that feature.

This method leaves the original images on stretched or touched in anyway (other than the color balance) which means a small person may not line up with a large person. and also there are times when the image is zoomed in to much in the cat scan so there is clipping etc... these problems can be resolved if there are enough images (so you can find someone similar to yourself) a bit like sampling enough voices will eventually give you a person who has the same accent as yourself. the question is, is there enough data?

hopefully i can get all this done by monday and have a crack at actually making some predictions with the results using a normal gbm.