Dim Red Glow

A blog about data mining, games, stocks and adventures.

Some animations from the cancer research

I ran the data through (well my layered version of t-sne)  and got some pretty good results. good at least visually. I'm not sure if these are going to pan out well, but i'm hopeful. Let say 2 things before i get to them.

First the more I use the new log version of the image, the more i think that is fundamentally the best way to look at the image data. In fact, I think it might be the best way to deal with just about any real world "thing" that can't be modeled in discrete values. it just does everything you want and in a way that makes good sense. It is, essentially a wave form of the whole object broken down in to 1 number. It might actually be a great way to deal with sounds, 3d models, electrical signals, pictures ... you name it. I think there is room for improvement in the details but the idea itself seems good.

The 2nd thing I want to say is the files are huge, so I'm going to share still images of the final result and then have links to the .Gif animations. I might upload them to youtube but i'm pretty sure gif format is the most efficient way to send them to anyone who wants to see. They lossless which is great and smaller than corresponding mp4s/mpeg/fla files etc. This is because normal video compressors don't handle 1000s moving pixels nearly as efficiently as a gif file with its simple difference layers does.

In these images a purple dots represents cat scan images from a patient with no cancer. The green dots represent cat scan images where they had cancer. and in the final image yellow is test data we don't have solutions for. The goal here is to identify which slices/images of actually have cancer telling info in it. So I would hope most of the images  fall a mix of purple and green dots.

Okay this one is a mess. and your initial reaction might be. how is that useful? well there is an important thing to know here. the fields i fed in included the indexed location of the slices after they were sorted. while this is fairly useful for data mining it is all but useless for visualization (a linear set of numbers is not something that has a meaningful standard deviation or localized average for use in grouping.)  I knew this going in to the processing but I wanted to see the result all the same.

So now we remove the indexing features and try again. (remember each pixel is actually a bunch of features created from multiple grids of images)

Okay that looks a LOT better. in the bottom right you see a group of green pixels. that is some very nice auto-grouping. I went ahead and ran this once more this time with the test data in there as well. it is in yellow.

If anything that's even better. Adding more data does tend to help things. the top group is really good. the ones on the left might be something too. its hard to know, but you dont have to! that's what the data miner tools will figure out for you.

I want to do even more runs and see if I can build a better picture. a few of my features are based on index distances and they should probably be based on my log number instead. Either way it's fun to share! the rubber will really meet the road when i see if this actually gives me good results when i make a submission. (probably a day away at least)

here are the videos they are 69, 47 and 77 meg each so... it'll probably take a while to download.

http://dimredglow.com/images/animation1.gif

http://dimredglow.com/images/animation2.gif

http://dimredglow.com/images/animation3.gif

 

First steps to image processing improvement

where to begin... first I started looking at just how big 32x32 is on the 512x512 images. It seemed too small to make good clear identification of abnormalities. So I increased my grid segments to 128 ... this might be too big. I'll probably drop them down to 64 next run.

I found a post by a radiologist in kaggle's forums that further reinforced this as a good size (solely based on his identification of a 33x32 legion, given that the arbitrary location 64x64 would be perfect.) .So that's on the short list of changes to make.

also, I have fixed a bug in my sorting (it's really bad this existed :( ) and added a log version of the number used for sorting. (That's actuallly how I caught the sorting bug.) I don't generally need both but it is useful in some contexts to use 1 and in others to use the other. The Log number alone lacks the specificity since 512x512 image is awful big in number form. So rounding/lack of percusion is a problem.

unfortunately even with me fixing my garbage inputs from before my score didn't improve. This leads me to the real problem I need to solve.

I need to be able to identify which image actually has cancer. The patient level identification is just not enough. That's why my score didn't improve when I fixed the inputs. .. but how?

i hope t-sne can help here. I feed in all images with their respective grid data and let it self organize. That's running right now. Hopefully the output puts the images into groups based on abnormalities. Ideally grouping images with cancer. Since I don't actually know which do have cancer, I'll be looking for a grouping of images all marked 1 (from a cancer patient) and hope he that gets it. there will also hopefully be groups made of a mishmash from non cancer images .

I'm building one of those animations from before for fun. I'll share when it's finished

if this works I either feed that tsne data (x,y) in as a new feature or use kmeans to group up the groups and use that. We will see!

cat scan analysis and first submissions

So things have moved forward and I've made a few submissions. Things have gone okay, but i'm looking for ... more.

It took the best part of a week to generate the data, though i didn't use suffix arrays. i transformed each cat scan image in to a series of images constructed of averages. first image was 1 byte that was an average of the whole thing. then the next 4 were for each quadrant. repeat in this manner till you have the level of detail you want. The point is, once it is sorted that level is done. So data might look something like this.

Image#       1 2 3 4 5 6 7 8

----------------------------------Data below here

sort this --> 4 2 1 1 0 2 1 4   <--- most significant byte (average of the whole image)

then this  -> 4 1 1 2 3 4 5 6   <---- next most significant byte (top left quadrant average)

then this  -> 0 2 1 0 0 0 0 2    <---- etc

repeat...

I broke each image in to lots of little pieces, and each of those pieces got sorted in to 1 gigantic index using the method above. Once I had that data I built 3 different values for the grid elements, index location, average of the cancer content in nearby indexed cells (excluding gid segments from the same person) and finally how far away the nearest grid of the same position is (left and right on the indexed sort). This one is supposed to give me an idea of how out of place the index is. you would think that similar images would cluster and anomalies would be indicative of something special.

On the last method,  I'm not certain the data sample size is large enough for that (or self-similar enough). It might be but without a visual way check it out (i haven't written one) it can be hard to be certain. its the sort of thing i would expect to work real well with maps, I just don't know about body features. I do this to try and solve the simple problem, I know that there is cancer in the patient I just don't know which slice/grid part its in. I would argue that's the real challenge for this contest. (more on that later)

Once I have these statistics I build a few others that relevant to the whole image and then take all the grid data ad the few other statistics and try to make a cancer / no cancer prediction for each image. ... again tons of false positives since each image is one part of the whole and cancer may only be on 1 cat scan.

Once I have the results (using a new form of my GBM) I take the 9 fold cross validation resuilt of the train data and the results from test data and send it all in to another GBM. This one takes the whole body (the slices have been organized in order by a label) an produces a uniform 1024 slices broken out by percentage (i take the results from the nearest image, that becomes my feature for that percentage location.) then i build 512, 256,128... down to 1 . these features dont use the nearest but average of the 2, 4, 8.. etc elements that went in to the 1024. I send all that data in to a GBM, get predictions and ... bob's your uncle.

The accuracy... is okay. the problem i've had is bias is super easy to introduce. local testing puts the results at around .55 log loss. but my submissions were more in the .6 neighborhood. which makes me think that: 1, i got nothing special and 2, all the false positives are screwing things up. Btw, getting to that point took over 3 weeks beginning to end, with many long hours of my computer doing things. I ahve since i started radically improved the speed of the whole process and can probably get from beginning to end in about 2-3 days now. so that's good. The biggest/most important improvement of late was threading the tree generation internal to the tree itself. I've had the trees themselves be threaded for years, but never bothered to thread out the node work. it is now and it really helps make as much use of the cpu as I can. That change if nothing else will be great for the future.

So lets talk about false positives real quick and knowing exactly which slice of cat scan image has the cancer in it. I think most people make a 3d model and handle the data as a whole (getting rid of the problem) that may be my solution. that is essentially what i did with 2nd level of the GBM. I was trying to solve the problem in an image by image way previously but i think there is just to much noise unless i add some insight that is missing. So with regards to that there is maybe 1 possible way to add some insight. consider this

No Cancer                   Cancer

00100000100             00107000100             

00010100100             00900100100             

00100011100 

 

if each 0 or 1 is a whole image the trick is to realize that 7 and 9 are unique to the cancer side. but how do you get to that point? right now i take all the numbers by themselves and say cancer on no cancer. so 0 has a percentage chance.. etc. Even this is a simplification because i dont have a clear picture of "3"  or "2" it might actually just be a special pattern of 1s and 0s  (that is to say it all looks normal and the particular oddness of the slices in the order they show is what makes ID as cancer).

So I'm noodling on this a little to see if I can find a good way to get the insight. if I can get the image analysis to indicate 3 or 2 is present. i'm in there, but more likely I'll make data out of the whole and rethink how to make the predictions there by adding more data to each prediction but removing the false positives.