Dim Red Glow

A blog about data mining, games, stocks and adventures.

suffix arrays maybe

I've been hard at work in my free time working on this cancer image processing. i got my image load and processing down so it takes 7 minutes to load the images (this includes the white-black color balancing that makes "black" consistent on all images) The problem i'm having now is finding the best way to match a 32x32 block from the 512x512 images against all the other images to find similar entities.

I tried a brute force approach that measured differences between blocks. that wasn't gonna work. the runtime would have been centuries and the results were meh at best. So just to speed it up (before i improve the meh results) I then tried making an index out of the data by some averages and what not from each cell, storing them in 4 bytes as an index. I associated all rows that tied to that index in a hashmap for that index. the problem with this is, there are to many similar cells. That is to say the averaging functions weren't doing enough to distinguish some very similar looking cells. So the runtime went to like 7 years... with still meh results.

I tried short circuiting the results and just give me "something" after a few tries but even this was proving to take a loooong time to run. why is this taking so long? well 250,000 base images and i divide each one in to 32x32 segments and then i do a few versions of the grid where i offset by 16. i end up with  16*16+15*16+16*15+15*15 cells for each of the 250,000 images i want to process. since i really dont want to process every single possible cell because of runtime concerns. Or do I?

So sitting here thinking about it. I learned about suffix arrays years ago for some bioinformatics class i took (a story for another time). I've implemented one on 3 separate occasions but its been a while. What are they? they are a clever way of using pointers in to your original array. then sorting the those pointers by what they point at. if you had the string "ABA" your initial pointers would be 0,1,2 then after they are sorted they would be 0,2,1 (assuming you sort "end  of line" after "B") why is this useful? well you can quickly see where similar things the string occur by scooting over one to the right or one to the left.

I could then put all the images grid segments in to a giant array of bytes and sort them (not a fast process, but really not that slow either). I want to keep the little images as they represent my features i'll feed in to the database, i could make smaller images but 961 features seems more than enough for now. the first real problem here is the array of images is about 250,000,000 long and each of those has 1k of bytes in it. so you need a lot of memory (i'm okay there, the indexing i was doing before had the same problem) the 2nd thing you need to think about is the image segment. its really not rendered the way that is useful.

when i worked on the indexing i thought about that some and tried making averages for the whole grid. this kinda worked but really it dances around the issue cause i was only making 4 bytes of averages from a 1k image. I think to do this right, you need to translate the images in to something more... jpeg-y. you want 1 byte that represents some average details then subsequent bytes to refine the average so that as you move from the left to the right of byte array you get more and more detailed information. this will put similiar images next to each other once sorted.  You can certain translate an image in to another form that does something like that. i'll have to.

Once i've done that, and once i have the data loaded in to a suffix array its a simple matter of looking at the nearest cells in the suffix array and seeing if they are in a cancer patient or not (skipping cells that reference the original image for that grid, and cells in test data). once you've done all that you  can get an average value for cancer/nocancer and save it for that feature.

This method leaves the original images on stretched or touched in anyway (other than the color balance) which means a small person may not line up with a large person. and also there are times when the image is zoomed in to much in the cat scan so there is clipping etc... these problems can be resolved if there are enough images (so you can find someone similar to yourself) a bit like sampling enough voices will eventually give you a person who has the same accent as yourself. the question is, is there enough data?

hopefully i can get all this done by monday and have a crack at actually making some predictions with the results using a normal gbm.