Dim Red Glow

A blog about data mining, games, stocks and adventures.

oh image process... how difficult you are

I've decided to compete in https://www.kaggle.com/c/data-science-bowl-2017 which is a contest where you process patient cat scan photos and try to identify stage 1 cancer people. there are something like 1500 patients and and each has 100+ cat scan slices. the number of  slices is not consistent and the order of the slices is seemingly random (they have guid's as names).

Until now i haven't done image processing contests. in fact in most of my contests i dont bother with feature creation at all if i can avoid it. That's a different thing than i enjoy working on. That can't be avoided with image processing. It is a whole thing unto itself. So why try one now? I had a close friends die from lung cancer. It is unlikely this technology would have helped him since by the time he went in to get his cough checked on, it was clear he had cancer from cat scans. But it still seems like a good pursuit and hits close to home. The prize money (which is huge) is also nice but so few get that, that it cant be a real draw.

So what will i be doing / what have i do so far? I've loaded the images using some trial software and a simple c# program (they are dicom medical images). It took a long while to realize that was the way to go. I tried using some open source packages and stuff but the images kept coming out a little off and i didn't know why.

I've normalized the images for clarity by removing gray backgrounds and re-balancing the image brightness with that in mind This really makes the details clear and gives all the images the same light levels to compare with each other. I did not try maximizing the light levels, presumably they already are but i might need to implement that just in case.

I tried removing non lung artifacts. things like the clothing and the table they are lying on, but the results were sketchy. I didn't want to lose anything important in the image by accident. So after many attempts i undid the work and decided to come back to it later.

I setup 2 dataminig databases for my data. 1 to load image results in and 1 to produce the actual final prediction. the image results will go in to a normal datamining database. In that each image slice will be a row with a predicted value of cancer or no-cancer (based on the person who it was taken from, not on if that particular image had cancer or not). The images will be split in to a grid of small cells and a 2nd grid that is offset by half a cell so corner regions are not ignored. I probably should do 2 other grids as well (i have not yet) that are offset by half a width or half a height respectfully (not just a 2nd with both). these tiles will be used to compare with all other people's image's to see/find the closest match.

Mapping the data from the image database to the real/result database is a little bit of a mystery. I have a solution to do it but i'm not sure it's the best solution. Normalizing the feature count can be done a number of different ways. We'll just see have to see what works best.

That last part.. compare with all other people's image's to see/find the closest match. is the hard part. that little statement right there is what humans do so well and computers do not. That is where i hope to really add and stand out in this competition. till i get everything working i'm just going to do a simple difference measure in light levels and take the closest matching tile as the winner... but later once i get everything working weill, i've got plans to really improve the matching algorithm... everything from doing a 2-d DTW (which has about the most abhorrent run time ever.. to building a data miner for the tiles... to just doing fuzzy matching of images .. to looking at the best match for each pixel in the entire square... to ???

clearly i've got lots of ideas/things i want to try. but the first step is just to get the whole jalopy running.