Dim Red Glow

A blog about data mining, games, stocks and adventures.

One Punch Man excercise crazieness

So lately i've been trying to lose weight. Partly cause i like the way i look when i look in the mirror and partly because it's just healthier for me. I had a goal to lose some weight by June 12th butwith the rate weight loss is working out to be. I dont think i'm gonna really come close to my goal.

The main reasons are my age (i'm 41 and metabolism slows down) and my lack of movement most of the day. I run but its not enough. I generally do 4 miles, 3 times a week but my constant sitting doesn't burn a lot of calories so its kind of a wash. Also my muscles seem conditioned to do said 4 mile run quite well without much training. I tried cutting carbs and that helped, but that's not doing much without enough exercise to burn the fat.

So I'm gonna do something that will really, and I mean really, make me lose weight (in a good way). i'm gonna do Saitama's exercise routine (from One Punch Man). I wouldn't do it if it wasn't possible for me. The "hard" part is the running and that's the one part i wont sweat. So what is it?

  • 100 Push-Ups
  • 100 Sit-Ups
  • 100 Squats
  • 10KM Running (that’s 6.2 miles)

every single day.

Okay really? every single day? no. that wouldn't make sense, even when maintaining yourself that seems overkill unless there was some real need. so I'll probably do it 4 days a week. 6.2 miles a day is another 2.2 miles on my normal run. so it will be a little harder at first. the 100 squats and sits ups and push ups wont be bad as long as i break them up in to sets. I'll probably do 2 work outs a day with 3 sets each. so 51 push ups,sit ups and squats  (3 sets of 17) twice a day.

But that's CHEATING! actually, he doesn't really say it was all in one go. in fact i probably should break up the run too. but honestly, i dont really want to cause its going to eat up enough of my day as is. (quite the time sink) And when i get in better shape i'll probably merge the 2 other work outs in to one so it'll be 3 sets of 34 each. but that wont be for a month or so.

am i really that out of shape? this does seem drastic but, its more just something to amuse me. if i absolutely cant find time or hate it, i'll stop. but baring incident, i'll start tomorrow (though my first day i might do 1/2 and really go 100% Wednesday). I'm abandoning my dieting at least till i get going cause not having carbs while exercising is more brutal than I want to be. right now I weigh around 195 I'd really like to be 180. 175 is like my perfect super-in-shape weight that I really havent been in... er ever (well high school, but that was a not yet adult form of me). even when I was in great shape around 28 (13 years ago ugh) i was still like 178. (splitting hairs i know)

Anyway, we'll see how it goes i"ll keep ya all updated. I may or may not post before and after pictures, you'll just have to wait and see hah! really, i'm thinking from the front there will be THAT much difference my face will look a little thinner.  the main thing is I wont have quite the thick core (torso seen from the side) . which is really my goal.

another mathy thing ... p = 4 * k + 1

So I was looking for an old video in numberphile (one about a particular kind of prime) and re-saw this

https://www.youtube.com/watch?v=yGsIw8LHXM8 (two square theorem)

or if you like the original video that one comes from

https://www.youtube.com/watch?v=SyJlRUBoVp0  The Prime Problem with a One Sentence Proof )

Anyway I found myself rewatching it. and decide to see if I can make a simpler version of the proof. not less space mind you, simpler as in more straight forward. I think I've done that here (it's probably not new, most things in math aren't) but regardless I submit it for your entertainment/utility.

 

Also it's worth saying that the you can then figure out exactly what form the X and Y (K or what have you) need to have by just unraveling all that. Here is that little bit of extra formula stuffs spelled out.

 

 

 

a need for better time warping

Well, weeks and weeks of working on the cancer contest have brought be back to where i started from. I want to use Dynamic time warping to match images. Once the images are matched, I think maybe look at a difference between the original and the target images to see what is left. This is probably your best place to start looking for cancer.

So why don't i do this? Because the run time is abhorrent. For the single comparison of two images I think Naively the Big O notation is like N^4 . N^2 is linear DTW but you can't just add a dimension and go up by 1 power. if i understand it right you have to add 2 to properly do 2d matching. Where N is the number of pixels in the image. So really it's like yeah. bad.

Maybe there is a way it can be done in N^3 and I'm missing something, but really it needs to be done in something like linear or at least N*log(N) time to really work. So that's where I'm leaving it.

There is a cervical cancer contest out there that is very similar except the photos are from some sort of normal optical camera and while maybe it could be done the same way, it has the same problem. I think if we solve this problem the world will have much much better analysis systems (in general).

It's worth mentioning I think most people do their analysis using deep neural networks. Quite honestly I'm not sure how they would do a good job processing 2-d image data but apparently it does work. I've got 3 weeks before the contest is over. if i can come up with a good way to do the DTW I will, otherwise I'm throwing in the towel on this one :( .

Some animations from the cancer research

I ran the data through (well my layered version of t-sne)  and got some pretty good results. good at least visually. I'm not sure if these are going to pan out well, but i'm hopeful. Let say 2 things before i get to them.

First the more I use the new log version of the image, the more i think that is fundamentally the best way to look at the image data. In fact, I think it might be the best way to deal with just about any real world "thing" that can't be modeled in discrete values. it just does everything you want and in a way that makes good sense. It is, essentially a wave form of the whole object broken down in to 1 number. It might actually be a great way to deal with sounds, 3d models, electrical signals, pictures ... you name it. I think there is room for improvement in the details but the idea itself seems good.

The 2nd thing I want to say is the files are huge, so I'm going to share still images of the final result and then have links to the .Gif animations. I might upload them to youtube but i'm pretty sure gif format is the most efficient way to send them to anyone who wants to see. They lossless which is great and smaller than corresponding mp4s/mpeg/fla files etc. This is because normal video compressors don't handle 1000s moving pixels nearly as efficiently as a gif file with its simple difference layers does.

In these images a purple dots represents cat scan images from a patient with no cancer. The green dots represent cat scan images where they had cancer. and in the final image yellow is test data we don't have solutions for. The goal here is to identify which slices/images of actually have cancer telling info in it. So I would hope most of the images  fall a mix of purple and green dots.

Okay this one is a mess. and your initial reaction might be. how is that useful? well there is an important thing to know here. the fields i fed in included the indexed location of the slices after they were sorted. while this is fairly useful for data mining it is all but useless for visualization (a linear set of numbers is not something that has a meaningful standard deviation or localized average for use in grouping.)  I knew this going in to the processing but I wanted to see the result all the same.

So now we remove the indexing features and try again. (remember each pixel is actually a bunch of features created from multiple grids of images)

Okay that looks a LOT better. in the bottom right you see a group of green pixels. that is some very nice auto-grouping. I went ahead and ran this once more this time with the test data in there as well. it is in yellow.

If anything that's even better. Adding more data does tend to help things. the top group is really good. the ones on the left might be something too. its hard to know, but you dont have to! that's what the data miner tools will figure out for you.

I want to do even more runs and see if I can build a better picture. a few of my features are based on index distances and they should probably be based on my log number instead. Either way it's fun to share! the rubber will really meet the road when i see if this actually gives me good results when i make a submission. (probably a day away at least)

here are the videos they are 69, 47 and 77 meg each so... it'll probably take a while to download.

http://dimredglow.com/images/animation1.gif

http://dimredglow.com/images/animation2.gif

http://dimredglow.com/images/animation3.gif

 

First steps to image processing improvement

where to begin... first I started looking at just how big 32x32 is on the 512x512 images. It seemed too small to make good clear identification of abnormalities. So I increased my grid segments to 128 ... this might be too big. I'll probably drop them down to 64 next run.

I found a post by a radiologist in kaggle's forums that further reinforced this as a good size (solely based on his identification of a 33x32 legion, given that the arbitrary location 64x64 would be perfect.) .So that's on the short list of changes to make.

also, I have fixed a bug in my sorting (it's really bad this existed :( ) and added a log version of the number used for sorting. (That's actuallly how I caught the sorting bug.) I don't generally need both but it is useful in some contexts to use 1 and in others to use the other. The Log number alone lacks the specificity since 512x512 image is awful big in number form. So rounding/lack of percusion is a problem.

unfortunately even with me fixing my garbage inputs from before my score didn't improve. This leads me to the real problem I need to solve.

I need to be able to identify which image actually has cancer. The patient level identification is just not enough. That's why my score didn't improve when I fixed the inputs. .. but how?

i hope t-sne can help here. I feed in all images with their respective grid data and let it self organize. That's running right now. Hopefully the output puts the images into groups based on abnormalities. Ideally grouping images with cancer. Since I don't actually know which do have cancer, I'll be looking for a grouping of images all marked 1 (from a cancer patient) and hope he that gets it. there will also hopefully be groups made of a mishmash from non cancer images .

I'm building one of those animations from before for fun. I'll share when it's finished

if this works I either feed that tsne data (x,y) in as a new feature or use kmeans to group up the groups and use that. We will see!

cat scan analysis and first submissions

So things have moved forward and I've made a few submissions. Things have gone okay, but i'm looking for ... more.

It took the best part of a week to generate the data, though i didn't use suffix arrays. i transformed each cat scan image in to a series of images constructed of averages. first image was 1 byte that was an average of the whole thing. then the next 4 were for each quadrant. repeat in this manner till you have the level of detail you want. The point is, once it is sorted that level is done. So data might look something like this.

Image#       1 2 3 4 5 6 7 8

----------------------------------Data below here

sort this --> 4 2 1 1 0 2 1 4   <--- most significant byte (average of the whole image)

then this  -> 4 1 1 2 3 4 5 6   <---- next most significant byte (top left quadrant average)

then this  -> 0 2 1 0 0 0 0 2    <---- etc

repeat...

I broke each image in to lots of little pieces, and each of those pieces got sorted in to 1 gigantic index using the method above. Once I had that data I built 3 different values for the grid elements, index location, average of the cancer content in nearby indexed cells (excluding gid segments from the same person) and finally how far away the nearest grid of the same position is (left and right on the indexed sort). This one is supposed to give me an idea of how out of place the index is. you would think that similar images would cluster and anomalies would be indicative of something special.

On the last method,  I'm not certain the data sample size is large enough for that (or self-similar enough). It might be but without a visual way check it out (i haven't written one) it can be hard to be certain. its the sort of thing i would expect to work real well with maps, I just don't know about body features. I do this to try and solve the simple problem, I know that there is cancer in the patient I just don't know which slice/grid part its in. I would argue that's the real challenge for this contest. (more on that later)

Once I have these statistics I build a few others that relevant to the whole image and then take all the grid data ad the few other statistics and try to make a cancer / no cancer prediction for each image. ... again tons of false positives since each image is one part of the whole and cancer may only be on 1 cat scan.

Once I have the results (using a new form of my GBM) I take the 9 fold cross validation resuilt of the train data and the results from test data and send it all in to another GBM. This one takes the whole body (the slices have been organized in order by a label) an produces a uniform 1024 slices broken out by percentage (i take the results from the nearest image, that becomes my feature for that percentage location.) then i build 512, 256,128... down to 1 . these features dont use the nearest but average of the 2, 4, 8.. etc elements that went in to the 1024. I send all that data in to a GBM, get predictions and ... bob's your uncle.

The accuracy... is okay. the problem i've had is bias is super easy to introduce. local testing puts the results at around .55 log loss. but my submissions were more in the .6 neighborhood. which makes me think that: 1, i got nothing special and 2, all the false positives are screwing things up. Btw, getting to that point took over 3 weeks beginning to end, with many long hours of my computer doing things. I ahve since i started radically improved the speed of the whole process and can probably get from beginning to end in about 2-3 days now. so that's good. The biggest/most important improvement of late was threading the tree generation internal to the tree itself. I've had the trees themselves be threaded for years, but never bothered to thread out the node work. it is now and it really helps make as much use of the cpu as I can. That change if nothing else will be great for the future.

So lets talk about false positives real quick and knowing exactly which slice of cat scan image has the cancer in it. I think most people make a 3d model and handle the data as a whole (getting rid of the problem) that may be my solution. that is essentially what i did with 2nd level of the GBM. I was trying to solve the problem in an image by image way previously but i think there is just to much noise unless i add some insight that is missing. So with regards to that there is maybe 1 possible way to add some insight. consider this

No Cancer                   Cancer

00100000100             00107000100             

00010100100             00900100100             

00100011100 

 

if each 0 or 1 is a whole image the trick is to realize that 7 and 9 are unique to the cancer side. but how do you get to that point? right now i take all the numbers by themselves and say cancer on no cancer. so 0 has a percentage chance.. etc. Even this is a simplification because i dont have a clear picture of "3"  or "2" it might actually just be a special pattern of 1s and 0s  (that is to say it all looks normal and the particular oddness of the slices in the order they show is what makes ID as cancer).

So I'm noodling on this a little to see if I can find a good way to get the insight. if I can get the image analysis to indicate 3 or 2 is present. i'm in there, but more likely I'll make data out of the whole and rethink how to make the predictions there by adding more data to each prediction but removing the false positives.

 

 

 

 

suffix arrays maybe

I've been hard at work in my free time working on this cancer image processing. i got my image load and processing down so it takes 7 minutes to load the images (this includes the white-black color balancing that makes "black" consistent on all images) The problem i'm having now is finding the best way to match a 32x32 block from the 512x512 images against all the other images to find similar entities.

I tried a brute force approach that measured differences between blocks. that wasn't gonna work. the runtime would have been centuries and the results were meh at best. So just to speed it up (before i improve the meh results) I then tried making an index out of the data by some averages and what not from each cell, storing them in 4 bytes as an index. I associated all rows that tied to that index in a hashmap for that index. the problem with this is, there are to many similar cells. That is to say the averaging functions weren't doing enough to distinguish some very similar looking cells. So the runtime went to like 7 years... with still meh results.

I tried short circuiting the results and just give me "something" after a few tries but even this was proving to take a loooong time to run. why is this taking so long? well 250,000 base images and i divide each one in to 32x32 segments and then i do a few versions of the grid where i offset by 16. i end up with  16*16+15*16+16*15+15*15 cells for each of the 250,000 images i want to process. since i really dont want to process every single possible cell because of runtime concerns. Or do I?

So sitting here thinking about it. I learned about suffix arrays years ago for some bioinformatics class i took (a story for another time). I've implemented one on 3 separate occasions but its been a while. What are they? they are a clever way of using pointers in to your original array. then sorting the those pointers by what they point at. if you had the string "ABA" your initial pointers would be 0,1,2 then after they are sorted they would be 0,2,1 (assuming you sort "end  of line" after "B") why is this useful? well you can quickly see where similar things the string occur by scooting over one to the right or one to the left.

I could then put all the images grid segments in to a giant array of bytes and sort them (not a fast process, but really not that slow either). I want to keep the little images as they represent my features i'll feed in to the database, i could make smaller images but 961 features seems more than enough for now. the first real problem here is the array of images is about 250,000,000 long and each of those has 1k of bytes in it. so you need a lot of memory (i'm okay there, the indexing i was doing before had the same problem) the 2nd thing you need to think about is the image segment. its really not rendered the way that is useful.

when i worked on the indexing i thought about that some and tried making averages for the whole grid. this kinda worked but really it dances around the issue cause i was only making 4 bytes of averages from a 1k image. I think to do this right, you need to translate the images in to something more... jpeg-y. you want 1 byte that represents some average details then subsequent bytes to refine the average so that as you move from the left to the right of byte array you get more and more detailed information. this will put similiar images next to each other once sorted.  You can certain translate an image in to another form that does something like that. i'll have to.

Once i've done that, and once i have the data loaded in to a suffix array its a simple matter of looking at the nearest cells in the suffix array and seeing if they are in a cancer patient or not (skipping cells that reference the original image for that grid, and cells in test data). once you've done all that you  can get an average value for cancer/nocancer and save it for that feature.

This method leaves the original images on stretched or touched in anyway (other than the color balance) which means a small person may not line up with a large person. and also there are times when the image is zoomed in to much in the cat scan so there is clipping etc... these problems can be resolved if there are enough images (so you can find someone similar to yourself) a bit like sampling enough voices will eventually give you a person who has the same accent as yourself. the question is, is there enough data?

hopefully i can get all this done by monday and have a crack at actually making some predictions with the results using a normal gbm.

oh image process... how difficult you are

I've decided to compete in https://www.kaggle.com/c/data-science-bowl-2017 which is a contest where you process patient cat scan photos and try to identify stage 1 cancer people. there are something like 1500 patients and and each has 100+ cat scan slices. the number of  slices is not consistent and the order of the slices is seemingly random (they have guid's as names).

Until now i haven't done image processing contests. in fact in most of my contests i dont bother with feature creation at all if i can avoid it. That's a different thing than i enjoy working on. That can't be avoided with image processing. It is a whole thing unto itself. So why try one now? I had a close friends die from lung cancer. It is unlikely this technology would have helped him since by the time he went in to get his cough checked on, it was clear he had cancer from cat scans. But it still seems like a good pursuit and hits close to home. The prize money (which is huge) is also nice but so few get that, that it cant be a real draw.

So what will i be doing / what have i do so far? I've loaded the images using some trial software and a simple c# program (they are dicom medical images). It took a long while to realize that was the way to go. I tried using some open source packages and stuff but the images kept coming out a little off and i didn't know why.

I've normalized the images for clarity by removing gray backgrounds and re-balancing the image brightness with that in mind This really makes the details clear and gives all the images the same light levels to compare with each other. I did not try maximizing the light levels, presumably they already are but i might need to implement that just in case.

I tried removing non lung artifacts. things like the clothing and the table they are lying on, but the results were sketchy. I didn't want to lose anything important in the image by accident. So after many attempts i undid the work and decided to come back to it later.

I setup 2 dataminig databases for my data. 1 to load image results in and 1 to produce the actual final prediction. the image results will go in to a normal datamining database. In that each image slice will be a row with a predicted value of cancer or no-cancer (based on the person who it was taken from, not on if that particular image had cancer or not). The images will be split in to a grid of small cells and a 2nd grid that is offset by half a cell so corner regions are not ignored. I probably should do 2 other grids as well (i have not yet) that are offset by half a width or half a height respectfully (not just a 2nd with both). these tiles will be used to compare with all other people's image's to see/find the closest match.

Mapping the data from the image database to the real/result database is a little bit of a mystery. I have a solution to do it but i'm not sure it's the best solution. Normalizing the feature count can be done a number of different ways. We'll just see have to see what works best.

That last part.. compare with all other people's image's to see/find the closest match. is the hard part. that little statement right there is what humans do so well and computers do not. That is where i hope to really add and stand out in this competition. till i get everything working i'm just going to do a simple difference measure in light levels and take the closest matching tile as the winner... but later once i get everything working weill, i've got plans to really improve the matching algorithm... everything from doing a 2-d DTW (which has about the most abhorrent run time ever.. to building a data miner for the tiles... to just doing fuzzy matching of images .. to looking at the best match for each pixel in the entire square... to ???

clearly i've got lots of ideas/things i want to try. but the first step is just to get the whole jalopy running.

 

The legacy Grand Prix in louisvie, ky

I took a deck I've been working on to the legacy grand prix this weekend. It was definitely tier 2 still. I had a lot of near misses (loss in 3 games, 1 match win in 4 rounds) and ended up dropping the main event and testing it a lot in side events this weekend.. I called it aether vise.  I'm not done with the deck. I've gotten a lot of great information to improve it. The deck is as follows (copy-paste from http://tappedout.net )

So my notes go like this:

a little more tweaking and i think the deck will sing. right now it feels like an 8 cylinder car with 1 dead plug. i think the amount of control is right where I want it. It was the win cons that gave me problems.(with an occasional exception of too much or not enough mana. I'm chalking that up to normal game play).

So many near misses. no blow outs in either direction (except me vs elves... poor elves guy. 1 game i lost, i misplayed winter orb over sphere of resistance turn 2. The games weren't much of a challenge. Oh! and me vs dead guy ale.. that deck is seems just too slow or he had absolutely rotten luck both games).

I feel like my deck tries a little to hard to do certain things and not hard enough at other things. I need it to be a smooth tool box deck. almost everything needs to be a 1 of.

So changes: i think every win con should all be 1 of  (Except ghripr aether grid that should be a two of since its value is huge and it shuts off winter orb). which means i have to cut a black vice. I'm going to put Ajani Vegenant in place. I was also pleased with adding a chandra main. that is i ran it with a chandra in it in side events (instead of 1 of the 3 sun droplets). I think she is ideal as a 1 of. sun droplet as a 2 of is fine... 3 was good to but 1 would leave me looking for it too often for spiteful visions or as another mechanism to slow the opponent's win. 

Previously I tried to hard to make all cards tutor-able. But that's not the best thinking. There is no reason i cant run a few win cons that aren't tutor-able ajani and chandra are both good enough to just "show up". I considered koth as well but he doesn't do enough unless you are playing a blood moon heavy deck.

the ensnaring bridges should probably be a 2 of main instead of 3 ghostly 1 ensnaring. it should be 2 and 2. Ever since i migrated away from the 4 howling mines main (down to 1 at this point) and moved away from the black vise heavy build ensnaring bridge has gotten stronger and stronger.

i also think the wastelands can go to a 1 of from 3 ... leave a spot of blink moth well and Academy Ruins (only usable with a mox).  Why? its rare that destroying a land sets me up for later. unlike blood moon wins, wasteland almost always has to combo off of the crucible to be all that good or just gives me a little early game time walk. Since i'm not running creatures to abuse the advantage its not all that valuable most of the time.  Also, as for the combo 1 ghost quarter does the same, which is in there too. All in all, sphere has the same net effect and is way better over all.

As for blink moth well, having a non counter-able way to tap winter orb (miracles) is good and that gives me 4 ways to tap the 4 orbs so hopefully that runs a little smoother. (1 relic barrier, 2 grid and 1 well). I considered a man land too. but, I don't think a man land is necessary with the addition of both planeswalkers, since miracles has a tough time with 4 drops.

The academy ruins, this would be an experiment. more than once i wanted a way to get back an artifact and it makes my expedition map better. it also gives me an out vs slow mill.

Side board thoughts: The porphyry nodes sideboard is great, I might want another or different creature hate card in place of the bridge I'm moving to main deck. the nevermore is great i just to get a lot better at picking things to name (wear/tare is a good one!). I went to 1 from 2 earlier, I can see arguments to going back to 2.  The lay lines of sanctity are great at 3. The same is true for chalice is great at 3. 2 wear and tares was good. It seems like the right amount when i want them, though there might be better choices out there. The 2nd rest in peace was less impact-full than i expected. i need to play more to see if it stays. The extra pithing needle in the side was fine, same with blood moon. I have no idea if sun droplet needs to stay in the sideboard as a 1 of. Two main seems enough except against burn and them 3 might not be a enough anyway.

the current version of the build can be found here (i renamed it)

http://tappedout.net/mtg-decks/ghirapur-prison-deck-of-many-things/

 

 

Minecraft and data mining (no relation) and stocks (some releation)

It's the 2nd day of 2017 and soon to be 2018 or so it always seems when you look back at years. I've dawdled for to long on some old work I wanted to do and the years turning over has put that in sharp relief.

First I started a you tube channel https://www.youtube.com/channel/UCoTP8WbdsCW_6FSLz0pUSsA (Hardcore in a Hurry) for my video game exploits. I don't play a lot of video games these days, but that being said I like playing games on the hardest setting possible. I'm not a fan cheat modes or easy walk through settings (or AIs that cheat for that matter, I mean seriously? couldn't you write a better AI?). That being said, minecraft is the only thing on there now. I don't have plans to add any other types of videos right now, but things change.

Next Let me say I've spent some more time on my layered Gradient boosting stuff. I learned a little bit about what t-sne can do for me. So feeding t-sne in initially can help if the groups do self organize out of the original data but normally, I think it is something you want to do after a few iterations have passed. it seems the further down the gradient you are the better the effect. The effects arent remarkable, but they aren't terrible either. The unfortunate takeway is that it is very time intensive. running t-sne processing on the current state of the gradient descent and the source data just takes ... well a while. maybe i can cherry pick features to use and speed it up, but as it stands right now unless the data set is pretty small its not a good option for eking out more performance. 

For a couple years now i've wanted to implement a stock analysis program, so i can do personal investment. I did this once along time ago with a friend using far less sophisticated methods. That's a story for another time. I've never been happy enough with my software for datamining to start the code/project for investing. I think i'm happy enough now. The really work will be in getting the data in to a form that is easily updated daily and makes measurable testable predictions. My recent work on https://www.kaggle.com/c/santander-product-recommendation got me to rewrite parts of my code to handle time series in a different way. which is key for the training and evaluation. (a contest i never got anywhere with. I've never implemented the map@<x> evaluation so it makes it hard to train for such a thing. )