Dim Red Glow

A blog about data mining, games, stocks and adventures.

The genetic algorithm delights don’t stop

And I’m back, and so quickly! So the improvements have started....

I tried the least squares weighting of answers ... this works but I think isn’t worth doing while training. I’ll do it on the final submission but train without it.

ive just gotten the layers implemented and that seems to work but it’s to early  to know if gains are really there. I think maybe if anything this will allow the code to improve to much greater extents faster... instead getting bogged down at a lower score. (Though code without layers may eventually get ther too)

the final improvement is essentially heuristic modeling to be applied to the odds of any given thing happening. I did this a little a while back and have rethought what I said since last post. I think the big thing is to just balance the odds of feature/channel selection and function/method/mechanism selection. This should increase speed and accuracy.

I'm still amused by the idea of giving a result to a Kaggle contest without giving the technology. I mean I give the the algorithm it generated but not how you get there. It would be delightful to win a few contests in a row without actually giving the technology away. It would turn Kaggle on its head (especially since its not the sort of thing that translates in to  kernel)

once everything is working the last step is to migrate it to a video card runnable code so I can scale it massively.

more genetic code thoughts...

So I've been thinking about deep neural networks and genetic algorithms and b-trees. first let me say that i made some simplifications in the tree export (both in concept and in genes) and got the exported size down some. it should be in the neighborhood of 1/3 the the size. I say "should" as i only exported a 4 depth 4 stack tree and that isn't anywhere near as big as a 6 depth 16 stack tree. the whole action was i think academic at this point.

At the time I was still hopeful i could have the genetic program optimize it. It turns out that tree based data mining while systematic in approach and isn't very efficient. There are almost always far better ways to get the scores you are looking for and the genetic programs tend to find them and throw our your tree entirely. The reason the tree is used is really a case of methodology. It's a generic way to produce a systematic narrowing of results to statistically model the data. the genetic mechanism tends to find mathy ways to model the data. they could be in any form, tree based or otherwise.

This leads me to some serious thoughts on what is going on in deep neural networks. They tend to have a number of layers each layer has access to previous layers or the original data.... possibly both (depending on the nature of the convolution). Its a strong strategy for figuring things out that require a group of features to work together and be evaluated.

It turns out this is kind of what the introduction of channels is doing for me... its also (in one way of looking at ) what stacking results in GBM. Each channel or tree has their own concern. This made me realize that by limiting the channels to a fix number i was trying to shoehorn what it actually needs to describe the data in to two ideas that get merged. because of the strong adaptability of the channels this can work, but it isnt ideal. ideally you let it go as wide in channels as it needs to. in fact you really should let channels stack too.

I implemented the idea of random channel creation (or removal) and reworked the way genes are merged/created with that in mind. the results have not disappointed. it hasnt existed long so i cant tell you how far it will go but it does tend to get to a result faster than before.

I think there are 3 more major improvements to make. right now, i'm still just taking the sum of the channels to get my final output. i think this can be improved by doing a least squared on each channels' output with the expected result to find a coefficient for the channel. This isnt needed persay, but it will help get me to where i'm going faster.

The 2nd improvement is to make it so there can be layers... layers get access to the previous channels in addition to the features and what not. layers could be randomly added or removed like channels. if a layer references a previous channel and it doesn't exist due to mutation I would just return 0.

the 3rd improvement is to add some system of reinforcement. right now i do breed better scorers more often . But I think that isn't enough, I think some system needs to be devised that eliminates under performing genes when breeding. This is really tricky of course, because who can say there isn't some other impact. Essentially which genes are the good ones? I think some sort of heuristic needs to be added to a gene to track how well it has worked. Basically a weight and a score. if a gene is copied unmodified the weight goes up by 1, the score is the average score of the whole cell. adding. if a copy is made and some change happens to the gene or if the gene is new the data is set the score and weight of just that particular cell (no average history). When deciding which gene to take when breeding two cells the odds would be reflected by the two average score. or possibly by just the current score. I dont know how well this will really work in practice but if not this... something.



One possible future of kaggle contests...

First... yes 2 posts in one day. crazy! I wanted to add a thought that the data miners of the world may not like and definitely something kaggle would be dismayed at. If the genetic algorithm stuff i've been working on works... like really works. You wont be able to use their kernel system to compete at least not as it stands.

why? because the results the genetic algorithm produce are based on processing power and time. Not on some particular mechanism you put in place. Now you might be able to get to a good result faster by tweaking some code. but its not like you would run that code indefinitely on their servers. Or if you find a good gene, sharing it is an option but basically you are giving away a winning solution by doing so.

The hard part of this of course is turning any particular problem in to one that the gene mechanism can make short work of. There is definitely skill there, at least until most problems can be distilled via known mechanisms. Unfortunately this isn't much different than a loading/preprocessing script.

I have to admit i'm anxious to try my tool on other problems. So much so that I'm feeling greedy as to its true value and I kind of don't want to share the actual code, lest the world catches up in a hurry. (it will likely get improved a ton more as time goes on) This leads me to... I wonder if I just gave them an order of operations to predict a great solution (the math actions of the genes/variables stored internally)  if that would just be a good solution to their contest without me giving them anything in particular? I mean they could use it to get what they wanted without knowing the first thing of how i got there. 

There is a function that generates a value using the genes and the raw data. its a single function and essentially that and the string of genes comprises the whole program to claim the result. (Or it should be enough) https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/rules and as long as i can generate the result and give them the mechanisms to do that (and the gene) they have all they need. I could probably distill down to something even simpler. (remove unused available functions, move any external calls internal etc, change the database model to simpler. write a custom loader for the new database model etc. essentially make a 1 off standalone program that does the most exact version of what they need)

I am of course getting way ahead of myself, i havent even come close to the top of the leader board let alone beat my OWN previous score which is meh at best. all of this is, i guess, just food for thought.

the "evolution" of my genetic algorithm

I thought about it a little and I wanted to say what a genetic algorithm is. At least as I understand it so everyone is on the same page.  a schema driven algorithm that has a final result that can be evaluated. the schema changes randomly or by merging good prior results. if a more desirable result is found (in whatever manner) it is kept for future merging. 

So it has been a few weeks and things have really moved forward. I think a full history via bullet point will explain everything

  1. I get a wild hair to retry the writing a GA (genetic algorithm) about a month ago, the first version was pretty amazing but produces nothing immediately useful. It produced features that had high correlation coefficient.
  2. Turns out there were some bugs in scoring. I fixed them and things were no where near as good as I thought but they still had higher than normal feature scores. unfortunately the data mining tools didn't make particularly good use of them (they added nothing over all). I can't figure out if it is over fitting or the results just aren't all that useful, probably both.
  3. I decide to make the GA 2+d (multi-dimensional). In things line tsne this is the way to go, as doing 1 dimensional work  is way to limiting. The way this works is there are basically two (or more) sets of results that work separately to produce an X and a Y. (they don't interact, though sometimes i flip them when loading since they are interchangeable and that is exactly the sort of cross breeding i want to see) . Initially I used distance from the origin point is used to score correlation.
  4. The correlation scoring by using distance from the origin is poor.  So, I create a new mechanism that scores based on distance to a mismatch result in 2-d space. The mini-universe's X and Y points are scaled to between 0 and 1 and a distance of 1/Region size is considered perfect separation. This works but the results don't seem to translate well to being useful for data mining. Also, due to every point interacting with every other point i am forced to make the scoring work on a subset of the data this is less than ideal.
  5. I decide to abandon scoring the way i created and instead make the  0,0 corner of the universe a 0 value and the 1,1 corner a 1 value. to find any points actual value sum the x and y (and divide by 2 if you want it between 0 and 1) this means the diagonals have the same value, but both channels/outputs of the GA can potentially dictate the result of any given point, but usually they need to work together.
  6. I make many improvements to the actual internals of the GA algorithm this is an on going process, but at this point i thought i'd call it out. i'm not to my current version yet, but i've been playing with things internally. things like how often a copy is merged with another copy without error. And the introduction of the last few math functions I'll use. I decided to use the sigmoid (logistic function) and logit (inverse of that) in the actual list of gene functions that are available. even though their functionality could be reproduced with 2 or 3 other functions , it is so very useful from a mutation stand point to make that feature available by just specifying it. this highlights the possibility for other such functions in the future.
  7. For reasons of speed I was using a subset of the data to test and train against. initial results look amazing. The upshot of my new technique is you dont have to mine on it you can directly translate the output to a final result (x+y = binary classifier) I let it run for a few days and get a result of .31 (exceeding the current contests best score of .29) only to realize that i have severely over-fit to the 10% training data i was using (which wasn't randomly selected either it was the first 10% of the data). 2 things come out of this, the idea of scaled-accuracy  and a better way to train in general. Also it turns out certain sections of the train data are very strongly organized. a competitor at kaggle pointed this out. the data as a whole that is much harder to work with.
  8. I figure if a my final score is good with a small sample and with the entire training data then it would be good overall. the logic is that if you get a self similar score at two scales, then no matter the scale you are working at it will be the same (scaled-accuracy). the more divisions you do the less chance of error you have. currently i'm only doing 2 (which is poor). 1 at 15% and the entire training set. You can probably compute an idea split based on the amount of time you want to spend testing your result. I would think a 2 or 3 at 15% (averaged) and 1 at 32% and the full train data would be far better. so far, i submitted a locally tested .244 result to the kaggle website and got back a .241. I also did a .195 and got back a .190 . So for now i'll leave it at %15/%100 and just expect a small drop when i submit.
  9. It turns out there is enough logic to export a mining tree in to genes. I wrote an export feature in to my data miner to see what the results are. This works cause GBM (how i do data mining these days) is the addition of lots of tree results. each tree node can bet turned in to a series of logistic functions multiplied together. These results are then multiplied by the tree nodes output and this by the tree weight. That is added to the running total (which is seeded with an average) you can split up trees in to the various dimensions since final scores are just the sum of the dimensions anyway. the resulting GA schema is huge (21000+ genes) for a 6 depth 16 tree stack. The logic does seem to work and i can generate the output for it. (see below) I'm sure this can be optimized some with the current mechanisms in my GA, it seems unlikely in this instance to be useful though. That schema is to large to process quickly to optimize in the time remaining.

As of the writing of this my genetic algorithm is running (and has been running for a few days. it is currently up to .256 on the training data. if testing holds true i expect to see around a 2% drop in score when i submit (though it will get worse the better my prediction gets as the genetic code isolates scores. which means i can expect a .316 to turn in to around a .31 . that btw is my goal. i doubt i will mess with the program to much more in this contest. (unless it runs for 3-4 hours without an improvement.) since the gains have been very steady. yesterday around now was at the .244 ... so .012 in day isnt bad. I'm sure it will slow but there are 2 weeks left in the contest. this puts me about 2 days out from tying my old score from normal data mining techniques, (assuming linear progression, which i'm sure will slow down) And about 7 days from my target.

here are some results. This is for this contest (kaggle's porto serguro) . The scoring is normalized Gini which is the same as  2 * AUC (receiver operating curve) - 1 . This works to out to be 1 to -1. 1 is perfect, -1 is the inverse of perfect. 0 is the worst. The red dots are 1s the blue dots are 0s in the training data. the top left corner is 0 the bottom right is 1. 

this scored a .241 on the test data, .244 on the training

Here is 1 GBM tree metioned above turned in to a gene and graphed. i didn't test it against test but the training data said it w as around .256 (iirc)

Behold my genetic algorithm

I recently started a new genetic algorithm and I've been really pleased how well it turned out. So let me give you the run down of how it went.

I was working on the zillow contest (which i won't go on about, needless to say you were trying to improve on their predictions) and thought a genetic algorithm might do well. I didnt even have my old code so i started from scratch (which is fine. it was time to do this again from scratch).

about 1 and half to 2 days later i had my first results and i was so happy with how well it worked. i should say what it was doing. it was trying to predict the score of the zillow data using the training features. all missing data was filled with average values and i only used real number data (no categories. if i wanted them they would be encoded in to new yes/no features for each category.. ie hot encoded features).

The actual predictions were ranked by how well they correlated to the actual result. i choose this cause i use the same mechanism in my prediction engine for data mining and it is historically a really good and fast way to know how well your data moves like the actual data moves. I found that it was really good at getting a semi decent correlation coefficient (say 0.07... which sounds terrible but against that data it was good) pretty quickly. the problem was it was over fitting. I could take out half the data at random and the coefficient would drop to crap.

Skipping a long bit ahead, i came to the conclusion that there were bugs in the scoring mechanism though apparent over fitting was there too. i tried doing some folding of the scoring data to help eliminate this but found it didn't really help. fixing the bugs was the big thing. This happened around the end of the zillow contest and by then i was improving my core algorithm to deal with zillows data. the contest ended and i set everything aside.

Skipping ahead about 2 weeks later i had started on a new contest for porto (still going on) and after running through the normal suite of things to try i went back to the genetic algorithm to see what it could do.

I left everything alone and went off going with it. pretty quickly some things occured to me. part of the problem with the genetic program is it is fairly limited in what it can do with 1 prediction. That is if it kicks out 1 feature to use/compare with the actual score this limits you in doing further analysis and makes evolution equally 1 dimensional. So I introduced the idea of 1 cell (1 genetic "thing") producing multiple answers. that is 1 for each dimension you care about. Essentially it would have 2 chains of commands to make two different predictions that in conjunction give you a "score".

But how to evaluate said predictions? there may be some good form of correlation coefficient off of 2 features, but whatever it is I never found it. Instead I decide to do a nearest neighbor comparison.

So basically each test is scored by where it is put in dimensional space vs all the other test points. The test are scored only against those they dont match (porto is a binary contest so this works well, i would probably do a median value if i was using this in normal data or the runtime gets ridiculous as its O^2).

In essence i don't care were training data with the same classifier does seen any of the others with the same classifier. they only see the data points of the opposite type, this seriously cuts down on run-time. In those cases I want as much separation as possible. Also I don't want scale to matter, artificially inflating numbers but keeping everything the same relative distance apart should produce the same score. So I scale all dimensions to be between 0 and 1 before scoring. Then I just use 1/distance  to get a value (you can use 1/distance^2 if you like).

this worked really well, but when is started looking at results i saw some odd things happening. things getting pushed to the 4 corners. I didnt want that specifically and in a lot of ways I just want separation not a case of extremes. To combat this I added a qualifier for scoring, if you are over a certain distances that is the same as a 0 score (the best score). this way you get regional groupings without the genetics pushing things to the corners just to get the tiniest improvement.

Lately i've left the region setting to 3 .. if two points are 1/3 of the map apart, they dont interact. also i cap my sample size to 500 of the positive and 500 of the negative. i could do more but again runtime gets bad quickly and this is about getting an idea how good the score for a cell is.

There are lots of details beyond that but thats the gist of it. The whole thing has come together nicely and is neat to watch run. things do grow in to corners sometimes since i only breed the strong cells (lately i've started throwing in a few week ones to mix it up) I kind of think i should have extinction events to help keep evolution from becoming stagnant in an attempt to get the best score possible. I'll have a follow up post with some pictures and some more specifics.


been a long time since I rock and roll..

Wow so I did it again. I stopped writing blog entries. I like to keep ideas going in blogs (a running story if you will) but i guess, things get busy and my focuses change and the next blog doesnt seem relevant thus maybe not something i want to write, then the whole blog writing idea gets out of my mind entirely... and then months have passed. lets set the excuses aside and give some updates to things.

So first I was going to start the one punch man thing. I started to get in to it, then not so much. that is to say every time i started to make headway in to a committed exercise routine, i'd get injured or sick or busy for like 5 days in a row. So long story made short, it never happened. I ~am~ trying to train for another marathon right now, but I'm finding that old foot injuries are getting aggravated and then not healing as fast as I need them to. i'm loathe to run on injured feet. I'm working on the problem by doing less heal striking when i do run, but right now i'm not keeping pace with the training i should be doing because of the longer breaks between runs.

Losing some weight would help all this. I'm not happy with my weight right now and i'm heavier than I have been in years and years. (200-205, usually i'm more like 190-195 and i like being 180-185)  the running hasn't really helped. It usually does some, this time not at all. I think it's because i program/sit all the time. i guess i need a diet on top of it. the extra weight isn't doing my body any favors healing from running either. Also, its harder on the body when i do run. Try running around with 20-25 pounds of top soil... its harder!

Other things that have happened, i went to gen con. It was probably for the last time. I mainly went this year because it was #50 and we humans do love celebrating multiples of 10. It was a bit of a meh experience, to much of it was stuff i either didn't care to do or wasn't all that. I played no board games at all there (which is odd, its a board game convention) I did play some magic, but magic there is not like it is at a grandprix where it is the main focus. And its a lack luster experience because of that. I did see they might be giants (which was pretty great), play some true dungeon (also great) and have dinner with a giant group of people at  fogo de chao a brazilian steakhouse. So there were definitely some good times to be had.

I've more or less quit magic. I might do 1 or 2 more events but I've just not got the interest anymore. I do tend to have a little more fun when i play with the long breaks in between but I really think it is time for new hobbies.

A few things I'm doing right now for hobbies, a automatic trading card sorter (seems ironic to do it now, but i have more time so.... yeah), more data mining improvements/competitions and maybe a feathercoin payment program/plugin for eCommerce. I'll talk more on all that at a later time.

little excercise update

So after I published that last post i got sick... like the very next day. so 2 weeks later i start running again and a week after that i start doing 6.2 mile runs instead of 4. that was a week ago. It's going well! ... but other than a few random sets of sit ups or pushups, i haven't started working the rest of the routine. it's not that it's hard to find time as much as getting started. So yeah, i run Saturday and I think Sunday I'll try and do my first set of 51... I wont go all out at first. after a few days i'll try the full 102 (assuming i can, you know, still sit up.)

One Punch Man excercise crazieness

So lately i've been trying to lose weight. Partly cause i like the way i look when i look in the mirror and partly because it's just healthier for me. I had a goal to lose some weight by June 12th butwith the rate weight loss is working out to be. I dont think i'm gonna really come close to my goal.

The main reasons are my age (i'm 41 and metabolism slows down) and my lack of movement most of the day. I run but its not enough. I generally do 4 miles, 3 times a week but my constant sitting doesn't burn a lot of calories so its kind of a wash. Also my muscles seem conditioned to do said 4 mile run quite well without much training. I tried cutting carbs and that helped, but that's not doing much without enough exercise to burn the fat.

So I'm gonna do something that will really, and I mean really, make me lose weight (in a good way). i'm gonna do Saitama's exercise routine (from One Punch Man). I wouldn't do it if it wasn't possible for me. The "hard" part is the running and that's the one part i wont sweat. So what is it?

  • 100 Push-Ups
  • 100 Sit-Ups
  • 100 Squats
  • 10KM Running (that’s 6.2 miles)

every single day.

Okay really? every single day? no. that wouldn't make sense, even when maintaining yourself that seems overkill unless there was some real need. so I'll probably do it 4 days a week. 6.2 miles a day is another 2.2 miles on my normal run. so it will be a little harder at first. the 100 squats and sits ups and push ups wont be bad as long as i break them up in to sets. I'll probably do 2 work outs a day with 3 sets each. so 51 push ups,sit ups and squats  (3 sets of 17) twice a day.

But that's CHEATING! actually, he doesn't really say it was all in one go. in fact i probably should break up the run too. but honestly, i dont really want to cause its going to eat up enough of my day as is. (quite the time sink) And when i get in better shape i'll probably merge the 2 other work outs in to one so it'll be 3 sets of 34 each. but that wont be for a month or so.

am i really that out of shape? this does seem drastic but, its more just something to amuse me. if i absolutely cant find time or hate it, i'll stop. but baring incident, i'll start tomorrow (though my first day i might do 1/2 and really go 100% Wednesday). I'm abandoning my dieting at least till i get going cause not having carbs while exercising is more brutal than I want to be. right now I weigh around 195 I'd really like to be 180. 175 is like my perfect super-in-shape weight that I really havent been in... er ever (well high school, but that was a not yet adult form of me). even when I was in great shape around 28 (13 years ago ugh) i was still like 178. (splitting hairs i know)

Anyway, we'll see how it goes i"ll keep ya all updated. I may or may not post before and after pictures, you'll just have to wait and see hah! really, i'm thinking from the front there will be THAT much difference my face will look a little thinner.  the main thing is I wont have quite the thick core (torso seen from the side) . which is really my goal.

another mathy thing ... p = 4 * k + 1

So I was looking for an old video in numberphile (one about a particular kind of prime) and re-saw this

https://www.youtube.com/watch?v=yGsIw8LHXM8 (two square theorem)

or if you like the original video that one comes from

https://www.youtube.com/watch?v=SyJlRUBoVp0  The Prime Problem with a One Sentence Proof )

Anyway I found myself rewatching it. and decide to see if I can make a simpler version of the proof. not less space mind you, simpler as in more straight forward. I think I've done that here (it's probably not new, most things in math aren't) but regardless I submit it for your entertainment/utility.


Also it's worth saying that the you can then figure out exactly what form the X and Y (K or what have you) need to have by just unraveling all that. Here is that little bit of extra formula stuffs spelled out.




a need for better time warping

Well, weeks and weeks of working on the cancer contest have brought be back to where i started from. I want to use Dynamic time warping to match images. Once the images are matched, I think maybe look at a difference between the original and the target images to see what is left. This is probably your best place to start looking for cancer.

So why don't i do this? Because the run time is abhorrent. For the single comparison of two images I think Naively the Big O notation is like N^4 . N^2 is linear DTW but you can't just add a dimension and go up by 1 power. if i understand it right you have to add 2 to properly do 2d matching. Where N is the number of pixels in the image. So really it's like yeah. bad.

Maybe there is a way it can be done in N^3 and I'm missing something, but really it needs to be done in something like linear or at least N*log(N) time to really work. So that's where I'm leaving it.

There is a cervical cancer contest out there that is very similar except the photos are from some sort of normal optical camera and while maybe it could be done the same way, it has the same problem. I think if we solve this problem the world will have much much better analysis systems (in general).

It's worth mentioning I think most people do their analysis using deep neural networks. Quite honestly I'm not sure how they would do a good job processing 2-d image data but apparently it does work. I've got 3 weeks before the contest is over. if i can come up with a good way to do the DTW I will, otherwise I'm throwing in the towel on this one :( .