Dim Red Glow

A blog about data mining, games, stocks and adventures.

Behold my genetic algorithm

I recently started a new genetic algorithm and I've been really pleased how well it turned out. So let me give you the run down of how it went.

I was working on the zillow contest (which i won't go on about, needless to say you were trying to improve on their predictions) and thought a genetic algorithm might do well. I didnt even have my old code so i started from scratch (which is fine. it was time to do this again from scratch).

about 1 and half to 2 days later i had my first results and i was so happy with how well it worked. i should say what it was doing. it was trying to predict the score of the zillow data using the training features. all missing data was filled with average values and i only used real number data (no categories. if i wanted them they would be encoded in to new yes/no features for each category.. ie hot encoded features).

The actual predictions were ranked by how well they correlated to the actual result. i choose this cause i use the same mechanism in my prediction engine for data mining and it is historically a really good and fast way to know how well your data moves like the actual data moves. I found that it was really good at getting a semi decent correlation coefficient (say 0.07... which sounds terrible but against that data it was good) pretty quickly. the problem was it was over fitting. I could take out half the data at random and the coefficient would drop to crap.

Skipping a long bit ahead, i came to the conclusion that there were bugs in the scoring mechanism though apparent over fitting was there too. i tried doing some folding of the scoring data to help eliminate this but found it didn't really help. fixing the bugs was the big thing. This happened around the end of the zillow contest and by then i was improving my core algorithm to deal with zillows data. the contest ended and i set everything aside.

Skipping ahead about 2 weeks later i had started on a new contest for porto (still going on) and after running through the normal suite of things to try i went back to the genetic algorithm to see what it could do.

I left everything alone and went off going with it. pretty quickly some things occured to me. part of the problem with the genetic program is it is fairly limited in what it can do with 1 prediction. That is if it kicks out 1 feature to use/compare with the actual score this limits you in doing further analysis and makes evolution equally 1 dimensional. So I introduced the idea of 1 cell (1 genetic "thing") producing multiple answers. that is 1 for each dimension you care about. Essentially it would have 2 chains of commands to make two different predictions that in conjunction give you a "score".

But how to evaluate said predictions? there may be some good form of correlation coefficient off of 2 features, but whatever it is I never found it. Instead I decide to do a nearest neighbor comparison.

So basically each test is scored by where it is put in dimensional space vs all the other test points. The test are scored only against those they dont match (porto is a binary contest so this works well, i would probably do a median value if i was using this in normal data or the runtime gets ridiculous as its O^2).

In essence i don't care were training data with the same classifier does seen any of the others with the same classifier. they only see the data points of the opposite type, this seriously cuts down on run-time. In those cases I want as much separation as possible. Also I don't want scale to matter, artificially inflating numbers but keeping everything the same relative distance apart should produce the same score. So I scale all dimensions to be between 0 and 1 before scoring. Then I just use 1/distance  to get a value (you can use 1/distance^2 if you like).

this worked really well, but when is started looking at results i saw some odd things happening. things getting pushed to the 4 corners. I didnt want that specifically and in a lot of ways I just want separation not a case of extremes. To combat this I added a qualifier for scoring, if you are over a certain distances that is the same as a 0 score (the best score). this way you get regional groupings without the genetics pushing things to the corners just to get the tiniest improvement.

Lately i've left the region setting to 3 .. if two points are 1/3 of the map apart, they dont interact. also i cap my sample size to 500 of the positive and 500 of the negative. i could do more but again runtime gets bad quickly and this is about getting an idea how good the score for a cell is.

There are lots of details beyond that but thats the gist of it. The whole thing has come together nicely and is neat to watch run. things do grow in to corners sometimes since i only breed the strong cells (lately i've started throwing in a few week ones to mix it up) I kind of think i should have extinction events to help keep evolution from becoming stagnant in an attempt to get the best score possible. I'll have a follow up post with some pictures and some more specifics.