I thought about it a little and I wanted to say what a genetic algorithm is. At least as I understand it so everyone is on the same page. a schema driven algorithm that has a final result that can be evaluated. the schema changes randomly or by merging good prior results. if a more desirable result is found (in whatever manner) it is kept for future merging.
So it has been a few weeks and things have really moved forward. I think a full history via bullet point will explain everything
- I get a wild hair to retry the writing a GA (genetic algorithm) about a month ago, the first version was pretty amazing but produces nothing immediately useful. It produced features that had high correlation coefficient.
- Turns out there were some bugs in scoring. I fixed them and things were no where near as good as I thought but they still had higher than normal feature scores. unfortunately the data mining tools didn't make particularly good use of them (they added nothing over all). I can't figure out if it is over fitting or the results just aren't all that useful, probably both.
- I decide to make the GA 2+d (multi-dimensional). In things line tsne this is the way to go, as doing 1 dimensional work is way to limiting. The way this works is there are basically two (or more) sets of results that work separately to produce an X and a Y. (they don't interact, though sometimes i flip them when loading since they are interchangeable and that is exactly the sort of cross breeding i want to see) . Initially I used distance from the origin point is used to score correlation.
- The correlation scoring by using distance from the origin is poor. So, I create a new mechanism that scores based on distance to a mismatch result in 2-d space. The mini-universe's X and Y points are scaled to between 0 and 1 and a distance of 1/Region size is considered perfect separation. This works but the results don't seem to translate well to being useful for data mining. Also, due to every point interacting with every other point i am forced to make the scoring work on a subset of the data this is less than ideal.
- I decide to abandon scoring the way i created and instead make the 0,0 corner of the universe a 0 value and the 1,1 corner a 1 value. to find any points actual value sum the x and y (and divide by 2 if you want it between 0 and 1) this means the diagonals have the same value, but both channels/outputs of the GA can potentially dictate the result of any given point, but usually they need to work together.
- I make many improvements to the actual internals of the GA algorithm this is an on going process, but at this point i thought i'd call it out. i'm not to my current version yet, but i've been playing with things internally. things like how often a copy is merged with another copy without error. And the introduction of the last few math functions I'll use. I decided to use the sigmoid (logistic function) and logit (inverse of that) in the actual list of gene functions that are available. even though their functionality could be reproduced with 2 or 3 other functions , it is so very useful from a mutation stand point to make that feature available by just specifying it. this highlights the possibility for other such functions in the future.
- For reasons of speed I was using a subset of the data to test and train against. initial results look amazing. The upshot of my new technique is you dont have to mine on it you can directly translate the output to a final result (x+y = binary classifier) I let it run for a few days and get a result of .31 (exceeding the current contests best score of .29) only to realize that i have severely over-fit to the 10% training data i was using (which wasn't randomly selected either it was the first 10% of the data). 2 things come out of this, the idea of scaled-accuracy and a better way to train in general. Also it turns out certain sections of the train data are very strongly organized. a competitor at kaggle pointed this out. the data as a whole that is much harder to work with.
- I figure if a my final score is good with a small sample and with the entire training data then it would be good overall. the logic is that if you get a self similar score at two scales, then no matter the scale you are working at it will be the same (scaled-accuracy). the more divisions you do the less chance of error you have. currently i'm only doing 2 (which is poor). 1 at 15% and the entire training set. You can probably compute an idea split based on the amount of time you want to spend testing your result. I would think a 2 or 3 at 15% (averaged) and 1 at 32% and the full train data would be far better. so far, i submitted a locally tested .244 result to the kaggle website and got back a .241. I also did a .195 and got back a .190 . So for now i'll leave it at %15/%100 and just expect a small drop when i submit.
- It turns out there is enough logic to export a mining tree in to genes. I wrote an export feature in to my data miner to see what the results are. This works cause GBM (how i do data mining these days) is the addition of lots of tree results. each tree node can bet turned in to a series of logistic functions multiplied together. These results are then multiplied by the tree nodes output and this by the tree weight. That is added to the running total (which is seeded with an average) you can split up trees in to the various dimensions since final scores are just the sum of the dimensions anyway. the resulting GA schema is huge (21000+ genes) for a 6 depth 16 tree stack. The logic does seem to work and i can generate the output for it. (see below) I'm sure this can be optimized some with the current mechanisms in my GA, it seems unlikely in this instance to be useful though. That schema is to large to process quickly to optimize in the time remaining.
As of the writing of this my genetic algorithm is running (and has been running for a few days. it is currently up to .256 on the training data. if testing holds true i expect to see around a 2% drop in score when i submit (though it will get worse the better my prediction gets as the genetic code isolates scores. which means i can expect a .316 to turn in to around a .31 . that btw is my goal. i doubt i will mess with the program to much more in this contest. (unless it runs for 3-4 hours without an improvement.) since the gains have been very steady. yesterday around now was at the .244 ... so .012 in day isnt bad. I'm sure it will slow but there are 2 weeks left in the contest. this puts me about 2 days out from tying my old score from normal data mining techniques, (assuming linear progression, which i'm sure will slow down) And about 7 days from my target.
here are some results. This is for this contest (kaggle's porto serguro) . The scoring is normalized Gini which is the same as 2 * AUC (receiver operating curve) - 1 . This works to out to be 1 to -1. 1 is perfect, -1 is the inverse of perfect. 0 is the worst. The red dots are 1s the blue dots are 0s in the training data. the top left corner is 0 the bottom right is 1.
this scored a .241 on the test data, .244 on the training
Here is 1 GBM tree metioned above turned in to a gene and graphed. i didn't test it against test but the training data said it w as around .256 (iirc)