Dim Red Glow

A blog about data mining, games, stocks and adventures.

bias bias bias...

Work continues on the client/server version of the genetic algorithm. but the pressing issue still remains, bias. the results i get locally at the end of a run are usually far better than the number 1 spot on the leader board... but my submissions are never that good. i think i found 2 things that might be causing it.

The first was some code i had in place from when i was trying to do rmse instead of correlation coefficient. I was calculating an average value to be used for "filler". the genetic algorithm has 2 major things going on. 1 is the prediction and the other is when the prediction is used.... when its not used it uses the average value (actually its not a hard fast switch but a weighting based on the a 2nd prediction going on.) the point is the value being used wasnt the average for that stack (its a series of stacks each adjusting for the error of the last) but the average for all stacks. the difference is the average of the current stack is usually 0 (first stack it is not). the average for all stacks is the average value of the training set's average. I dont know how much that matters cause of the linear alegbra at the end adjusted the correlation values to actual values... but it certainly didnt help.

The second was the idea that even with the self similar approach i'm still fitting to my training data... and all of it.... so bias is unavoidable. i might have a way to fix that. basically i'm going to treat the entire stack as one fitted answer on 63.2% of the data. (that's right this is bagging) by using the same exact training data portion over and over again as i fit my genetics to it (its still self similar too). I can at the end, when i'm done improving, take the remaining results and use those to figure out how i should adjust my correlated predictions. in short the hold out data becomes the unbiased mechanism i use to scale the correlated values. I could also use this to get an accuracy if I wanted. I might, but right now I'm just using it to do the scaling in an unbiased way.

So those 2 should help! there is a 3rd thing i thought of too but its unrelated to bias. i'll try it later when the client/server code is working. instead of using least squares to fit the correlated values. i think fitting by the evaluation metric might work better. sqrt(sum((log(prediction +1) - log(actual +1))^2) / n ) see here https://www.kaggle.com/c/nomad2018-predict-transparent-conductors#evaluation . the thing is, to do that i have to implement it myself. i couldnt find any accord.net code to do it