Dim Red Glow

A blog about data mining, games, stocks and adventures.

i guess things are looking up

Work still continues on the client/server code. i had some real issues automating the firewall access, but i think i got that sorted now. i set it aside after that (that was sunday) I'll probably look at it some more tonight.

I've been trying some slow gradients in the boosting part of the algorithm. i think the idea here is less about picking up other signals in the data and more about dampening the noise i introduce. instead of 70% of the answer put in. i'm doing 40% now. this is working pretty well. i get the feeling 20% might work better still. the idea being that if you are introducing noise you want your noise to be same "level" as the background noise, so on the whole you arent adding any (since noise ideally hovers around 0 and the various noises cancel). the signal while only at 20% is still signal and does what it's supposed to do. (identify the answers) the 40% run is still going, i'll probably start on the 20% run before i  get the client/server version done. 

I started looking at the data its using (just for fun) turns out its using a lot atom position data.... which is weird. I read that there is a standard "best" form for describing crystalline structures and these have already been optimized for that. So maybe conceptually certain atoms positions (atom 20's X position for example) are telling to the details of the crystal. but this seems wrong... at least a really terrible way to get at the data you want. I might do my 20% run without the XYZ data turned on just to see if it: 1, produces better results and 2, has results that make more sense.




bias bias bias...

Work continues on the client/server version of the genetic algorithm. but the pressing issue still remains, bias. the results i get locally at the end of a run are usually far better than the number 1 spot on the leader board... but my submissions are never that good. i think i found 2 things that might be causing it.

The first was some code i had in place from when i was trying to do rmse instead of correlation coefficient. I was calculating an average value to be used for "filler". the genetic algorithm has 2 major things going on. 1 is the prediction and the other is when the prediction is used.... when its not used it uses the average value (actually its not a hard fast switch but a weighting based on the a 2nd prediction going on.) the point is the value being used wasnt the average for that stack (its a series of stacks each adjusting for the error of the last) but the average for all stacks. the difference is the average of the current stack is usually 0 (first stack it is not). the average for all stacks is the average value of the training set's average. I dont know how much that matters cause of the linear alegbra at the end adjusted the correlation values to actual values... but it certainly didnt help.

The second was the idea that even with the self similar approach i'm still fitting to my training data... and all of it.... so bias is unavoidable. i might have a way to fix that. basically i'm going to treat the entire stack as one fitted answer on 63.2% of the data. (that's right this is bagging) by using the same exact training data portion over and over again as i fit my genetics to it (its still self similar too). I can at the end, when i'm done improving, take the remaining results and use those to figure out how i should adjust my correlated predictions. in short the hold out data becomes the unbiased mechanism i use to scale the correlated values. I could also use this to get an accuracy if I wanted. I might, but right now I'm just using it to do the scaling in an unbiased way.

So those 2 should help! there is a 3rd thing i thought of too but its unrelated to bias. i'll try it later when the client/server code is working. instead of using least squares to fit the correlated values. i think fitting by the evaluation metric might work better. sqrt(sum((log(prediction +1) - log(actual +1))^2) / n ) see here https://www.kaggle.com/c/nomad2018-predict-transparent-conductors#evaluation . the thing is, to do that i have to implement it myself. i couldnt find any accord.net code to do it

i just can't stop working on genetic algorithm stuff

hello hello hello. i don't even know where to begin. I guess some short stories, the improvements on the genetic algorithm have been steady and successful. I do still struggle with bias some. I also struggle with speed. now a little more on each.

For bias i'm pretty certain my strategy of scale-able scoring (or whatever i'm calling it) is the way to go. that is regardless of the size of the sample the sample scores the same. the more samples of varied size the more accurate the score is (with more certainty). Basically, you use the worst score from the group of samples. you should always include a full 100% sample as well as the various sub samples, but i run that one last and only if the sub-samples indicate the result is still qualifying. in fact i quit at anytime the score drops below a cutoff. this actually makes it faster on the whole too.

For speed, i've managed to get some sample GPU code to work, which is great! but alas, i haven't found time to write client/server code and implement a distributed version of the genetic stuff. I will, i just need more weeks/weekends. this will hopefully give me something like a 50-1000x boost of processing power.

All this work has been on https://www.kaggle.com/c/nomad2018-predict-transparent-conductors which is rather ideal for my purposes here. you can read the indepth on-goings here https://www.kaggle.com/c/nomad2018-predict-transparent-conductors/discussion/46239 . I'm really hoping i end up in the top 10 before its all over.

There also happens to be a new contest https://www.kaggle.com/c/data-science-bowl-2018 which needs image analysis to be completed.... however! I think this might be also be a contender for the genetic algorithm, though maybe a different version. I could certainly load the images in to the database and let the genetic algorithm figure out what is what... but i think there might be a better way. I think it might be better/more fun. to  design a creature (yes creature) that can move over the board and adjust its shape/size, and when it thinks its found a cell it it sends the mask back for a yes/no. after, i dunno 1000 tries it is scored and we make a new creature that does the same thing... breed winners etc etc. or we could go the reinforcement route. whenever it sends back a mistake we tell it "bad". and when it sends back a success we send back "good". in that way there would be only 1 organism and the learning would be at a logic level inside of it instead of having new versions of itself over and over again. I haven't decided which i'll do, but i think its probably something i'd get a kick out of writing.