Dim Red Glow

A blog about data mining, games, stocks and adventures.

wow talk about leaving people hanging

hah, sorry about that. 17 hours to go and i disappear! i did not win *grin* the contest really went all topsy turvy when they released the results. everyones score got worse and the people got rearranged. more than a few (myself included) people said their best result was not selected. I fell to sub 100. The winner was someone who moved up a lot, and i heard from at least 1 person in the forums that they had a better score than the winner but it wasnt selected.

Why did this happen? well, probably due to the size of the data. it was a really small sample. It was also possibly due to some deliberate sampling the contest runners did (where certain things weren't expressed in the training data) but really sometimes it's just bad luck too. But that still leaves me in a great position. i've not rested in the time between then and now. I've been applying the stuff i've learned. honed the code. improved the "Fitness" test.

Ah yes, so i experimented with all kinds of ways to score results.... ways to test the results, fit the results...etc. Right now the thing i think is best is to bag the data. train on the 63% of the data (and do the linear algebra for fitting on that). score on the 32%. use the score to evaluate your accuracy of the that stack and modify the weight the stack as it gets added to the final answer. I also score (fitness test) it by the same method i mentioned previously (self similar scoring) as well as a final modification I made the correlation coefficient's calculation internally.  this last one is the big and since it does no harm to share, i'll do that.

Normal Pearson correlation coefficient https://en.wikipedia.org/wiki/Pearson_correlation_coefficient basically takes a sum of all (x - avg(x))*(y -avg(y)) where x is the prediction and y is the answer. it then divides by the number of elements and divides again by the standard deviation of x and standard deviation of y. (multiplied together) My change is to leave that. then also do a 2nd calculation on the squared of (x - avg(x))*(y -avg(y)). then divide that by the number of elements and divide that by the variance of x and y again multiplied together  (aka standard deviation squared). I take that result and multiply it by the original value. this new value is my "true" coefficient. or at least close to the truth.

I actually left out a step.. i dont allow any value to be over  1. so if it does go over 1 i invert it. dont ask me why but this seems to be best.... that is .5 coefficient is roughly equal to 2 overall (in terms of value for analysis)  .33  = 3 ..etc. so when i multiply the two numbers together they are already in <= 1 state. I also tend to do an Absolute value on the original coefficient since negative or positive will get adjusted by the linear algebra (the squared is always positive).

So that's the skinny. that works best (at least so far). when i dont do that i get random over fittings to the training data.... spikes that 1 value here or there that correlate that make it look better than it actually is. The squaring deals with that. using cubes and higher powers might help, but after trying cubes in there, i saw no real value-add.

what am i using this stuff for? stock analysis :)


17 hours left...

There are 17 hours left in the transparent materials contest. despite lots of changes i havent managed to move my score in a few weeks. i have though made my code more efficient (add a few mechanisms that help it get to the best score faster). And I added a mechanism i think will help a lot when there is a larger data set. I can flag train data features to get passed forward to the linear algebra that fits the correlated value to the actual value. with this tiny dataset though all it does is seriously overfit.  So, I turned it off. I also tweeked the scoring mechanism to work very nearly (only 2 differences the final run splits the data in half and uses 2 different linear algebra computations. and it also dampens the results merging them with the average for each half.) to the final adjustments... basically i use the linear algebra in there too.

So where does that leave me. i'm doing one final run right now. it'll either be done or be close being done by 6 tomorrow. (i can submit it at any point along the stage of evolving to an answer). so maybe it'll give me a better score, but probably not. We'll see! regardless its look like i'll have a top 10% finish which is nice its been gosh 4ish years since i had one.





i guess things are looking up

Work still continues on the client/server code. i had some real issues automating the firewall access, but i think i got that sorted now. i set it aside after that (that was sunday) I'll probably look at it some more tonight.

I've been trying some slow gradients in the boosting part of the algorithm. i think the idea here is less about picking up other signals in the data and more about dampening the noise i introduce. instead of 70% of the answer put in. i'm doing 40% now. this is working pretty well. i get the feeling 20% might work better still. the idea being that if you are introducing noise you want your noise to be same "level" as the background noise, so on the whole you arent adding any (since noise ideally hovers around 0 and the various noises cancel). the signal while only at 20% is still signal and does what it's supposed to do. (identify the answers) the 40% run is still going, i'll probably start on the 20% run before i  get the client/server version done. 

I started looking at the data its using (just for fun) turns out its using a lot atom position data.... which is weird. I read that there is a standard "best" form for describing crystalline structures and these have already been optimized for that. So maybe conceptually certain atoms positions (atom 20's X position for example) are telling to the details of the crystal. but this seems wrong... at least a really terrible way to get at the data you want. I might do my 20% run without the XYZ data turned on just to see if it: 1, produces better results and 2, has results that make more sense.




bias bias bias...

Work continues on the client/server version of the genetic algorithm. but the pressing issue still remains, bias. the results i get locally at the end of a run are usually far better than the number 1 spot on the leader board... but my submissions are never that good. i think i found 2 things that might be causing it.

The first was some code i had in place from when i was trying to do rmse instead of correlation coefficient. I was calculating an average value to be used for "filler". the genetic algorithm has 2 major things going on. 1 is the prediction and the other is when the prediction is used.... when its not used it uses the average value (actually its not a hard fast switch but a weighting based on the a 2nd prediction going on.) the point is the value being used wasnt the average for that stack (its a series of stacks each adjusting for the error of the last) but the average for all stacks. the difference is the average of the current stack is usually 0 (first stack it is not). the average for all stacks is the average value of the training set's average. I dont know how much that matters cause of the linear alegbra at the end adjusted the correlation values to actual values... but it certainly didnt help.

The second was the idea that even with the self similar approach i'm still fitting to my training data... and all of it.... so bias is unavoidable. i might have a way to fix that. basically i'm going to treat the entire stack as one fitted answer on 63.2% of the data. (that's right this is bagging) by using the same exact training data portion over and over again as i fit my genetics to it (its still self similar too). I can at the end, when i'm done improving, take the remaining results and use those to figure out how i should adjust my correlated predictions. in short the hold out data becomes the unbiased mechanism i use to scale the correlated values. I could also use this to get an accuracy if I wanted. I might, but right now I'm just using it to do the scaling in an unbiased way.

So those 2 should help! there is a 3rd thing i thought of too but its unrelated to bias. i'll try it later when the client/server code is working. instead of using least squares to fit the correlated values. i think fitting by the evaluation metric might work better. sqrt(sum((log(prediction +1) - log(actual +1))^2) / n ) see here https://www.kaggle.com/c/nomad2018-predict-transparent-conductors#evaluation . the thing is, to do that i have to implement it myself. i couldnt find any accord.net code to do it

i just can't stop working on genetic algorithm stuff

hello hello hello. i don't even know where to begin. I guess some short stories, the improvements on the genetic algorithm have been steady and successful. I do still struggle with bias some. I also struggle with speed. now a little more on each.

For bias i'm pretty certain my strategy of scale-able scoring (or whatever i'm calling it) is the way to go. that is regardless of the size of the sample the sample scores the same. the more samples of varied size the more accurate the score is (with more certainty). Basically, you use the worst score from the group of samples. you should always include a full 100% sample as well as the various sub samples, but i run that one last and only if the sub-samples indicate the result is still qualifying. in fact i quit at anytime the score drops below a cutoff. this actually makes it faster on the whole too.

For speed, i've managed to get some sample GPU code to work, which is great! but alas, i haven't found time to write client/server code and implement a distributed version of the genetic stuff. I will, i just need more weeks/weekends. this will hopefully give me something like a 50-1000x boost of processing power.

All this work has been on https://www.kaggle.com/c/nomad2018-predict-transparent-conductors which is rather ideal for my purposes here. you can read the indepth on-goings here https://www.kaggle.com/c/nomad2018-predict-transparent-conductors/discussion/46239 . I'm really hoping i end up in the top 10 before its all over.

There also happens to be a new contest https://www.kaggle.com/c/data-science-bowl-2018 which needs image analysis to be completed.... however! I think this might be also be a contender for the genetic algorithm, though maybe a different version. I could certainly load the images in to the database and let the genetic algorithm figure out what is what... but i think there might be a better way. I think it might be better/more fun. to  design a creature (yes creature) that can move over the board and adjust its shape/size, and when it thinks its found a cell it it sends the mask back for a yes/no. after, i dunno 1000 tries it is scored and we make a new creature that does the same thing... breed winners etc etc. or we could go the reinforcement route. whenever it sends back a mistake we tell it "bad". and when it sends back a success we send back "good". in that way there would be only 1 organism and the learning would be at a logic level inside of it instead of having new versions of itself over and over again. I haven't decided which i'll do, but i think its probably something i'd get a kick out of writing.

Still MORE genetic algorithm stuff, running and feathercoin

The short version is I streamlined the mutation process. I did 2 things first I made it very simple either you breed you take 1 of the 2 parent genes or you insert or remove a gene. the chance of doing the insert and remove is totally based on stagnation. (as is the chance of a bad copy when taking the parent genes) stagnation is how many generations we've had since a new best cell was found. since this constantly increases it will pass through whatever sweet spot it needs to get to to maximize a the odds for a new best.... till it finds a new best then it resets. also by starting out baby stepping it, i give the code a really good chance to improve the results.

My old code was far more off the cuff. this new method seems really really focused. unfortunately the new method seems to improve a little slower at first. I think because you dont get the super crazy mutations the old one did. i'm hoping it just doesnt stop though.

The other improvement was i made it so the number of channels and layers dont mutate right now. I think they should be allowed to enter in but the actual process of that is probably in the same vein as gene mutation, super low odds slowly increasing and even if you do add one it should be a small channel (1 or 2 genes) or a layer that comes before the final result that can start being used.

I've started looking in to GPUing the code it looks like amd gpu code choices are kinda limited there is a c++ library that gets wrapped in various c# libraries that i'll probably end up using. the trick will be in finding the one that is easiest use. but... i'm not in a serious hurry there.

The program is currently doing 2 layers with 3 channels  each and is at .22454 gini (which is terrible) its been running for about a day so hopefully this time tomorrow its up to between .235-.255  if it gets over .255 we'll be in a good spot. the contest has a week or so to go. i think it's my best chance to win, though i might go back and fiddle with the normal data mining program over the long weekend. i really want to see if i can maybe hone the GBM using that least squares stuff i did to the gene results to improve them.

Unrelated to any of that... marathon training. i told you i'm going to run in a marathon right? or at least that's the plan. i've been having a real tough time losing weight this time around. and getting the miles in on a treadmill is... hard. i stopped dieting (which was working well) this last week so i could do/enjoy thanksgiving but the weight is trying to come back.

In better news though, I think i've finally fixed my stride as my plantar fasciitis isn't getting worse despite the regular longer runs. which brings me to tonight. tonight i do 18 miles. this is kind of a rubber meets the road moment. Since I started training all attempts to do extended distances (over 13 miles) on the treadmill or otherwise have been cut short by me. usually due to concerns about damaged feet/ligiments or just not being in shape enough. so if this doesnt stick tonight i might put a pin in the planned marathon and do one further out where i can get in better shape/have more time. we will see.

One other thing I'll mention. I've been watching feathercoin do its thing lately. I haven't talked about cryptocurrency since i think before i reset the log... it's been a while. I'm a huge supporter of feathercoin and their  team and want to see them do well. not to mention the coin itself. The point is its nice to see it start to take off. and i mean really take off. it was 10 cents a coin a week ago, its toying between 25 and 30 cents now. it could be some sort of artifical spike but i cant find a source of the news/pumping so i think it might just be a case of some big buyers wanting to hold reserves. I'm hopeful at any rate.

The genetic algorithm delights don’t stop

And I’m back, and so quickly! So the improvements have started....

I tried the least squares weighting of answers ... this works but I think isn’t worth doing while training. I’ll do it on the final submission but train without it.

i've just gotten the layers implemented and that seems to work but it’s too early to know if gains are really there. I think maybe if anything this will allow the code to improve to much greater extents faster instead getting bogged down at a lower score. (Though code without layers may eventually get there too)

The final improvement is essentially heuristic modeling to be applied to the odds of any given thing happening. I did this a little a while back and have rethought what I said since last post. I think the big thing is to just balance the odds of feature/channel selection and function/method/mechanism selection. This should increase speed and accuracy.

I'm still amused by the idea of giving a result to a Kaggle contest without giving the technology. I mean I give the the algorithm it generated but not how you get there. It would be delightful to win a few contests in a row without actually giving the technology away. It would turn Kaggle on its head (especially since its not the sort of thing that translates in to  kernel)

once everything is working the last step is to migrate it to a video card runnable code so I can scale it massively.

more genetic code thoughts...

So I've been thinking about deep neural networks and genetic algorithms and b-trees. first let me say that i made some simplifications in the tree export (both in concept and in genes) and got the exported size down some. it should be in the neighborhood of 1/3 the the size. I say "should" as i only exported a 4 depth 4 stack tree and that isn't anywhere near as big as a 6 depth 16 stack tree. the whole action was i think academic at this point.

At the time I was still hopeful i could have the genetic program optimize it. It turns out that tree based data mining while systematic in approach and isn't very efficient. There are almost always far better ways to get the scores you are looking for and the genetic programs tend to find them and throw our your tree entirely. The reason the tree is used is really a case of methodology. It's a generic way to produce a systematic narrowing of results to statistically model the data. the genetic mechanism tends to find mathy ways to model the data. they could be in any form, tree based or otherwise.

This leads me to some serious thoughts on what is going on in deep neural networks. They tend to have a number of layers each layer has access to previous layers or the original data.... possibly both (depending on the nature of the convolution). Its a strong strategy for figuring things out that require a group of features to work together and be evaluated.

It turns out this is kind of what the introduction of channels is doing for me... its also (in one way of looking at ) what stacking results in GBM. Each channel or tree has their own concern. This made me realize that by limiting the channels to a fix number i was trying to shoehorn what it actually needs to describe the data in to two ideas that get merged. because of the strong adaptability of the channels this can work, but it isnt ideal. ideally you let it go as wide in channels as it needs to. in fact you really should let channels stack too.

I implemented the idea of random channel creation (or removal) and reworked the way genes are merged/created with that in mind. the results have not disappointed. it hasnt existed long so i cant tell you how far it will go but it does tend to get to a result faster than before.

I think there are 3 more major improvements to make. right now, i'm still just taking the sum of the channels to get my final output. i think this can be improved by doing a least squared on each channels' output with the expected result to find a coefficient for the channel. This isnt needed persay, but it will help get me to where i'm going faster.

The 2nd improvement is to make it so there can be layers... layers get access to the previous channels in addition to the features and what not. layers could be randomly added or removed like channels. if a layer references a previous channel and it doesn't exist due to mutation I would just return 0.

the 3rd improvement is to add some system of reinforcement. right now i do breed better scorers more often . But I think that isn't enough, I think some system needs to be devised that eliminates under performing genes when breeding. This is really tricky of course, because who can say there isn't some other impact. Essentially which genes are the good ones? I think some sort of heuristic needs to be added to a gene to track how well it has worked. Basically a weight and a score. if a gene is copied unmodified the weight goes up by 1, the score is the average score of the whole cell. adding. if a copy is made and some change happens to the gene or if the gene is new the data is set the score and weight of just that particular cell (no average history). When deciding which gene to take when breeding two cells the odds would be reflected by the two average score. or possibly by just the current score. I dont know how well this will really work in practice but if not this... something.



One possible future of kaggle contests...

First... yes 2 posts in one day. crazy! I wanted to add a thought that the data miners of the world may not like and definitely something kaggle would be dismayed at. If the genetic algorithm stuff i've been working on works... like really works. You wont be able to use their kernel system to compete at least not as it stands.

why? because the results the genetic algorithm produce are based on processing power and time. Not on some particular mechanism you put in place. Now you might be able to get to a good result faster by tweaking some code. but its not like you would run that code indefinitely on their servers. Or if you find a good gene, sharing it is an option but basically you are giving away a winning solution by doing so.

The hard part of this of course is turning any particular problem in to one that the gene mechanism can make short work of. There is definitely skill there, at least until most problems can be distilled via known mechanisms. Unfortunately this isn't much different than a loading/preprocessing script.

I have to admit i'm anxious to try my tool on other problems. So much so that I'm feeling greedy as to its true value and I kind of don't want to share the actual code, lest the world catches up in a hurry. (it will likely get improved a ton more as time goes on) This leads me to... I wonder if I just gave them an order of operations to predict a great solution (the math actions of the genes/variables stored internally)  if that would just be a good solution to their contest without me giving them anything in particular? I mean they could use it to get what they wanted without knowing the first thing of how i got there. 

There is a function that generates a value using the genes and the raw data. its a single function and essentially that and the string of genes comprises the whole program to claim the result. (Or it should be enough) https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/rules and as long as i can generate the result and give them the mechanisms to do that (and the gene) they have all they need. I could probably distill down to something even simpler. (remove unused available functions, move any external calls internal etc, change the database model to simpler. write a custom loader for the new database model etc. essentially make a 1 off standalone program that does the most exact version of what they need)

I am of course getting way ahead of myself, i havent even come close to the top of the leader board let alone beat my OWN previous score which is meh at best. all of this is, i guess, just food for thought.

the "evolution" of my genetic algorithm

I thought about it a little and I wanted to say what a genetic algorithm is. At least as I understand it so everyone is on the same page.  a schema driven algorithm that has a final result that can be evaluated. the schema changes randomly or by merging good prior results. if a more desirable result is found (in whatever manner) it is kept for future merging. 

So it has been a few weeks and things have really moved forward. I think a full history via bullet point will explain everything

  1. I get a wild hair to retry the writing a GA (genetic algorithm) about a month ago, the first version was pretty amazing but produces nothing immediately useful. It produced features that had high correlation coefficient.
  2. Turns out there were some bugs in scoring. I fixed them and things were no where near as good as I thought but they still had higher than normal feature scores. unfortunately the data mining tools didn't make particularly good use of them (they added nothing over all). I can't figure out if it is over fitting or the results just aren't all that useful, probably both.
  3. I decide to make the GA 2+d (multi-dimensional). In things line tsne this is the way to go, as doing 1 dimensional work  is way to limiting. The way this works is there are basically two (or more) sets of results that work separately to produce an X and a Y. (they don't interact, though sometimes i flip them when loading since they are interchangeable and that is exactly the sort of cross breeding i want to see) . Initially I used distance from the origin point is used to score correlation.
  4. The correlation scoring by using distance from the origin is poor.  So, I create a new mechanism that scores based on distance to a mismatch result in 2-d space. The mini-universe's X and Y points are scaled to between 0 and 1 and a distance of 1/Region size is considered perfect separation. This works but the results don't seem to translate well to being useful for data mining. Also, due to every point interacting with every other point i am forced to make the scoring work on a subset of the data this is less than ideal.
  5. I decide to abandon scoring the way i created and instead make the  0,0 corner of the universe a 0 value and the 1,1 corner a 1 value. to find any points actual value sum the x and y (and divide by 2 if you want it between 0 and 1) this means the diagonals have the same value, but both channels/outputs of the GA can potentially dictate the result of any given point, but usually they need to work together.
  6. I make many improvements to the actual internals of the GA algorithm this is an on going process, but at this point i thought i'd call it out. i'm not to my current version yet, but i've been playing with things internally. things like how often a copy is merged with another copy without error. And the introduction of the last few math functions I'll use. I decided to use the sigmoid (logistic function) and logit (inverse of that) in the actual list of gene functions that are available. even though their functionality could be reproduced with 2 or 3 other functions , it is so very useful from a mutation stand point to make that feature available by just specifying it. this highlights the possibility for other such functions in the future.
  7. For reasons of speed I was using a subset of the data to test and train against. initial results look amazing. The upshot of my new technique is you dont have to mine on it you can directly translate the output to a final result (x+y = binary classifier) I let it run for a few days and get a result of .31 (exceeding the current contests best score of .29) only to realize that i have severely over-fit to the 10% training data i was using (which wasn't randomly selected either it was the first 10% of the data). 2 things come out of this, the idea of scaled-accuracy  and a better way to train in general. Also it turns out certain sections of the train data are very strongly organized. a competitor at kaggle pointed this out. the data as a whole that is much harder to work with.
  8. I figure if a my final score is good with a small sample and with the entire training data then it would be good overall. the logic is that if you get a self similar score at two scales, then no matter the scale you are working at it will be the same (scaled-accuracy). the more divisions you do the less chance of error you have. currently i'm only doing 2 (which is poor). 1 at 15% and the entire training set. You can probably compute an idea split based on the amount of time you want to spend testing your result. I would think a 2 or 3 at 15% (averaged) and 1 at 32% and the full train data would be far better. so far, i submitted a locally tested .244 result to the kaggle website and got back a .241. I also did a .195 and got back a .190 . So for now i'll leave it at %15/%100 and just expect a small drop when i submit.
  9. It turns out there is enough logic to export a mining tree in to genes. I wrote an export feature in to my data miner to see what the results are. This works cause GBM (how i do data mining these days) is the addition of lots of tree results. each tree node can bet turned in to a series of logistic functions multiplied together. These results are then multiplied by the tree nodes output and this by the tree weight. That is added to the running total (which is seeded with an average) you can split up trees in to the various dimensions since final scores are just the sum of the dimensions anyway. the resulting GA schema is huge (21000+ genes) for a 6 depth 16 tree stack. The logic does seem to work and i can generate the output for it. (see below) I'm sure this can be optimized some with the current mechanisms in my GA, it seems unlikely in this instance to be useful though. That schema is to large to process quickly to optimize in the time remaining.

As of the writing of this my genetic algorithm is running (and has been running for a few days. it is currently up to .256 on the training data. if testing holds true i expect to see around a 2% drop in score when i submit (though it will get worse the better my prediction gets as the genetic code isolates scores. which means i can expect a .316 to turn in to around a .31 . that btw is my goal. i doubt i will mess with the program to much more in this contest. (unless it runs for 3-4 hours without an improvement.) since the gains have been very steady. yesterday around now was at the .244 ... so .012 in day isnt bad. I'm sure it will slow but there are 2 weeks left in the contest. this puts me about 2 days out from tying my old score from normal data mining techniques, (assuming linear progression, which i'm sure will slow down) And about 7 days from my target.

here are some results. This is for this contest (kaggle's porto serguro) . The scoring is normalized Gini which is the same as  2 * AUC (receiver operating curve) - 1 . This works to out to be 1 to -1. 1 is perfect, -1 is the inverse of perfect. 0 is the worst. The red dots are 1s the blue dots are 0s in the training data. the top left corner is 0 the bottom right is 1. 

this scored a .241 on the test data, .244 on the training

Here is 1 GBM tree metioned above turned in to a gene and graphed. i didn't test it against test but the training data said it w as around .256 (iirc)