Dim Red Glow

A blog about data mining, games, stocks and adventures.

wow talk about leaving people hanging

hah, sorry about that. 17 hours to go and i disappear! i did not win *grin* the contest really went all topsy turvy when they released the results. everyones score got worse and the people got rearranged. more than a few (myself included) people said their best result was not selected. I fell to sub 100. The winner was someone who moved up a lot, and i heard from at least 1 person in the forums that they had a better score than the winner but it wasnt selected.

Why did this happen? well, probably due to the size of the data. it was a really small sample. It was also possibly due to some deliberate sampling the contest runners did (where certain things weren't expressed in the training data) but really sometimes it's just bad luck too. But that still leaves me in a great position. i've not rested in the time between then and now. I've been applying the stuff i've learned. honed the code. improved the "Fitness" test.

Ah yes, so i experimented with all kinds of ways to score results.... ways to test the results, fit the results...etc. Right now the thing i think is best is to bag the data. train on the 63% of the data (and do the linear algebra for fitting on that). score on the 32%. use the score to evaluate your accuracy of the that stack and modify the weight the stack as it gets added to the final answer. I also score (fitness test) it by the same method i mentioned previously (self similar scoring) as well as a final modification I made the correlation coefficient's calculation internally.  this last one is the big and since it does no harm to share, i'll do that.

Normal Pearson correlation coefficient https://en.wikipedia.org/wiki/Pearson_correlation_coefficient basically takes a sum of all (x - avg(x))*(y -avg(y)) where x is the prediction and y is the answer. it then divides by the number of elements and divides again by the standard deviation of x and standard deviation of y. (multiplied together) My change is to leave that. then also do a 2nd calculation on the squared of (x - avg(x))*(y -avg(y)). then divide that by the number of elements and divide that by the variance of x and y again multiplied together  (aka standard deviation squared). I take that result and multiply it by the original value. this new value is my "true" coefficient. or at least close to the truth.

I actually left out a step.. i dont allow any value to be over  1. so if it does go over 1 i invert it. dont ask me why but this seems to be best.... that is .5 coefficient is roughly equal to 2 overall (in terms of value for analysis)  .33  = 3 ..etc. so when i multiply the two numbers together they are already in <= 1 state. I also tend to do an Absolute value on the original coefficient since negative or positive will get adjusted by the linear algebra (the squared is always positive).

So that's the skinny. that works best (at least so far). when i dont do that i get random over fittings to the training data.... spikes that 1 value here or there that correlate that make it look better than it actually is. The squaring deals with that. using cubes and higher powers might help, but after trying cubes in there, i saw no real value-add.

what am i using this stuff for? stock analysis :)