Hi again faithful readers (you know who you are). So the marathon is back off the plate. Basically, i can't afford to spend the extra cash right now on a trip. Or rather I'm not going to borrow the money for a trip like that. I over spent when i thought the trip was off and upon reflection when it came time to book the trip, it just didn't make sense. So... maybe next year.
So for the last month i've been toying with things with my data mining code base. I removed tons of old code that wasnt being used/tested. i honed it down to just GBM. then i spent some time seeing if i can make a version that boosts accuracy over log loss (normal gbm produces a very balanced approach) i was successful. I did it by taking the output from one gbm and sending it in to another in essence over fitting.
Why would i want to do that? Well there is a contest ( https://www.kaggle.com/c/santander-product-recommendation ) i tried to work on it a little, but the positives are far and few in between. the predictions generally put any given positive at less than 1% chance of being right so i tried bumping the number since right now it returned all negatives. The results are actually running right now. i still expect all negatives but the potential positives should have higher percentages and I can pick a cut off point to round to a positive that is a little higher than .0005% i would have had to use before. All this, so I can send in a result other than all negatives. (which scores you at the bottom of the leader board) Will this give me a better score than all negatives... no idea :)
I tried making a variation on the GBM tree i was using that worked like some stuff i did years ago. it wasnt bad, but still not as good as the current gbm implementation. I also modified the Tree to be able to handle time series data in that it can lock 1 row to another and put the columns in time sensitive order. this allows me to process multi month data really well. It also gave me a place to feed in fake data if i want to stack 1 gbm on top of another i can send in the previous gbm's results as new features to train on (along with the normal training data).
This leads me to where i think the next evolution of this will be. I'm slowly building a multi layer GBM ... which essentially is a form of neural net. The thing i need to work out is how best to sub divide the things each layer should predict. that is i could make it so the GBM makes 2 or 1000 different groups and predicts rows results for each and feeds those in for the next prediction...etc. till we get to the final prediction. the division of the groups is something that can probably be done using a form of multi variable analysis that makes groups out of variables that change together. figuring out how to divide it in to multiple layers is a different problem all together.
Do you want an AI cause this is how you get AIs! heh, seriously, thats what it turns in to. once you have a program that takes in data builds a great answer in layers solving little problems and assembles them in to a final answer that is super great. well you pretty much have an AI.
Incidently, TSNE might also help here as it I might just feed the tsne results for that layer's training data (fake data included if we are a level or two down) to give the system a better picture of how things group statistically.
In other news, I started using blue apron. This is my first time trying a service like this and so far I'm really enjoying it. I'm pretty bad about going to the grocery store. And going every week ... well that aint gonna happen. This is my way to do that, without doing that :) . I'm sure most people have similar thoughts when they sign up, even if the selling point is supposed to be the dishes you are making. Honestly, I've just been eating too much take out. I don't mind cooking and the dishes they send you to prepare are for the most part really good.