I've been working on a new version (seems like it's been a while) of my data miner. the last one i worked on /got working used the logistic function to build curves that represented the left and right side of each splitter in the trees I built. This worked really well. I'm not quite sure it was state of the art or not but when combined with either random forests or with gradient boosting the results were better than my previous techniques.
So what's the new one do? well that would be telling :) . maybe a better thing to say is how much better is it? not much but it does seem to be better. some run of the mill tests on the last kaggle contest I worked on (Home Credit Default Risk) show it's producing around .74 auc compared to .73 with the old technique. "That's terrible!" no, it's not :) That is i'm just using 1 table they gave us (the main one). And I'm doing nothing special to it. i only created 1 feature. my results arent going to be really all that with so little work. That is, to me that seems pretty decent considering what it's working with. you can read the winner's solution with auc was .8057 on the leader board here https://www.kaggle.com/c/home-credit-default-risk/discussion/64821
both with home credit default risk and santander value prediction I didn't really put the effort in I should have to get the training data setup. especially with santander. That one was starting to get interesting then they discovered the leak and i tried naively to implement it with basically 0 success and said. "meh. even if i get this working right.. i'm pretty far behind the curve. this has stopped being a contest I really want to do."
Another thing i've come to realize, the genetics can do the same thing my normal data mining does, but a better way it seems (at least for the run time) to use the genetics is to just make new features. I'm still working on mechanisms (in my head) for deciding how to determine a particular genetic out put is ideal. correlation seems ideal but also ideal is something that correlates highly with the solution but not with any given features. That is maximize one and minimize the other. I'm just not positive on the best fitness test to do that. Once you've done it once, you do it again and again. since each feature is independent they ~should~ add.
Incidentally, I'll be using the new aforementioned technique on my new for-the-public data mining webste ( Https://yourdatamine.com ). I've spent a lot of time cleaning up my code and moving it in to a wrapper so it's easily portable to the website. cleaning it up was good too, i got rid of LOTS of old code that either did nothing, wasn't used or was being used but shouldn't. I also sped it up the load and made it so you can do new cross validations willy-nilly. previously i setup that in the database. Now i can just pick a number of CV folds to do and the code will make new onest on the fly (or it can still use the DB). having a consistent cross validation does give a reproducible result but sometimes the random selection happens to be weird. so mixing it up can be good.
The website is going to get a new page that lets you know what kind of results you can expect from a training set but doing analysis on it using the on-the-fly cross validation and giving you expected error margins.
I made a little effort to get all that to run in memory. The processing was always done that way but the datasource used to come from a database (as i was mentioning).when i put the code in to a package I also wrote a class that directly imports the data in to the structures it needs. This has 2 upshots; 1 people can rest assured I dont keep or even care about their data. (i setup https for that very reason too). And 2 the website has no database to maintain. It's hard drive footprint never increases (though memory varies by current usage). if the site becomes popular, i'll make it so the requests will go to new instances/farms/application servers. Those will be able to be spun up easy peasy! so its a super scalable solution. (or should be)