Search
rss Logo
cough hack cough
8/25/2014
So I take it you're sick? well, now I just have the remnants. The coughing and the laryngitis and "cleaning out" congestion ewww. It's really distracting and gross! But on with the blog! I'm still optimizing away at the new implementation. Two things to note. I've been trying variation after variation with some results, and right now my best result uses an emergent network which is kind of neat!
Thus far, I've made like 9 or 10 different versions of the core implementation to try things out. Sometimes I do decisions in different ways. Sometimes I artificially chop a tree node and do an average if a condition is met. I actually had one that was far better than the rest but at the moment I don't trust it. Or more specifically, I don't really know why it was doing better. I had a fixed maximum depth to the tree in place, which when it got hit would produce the average at the node at that point as a result. For whatever reason, with this sample data, 70 nodes deep is a sweet spot. I need to go back and see what the conditions were with the data when that failure point gets hit. So I can better understand what I need to be doing. Extending the tree depth or shortening it had detrimental results.
So you aren't keeping at 70 right now? No, I back burner-ed that to keep trying things out that were more fundamental to the tree structures themselves. In so much I extended the trees maximum size to 700 basically impossible to reach with this data set and continued with testing. Which leads me to the emergent networks stuff. What I've been doing is strictly speaking a tree, but it's a tree with the same path's available at each node at a certain depth regardless of how you got there. So no matter what happens you have the same choices at depth 17. Which choice is most viable is where the emergent network comes in to play. The more a path gets used the more it is re-enforced and more likely to get used next time. The specifics of this I'll leave for another time. But thus far it is giving me my best consistent result well other than the mysterious truncation technique at depth 70. To put it in terms of the numbers you have sen before a few blogs ago. my log loss is around 0.50270's is around .486 so that's good!
My god, it's full of stars!
8/21/2014
Enter psychedelic super space speed special effect? Ah no. I got exciting news though!do tell My new data mining zzzzz technique has beaten my implementation of random forest. :) huzzah!
welp, give us the skinny I should start out by saying I'm talking out of the box comparisons. That is no feature selection, no feature weighting and no data manipulation. I should also say, I'm using my modifiedread: better most times random forest and I'm using the data set from http://www.kaggle.com/c/bioresponse so, you know, take it in that context. Also, it's a lot slower, I may be able to make it faster... maybe not. but right now my best run there are 4 settings to choose from, 2 switches takes 22 minutes to run and produces a 0.486789 log-loss, Compared to my random forest which makes 0.517056. I realized these numbers are way less than the leader board results, but this data set is less conventional like that. In this case every bit of data will help your final score and my internal tests are done using a 3 fold Cross validation only using the 3751 training rows.
So is that it? Or is there more in store? I have a one obvious improvements to make. I'll probably do away with one of the switches so there will be only 2 modes to run in. 1 of the switches uses bagging or not. I can do N-1 cross validation in my bagging with minimal speed change and get an improved accuracy. So the conventional bagging is not necessary. Beyond that though, i'll have to do some testing and stewing. But I'd like to keep improving it. The actual specific details beyond what I've explain in previous blogs I'll keep to myself till I get a chance to try it out in an active competition. It's worth mentioning that when blended with random forest the two produced a slightly better result than either.
No details on how then? Boooo! Well, its not too complicated. How's that for a hint?terrible Also, I might be able to improve my RF implementation slightly with one of the things I learned getting this to work. But I need more time to find out. I have a pipe dream of seeing .38 log-loss just a number being pulled out of my ass I don't know if its remotely possible but if i could do that, I'd have something pretty darn amazing. Not that this is bad.it's not!
Levels
8/5/2014
I love that song! Of course you do, you're me and it's great. we should avoid the referring to yourself in the 2nd person. It makes it very Sybil-esk he's right but apparently 3rd person is fine. Okay, setting aside my framing device, we can make fun of it more another time, I bring news on the new technique, which still has no name. I was just thinking how development of it has been like moving through levels. First the raw idea implementation level 1. Then some obvious naive improvementslevel 2. Then some not so obvious improvementslevel 3. Then realization that you have bugs in test data this ALWAYS seems to happenslevel 4. Then seriousheh evaluation of what you have and where to take it level 5. Wash, rinse, repeat. Granted in 2nd and subsequent passes, certain steps might get skipped. I could certainly do with less of step 4.
Wouldn't it make more sense to call them phases? yeah but then I couldn't reference an AVCII song! Also level's signifies improvement and makes it more game like.we like games. we do, really! so anyway, my new technique is currently at level 5. Okay, that sounded weird, step 5 definitely sounds better. I'm at the re-evaluation phase. Yesterday, I came to realize I did have some data leakagesighof course!. That was really making the idea look crazy good. I generally do 3 folds in my CV don't do this at home kids, 10 is the standard, stick with it cause of speed concerns i'm a busy man who can wait for 10 runs! So the training data was feeding the first two thirds of the data to each pass. meaning that about 1/3 of the answers were in the data, and some of the training data was missing. I fixed the problem but it drop the results down a bit. Now I'm doing slightly worse the Random Forestwell that's not really surprisingStill, its doing quite good considering.
Improvement time! What has Random Forests got on you, that you don't have on it? Specifically it is doing 1 thing that I am not. It builds answers up around targeted final values in its decision trees, the left and right side of the split. I let the test data dictate what should be used, for the next pass down the tree. Using the final score from the training data is definitely stronger in most cases. Here are some examples using data from http://www.kaggle.com/c/bioresponse aka some of my favorite test data with no feature selection or transforms. Note: this is using *my* generic implementation of Random Forest which is fairly significantly different the the vanilla Random forest as I've made a number of changes I find are better in general.
Score Normal Random Forest Grouped Random Forest (no averages use most popular answer) New Method Use random features New Method Use all features
AUC 0.839440346949843 0.758190264627712 0.827021301817845 0.741927563396631
Accuracy 0.627572248503069 0.765662490002666 0.62737487853194 0.709150023761779
Log-Loss 0.517055986080121 2.69791794290994 0.526384889669592 2.476401111389
There are a number of things I could try that come to mind. I might find features that the test data is representative of that also have the relatively same final score modifying the score for selecting a feature via some sort of math function. I might add a new set of features to look at in addition to the ones I have. Those would be just like normal random forest splitterstake the best of the bunch. I might also just get the rows based on feature value then do my analysis using the scores, looking for the feature with the least score variance. Finally I can try all manner of analysis on the "other" group. the group of features i throw away. Basically ensuring they aren't all grouped right outside the edge of my cut off. I would be checking to make sure I truly have a good separation of data and not just a lucky split.
How soon will you know what to do? I'll know better tonight what's looking good! I found that in most cases testing the bootstrap is a waste of time. So I skip it and give the trees all a vote weight of 1 for testing purposes. Its not perfect but the gains vs the time spent arent worth it right now. I'm trying to sauce out ideas not squeeze a few more parts of a percentage of score.
New Beginnings
8/1/2014
So already, back on track with data mining? Yeah, I got a rudimentary implementation done last night. And I got to say the new idea, well it isn't too bad. But I've already had to change it slightly from what I was thinking originally. But it is leading down a path that is good. I've got the next logical progression for improvements already scoped out. But here, I'm getting way ahead of myself. Let me explain what I did, and what kind of results I got.
What you got? Okay, the first version worked like this. I load a bootstrapped sample in to the tree. There is no preprocessing or tree building initially. The tree takes the rows it has and a test row we are scoring. it looks at all the features splitting them in half at their average. then checking both halves against the test data. The side that the test data is on compares its average to the the test data's value. the further away the test data is from the average, the betterin terms of standard deviation. We choose the feature that does this best and use the rows from that side to build the next level of the tree. Repeat the process till you have 1 row or a series of rows with the same value.
What are you trying to accomplish? In short I was trying to pick sets of data that best represented the most notable features of test data. When I say most notable, i mean features that are iconic in their outliernessah a made up word. It turns out this method gives around a 60% average accuracy in my mythical test data. This is on par with what random forest gives me but random forest gives a result with a much lower log-loss error. Which is to say, when my code runs and guesses wrong, it is really wrong. Which makes sense as those extremes I'm using may not be the signal, they may be the noise.
You improved it though? I did, so instead I decided to make it follow the average of the group. I did the same split of each feature and average of both halves. But after I did that, instead of finding the most extreme case I went for the one that nails the group average the best again in terms of the standard deviation for that side. This produced a 66% percent average instead. It still has the problem of having a really terrible log-loss. I realize there are ways of mathematically adjusting predictions to minimize that error, but for now I'm just going to take the raw result as that's what I care about.
Cool so an improvement in accuracy! Yeah, but I think it's going to get better. If you think about what is going on here, i'm building 2 sets of results for each feature and trying to find the one that my test data fits best in by being spot on the average, a representative example. I just happen to be doing this by halving the data when I do it. This make sense from a searching/building information perspective. We are making 1 decision and with 1 decision we should be adding 1 bit of information to our decision. Thus we should be eliminating 1 bit worth of solutions aka half the number of possible solutions. I think if we are approaching it this way, there is more we can do here.
What is the improvement? that would be telling! Well another way to think about the feature matching process is to think of it as being like nearest neighbor. Normally in nearest neighbor you find either the 1 nearest or group of K neighbors and take the average or most popular and use that as your solution. Instead of building our list of neighbors we are matching known neighborhoods with our test. I think the improvement will be to build the neighborhood around our test data.
This is done by taking the closest rows to our test data on a particular feature. We want half of them for the reasons I already pointed out. So if we had 100 training rows we want the 50 closest to our test data. That is, the closest to 1 particular feature. Then repeat the process for the other features. each group we build is just that, a group with an average. Ideally our data is in the middle, but it wont be due data's distribution, test data having a sample near the extreme or just noise in the data. We pick the feature that places our test data in the closest to the middle in terms of standard deviations of that feature's group. We then use those rows for the next iteration. repeat the process until you are down to 1 row or have the same answer for all the rows. I believe this should further improve the accuracy of the tree building process. I'm hopeful that we can see some sizable gains 5% again? over the forced neighborhood method.
Don't you need to sort to get the top 50 that are closest, isnt that expensive to do? Yes and no, you could definitely sort the distances and take the top half. But really all you need is that middle dividing line. I'll do a 2 pass method where I figure every points distance and keep a running total of the distances. Then I'll divide to get the average distance and take any that are under it. That will give me the top and bottom half, and my run-time stays linear instead going to n log(n) for that step.
any other ideas for things to improve it? nothing off hand but if it does get more accurate and the log-loss doesn't improve i might take a look at thing I can do to improve that, or maybe just keep pushing on that accuracy. if you get the accuracy high enough eventually the log-loss will come down as they both approach 0 at the same time.
Legacy
7/31/2014
So moving on to legacy? not modern? Well the title is a bit of a double entendre. And not to turn this blog in to something else, something it has never really been, but a friend of mine passed a way this last weekend while i was unfortunately in Boston at the grand prix :(, a fine fellow by the name of Craig Durkee. I can't say enough good things about him. So, when I say legacy I'm thinking of his, and all the good times we've had over the last 25+ years. His death overshadows most everything I've done since the last blog.
He died of a really rare form of cancer angiosarcoma of the lung at the young age of 38. It is something he had been fighting now for about a year. He only recently found out it had spread to his brain and liver heart too i believe. He was almost done with radiation treatment but things were going bad for him. Radiation treatment is, by the way, brutal. I hope no one you know ever has to do it. It was his best shot, but it didn't work out. When he went, he went fast coming home on Thursday and passing on Saturday. This is why I was stuck in another city when he passed and not at home to visit :(.
Did he do magic, running, crypto currency, data mining or any of the other things you talk about? Not so much, but he could talk about anything. He didn't have an aversion to the subjects. I often would outline something i was doing on the side and he'd genuinely take an interest and share if he had anything to add. He was funny, smart and creative too. You always got reasonable advice or good ideas if he had something to add. You might be surprised to find out that while these things are seemingly significant in my life `cause I blog about them, I actually don't talk about them too much outside of a few on-going conversations with people. That's part of why I set up a blog. It's a place to share and work through ideas that I don't want to necessarily bug my friends about time and time again. When I'm out with them generally speaking we are talking about other things and busy with the good times! And with Craig around we had some of the best times. I can't understate how funny Craig was and how he knew how to have a good time. In fact, just as an example of the random ridiculous good times we've had, it was actually HIS wedding I went to in vegas way back in 2012 near the beginning of this blog!
All of that having been said, everything has been a bummer these last 4-5 days and I was pretty distracted and sad when I was busy playing magic this weekend. The games didn't go badly, but the green/white deck wasn't up to snuff either. I did do much better in the legacy tourney the 2nd day. I played a red white painter/grindstone deck i've been working on for some time. I finished 38th out of 128 and got myself a small prize :). Also, I did figure out exactly how my new algorithm is going to work. No ifs, ands or buts about it. It's all clear as a bell. I just need to sit down and do it. I'll write more on that next time if it works out should be relatively easy to implement.
So, just to finish some last thoughts on craig, I'm sure I'll be telling stories about him and his/our antics whiskey doughnut! for years to come. I'll leave you all with my thoughts on his death. The more you like something and the longer it's around the more sad you are when it's gone. It's just that simple. I miss my friend.
not quite ready for prime time
7/25/2014
what's not? The magic deck I outlined last week. It seems ok, but it's missing something still. Maybe some balancing, maybe a particular creature or card. But after a little testing I've decided to back burner it in favor of the green-white deck I mentioned at the end. A rather untested thing, that I'm hopeful for. It does a few rather interesting thing but most of the deck is built to work like the good utility creatures in a birthing pod deck.
So why not add birthing pod? I could, but really that is a much different deck. That deck has a whole lot of 1-ofs you can have up to 4 of the same card in a deck mine is made to milk particular creatures. hopefully it does well, and it will give me something to write about when the tournament is over.
anything else? just briefly two things, feather coin is about 2 weeks away from using a new hashing algorithm which is awesome it should give the currency a shot in the armthat it needs in the exchanges. Also, while I haven't implemented it yet I think I have most of the various details worked out for my next data mining algorithm experiment. It will be really interesting to see if it is effective or not. Due to it's nature I'm not going to build out all the tree nodes initially. why not? the permutations of nodes that may or may be needed is ridiculous Ndepth where N is features you have., but most wont get used. So, I will be creating caching particular child nodes as they are needed. I'll explain the details next week if it works.