Spring has attacked and with it has come the monotony that is yard work. I really need to invest in a yard service. (not a fan of mowing the yard, pulling dandelions etc... but alas you're too lazy to pick of the phone, which makes no sense!). So there I am doing that stuff, and the mind wanders and wonders.
Get to it! What are you thinking about?!!? Well it's data mining again. geeze. I'll come back when you are gonna talk about beer I believe I need a different set of rules for those cases that fall on the line between out of a group and in a group. That is, each item I test against the model either falls in to a classification or not. There are multiple classification.beer! tease The thing is, trying to make multiple separate ways to make a single classification subgroups of the same group seems like a process doomed to fail. Each time you subdivide your training data to make a new sub group you get that much closer to isolating all the elements individually. Eventually, they become data sets of size 1 and matching test data to those groups is the nearest neighbor algorithm. Which is a totally valid mechanism to use, just not what I'm trying to do. What I am trying to do is make a model what represents the data via groups of items that share similar traits.
How will I determine the border cases that need to be handled using a new method?When I test the training data against the model created from said data, I find that some of the cases are sorted incorrectly. I can build the average model for these cases. I separate out the false positives and false negatives in to their own models and use these to identify problem records. Got it? good.
Wait, what happened to the grouping method you were trying before? The brute force and learning systems stuff. Glad you asked. In short it never improved much. The whole problem of trying to test a model with the data you built a model with is kind of flawed. Or at least not ideal. The training data's results against it's own model are always skewed to "better than it should be". Heckyou mean hell, nearest neighbor would score 100% if you tested the training data against itself. So in my case I tried to optimize some with sub-groupings but quickly found there were no real gains to be had. I mean it ran for a while, and did improve on sorting the training data. I got to a 0.5++ result which should put be near 1st, but when I scored it against the real data, I still sat at .42++. A case of fitting the model to the test data to perfectly and not adding any real value. For now I've back-burner-ed the idea. I will continue forward with what little time there is left (5 days) by working on a new method for just the middle cases.Right after you go mow the yard Maybe then I'll have a beer hooray!
This weekend I made my first gains in a long time on my data mining program.hooray! With only just over a week left, I'm more hopeful than reason should allow. I actually saw two jumps. One was was 0.005 and the 2nd was another 0.003. The cool thing about the second jump is it was using completely new addition to my classic program. As opposed to the first one which was me just throwing CPU cycles at a my base technique in order to squeeze a little more accuracy out of it. They two techniques should stack, but as the 1st technique took two days to run, it is unlikely I will try stacking them before this contest is over. I need to focus on further improvements, especially considering how fast the second improvement ran.
So just how long did the second take? It took just over 45 minutes. And the second mechanism also happens to be the sort of thing can be improved quite a bit. Well, we'll see if that's true. That is to say, there are no guarantees but what I did was simple and naive but not very thorough. I'm working on a iterative process right now that should be more thorough and far less naive. That doesn't mean the new version will work better. It's a kind of like finding a vein of gold when you're a gold miner. You just don't know if it's gonna open up to more, or if that was it.
I should add, I approached this "new technique" that is being done once before. I did it a different way over a month ago and saw no improvement after over a 5 days of continuous processing (1 run to produce testable results). There are a lot of ways to solve the problem I'm trying to solve. And back then I thought I had chosen the best. What happened to make me do it differently? I knew what data was giving me problem and separated it out. I now in retrospect suspect I my previous poor solution was actually designed in a way to produce no improvement except in special cases (this is not one of them). I wont go on about it till I explain what that and this is in detail. And I won't do that till I've moved on to better grounds or am done with the idea and ready for the world to know. No hints? Nah not right now :)
So the future of this thing is...? The way I see it, there are two paths I can take to make what I'm doing better. There is the the brute force/naive way. And there is the machine learning way. both with countless variations on the themes so you are doing the brute force way right now? :D yep :). Will it be enough to get up to 0.526? (a winning score right now) Maybe, I don't know what the limits are going to be. I have reason to believe it might be that good, but the fact that there are 50 people above me all stuck in the 0.40s to low 0.50s makes me think probably I won't get over 0.45 if you're lucky!. likely they are all fighting some version of what I'm doing. I think at least I can expect on small gains. Remember, this is all about the bigger picture. Win the war, not the battle. Well, win both if ya can ;)
So, it's hard to frame this moment correctly. (but I'll try) I've been trying different aspects of the same mechanism for my current data mining tool for over around 2 and half months. Each test, each code change, each moment of clarity has slowly led me to this point. Literately months of having all the cores of my processor running at 96% I like to leave a little room, in case I want to do something while stuff is running. With each test run an idea is shot down or i find some new insight into the way the sample data is distributed. But finally, I'm at the point where everything I've wanted to test is tested. And I haven't seen any gains over my original idea in about a month and half.
Where does this leave me? There is less than 2 weeks left in the contest and I have a decent idea about why I'm not making improvements.why you fail! muhahahaha! I need one last big jump, and there isn't really much time to get it right. I'm struggling with the model I made from the training data. By it's nature, it's fuzzy. The boundary cases are ~probably~ what are causing all the problems. Those cases half way between being in a group and not being in it. Trying to model those cases doesn't generally give any more resolution. they don't represent a subset as much as extreme cases of normal datawhat are you going to do? I need to make better use of the data I'm provided. I'm throwing away any information I'm not using. That's where the solution lays. But what is it I need to do? For now, I just need to think. My computer sits quite waiting for my next move.
Oh and on an unrelated note, Diablo 3 is coming out may 15th
So, last week my home desktop machine sat and churned for 5 days on some data mining stuff. Only to produce a result that was lackluster at best (0.356 more on that later). I had moderately high hopes for this new version but so far it has produced around the same kind of result, just in minutes instead of days. Allow me to back up a little...
I've been working on this
contest over at http://tunedit.org/
for the last few months.
I should say I love a good puzzle that no one has solved. The technically challenging kind. If I had all the time in the world I'd probably be working on the Riemann_hypothesis
. but, alas, I do not.... some day. maybe when I'm independently wealthy.hah ya right.
Well, I'd work on that along with everything else that interests me if I had the time and money.
I have a long term plan of developing a data mining tool that can run over any set of data that is in the right form. That is data where there are attributes and groups for the test and training items. That particular contest has ideal data at the moment. In this contest you are predicting groups and all the attributes are filled out. Eventually, I'll have a version that predicts either groups or attributes. It would be nice to get the program to a point where I could pick up a set of data, spend some time on loading the data in to the right form and just get an answer off the data after X hours of processing.
So far I'm doing pretty poorly compared to others in this particular contest. My best score is .43 and last I looked the number 1 team had a .525. I have no idea what mechanism they are using. I'm not too concerned about it, anything I learn in this contest will be applied to the next as the program continues to evolve. The biggest thing is that I continue to innovate, improve on old ideas while developing new ones. I want to develop an algorithm that is to data mining as quicksort is to sorting. Not always the best in every case, but darn good.a person 's gotta have goals Trying to achieve that is really where the fun is for me. All will bask in the glory of what I have created!!!!!liar:(In the end, it's likely the world wont care a lick. and I'll just entertain myself :) I'll of course try any new idea (to me) to make sure I know as much as I can, not just ones I develop. There always seems to be some new implementation or some new theory of whatever. And, I'm not trying to reinvent the wheel. Just make the wheel as good as it can be.
There is a larger goal in the end. There is the 3 million dollar contest on http://kaggle.com
doing this sort of thing. I'll talk more about that when I get there, but know that it is the over-reaching goal. If things go well maybe I can be a contender. That contest doesn't finish for a couple years though, so there is time.
I updated the blog's code a little... Calendar, rudimentary search and what not. I still plan on adding an RSS feed but that can wait for another day. I'll probably move it in to an MVC 3.0 framework too at some point. But, ya know... baby steps.
Why am I bothering to write my own source for my blog? I don't have a really good reason. It's a hobby to begin with, so there is that. And it's always nice to have complete control over everything you are using. I mean sure, you can build your own car engine and use it, but people don't (well most people). This is no car engine though. The time involved just wasn't as bad as all that.
For good or bad I've decided to start a blog. Basically, I'm looking for an outlet to share stuff in what I hope is an entertaining fashion. Well, entertaining to me at least. If a larger audience gets a kick out of it as well all the better. What sort of stuff? I like breaking things! Are you gonna break stuff?! Oh, you know. Whatever. Anything I've either been pondering, found really interesting, or have been working on. I'm not going to jump right in and share anything just yet. I think, A little context is necessary first.
My Name is
Jamie. Well, actually, it's James but nicknames die hard.
. Probably the best way to give you a picture of who I am is by describing the topics I figure I'd write about. For starters, I'll probably write about data mining a lot, at least for the time being as that's a strong focus of mine right now. I used to be really focused on data compression and to a lesser degree encryption, but these days it's all about finding the needle in the hay stack (as it were)
. I'm sure at times I'll regale you with a story or two about whatever iPhone development I'm doing. Other topics that come to mind: human nature, chemistry, physics, food, algorithms, game theory/design, drinks, learning systems, astrophysics, music and finance. I might even go on about movies, art or some neat-o gadget I read about. Really, this isn't about setting up boundaries. I'm just trying to draw a mental picture for you all about the topics I'm guessing I will bring up. In truth, I'll probably end up writing about whatever has been interesting since I last wrote. I don't expect the blog to be daily maybe weekly maybe not. We'll have to see how it goes!