Not to much to say in terms interesting news. I've been quietly working on improving the DMT (data mining tool) acronyms for the win! and for the lazy. Thus far I have nothing to show for it. I could go on about what I'm doing exactly but really it would be far more easily understood with pictures. So at some point I'll throw them up for you all to see. Just not right now.
I did have one idea that is worth sharing. Once I get this thing working well, I think I'm going to apply it to financial analysis. How? First, I'll grab a list of stocks probably everything from AMEX, NYSE and NASDAQ and their daily stock prices for a couple years. Then I'll setup attributes for each stock representing key financial statistics. I think those statistics all have to be relative to stock price or possibly market capitalization. This is done so that there is some reference point day to day on how good that attribute is. I'll probably also have to organize the data by industry. For example, I won't want to compare agricultural to pharmaceutical. They may be similar but one industry works off of a difference set of ideals in valuation than the other. I can try mushing it all together to see if it does perform similarly but in case it doesn't they will be stored separately. Finally, I setup some sort of criteria for groupings of buy and sell (hold will be everything else) based on how a stock ended up performing in the following months. Once that's all setup, I run the tool over the data and presto if you did everything right stock picks.
Aren't other people doing that sort of thing? Yep! It'll be my take on it. I'll just be one of many... many people doing that sort of thing. I need to get my algorithm in the same league as top performers in the last contest first though. And it'll never be perfect because the world is full of random events that make the best analysis fail. Hurricanes, early freezes, earthquakes, war... etc. all of those things can throw a monkey wrench in to the predictions. Regardless, once everything is performing really well, I'll attack that stock data and see what I can do.
I put together an RSS feed for the site. No idea if it works. I'm not much of an RSS guy. I've had to work with them before but I don't actively use RSS for notifications or reading articles. So, if there are any problem with it, sorry in advance. If not well great! :D
So, I was in Vegas for 4 days. And it was glorious! I went for a wedding and that would have been awesome by itself. but ya know, I always have a good time there and that made it that much more fun. I have to admit though, I could do with about a 10 year break from the place.blasphemy! Give it some time to change and not be so fresh in the memory. I've been 3 times in the last 10 years now and while I always have a blast, there is less and less to do that is truly new and exciting. I've actually never been there for a convention. So, I suppose I'm due for that. If that happens though, It'll be for work and really I wont be there for "fun".
Other than that not to much to share. I did find out that I will be getting the test data results for the competition I just worked on. hooray! With those I can continue my work on improving the methods I was working on using the exact same framework and scoring I previously had. I wont get it till April 30th, but that's OK. I probably could use the separation from it anyway. Also, with regards to my next attack on the data. I'm not entirely sure how I'm going to go about making my decisions on which data goes where my new modeling I have planned. Eh? I'll talk more about it another time.
The bugs are fixed. You can now search and look at old entrieswhat few there are to your hearts content!hooray!
So what news? The most recent data mining contest ended. I didn't do well at all. 79th out of 126. I ended up slicing the data quite a few different ways but I always seemed to end up sitting with a score between 0.419 and 0.430. My best score of 0.439 was really refining a particular result more than it was finding some great revelation. I did figure a few things out. I'm not sure if it's remarkable or not but I made my best result off only slicing the data in to two subgroups.eh? Each group you are trying solve has two sets of data. 1 set that is in-group and 1 set that is out. So for each of the 83 groups I make 2 subgroups out of the whole thing with respect to any 1 group. I then used that to identify where the groups appear. Through the course of events I also tried some variations of nearest neighbor and few other techniques that involved making many groups. But they always produced nearly the same result and were slower. Typically, they scored slightly worse. In the end my two subgroups method seemed the best. I suppose I learned/figured out quite a bit about that kind of data manipulation (aggregates) which in itself is something.
So now that it's over you have all kinds of new ideas right? You bet! *grin* So even though the contest is over I'll probably explore those ideas anyway. Hopefully, the guys running the old contest will release the test data answers and I can test with that data. If not I'll just mock something that isn't quite as good from the data they did provide. If it all works out I'll use the ideas next time I participate in a contest.
Also in the news, I managed to upgrade this site's code to MVC 3. Hopefully, you don't even notice. There are still a few issues. clicking on an entry doesn't take you to the entry, the search still goes to the old page and the calendar still seems to be using the old URLs. Also, unfortunately, I still haven't set up an RSS feed. I'll fix most of that Tuesday night. But I'll probably another week or two getting the RSS finished.
Spring has attacked and with it has come the monotony that is yard work. I really need to invest in a yard service. (not a fan of mowing the yard, pulling dandelions etc... but alas you're too lazy to pick of the phone, which makes no sense!). So there I am doing that stuff, and the mind wanders and wonders.
Get to it! What are you thinking about?!!? Well it's data mining again. geeze. I'll come back when you are gonna talk about beer I believe I need a different set of rules for those cases that fall on the line between out of a group and in a group. That is, each item I test against the model either falls in to a classification or not. There are multiple classification.beer! tease The thing is, trying to make multiple separate ways to make a single classification subgroups of the same group seems like a process doomed to fail. Each time you subdivide your training data to make a new sub group you get that much closer to isolating all the elements individually. Eventually, they become data sets of size 1 and matching test data to those groups is the nearest neighbor algorithm. Which is a totally valid mechanism to use, just not what I'm trying to do. What I am trying to do is make a model what represents the data via groups of items that share similar traits.
How will I determine the border cases that need to be handled using a new method?When I test the training data against the model created from said data, I find that some of the cases are sorted incorrectly. I can build the average model for these cases. I separate out the false positives and false negatives in to their own models and use these to identify problem records. Got it? good.
Wait, what happened to the grouping method you were trying before? The brute force and learning systems stuff. Glad you asked. In short it never improved much. The whole problem of trying to test a model with the data you built a model with is kind of flawed. Or at least not ideal. The training data's results against it's own model are always skewed to "better than it should be". Heckyou mean hell, nearest neighbor would score 100% if you tested the training data against itself. So in my case I tried to optimize some with sub-groupings but quickly found there were no real gains to be had. I mean it ran for a while, and did improve on sorting the training data. I got to a 0.5++ result which should put be near 1st, but when I scored it against the real data, I still sat at .42++. A case of fitting the model to the test data to perfectly and not adding any real value. For now I've back-burner-ed the idea. I will continue forward with what little time there is left (5 days) by working on a new method for just the middle cases.Right after you go mow the yard Maybe then I'll have a beer hooray!