Dim Red Glow

A blog about data mining, games, stocks and adventures.

Eldrazi and Modern and Vintage Masters

I've talked about it before and it's high time I wrote a little about it. I play magic the gathering and have for years. I'm going to skip a lot of basic explanation that a laymen might need and assume if you keep reading you have some said knowledge of how the game works and what sets are currently coming out. I'll also assume you know what the different tournament structures are and know at least a little something about deck building and technical game play (triggers, the stack, layers etc).

That being said, 2 interesting things are happening right now. First Modern has really turned in to a mess with the introduction of the new eldrazi cards from oath. Mess might be the wrong word. Let's say that all forms of the eldrazi deck seem to be dominant. You only have to watch this video of the coverage of the end of the pro tour to see that. I find this article's first line about them the most apt Can Eldrazi decks be beaten? (to save you from clicking 'No. See you next week!').

What happened/how this came to be was they made the eldrazi in the latest set strong/on par with the creatrues with color. They treat the colorless "waste" mana symbol like its a regular colored mana symbol. This would be all well and good except they had previously printed some lands (Eldrazi Temple and Eye of Ugin) that make the really high mana cost eldrazi easier to cast by reducing the mana cost or providing extra mana in lands that have basically no other function. When you combine the two... it's... it's just not fair.

There was another rather major change in modern, they banned splinter twin which changes up the control scene in modern. Not for reasons of being too powerful (it kind of was) but just to change things up. I like the idea but i think control players will have a tough time finding a good reliable replacement. Not that this really directly impacts the eldrazi addition. It's just another thing.

What decks have I been playing in modern? I've been throwing around a naya tokens deck for a while. I like it but its not quite consistent enough. I've tried to put together an blue white control deck as well. it's good but eldrazi will walk over it. I've even toyed with beating eldrazi using painter's servant (which adds color to everything, screwing up their accelerator lands) but it also seems inconsistent. This all leads me to, if you can't beat em, join em. Without further adieu I give you my eldrazi brew black eldrazi tron. it's a work in progress. I like the use of the Whip and oblivion stones not to mention leveraging Expedition Map. But it may all change after some testing.

There is a modern starcity open in Kentucky this weekend. I talked myself out of going. That's a bit of a shame as it would have been a great place to test my deck. It's just a 4 hour drive each way (or by plane, there are no direct flights) just stinks. So, i'll just try it out at my local shop (Realms of Gaming).

Oh and on one final note, wizards of the coast decided to reprint a number of older cards in a set called vintage masters. I understand the motivation (money) and the reasoning (some of those cards are hard to come by) but since most of the really good cards in the vintage tournament scene are forbidden from being reprinted by the reserve list and a lot of the ones that should be reprinted to make legacy more viable for new players are also on the list (read: dual lands). I question what effect it will really have on those tournaments in general, and if this is just a for fun set or ... or what. I guess in the end, any reprint is a good for those formats even if the $500 and up cards never see the printing press again.

Gradient boosting series preamble

Hello folks! I'm going to do another series of blogs this time on gradient boosting. This series will be on-going with no definite time line. I expect it will only take 1 maybe 2 posts to cover the basics of gradient boosting. After that I want to spend some time exploring possible ways to improve it. I may find none, we'll see when we get there. Allow me to catch you up from last week and explain some of my motivation in doing series.

Last time i wrote I had finished 1 kaggle contest and was moving back to another I had previously started. That contest is now over, having ended some 4 hours ago. If you go and look at the leader board you'll actually see I fell some 300 spots after results got posted. It seems I forgot to move my selected submission to the more recent submission I made *grin*. Well, it didn't matter much anyways. I didn't get much of a chance to work on the contest this week and while the gradient boosting version of my code worked much better, nothing was better than my initial submission with it. And that submission is still a seriously far cry from the top of the leader board.

The time I did spend on my code for the contest was spent half trying to improve my code and half trying to find optimal values for the contest. Neither of which was much of a success. The time I spent away from the computer just thinking about boosting in general has led me back here. I find that the way that Gradient boosting has been explained to me over and over again was possibly part of the reason I was complacent about getting it working. I want to present it in another way that perhaps is new to some people, maybe most people. A change of perspective if you will. I'll get on about that when I do my first post in the series.

I'm hoping this all leads to something better. I'm optimistic, but with all things there is a functional best version. You don't see people really improving hash sort or bucket sort. And quicksort is really about as good as it gets for a comparison sort. Compression only goes so far too. I remember back in the 80s and 90s when there always seemed to be a new compression algorithm for better space savings. At some point that stopped. At least when it comes to lossless compression. (I'm still waiting on someone to figure out lossless fractal compression). It seems possible but unlikely GBM is the end of the road for general data mining.

 

A contest in a week (part 5)

It's over! time for the epilogue. The winner scored  0.97024 I came in at  0.96477 on the private leader board with 1056 people between us. :) My GBM submission this morning moved me up a ton. I also did a follow up submission later which moved me up a little. Oh if only i had another week. :) but that's okay! I'm sure I would have improved my score, but getting past .97024 well that would have been something.

I did find there was some more tweaks to be made that i could bring some real improvement but I ran out of time. Specifically, the feature selection  i did on each round. I hadn't honed that very well. Also while making a forest of the results does improve results in general, adding rounds to the GBM was my path to most success. If I had more time, I think I would have at least figured out the breaking point where GBM stops making gains. That is, unlike random forest, when you make too many trees, the noise starts dominating.

My next challenge is one I was actually already working on, https://www.kaggle.com/c/prudential-life-insurance-assessment . I'll take what I've created here and go apply it there. I'm a long way from the top there as well, but well, I didn't have GBM. :) 'sides prudential pays more... ;)

 

A contest in a week (part 4)

Wouldn't it be the case that the one time I try to do a contest in a week (instead of doing it over months) I actually get a big break through with about 17 hours left to go. And now I don't have time to hone the result. Hehehe, oh well. Better that then no result. I'm getting ahead of myself, let me catch you up.

Here’s a rundown of how my weekend went. Friday evening I went out and played cards (magic the gathering, another hobby of mine). Then Friday night I worked on the contest till the wee hours of the morning. Usually this means I coded some then watched TV while things ran. Wash rinse repeat till I final call it a night. You can read the last post for details on how that went.

Then Saturday I skipped Mardis Gras (St. Louis has a rather large celebration, something I rather enjoy doing, great weather for it this year) only to work more on the program. My day involved programming and then play civilization 4 (any ole video game will do, I just happen to like that one :) ) or watch TV while things ran. I spent all day honing the Platt scaling and trying a few variations in my trees. I also added a few features namely month, day, quarter, year and day of week. Usually these are super strong indicators for sales and retail, I wasn’t sure they would be useful here. They get used they don't change my score much.

In the end I came to the conclusion that my best version of Platt scaling was only compensating for the noise in the results. Which is to say it was useless. I was just over fitting my local results. Saturday night I ran my program for about 8 hours doing 1380 tree (3 times my normal run of 460) and submitted it to the leaderboard. I moved up .00001 yep the smallest amount you can register. How disappointing.

That just means my forest had really given all it can give. So Sunday was pretty much the same as Saturday except for one major thing. Having spent all day Saturday cleaning/checking data and trying to hone my results I was out of things to do! I couldn’t even submit my program with a huge set of trees and expect a better result. I already did that. So that was it, I was out of things to do.... Well everything except 1 thing, the elephant in the room. I still could try and work on my GBM model. For the uninitiated (are there any of those out there still?) GBM is a Gradient boosting machine. 

It's been a "thing" for me for a long time. That is I've tried time and time again to implement it only to get poor results back. I could give you my best guesses as to why... I suppose part of it is I'm always trying to do too much at once (improve while implementing). I don’t have a test case to build to, to see it work right at first. I probably implemented parts of it wrong. Perhaps sometimes I don’t stay true to the core idea. Or maybe my trees don't work like conventional trees (they don't sometimes, in the mathematical sense). But you know, those are really guesses on my part. What matters is I decided to try yet again because it’s all I had left to do!

I put together a simple frame work (no frills) that calls my standard tree which uses logistic splits, and set to trying to get some settings right. It didn’t take long before I got some that actually showed an improvement in my local tests! And I mean a sizable improvement. The rubber will meet the road when I actually make a submission of course, but right now things look really good. Granted it didn’t improve accuracy, but that's not important for AUC.  Correlation Coefficient and Gini are all I really care about (well I’d just use AUC if I ever bothered to fix the calculation it *grin*). Incidentally, I can actually get accuracy way up there if I use Platt scaling. If I do though the others but my results go to crap. That just goes to show it’s not what you care about this time.

previous best (46 trees in a random forest - 3 fold CV)
gini-score:0.91988301504356 - 0.96333
CorCoefRR:0.635294182255424
Accuracy:0.887165645525564
LogLoss:NaN
AUC:1
RMSE:1.18455684615341
MAE:0.529971023064858
Rmsle:0.295349461579467
Calibration:0.0801169849564404

current best with GBM (15 trees in a GBM forest - don't judge me, I love forests - 3 fold CV)
gini-score:0.925589843599255
CorCoefRR:0.64803931053647
Accuracy:0.86189710899676
LogLoss:NaN
AUC:1
RMSE:1.15794907138431
MAE:0.588354750026592
Rmsle:0.303147411676309
Calibration:0.074410156400745

You are probably curious about the specifics on my GBM settings. Oddly, or perhaps not, much of it people have already shared on the forums. I'm doing 7 depth trees and I'm taking only 68% of the features for each GBM iteration. I tried using more or less depth and I tried sub selecting rows for each iteration, but it all made the score worse. I tried also knocking down the feature selection to like 33% just to see, but it hurt the results as well. Normally I would hone that too but I’m out of time.

The other settings I'm using don’t translate to the GBMs other people are using. For instance my “eta” is .75 way more than some of the settings I saw. I tested it, that’s the sweet spot. My nrounds is like 15 in that result above. Doing 1800 or whatever wouldn't be feasible with the way my tree works. It would take weeks to run and the improvements would diminish so fast as to not make any sense in doing it. Okay well maybe it would make sense if you had your eta at .01 but again...it would take days to run. Also the difference between 10 nrounds and 15 is pretty small. So clearly that is diminishing quickly as something that is beneficial to ramp up.

So that's been my weekend (Throw in some cold pizza, 2 pots of coffee, a bowl of oatmeal and 2 trips to taco bell and you now have the full experience.) I'll be wrapping it up here in the next half hour or so as I have to work in the morning. Just as soon as I get a few more results back so I know what I can scale up for a run over night to maximize a submission results. Once I got that I'll hit the hay and tomorrow.... oh tomorrow, tomorrow we see if it’s all for not. :)

 

A contest in a week (part 3)

A few more days have passed and I've moved my score up only a tiny bit. I now sit at 0.96333 . I think at this point I might actually be moving down the leader board as people pass me. I've spent quite a bit of time mulling over possible changes in the bagging process and trying a few ideas. I don't think there is anything for me to do there right now. In short what I have now works just fine and there are no obvious improvements.

The platt scaling I mentioned before might still produce some beneficial results. I tried using the scaling I had in place for a different contest but it doesn't seem to translate. It gave me higher accuracy but destroyed the order which is all AUC cares about. I'm going to give it another crack by saving off Cross Validation results and seeing if I improve them by running the sigmoid function over them using different coefficients. I store the total weight (the accuracy from each tree from bagging, used to weight voting of each tree to make the final results). I actually have 2 inputs and can do something more interesting than a straight translation.

When I do the work on that platt scaling i will likely want to graph the formula i create. Years ago I found a really nice online graphing tool and thought I'd share. you can check it out here https://rechneronline.de/function-graphs/ 

I wrote a little pivot sql to help me look at the data. Normally I wouldn't share this kind of thing as it would tend to be specific to your data storage. However, I will share it as it has applications anywhere you have a key-value/name-value pair table you want to pivot where you column names that go 0,1,2,3,4,5 etc... (I have an attribute table with the labels) So for any google searchers that are looking for this sort of thing here you go. I tried making it generic enough to understand and be reusable.

 

declare @sql as nvarchar(max)
declare @nullSql as nvarchar(max)
declare @columnNum as int
declare @n as int

select @columnCount = 1000
select @n = 1
select @sql = '[0]'
select @nullSql = 'isnull([0],0) as [0]'

while (@n < @columnCount)
begin
	select @sql = @sql + ',['+cast(@n as varchar(20))+']'
	select @nullSql = @nullSql + ',isnull(['+cast(@n as varchar(20))+'],0) as ['+cast(@n as varchar(20))+']'
	select @n = @n + 1
end

select  @sql = N'SELECT RowNumber,' + @nullSql + ' FROM (
    SELECT 
        RowNumber,ColumnNumber, value
    FROM KeyValueTable
) as sel
PIVOT
(
		SUM(value)
    FOR ColumnNumber IN (' + @sql + ')
) AS pvt 
'

EXECUTE sp_executesql @SQL

Oh one thing I noticed when I went and used it. I had imported my date field wrong! there was a bug in my code and it turns out my date field was pretty much garbage. I fixed it, but this just goes to show you should always double check your inputs to make sure they are good. This isn't the first time this sort of thing has happened. I've wasted weeks and weeks before on bad data. I did spot check my data i just missed this. That particular column was special being a date and all. The date format was yyyy-mm-dd, my loader had only ever dealt with dates in yyyy/mm/dd format and the difference is what made it not work right.

I took a look at the forums to see if there were any obvious insights people have shared that I needed to implement. I already mentioned i don't spend a lot of time trying learn the data and hand massage it to be exactly what I need. To really excel at any data mining competitions you should do that. in the business world you never touch the algorithms. That's what R&D and PHDs do (and me apparently). You just buy the tools and use them. Which of course means the stronger my tool kit gets the more I do that sort of thing cause that's where the gains are. 

I like to think I get some deep understanding of the nature of data interactions by taking the long road, but i'm probably fooling myself. :) There was at least one obvious thing I found in the forums (there may be more, i need to go back and look).  I needed to  create a column tallying how many pieces of missing data there are in that row. That's exactly the sort of thing my algorithm has a hard time determining on its own. It is what gave me my tiny bump in score. Its also the sort of thing a genetic algorithm might find on its own... but that's something I'll save for another series. (I have such delights to share with you all!)

I also tried some TSNE transforms on the data. There is a nice thread about this on the forums https://www.kaggle.com/c/homesite-quote-conversion/forums/t/18554/visualization-of-observations . One guy in particular managed to get the output to look really nice. Unfortunately the few tests I did produced the stringy looking results you can see in that thread as well. He mentioned he did a replacement on the categorical values with the average score for that category. This makes a lot of sense as raw whole numbers representing a category are meaningless. Also it is probably a far better technique than the one hot encoding method when it comes to that process. TSNE wants related data to be in 1 feature so it can figure out the connection to other features. Separating it out messes this up as it doesn't know two features are actually one. As I only have a week and we are down to 2+ days. I wont be revisiting my TSNE work anymore for this contest. However, it's definitely something to remember: feeding the TSNE results in to your model could very well produce a winning score. 

Incidentally, I do the same category to real number transformation when calculating correlation coefficients on categorical values when I'm figuring splits down the tree. I don't always do one hot encoding. In fact i only do it afterwards when looking for improvements.

So my next steps are looking at making a transformation using platt scaling from my results and looking the data for obvious things I might try to improve the score.  More reviewing the forums to see if there are other "you need to do this" posts. And just general noodling on what might work.

 

 

A contest in a week (part 2)

Here we are a few days later if you look at the leader board  https://www.kaggle.com/c/homesite-quote-conversion/leaderboard the current best is 0.97062 . I am still way below that at 0.96305 . That doesn't sound like much but evidently this is a pretty straight forward set of data and the value is in eeking out that last little bit of score. My score IS higher than last time I posted. It was improved 2 different times.

It first went up when I realized the data I had imported had features that had been flagged categorical. My importer does the best it can to figure that sort of thing out, but its very very far from perfect. Just because values are whole numbers doesn't mean it is necessarily categorical. and even if it DOES have whole real numbers it doesn't mean it should be treated as a linear progression. In short I changed all the features to be considered real numbers and this gave me a large gain.

I should mention i handle categorical features differently than real number features. basically i make bags that represent some of the feature values and try to make the bags the same size. in cases where i need to evaluate feature as a number and its not been changed to a yes or a no, i look at the training data and find an average value for that category's value. its not perfect but in some cases its way faster than splitting out 10000 categories especially when you want to see if a particular feature has some sort of correlation with the scores.

The 2nd much smaller gain came from flagging any features with 10 or less values and telling the system to treat it as a categorical field. (chosen mainly as a cut off to keep total feature count down to a minimum) Then I went ahead and did a "one-hot encoding" style split on those features to get them in to their component parts. that is you take all the possible values for a feature and give it, its own feature. 5 different values means 5 different features. Each feature then has either a 0 or 1 indicating if that value is present. I flag all those new features as real number features and not categorical features. I turn off the original feature. This gave me a small gain.

My current internal testing result looks like this:


rfx3 - 460 trees - new cat setup - 0.96305
gini-score:0.919898981627662
CorCoefRR:0.634927047984772
Accuracy:0.886946585874578
LogLoss:NaN
AUC:1
RMSE:1.18522649708472
MAE:0.530871232940142
Rmsle:0.295421422778189
Calibration:0.0801010183723376

I mentioned before i should probably fix AUC, as that is really what ROC is. I did take a glance at it and didnt see anything obviously wrong, but before this is over I'll almost definitely have to get in there and fix it. I've continued using Gini 'cause it seems pretty close.

My next steps are to figure out if there is a way I can reduce noise and/or if there is a way I can increase the weight of correct values. In a different contest I wrote a mechanism that attempts to do a version of plat scaling https://en.wikipedia.org/wiki/Platt_scaling  (which really, is all my sigmoid splitters are) on the result to better pick whole number answers to an exact value. I didn't do a standard implementation which is very me. I worked really well. This contest does not use whole values though, so I'll have to go take a look at the code to see if I can make it work for that. To be clear, i intend on getting the appropriate weight for any given result from the plat scalling.

This doesn't directly handle another problem I'd like to fix. noise in the data. any given sample is fuzzy in nature. the values may be exact and correct, but the underlining truth of what it means is bell shaped. Ideally we would get at that truth and have empirical flags (or real values) come out of each feature that always gave the right answer when used correctly. I don't have a magic way to do that... yet. :) in the mean time the best thing we can do is find the features that do more harm than good. unfortunately the only way I have to do that is at this point is to brute force test removing each feature and seeing if there are gains or not. terrible, i know. its been running for about 24 hours and has found only 1 feature it thinks is worth turning off. And even that improvement was right on the line of what could be considered variance in my results. This is why you want higher cross validation, so you can say with much more certainty that you should do a change or not. I might stop it and pickup the search later (i can always resume where I left off). It's a good way to fill time productively when you are working or just not interested in programming.

The only other way I can think of off hand to improve my overall score is to increase the data set size i train on. Currently I'm doing my training using a bagged training set. This means I hold out about 1/3 ( 1.0/e ) of the data and use it to score the other 2/3rds (1.0 - 1.0/e) I build the tree on. This then becomes my measure of accuracy for that tree. if I can find a good way increase that I should in principle have a better, more accurate model. ideally the ensemble nature of the random forest takes care of that. combined with the plat scaling I'm looking at adding, there may be no real gains. I would think though if you can minimize the statistics and maximize the use of the data you would get to a precise answer faster. we will see what I come up with.

 

 

A contest in a week (part 1)

I thought I'd give some examples of how things generally go for me. I imported the data from the https://www.kaggle.com/c/homesite-quote-conversion contest. I had to do this 2 times as I messed it up the first time, which is par for the course. I have some stock data importers I wrote that handle most spreadsheet style data with very little tweaking.

Next I ran a few out of the box tests. I get my results back in a standard form. I believe this scoreboard is being calculated using ROC (which is essentially AUC, and to a lesser extent the gini score). You may notice I have AUC of 1 and LogLoss of NAN in all these tests. Well my AUC is probably being calculated wrong (like maybe it has some rounding hardcoded in it or expects values 0 - 1 or some such).  I just haven't gone and looked yet. My logloss is NAN cause that does specifically want values 0 -1.

Generally speaking I use which ever test I have that is an exact match or is the closest. Because they all approach a perfect score, and spending the time implement yet another scoring mechanism really is not very high on list of things to do. Basically, I don’t usually worry about it too much unless its radically different from what I already have.

As for getting the actual scores, I use cross validation like pretty much the rest of the world. I generally only do 3 fold cross validation because significant gains are very apparent even at 3. Increasing it to 4,5,6...9, 10...etc  just end up making the tests take longer for little extra information. Though if you are looking for accuracy go with 10. That seems to be the sweet spot. I’ve toyed with going back to 5 or even 9 cause all too often I do get in to the weeds with this stuff and finding little gains becomes important if you want to win.

Legend

rfx1 = random forest experiment 1 ... basic normal random forest with my custom splitters

rfx2 = random forest experiment 2 ... I tested this but the results were really bad

rfx3 = random forest experiment 3 ... basic normal random forest with my logistic/sigmoid splitters. the number behind the logistic represents how many features are used in each split. I should note I do not randomly select features with this mechanism, i use those that correlate the most with the final score.

rfx4 = random forest experiment 4 ... another logistic splitter this one uses an additional mechanism that figures in parent node scores and accuracies in to the final score for the tree's leaves. It didn’t' improve results but I thought I'd show it here just to give an example of the kind of things I try.

Results
rfx1 - 46 trees
gini-score:0.796266356063114
CorCoefRR:0.378157688020071
Accuracy:0.7456163354227
LogLoss:NaN
AUC:1
RMSE:1.5631840915599
MAE:1.07671401193999
Rmsle:0.405897189976455
Calibration:0.2543836645773

rfx3 - 46 trees (logistic 1) Leaderboard score with 46 trees 0.90187
gini-score:0.795058723660037
CorCoefRR:0.409256922331928
Accuracy:0.780350470167233
LogLoss:NaN
AUC:1
RMSE:1.52284323987427
MAE:0.979367384563889
Rmsle:0.389736465098101
Calibration:0.219649529832767

rfx3 - 46 trees (logistic 2)
gini-score:0.900834289021121
CorCoefRR:0.592472621731979
Accuracy:0.854328645381246
LogLoss:NaN
AUC:1
RMSE:1.25910166514491
MAE:0.658794207857713
Rmsle:0.316163290273993
Calibration:0.145671354618754

rfx3 - 46 trees (logistic 3) Leaderboard score with 460 trees 0.95695
gini-score:0.905913368972696
CorCoefRR:0.600941784768692
Accuracy:0.859935155804304
LogLoss:NaN
AUC:1
RMSE:1.24723196082576
MAE:0.640500763913664
Rmsle:0.311818233193152
Calibration:0.140064844195696

rfx3 - - 46 trees (logistic 4)
gini-score:0.90578470857155
CorCoefRR:0.599962582265556
Accuracy:0.857539596865357
LogLoss:NaN
AUC:1
RMSE:1.25525929882265
MAE:0.657799102194174
Rmsle:0.312383492790915
Calibration:0.142460403134643

rfx4 - 46 trees (logistic 3)
gini-score:0.905049421077021
CorCoefRR:0.599288488034863
Accuracy:0.83152333785067
LogLoss:NaN
AUC:1
RMSE:1.24604271464082
MAE:0.714368432524074
Rmsle:0.326817108831641
Calibration:0.16847666214933

 

Calibration is something I change ad hoc. I have iterative processes that can test things while I'm not at the computer and I use the calibration field to decide what is working and what is not. it currently set to be 1-accurracy

I should talk about accuracy. How do you determine how accurate something is? It’s easy when the values are 0 to 100 and scores are evenly distributed. That’s rarely the case. Here's how i do it. I build a normal distribution of the expected results then I see where the final result fell compared to the where the expected result fell on the normal curve. Then I find the % change for the two results and the 1 - difference is my accuracy.

An example might go something like, test value is -1, and train value is 2. The average value is 0 and let’s say the standard deviation is 1. Well the 50% are left and right of the mean. so -1 standard deviation is 15.9% and +2 standard deviation 97.7% so my accuracy is 1 - (97.7-15.9) or 18.2% (which is terrible!) but you see how it works. I do this for all accuracies I need whether it is bagging, final scores or splitting.

That's enough for today, I’ll follow up in a few days with how things are going.

 

Current state of my datamining

Since this blog is starting over, I think I’ll give a little background on my exploits. I never use tool kits when it comes to the primary algorithm I use for data mining. So things like R, scikit, xgboost and mattlab, where the code is already implemented, I don't use. In my mind half the fun IS developing algorithms and trying to improve on what has been made. The latter is generally the better approach as truly new algorithms seem to fall short.

Now there are times when I do use libraries or someone else's code. I use them for data transformation. So while I might write my own distance function or my own correlation coefficient. Somethings like TSNE or SVD would be just exercises in implementation. And really, I don't want to do that as i'm not trying to speed them up or somehow improve them.(though I did make my own version of TSNE but I did still start with someone else's code.)

When it comes to the actual data mining a new thing (read: kaggle contest), I generally throw some stock stuff I've written at it and then try new variations on how it is implemented. Most the variations are new ways to improve the mechanism in general and have very little to do with that specific data. Usually I then move to a new contest, find that the variations weren't all that good in general and back track some.

The 2 tools I have at the moment that are the best I have to offer are variation on random forests. It’s worth mentioning every time i try to implement gradient boosting the results are lack luster. Someday, maybe, I'll get a good version of that. It is a bit disappointing ‘cause I have never been in the running for winning at a contest at https://www.kaggle.com/. And lately xgboost https://www.youtube.com/watch?v=X47SGnTMZIU which is a very very good implementation of gradient boosting has dominated.

Yeah, despite thousands of man hours I've had 2 top 10% finishes. That's really the difference between developing new and using out of the box implementations that have been developed using standard well test/thought out/prove techniques. So, you know if your thing is winning stick with known mechanism. Most people who had spent the time I've spent using those would have had at least 1, top 10 finish or even an in the money finish by now. Well, I would hope anyways.

So what's different about my 2 random forest implementations? It’s the splitters. 1 uses a splitter that does the best I can come up with to get a pure left and right at each tree node. I picks the purity based on property sorted percentages that weight the accuracy the correct or incorrect split using normal curves

It also is written so it can use multiple features at once. Which can either consolidate down to a point or a line. It tries all 4 possibilities, 2 points 1 point and 1 line or 2 lines. The distance from said point/line dictates the side it falls on. (closer is better)

The other version uses logistic (sigmoid) functions based on normal distributions of the values at that point in the tree. It was an "AH HA!" moment about a month and half ago, when I realized there was a direct translation between normal curves and the sigmoid function cumulative distribution. You don’t need to do gradient descent to get optimal values. You just need a normal curve and it translates perfectly. You can see here https://en.wikipedia.org/wiki/Logistic_distribution the average and standard deviation are known from the normal curve. You just need to do some simple math to figure out the values you need. This works amazingly well. There are more details to it in terms of which side is the "high side" (left or right). The upshot from doing it this way is you can make a multi-dimensional splitter (at any level of dimensionality) in ridiculously fast time. You can make a sigmoid curve that is either based on the sum of features or has the features multiplied out (since combining standard deviations and averages is super simple). I find the sums version works better. This makes sense if your features are all independent you definitely could mix and match features, combing some and making sums out of others. I haven’t tried to figure out if there are any gains doing that yet.

so beyond those things and playing around with GBM that's where I currently am. It’s been a long long road to get here and I'm really hoping for that break through that puts me over the top. I've been looking at uses indexes back in to my tree to use data from further down the tree to change parent splitters, but the problem is tricky. Unlike gxboost, I want a fully populated tree so doing that is problematic from a run time perspective, and really it’s hard to know exactly what makes the best sense to do.

Tomorrow I'll give some examples of how well these work using a contest ( https://www.kaggle.com/c/homesite-quote-conversion ) I joined today that is ending in a week. Just to give people who are curious, an idea of how well things work.

 

Time for a reboot

I've decided to start my blog over with actual blog software i didn't write. I did this mainly because I wanted some additional features and didn't want to spend time writing them. Also, I have had 2 posts in 1 year which means everything was stagnant anyways and it seemed like a good point to start over with a clean slate. Hopefully this new site will give me the gumption to write a little more.