Dim Red Glow

A blog about data mining, games, stocks and adventures.

A Standard eldrazi deck and a PPTQ

A PPTQ is a Preliminary Pro Tour Qualifier (for magic the gathering). I've not won one before, but I came as close as I ever have this last weekend. I did it playing this deck http://tappedout.net/mtg-decks/24-03-16-4-color-eldrazi/ . It was a 33 person tournament (6 rounds) I ended up losing the semi finals to a rally deck. Rally the Ancestors has pretty much dominated the tournaments in standard at least locally. Though based on mtgtop8.com it seems to be a thing in other places too.

It uses either Zulaport Cutthroat triggers or an unblocked Nantuko Husk to win. Usually the former in my experience. I tried to prepare for it using hallowed moonlight (which is also good against other things) and a single cranial archive which would force them to reshuffle their graveyard (making the rally worthless) but it wasnt enough. In the end I neglected to always leave my 2 mana open to trigger cranial and lost on a top deck rally. Game one they won as they tend to do without my sideboard.

This leads me to possibly a better solution. I think I'll be trying tainted remedy next time. For the rally match up it should work wonders as they don't run enchantment removal. Most of the time when they start trying to win with the cutthroat they have less life than I do and in short it would kill them. I'll have it as a 2 of in the sideboard somewhere. It's also worth mentioning, it will also would help against any decks running normal life gain monsters like seeker of the way or soulfire grand master. Not that we see a lot of those these days.

On a final note concerning the two planeswalkers I ran. The unsung hero of my deck was Sorin, Solemn Visitor. He pulled his weight more than you would expect, i got paired against two prowess decks and he got me through both matches. As far as Gideon, Ally of Zendikar goes, he's been underwhelming as of late. I think that has more to do with the prevalence of dragons and rally in the format. Both decks try to win in a way that make him more or less a non issue.

Gradient Boosting (Part 2)

This time around let’s start from scratch. We have some training data. Each row of the data has some various features with real number values and a score attached to it. We want to build a system that produces those scores from the training data. If we forget for a second "how" and think about it generically we can say that there is some magical box that gives us a result. The result isn't perfect put its better than a random guess.

The results that we get back from this black box are a good first step, but what we really need is to further improve the results. To further improve the results we take the training data and instead of using the scores we were given we assign new scores to the rows. This score is how much we were off from the black box prediction.

Now we send this data in to a new black box and get back a result. This result isn't perfect either but it’s still better than a random guess. The two results then combined give us a better answer then just the first black box call. We repeat this process over and over making a series of black box calls that slowly gets the sum of all the results to match the original scores. If we then send in a single test row (that we don't know the score to) in to each of these black boxes that have been trained on the train data, the results can be added up to produce the score the test data has.

In math this looks something like this

f(x) = g(x) + g'(x) + g''(x) + g'''(x) + g''''(x) .... (etc)

Where f(x) is our score and  g(x) is our black box call. g'(x) is the 2nd black box call which was trained with the adjusted scores, g''(x) is the 3rd black box call with the scores still further adjusted. etc...

A few questions should arise. First how many subsequent calls do we do? And 2nd what exactly is the black box?

I'll answer the second question first. The black box (at least in my case) is the lowly decision tree. Specifically it is a tree that has been limited so that it terminates before it gets to a singular answer. In fact it is generally stopped while large fractions the the training data are still grouped together. The terminal nodes just give averages for the scores of the group at the terminal node. It is important that you limit the tree because you want to build an answer that is good in lots of cases.

Why? Because if you build specific answers and the answer is wrong, correcting the result is nearly impossible. Was it wrong because of a lack of data? Was it wrong because you split your decision tree on the wrong feature 3 nodes down? Was it wrong because this is an outlier? Any one of these things could be true so to eliminate them all as possibilities you stop relatively quickly in building your tree and the get an answer that is never right but at least puts the results in to a group. If you go too far down you start introducing noise in the results. Noise that creeps in because your answers are too specific. It adds static to the output.

How far down should you go? It depends on how much data you have how varied the answers are and how the data is distributed. In general I do a depth of ((Math.Log(trainRows.Length) / Math.Log(2)) / 2.0) but it varies from data set to data set. This is just where I start and I adjust it if need be. In short I go half way down to a specific answer.

Now if you limit the depth uniformly (or nearly uniformly) the results from each node will have similar output. That is the results will fall between the minimum and maximum score just like any prediction you might make, but probably somewhere in the middle (that is important). It will have the same amount of decisions used to derive the result as well so the answers information content should on average be about the same.  Because of this next iteration will have newly calculated target values the range of those scores in the worst case is identical to the previous range. In any other case the range is decreased since the low and high ends got smothered in to average values. So it is probably shrinking the range of scores and at worst leaving the range the same.

Also the score input for the next black box call will still have most of the information in it since all we have done is adjust scores based on 1 result and we did it in the same way to a large number of results. Results that from the previous tree's decisions share certain traits. Doing this allows us to slowly tease out qualities of similar rows of data in each new tree. But since we are starting over when a new tree the groupings end up different each time. In this way subtle shared qualities of rows can be expressed together once their remaining score (original score minus all the black box calls) line up.

This brings us to the first question how many calls should I do? To answer that accurately it’s important to know that usually the result that is returned is not added to the rows final score unmodified. Usually they decrease the value by some fixed percentage. Why? This further reduces the influence that decision tree has on the final prediction. In fact I've seen times where people reduce the number to 1/100th of its returned value. Doing these tiny baby steps can help, but sometimes it just adds noise to the system as well since each generation of a tree may have bias in it or might to strongly express a feature. In any case it depends on your decision trees and your data.

This goes for how many iterations to do as well. At some point you have gotten as accurate as you can get and doing further iterations over fits to the training data and makes your results worse. Or worse yet it just adds noise as the program attempts to fit noise of the data to the scores. In my case I test the predictions after each tree is generated to see if it is still adding accuracy. If I get 2 trees in a row that fail to produce an improvement. I quit (throwing away the final tree). Most program just have hard iteration cut off points. This works pretty well, but leads to a guessing game based on parameters you have setup.

 

 

 

Detroit Magic the Gathering Grand Prix

I went to the Magic the gathering grand prix event in Detroit. It was interesting but maybe less fun than past events, this is mainly due to me having done it too many times before. The novelty has finally worn off.

I made day two playing an eldrazi tron deck I put together. My record was 7-2 going in to day 2 (I had 2 byes) I think this had less to do with my magical brewing skills and more to do with eldrazi being so easy to play and overly powerful. I fully expect eye of ugin to be banned and possibly eldrazi mimic. Though to be honest temple and mimic would work as well. Why mimic? cause it's a 2 mana creature that enables the super fast wins.

If you curve perfectly with nothing more than eye of ugin then waste waste waste... it's a 3/2 turn 2, 4/4 turn 3, 5/5 turn 4. Granted it might die along the way, but essentially you are paying 2 mana for 2 of whatever your biggest creature is. And if it dies the real creature lives. It gets worse if you get up to ulamog (which my tron deck of course ran). It is also abusive in other ways since any colorless creature you put in to play will trigger it's abbility. Just be happy phyrexian dreadnought isn't available in modern. :)

Don't get me wrong without the rest of the eldrazi it's a 'meh' card. If they leave it, it'll be because they want the eldrazi deck to remain a 'thing' in modern. it still could be with just temple and mimic. I would think it'll still be to strong/consistent with that though. The consistency is what really makes it a good deck. Most games I didn't have an eye of ugin opening hand. Losing eye would slow up my deck maybe half the games and make the end game more difficult. But regardless eldrazi tron would probably remain a viable deck.

I should add I didn't play against much eldrazi. I think this was just luck and part of what got me to day 2. I did play about everything else under the sun though. Day one went something like this: I played infect (lost 1-2), white blue planeswalkers (really surprised to play this! won 2-0), mardu tokens/goodstuff (2-0), black white tokens (2-1), red/green eldrazi (lost 1-2), abzan chord of calling deck (2-1) and affinity (won 2-1). Day Two went: storm (lost 1-2), merfolk (won 2-1), eldazi tron (mirror match! won 2-0),  living end(lost 1-2 drop)

My losses could have gone either way most of the time, but that's the luck of the draw. my deck could have used more testing. there were cards I cut almost every game but that just shows you how good eldrazi is that it could carry itself with maybe 4 so-so cards being run with it.

I'm not going to abandon eldrazi post ban. In fact I've long since put together my legacy eldrazi deck and have been tweeking it. I dont see it going anywhere anytime soon :) it's really competitive.

 

Gradient Boosting (Part 1)

Okay! So you want to learn about gradient boosting. Well, first let me point you to the obvious source https://en.wikipedia.org/wiki/Gradient_boosting I'll wait for you to go read it and come back.

Back? Think you understand it? Good! Then you don't need to read further.... probably. I should warn you now in the strictest sense this post is entirely backstory on getting to the point of implementing gradient boosting. You might want to skip to part 2 if you want more explanation of what gradient boosting is doing.

When I first tackled Gradient boosting, I tried it and it didn't work. What I mean to say is I got worse results than Random Forest https://en.wikipedia.org/wiki/Random_forest. Perhaps I'm getting ahead of myself. Let me back up a little more and explain my perspective.

Most people at https://www.kaggle.com/ use tool kits or languages with libraries and built in implementations of all the core functionality they need. That is, the tool kits that they use have everything written for them. They make API calls that perform the bulk of the work. They just assemble the pieces and configure the settings for the calls. 

I write my own code for pretty much everything when it comes to data mining. While I don't reinvent things I have no plans on improving there aren't to many things like that. I didn't write my own version of  PCA https://en.wikipedia.org/wiki/Principal_component_analysis I use the one that was out there in a library on the rare occasion I want to use it. And while I've got my own version of TSNE https://lvdmaaten.github.io/tsne/, it was a rewrite of a javascript implementation that someone else had written. Granted I've tweaked the code a lot for speed and to do some interesting things with it, but I didn't sit down with a blank class file and just write it. But everything else, I've written all by myself.

So why does that make difference (toolkit vs handwritten)? Well, i try stuff and have to figure things out. And because of that my version of the technology might work in a fundamentally different way. Or prehaps what I settle on isn't as good (though it probably seems better at the time). Then when I try and leverage that code for the implementation of gradient boosting, it doesn't work like it should.

The core of gradient boosting and random forests is the decision tree https://en.wikipedia.org/wiki/Decision_tree_learning. When it comes to random forest I have been very pleased with the tree algorithm I've designed. However they just didn't seem to work well stubbed out for gradient boosting. I can only think of three explanations for this.

  1. My stubbed out trees tended to be biased a certain way.
  2. They have a lot of noise in the output.
  3. My gradient boosting algorithm got mucked up due to poor implementation.

That at least is the best I can figure.

Recently I made a giant leap forward in my decision tree technology, I had an 'AH HA!' moment and changed my tree implementation to use my new method. When I got it all right my scores went up like 10% (ball-parking here) and all of a sudden when I tried gradient boosting it worked as well.The results I got with that were fully another 15-20% better still! All at once I felt like a contender. The rest of the world had long since moved on to XGBoost https://github.com/dmlc/xgboost which was leaving me in the dust. Not so much anymore, but i still haven't won any contests or like made million on the stock market :) .

What changed in my trees that made all the difference? I started using the sigmoid https://en.wikipedia.org/wiki/Logistic_function to make my splitters. I had tried that as well, many years ago but the specific implementation was key. The ah-ha moment was when I realize how to directly translate my data in to a perfect splitter without any messing around "honing in" on a the best split. I think this technology not only gives me an great tree based on better statistics model than using accuracy alone (accuracy is how my old tree worked). But the results are more "smooth" so noise is less of an issue.

 

 

Legacy Eldrazi

Just quick post about legacy. I tried an eldrazi deck I built last night at my local shop's legacy fnm. The short version is the deck did remarkably well and went 4-0. It might just have been luck as any decent deck can fight it's way through the variance every once in a while, but this was fairly one sided. The only game I lost was in the 4th round and it was with a hand I probably shouldn't have kept.

I was thinking I was going to try a green and black deck at the legacy grand prix in June, but I think I might try this if it continues to impress. I didn't get a real test against any combo decks but it beat 2 shardless bug decks, a merfolk deck and miracles. I also played against death and taxes afterwards and it seemed to hold it's own just fine. Anyways here's the deck

Eldrazi Post

I should warn you I might modify the deck the link goes to slightly. This was was just a first go at the deck and as I figure out where the holes are I'll modify the side board and tweak the main board.

Eldrazi and Modern and Vintage Masters

I've talked about it before and it's high time I wrote a little about it. I play magic the gathering and have for years. I'm going to skip a lot of basic explanation that a laymen might need and assume if you keep reading you have some said knowledge of how the game works and what sets are currently coming out. I'll also assume you know what the different tournament structures are and know at least a little something about deck building and technical game play (triggers, the stack, layers etc).

That being said, 2 interesting things are happening right now. First Modern has really turned in to a mess with the introduction of the new eldrazi cards from oath. Mess might be the wrong word. Let's say that all forms of the eldrazi deck seem to be dominant. You only have to watch this video of the coverage of the end of the pro tour to see that. I find this article's first line about them the most apt Can Eldrazi decks be beaten? (to save you from clicking 'No. See you next week!').

What happened/how this came to be was they made the eldrazi in the latest set strong/on par with the creatrues with color. They treat the colorless "waste" mana symbol like its a regular colored mana symbol. This would be all well and good except they had previously printed some lands (Eldrazi Temple and Eye of Ugin) that make the really high mana cost eldrazi easier to cast by reducing the mana cost or providing extra mana in lands that have basically no other function. When you combine the two... it's... it's just not fair.

There was another rather major change in modern, they banned splinter twin which changes up the control scene in modern. Not for reasons of being too powerful (it kind of was) but just to change things up. I like the idea but i think control players will have a tough time finding a good reliable replacement. Not that this really directly impacts the eldrazi addition. It's just another thing.

What decks have I been playing in modern? I've been throwing around a naya tokens deck for a while. I like it but its not quite consistent enough. I've tried to put together an blue white control deck as well. it's good but eldrazi will walk over it. I've even toyed with beating eldrazi using painter's servant (which adds color to everything, screwing up their accelerator lands) but it also seems inconsistent. This all leads me to, if you can't beat em, join em. Without further adieu I give you my eldrazi brew black eldrazi tron. it's a work in progress. I like the use of the Whip and oblivion stones not to mention leveraging Expedition Map. But it may all change after some testing.

There is a modern starcity open in Kentucky this weekend. I talked myself out of going. That's a bit of a shame as it would have been a great place to test my deck. It's just a 4 hour drive each way (or by plane, there are no direct flights) just stinks. So, i'll just try it out at my local shop (Realms of Gaming).

Oh and on one final note, wizards of the coast decided to reprint a number of older cards in a set called vintage masters. I understand the motivation (money) and the reasoning (some of those cards are hard to come by) but since most of the really good cards in the vintage tournament scene are forbidden from being reprinted by the reserve list and a lot of the ones that should be reprinted to make legacy more viable for new players are also on the list (read: dual lands). I question what effect it will really have on those tournaments in general, and if this is just a for fun set or ... or what. I guess in the end, any reprint is a good for those formats even if the $500 and up cards never see the printing press again.

Gradient boosting series preamble

Hello folks! I'm going to do another series of blogs this time on gradient boosting. This series will be on-going with no definite time line. I expect it will only take 1 maybe 2 posts to cover the basics of gradient boosting. After that I want to spend some time exploring possible ways to improve it. I may find none, we'll see when we get there. Allow me to catch you up from last week and explain some of my motivation in doing series.

Last time i wrote I had finished 1 kaggle contest and was moving back to another I had previously started. That contest is now over, having ended some 4 hours ago. If you go and look at the leader board you'll actually see I fell some 300 spots after results got posted. It seems I forgot to move my selected submission to the more recent submission I made *grin*. Well, it didn't matter much anyways. I didn't get much of a chance to work on the contest this week and while the gradient boosting version of my code worked much better, nothing was better than my initial submission with it. And that submission is still a seriously far cry from the top of the leader board.

The time I did spend on my code for the contest was spent half trying to improve my code and half trying to find optimal values for the contest. Neither of which was much of a success. The time I spent away from the computer just thinking about boosting in general has led me back here. I find that the way that Gradient boosting has been explained to me over and over again was possibly part of the reason I was complacent about getting it working. I want to present it in another way that perhaps is new to some people, maybe most people. A change of perspective if you will. I'll get on about that when I do my first post in the series.

I'm hoping this all leads to something better. I'm optimistic, but with all things there is a functional best version. You don't see people really improving hash sort or bucket sort. And quicksort is really about as good as it gets for a comparison sort. Compression only goes so far too. I remember back in the 80s and 90s when there always seemed to be a new compression algorithm for better space savings. At some point that stopped. At least when it comes to lossless compression. (I'm still waiting on someone to figure out lossless fractal compression). It seems possible but unlikely GBM is the end of the road for general data mining.

 

A contest in a week (part 5)

It's over! time for the epilogue. The winner scored  0.97024 I came in at  0.96477 on the private leader board with 1056 people between us. :) My GBM submission this morning moved me up a ton. I also did a follow up submission later which moved me up a little. Oh if only i had another week. :) but that's okay! I'm sure I would have improved my score, but getting past .97024 well that would have been something.

I did find there was some more tweaks to be made that i could bring some real improvement but I ran out of time. Specifically, the feature selection  i did on each round. I hadn't honed that very well. Also while making a forest of the results does improve results in general, adding rounds to the GBM was my path to most success. If I had more time, I think I would have at least figured out the breaking point where GBM stops making gains. That is, unlike random forest, when you make too many trees, the noise starts dominating.

My next challenge is one I was actually already working on, https://www.kaggle.com/c/prudential-life-insurance-assessment . I'll take what I've created here and go apply it there. I'm a long way from the top there as well, but well, I didn't have GBM. :) 'sides prudential pays more... ;)

 

A contest in a week (part 4)

Wouldn't it be the case that the one time I try to do a contest in a week (instead of doing it over months) I actually get a big break through with about 17 hours left to go. And now I don't have time to hone the result. Hehehe, oh well. Better that then no result. I'm getting ahead of myself, let me catch you up.

Here’s a rundown of how my weekend went. Friday evening I went out and played cards (magic the gathering, another hobby of mine). Then Friday night I worked on the contest till the wee hours of the morning. Usually this means I coded some then watched TV while things ran. Wash rinse repeat till I final call it a night. You can read the last post for details on how that went.

Then Saturday I skipped Mardis Gras (St. Louis has a rather large celebration, something I rather enjoy doing, great weather for it this year) only to work more on the program. My day involved programming and then play civilization 4 (any ole video game will do, I just happen to like that one :) ) or watch TV while things ran. I spent all day honing the Platt scaling and trying a few variations in my trees. I also added a few features namely month, day, quarter, year and day of week. Usually these are super strong indicators for sales and retail, I wasn’t sure they would be useful here. They get used they don't change my score much.

In the end I came to the conclusion that my best version of Platt scaling was only compensating for the noise in the results. Which is to say it was useless. I was just over fitting my local results. Saturday night I ran my program for about 8 hours doing 1380 tree (3 times my normal run of 460) and submitted it to the leaderboard. I moved up .00001 yep the smallest amount you can register. How disappointing.

That just means my forest had really given all it can give. So Sunday was pretty much the same as Saturday except for one major thing. Having spent all day Saturday cleaning/checking data and trying to hone my results I was out of things to do! I couldn’t even submit my program with a huge set of trees and expect a better result. I already did that. So that was it, I was out of things to do.... Well everything except 1 thing, the elephant in the room. I still could try and work on my GBM model. For the uninitiated (are there any of those out there still?) GBM is a Gradient boosting machine. 

It's been a "thing" for me for a long time. That is I've tried time and time again to implement it only to get poor results back. I could give you my best guesses as to why... I suppose part of it is I'm always trying to do too much at once (improve while implementing). I don’t have a test case to build to, to see it work right at first. I probably implemented parts of it wrong. Perhaps sometimes I don’t stay true to the core idea. Or maybe my trees don't work like conventional trees (they don't sometimes, in the mathematical sense). But you know, those are really guesses on my part. What matters is I decided to try yet again because it’s all I had left to do!

I put together a simple frame work (no frills) that calls my standard tree which uses logistic splits, and set to trying to get some settings right. It didn’t take long before I got some that actually showed an improvement in my local tests! And I mean a sizable improvement. The rubber will meet the road when I actually make a submission of course, but right now things look really good. Granted it didn’t improve accuracy, but that's not important for AUC.  Correlation Coefficient and Gini are all I really care about (well I’d just use AUC if I ever bothered to fix the calculation it *grin*). Incidentally, I can actually get accuracy way up there if I use Platt scaling. If I do though the others but my results go to crap. That just goes to show it’s not what you care about this time.

previous best (46 trees in a random forest - 3 fold CV)
gini-score:0.91988301504356 - 0.96333
CorCoefRR:0.635294182255424
Accuracy:0.887165645525564
LogLoss:NaN
AUC:1
RMSE:1.18455684615341
MAE:0.529971023064858
Rmsle:0.295349461579467
Calibration:0.0801169849564404

current best with GBM (15 trees in a GBM forest - don't judge me, I love forests - 3 fold CV)
gini-score:0.925589843599255
CorCoefRR:0.64803931053647
Accuracy:0.86189710899676
LogLoss:NaN
AUC:1
RMSE:1.15794907138431
MAE:0.588354750026592
Rmsle:0.303147411676309
Calibration:0.074410156400745

You are probably curious about the specifics on my GBM settings. Oddly, or perhaps not, much of it people have already shared on the forums. I'm doing 7 depth trees and I'm taking only 68% of the features for each GBM iteration. I tried using more or less depth and I tried sub selecting rows for each iteration, but it all made the score worse. I tried also knocking down the feature selection to like 33% just to see, but it hurt the results as well. Normally I would hone that too but I’m out of time.

The other settings I'm using don’t translate to the GBMs other people are using. For instance my “eta” is .75 way more than some of the settings I saw. I tested it, that’s the sweet spot. My nrounds is like 15 in that result above. Doing 1800 or whatever wouldn't be feasible with the way my tree works. It would take weeks to run and the improvements would diminish so fast as to not make any sense in doing it. Okay well maybe it would make sense if you had your eta at .01 but again...it would take days to run. Also the difference between 10 nrounds and 15 is pretty small. So clearly that is diminishing quickly as something that is beneficial to ramp up.

So that's been my weekend (Throw in some cold pizza, 2 pots of coffee, a bowl of oatmeal and 2 trips to taco bell and you now have the full experience.) I'll be wrapping it up here in the next half hour or so as I have to work in the morning. Just as soon as I get a few more results back so I know what I can scale up for a run over night to maximize a submission results. Once I got that I'll hit the hay and tomorrow.... oh tomorrow, tomorrow we see if it’s all for not. :)

 

A contest in a week (part 3)

A few more days have passed and I've moved my score up only a tiny bit. I now sit at 0.96333 . I think at this point I might actually be moving down the leader board as people pass me. I've spent quite a bit of time mulling over possible changes in the bagging process and trying a few ideas. I don't think there is anything for me to do there right now. In short what I have now works just fine and there are no obvious improvements.

The platt scaling I mentioned before might still produce some beneficial results. I tried using the scaling I had in place for a different contest but it doesn't seem to translate. It gave me higher accuracy but destroyed the order which is all AUC cares about. I'm going to give it another crack by saving off Cross Validation results and seeing if I improve them by running the sigmoid function over them using different coefficients. I store the total weight (the accuracy from each tree from bagging, used to weight voting of each tree to make the final results). I actually have 2 inputs and can do something more interesting than a straight translation.

When I do the work on that platt scaling i will likely want to graph the formula i create. Years ago I found a really nice online graphing tool and thought I'd share. you can check it out here https://rechneronline.de/function-graphs/ 

I wrote a little pivot sql to help me look at the data. Normally I wouldn't share this kind of thing as it would tend to be specific to your data storage. However, I will share it as it has applications anywhere you have a key-value/name-value pair table you want to pivot where you column names that go 0,1,2,3,4,5 etc... (I have an attribute table with the labels) So for any google searchers that are looking for this sort of thing here you go. I tried making it generic enough to understand and be reusable.

 

declare @sql as nvarchar(max)
declare @nullSql as nvarchar(max)
declare @columnNum as int
declare @n as int

select @columnCount = 1000
select @n = 1
select @sql = '[0]'
select @nullSql = 'isnull([0],0) as [0]'

while (@n < @columnCount)
begin
	select @sql = @sql + ',['+cast(@n as varchar(20))+']'
	select @nullSql = @nullSql + ',isnull(['+cast(@n as varchar(20))+'],0) as ['+cast(@n as varchar(20))+']'
	select @n = @n + 1
end

select  @sql = N'SELECT RowNumber,' + @nullSql + ' FROM (
    SELECT 
        RowNumber,ColumnNumber, value
    FROM KeyValueTable
) as sel
PIVOT
(
		SUM(value)
    FOR ColumnNumber IN (' + @sql + ')
) AS pvt 
'

EXECUTE sp_executesql @SQL

Oh one thing I noticed when I went and used it. I had imported my date field wrong! there was a bug in my code and it turns out my date field was pretty much garbage. I fixed it, but this just goes to show you should always double check your inputs to make sure they are good. This isn't the first time this sort of thing has happened. I've wasted weeks and weeks before on bad data. I did spot check my data i just missed this. That particular column was special being a date and all. The date format was yyyy-mm-dd, my loader had only ever dealt with dates in yyyy/mm/dd format and the difference is what made it not work right.

I took a look at the forums to see if there were any obvious insights people have shared that I needed to implement. I already mentioned i don't spend a lot of time trying learn the data and hand massage it to be exactly what I need. To really excel at any data mining competitions you should do that. in the business world you never touch the algorithms. That's what R&D and PHDs do (and me apparently). You just buy the tools and use them. Which of course means the stronger my tool kit gets the more I do that sort of thing cause that's where the gains are. 

I like to think I get some deep understanding of the nature of data interactions by taking the long road, but i'm probably fooling myself. :) There was at least one obvious thing I found in the forums (there may be more, i need to go back and look).  I needed to  create a column tallying how many pieces of missing data there are in that row. That's exactly the sort of thing my algorithm has a hard time determining on its own. It is what gave me my tiny bump in score. Its also the sort of thing a genetic algorithm might find on its own... but that's something I'll save for another series. (I have such delights to share with you all!)

I also tried some TSNE transforms on the data. There is a nice thread about this on the forums https://www.kaggle.com/c/homesite-quote-conversion/forums/t/18554/visualization-of-observations . One guy in particular managed to get the output to look really nice. Unfortunately the few tests I did produced the stringy looking results you can see in that thread as well. He mentioned he did a replacement on the categorical values with the average score for that category. This makes a lot of sense as raw whole numbers representing a category are meaningless. Also it is probably a far better technique than the one hot encoding method when it comes to that process. TSNE wants related data to be in 1 feature so it can figure out the connection to other features. Separating it out messes this up as it doesn't know two features are actually one. As I only have a week and we are down to 2+ days. I wont be revisiting my TSNE work anymore for this contest. However, it's definitely something to remember: feeding the TSNE results in to your model could very well produce a winning score. 

Incidentally, I do the same category to real number transformation when calculating correlation coefficients on categorical values when I'm figuring splits down the tree. I don't always do one hot encoding. In fact i only do it afterwards when looking for improvements.

So my next steps are looking at making a transformation using platt scaling from my results and looking the data for obvious things I might try to improve the score.  More reviewing the forums to see if there are other "you need to do this" posts. And just general noodling on what might work.