Dim Red Glow

A blog about data mining, games, stocks and adventures.

A contest in a week (part 3)

A few more days have passed and I've moved my score up only a tiny bit. I now sit at 0.96333 . I think at this point I might actually be moving down the leader board as people pass me. I've spent quite a bit of time mulling over possible changes in the bagging process and trying a few ideas. I don't think there is anything for me to do there right now. In short what I have now works just fine and there are no obvious improvements.

The platt scaling I mentioned before might still produce some beneficial results. I tried using the scaling I had in place for a different contest but it doesn't seem to translate. It gave me higher accuracy but destroyed the order which is all AUC cares about. I'm going to give it another crack by saving off Cross Validation results and seeing if I improve them by running the sigmoid function over them using different coefficients. I store the total weight (the accuracy from each tree from bagging, used to weight voting of each tree to make the final results). I actually have 2 inputs and can do something more interesting than a straight translation.

When I do the work on that platt scaling i will likely want to graph the formula i create. Years ago I found a really nice online graphing tool and thought I'd share. you can check it out here https://rechneronline.de/function-graphs/ 

I wrote a little pivot sql to help me look at the data. Normally I wouldn't share this kind of thing as it would tend to be specific to your data storage. However, I will share it as it has applications anywhere you have a key-value/name-value pair table you want to pivot where you column names that go 0,1,2,3,4,5 etc... (I have an attribute table with the labels) So for any google searchers that are looking for this sort of thing here you go. I tried making it generic enough to understand and be reusable.


declare @sql as nvarchar(max)
declare @nullSql as nvarchar(max)
declare @columnNum as int
declare @n as int

select @columnCount = 1000
select @n = 1
select @sql = '[0]'
select @nullSql = 'isnull([0],0) as [0]'

while (@n < @columnCount)
	select @sql = @sql + ',['+cast(@n as varchar(20))+']'
	select @nullSql = @nullSql + ',isnull(['+cast(@n as varchar(20))+'],0) as ['+cast(@n as varchar(20))+']'
	select @n = @n + 1

select  @sql = N'SELECT RowNumber,' + @nullSql + ' FROM (
        RowNumber,ColumnNumber, value
    FROM KeyValueTable
) as sel
    FOR ColumnNumber IN (' + @sql + ')
) AS pvt 

EXECUTE sp_executesql @SQL

Oh one thing I noticed when I went and used it. I had imported my date field wrong! there was a bug in my code and it turns out my date field was pretty much garbage. I fixed it, but this just goes to show you should always double check your inputs to make sure they are good. This isn't the first time this sort of thing has happened. I've wasted weeks and weeks before on bad data. I did spot check my data i just missed this. That particular column was special being a date and all. The date format was yyyy-mm-dd, my loader had only ever dealt with dates in yyyy/mm/dd format and the difference is what made it not work right.

I took a look at the forums to see if there were any obvious insights people have shared that I needed to implement. I already mentioned i don't spend a lot of time trying learn the data and hand massage it to be exactly what I need. To really excel at any data mining competitions you should do that. in the business world you never touch the algorithms. That's what R&D and PHDs do (and me apparently). You just buy the tools and use them. Which of course means the stronger my tool kit gets the more I do that sort of thing cause that's where the gains are. 

I like to think I get some deep understanding of the nature of data interactions by taking the long road, but i'm probably fooling myself. :) There was at least one obvious thing I found in the forums (there may be more, i need to go back and look).  I needed to  create a column tallying how many pieces of missing data there are in that row. That's exactly the sort of thing my algorithm has a hard time determining on its own. It is what gave me my tiny bump in score. Its also the sort of thing a genetic algorithm might find on its own... but that's something I'll save for another series. (I have such delights to share with you all!)

I also tried some TSNE transforms on the data. There is a nice thread about this on the forums https://www.kaggle.com/c/homesite-quote-conversion/forums/t/18554/visualization-of-observations . One guy in particular managed to get the output to look really nice. Unfortunately the few tests I did produced the stringy looking results you can see in that thread as well. He mentioned he did a replacement on the categorical values with the average score for that category. This makes a lot of sense as raw whole numbers representing a category are meaningless. Also it is probably a far better technique than the one hot encoding method when it comes to that process. TSNE wants related data to be in 1 feature so it can figure out the connection to other features. Separating it out messes this up as it doesn't know two features are actually one. As I only have a week and we are down to 2+ days. I wont be revisiting my TSNE work anymore for this contest. However, it's definitely something to remember: feeding the TSNE results in to your model could very well produce a winning score. 

Incidentally, I do the same category to real number transformation when calculating correlation coefficients on categorical values when I'm figuring splits down the tree. I don't always do one hot encoding. In fact i only do it afterwards when looking for improvements.

So my next steps are looking at making a transformation using platt scaling from my results and looking the data for obvious things I might try to improve the score.  More reviewing the forums to see if there are other "you need to do this" posts. And just general noodling on what might work.