Dim Red Glow

A blog about data mining, games, stocks and adventures.

Some animations from the cancer research

I ran the data through (well my layered version of t-sne)  and got some pretty good results. good at least visually. I'm not sure if these are going to pan out well, but i'm hopeful. Let say 2 things before i get to them.

First the more I use the new log version of the image, the more i think that is fundamentally the best way to look at the image data. In fact, I think it might be the best way to deal with just about any real world "thing" that can't be modeled in discrete values. it just does everything you want and in a way that makes good sense. It is, essentially a wave form of the whole object broken down in to 1 number. It might actually be a great way to deal with sounds, 3d models, electrical signals, pictures ... you name it. I think there is room for improvement in the details but the idea itself seems good.

The 2nd thing I want to say is the files are huge, so I'm going to share still images of the final result and then have links to the .Gif animations. I might upload them to youtube but i'm pretty sure gif format is the most efficient way to send them to anyone who wants to see. They lossless which is great and smaller than corresponding mp4s/mpeg/fla files etc. This is because normal video compressors don't handle 1000s moving pixels nearly as efficiently as a gif file with its simple difference layers does.

In these images a purple dots represents cat scan images from a patient with no cancer. The green dots represent cat scan images where they had cancer. and in the final image yellow is test data we don't have solutions for. The goal here is to identify which slices/images of actually have cancer telling info in it. So I would hope most of the images  fall a mix of purple and green dots.

Okay this one is a mess. and your initial reaction might be. how is that useful? well there is an important thing to know here. the fields i fed in included the indexed location of the slices after they were sorted. while this is fairly useful for data mining it is all but useless for visualization (a linear set of numbers is not something that has a meaningful standard deviation or localized average for use in grouping.)  I knew this going in to the processing but I wanted to see the result all the same.

So now we remove the indexing features and try again. (remember each pixel is actually a bunch of features created from multiple grids of images)

Okay that looks a LOT better. in the bottom right you see a group of green pixels. that is some very nice auto-grouping. I went ahead and ran this once more this time with the test data in there as well. it is in yellow.

If anything that's even better. Adding more data does tend to help things. the top group is really good. the ones on the left might be something too. its hard to know, but you dont have to! that's what the data miner tools will figure out for you.

I want to do even more runs and see if I can build a better picture. a few of my features are based on index distances and they should probably be based on my log number instead. Either way it's fun to share! the rubber will really meet the road when i see if this actually gives me good results when i make a submission. (probably a day away at least)

here are the videos they are 69, 47 and 77 meg each so... it'll probably take a while to download.





First steps to image processing improvement

where to begin... first I started looking at just how big 32x32 is on the 512x512 images. It seemed too small to make good clear identification of abnormalities. So I increased my grid segments to 128 ... this might be too big. I'll probably drop them down to 64 next run.

I found a post by a radiologist in kaggle's forums that further reinforced this as a good size (solely based on his identification of a 33x32 legion, given that the arbitrary location 64x64 would be perfect.) .So that's on the short list of changes to make.

also, I have fixed a bug in my sorting (it's really bad this existed :( ) and added a log version of the number used for sorting. (That's actuallly how I caught the sorting bug.) I don't generally need both but it is useful in some contexts to use 1 and in others to use the other. The Log number alone lacks the specificity since 512x512 image is awful big in number form. So rounding/lack of percusion is a problem.

unfortunately even with me fixing my garbage inputs from before my score didn't improve. This leads me to the real problem I need to solve.

I need to be able to identify which image actually has cancer. The patient level identification is just not enough. That's why my score didn't improve when I fixed the inputs. .. but how?

i hope t-sne can help here. I feed in all images with their respective grid data and let it self organize. That's running right now. Hopefully the output puts the images into groups based on abnormalities. Ideally grouping images with cancer. Since I don't actually know which do have cancer, I'll be looking for a grouping of images all marked 1 (from a cancer patient) and hope he that gets it. there will also hopefully be groups made of a mishmash from non cancer images .

I'm building one of those animations from before for fun. I'll share when it's finished

if this works I either feed that tsne data (x,y) in as a new feature or use kmeans to group up the groups and use that. We will see!

cat scan analysis and first submissions

So things have moved forward and I've made a few submissions. Things have gone okay, but i'm looking for ... more.

It took the best part of a week to generate the data, though i didn't use suffix arrays. i transformed each cat scan image in to a series of images constructed of averages. first image was 1 byte that was an average of the whole thing. then the next 4 were for each quadrant. repeat in this manner till you have the level of detail you want. The point is, once it is sorted that level is done. So data might look something like this.

Image#       1 2 3 4 5 6 7 8

----------------------------------Data below here

sort this --> 4 2 1 1 0 2 1 4   <--- most significant byte (average of the whole image)

then this  -> 4 1 1 2 3 4 5 6   <---- next most significant byte (top left quadrant average)

then this  -> 0 2 1 0 0 0 0 2    <---- etc


I broke each image in to lots of little pieces, and each of those pieces got sorted in to 1 gigantic index using the method above. Once I had that data I built 3 different values for the grid elements, index location, average of the cancer content in nearby indexed cells (excluding gid segments from the same person) and finally how far away the nearest grid of the same position is (left and right on the indexed sort). This one is supposed to give me an idea of how out of place the index is. you would think that similar images would cluster and anomalies would be indicative of something special.

On the last method,  I'm not certain the data sample size is large enough for that (or self-similar enough). It might be but without a visual way check it out (i haven't written one) it can be hard to be certain. its the sort of thing i would expect to work real well with maps, I just don't know about body features. I do this to try and solve the simple problem, I know that there is cancer in the patient I just don't know which slice/grid part its in. I would argue that's the real challenge for this contest. (more on that later)

Once I have these statistics I build a few others that relevant to the whole image and then take all the grid data ad the few other statistics and try to make a cancer / no cancer prediction for each image. ... again tons of false positives since each image is one part of the whole and cancer may only be on 1 cat scan.

Once I have the results (using a new form of my GBM) I take the 9 fold cross validation resuilt of the train data and the results from test data and send it all in to another GBM. This one takes the whole body (the slices have been organized in order by a label) an produces a uniform 1024 slices broken out by percentage (i take the results from the nearest image, that becomes my feature for that percentage location.) then i build 512, 256,128... down to 1 . these features dont use the nearest but average of the 2, 4, 8.. etc elements that went in to the 1024. I send all that data in to a GBM, get predictions and ... bob's your uncle.

The accuracy... is okay. the problem i've had is bias is super easy to introduce. local testing puts the results at around .55 log loss. but my submissions were more in the .6 neighborhood. which makes me think that: 1, i got nothing special and 2, all the false positives are screwing things up. Btw, getting to that point took over 3 weeks beginning to end, with many long hours of my computer doing things. I ahve since i started radically improved the speed of the whole process and can probably get from beginning to end in about 2-3 days now. so that's good. The biggest/most important improvement of late was threading the tree generation internal to the tree itself. I've had the trees themselves be threaded for years, but never bothered to thread out the node work. it is now and it really helps make as much use of the cpu as I can. That change if nothing else will be great for the future.

So lets talk about false positives real quick and knowing exactly which slice of cat scan image has the cancer in it. I think most people make a 3d model and handle the data as a whole (getting rid of the problem) that may be my solution. that is essentially what i did with 2nd level of the GBM. I was trying to solve the problem in an image by image way previously but i think there is just to much noise unless i add some insight that is missing. So with regards to that there is maybe 1 possible way to add some insight. consider this

No Cancer                   Cancer

00100000100             00107000100             

00010100100             00900100100             



if each 0 or 1 is a whole image the trick is to realize that 7 and 9 are unique to the cancer side. but how do you get to that point? right now i take all the numbers by themselves and say cancer on no cancer. so 0 has a percentage chance.. etc. Even this is a simplification because i dont have a clear picture of "3"  or "2" it might actually just be a special pattern of 1s and 0s  (that is to say it all looks normal and the particular oddness of the slices in the order they show is what makes ID as cancer).

So I'm noodling on this a little to see if I can find a good way to get the insight. if I can get the image analysis to indicate 3 or 2 is present. i'm in there, but more likely I'll make data out of the whole and rethink how to make the predictions there by adding more data to each prediction but removing the false positives.





suffix arrays maybe

I've been hard at work in my free time working on this cancer image processing. i got my image load and processing down so it takes 7 minutes to load the images (this includes the white-black color balancing that makes "black" consistent on all images) The problem i'm having now is finding the best way to match a 32x32 block from the 512x512 images against all the other images to find similar entities.

I tried a brute force approach that measured differences between blocks. that wasn't gonna work. the runtime would have been centuries and the results were meh at best. So just to speed it up (before i improve the meh results) I then tried making an index out of the data by some averages and what not from each cell, storing them in 4 bytes as an index. I associated all rows that tied to that index in a hashmap for that index. the problem with this is, there are to many similar cells. That is to say the averaging functions weren't doing enough to distinguish some very similar looking cells. So the runtime went to like 7 years... with still meh results.

I tried short circuiting the results and just give me "something" after a few tries but even this was proving to take a loooong time to run. why is this taking so long? well 250,000 base images and i divide each one in to 32x32 segments and then i do a few versions of the grid where i offset by 16. i end up with  16*16+15*16+16*15+15*15 cells for each of the 250,000 images i want to process. since i really dont want to process every single possible cell because of runtime concerns. Or do I?

So sitting here thinking about it. I learned about suffix arrays years ago for some bioinformatics class i took (a story for another time). I've implemented one on 3 separate occasions but its been a while. What are they? they are a clever way of using pointers in to your original array. then sorting the those pointers by what they point at. if you had the string "ABA" your initial pointers would be 0,1,2 then after they are sorted they would be 0,2,1 (assuming you sort "end  of line" after "B") why is this useful? well you can quickly see where similar things the string occur by scooting over one to the right or one to the left.

I could then put all the images grid segments in to a giant array of bytes and sort them (not a fast process, but really not that slow either). I want to keep the little images as they represent my features i'll feed in to the database, i could make smaller images but 961 features seems more than enough for now. the first real problem here is the array of images is about 250,000,000 long and each of those has 1k of bytes in it. so you need a lot of memory (i'm okay there, the indexing i was doing before had the same problem) the 2nd thing you need to think about is the image segment. its really not rendered the way that is useful.

when i worked on the indexing i thought about that some and tried making averages for the whole grid. this kinda worked but really it dances around the issue cause i was only making 4 bytes of averages from a 1k image. I think to do this right, you need to translate the images in to something more... jpeg-y. you want 1 byte that represents some average details then subsequent bytes to refine the average so that as you move from the left to the right of byte array you get more and more detailed information. this will put similiar images next to each other once sorted.  You can certain translate an image in to another form that does something like that. i'll have to.

Once i've done that, and once i have the data loaded in to a suffix array its a simple matter of looking at the nearest cells in the suffix array and seeing if they are in a cancer patient or not (skipping cells that reference the original image for that grid, and cells in test data). once you've done all that you  can get an average value for cancer/nocancer and save it for that feature.

This method leaves the original images on stretched or touched in anyway (other than the color balance) which means a small person may not line up with a large person. and also there are times when the image is zoomed in to much in the cat scan so there is clipping etc... these problems can be resolved if there are enough images (so you can find someone similar to yourself) a bit like sampling enough voices will eventually give you a person who has the same accent as yourself. the question is, is there enough data?

hopefully i can get all this done by monday and have a crack at actually making some predictions with the results using a normal gbm.

Minecraft and data mining (no relation) and stocks (some releation)

It's the 2nd day of 2017 and soon to be 2018 or so it always seems when you look back at years. I've dawdled for to long on some old work I wanted to do and the years turning over has put that in sharp relief.

First I started a you tube channel https://www.youtube.com/channel/UCoTP8WbdsCW_6FSLz0pUSsA (Hardcore in a Hurry) for my video game exploits. I don't play a lot of video games these days, but that being said I like playing games on the hardest setting possible. I'm not a fan cheat modes or easy walk through settings (or AIs that cheat for that matter, I mean seriously? couldn't you write a better AI?). That being said, minecraft is the only thing on there now. I don't have plans to add any other types of videos right now, but things change.

Next Let me say I've spent some more time on my layered Gradient boosting stuff. I learned a little bit about what t-sne can do for me. So feeding t-sne in initially can help if the groups do self organize out of the original data but normally, I think it is something you want to do after a few iterations have passed. it seems the further down the gradient you are the better the effect. The effects arent remarkable, but they aren't terrible either. The unfortunate takeway is that it is very time intensive. running t-sne processing on the current state of the gradient descent and the source data just takes ... well a while. maybe i can cherry pick features to use and speed it up, but as it stands right now unless the data set is pretty small its not a good option for eking out more performance. 

For a couple years now i've wanted to implement a stock analysis program, so i can do personal investment. I did this once along time ago with a friend using far less sophisticated methods. That's a story for another time. I've never been happy enough with my software for datamining to start the code/project for investing. I think i'm happy enough now. The really work will be in getting the data in to a form that is easily updated daily and makes measurable testable predictions. My recent work on https://www.kaggle.com/c/santander-product-recommendation got me to rewrite parts of my code to handle time series in a different way. which is key for the training and evaluation. (a contest i never got anywhere with. I've never implemented the map@<x> evaluation so it makes it hard to train for such a thing. )


adjusting the layers to be tsne driven

Hello again, i've slowly turned over the ideas in my head i wrote about last time, and i think there are 2 big flaws in my idea. first fitting a gbm to data you already have in the training and test data adds nothing. (that's huge). this is with regards to me selecting a few features to make something that has a correlation coefficent equal to some fixed amount of the final result. And 2nd making groups in any fashion that is not some form of transform will likely never give me any new information to work with.

I have a fix to both (of course :) ) i mentioned using tsne as well. i think that's where i need to get my groups from. the features i send in to tsne is the how i get various new groups. everthing else i've said though is still relevant. so it is no longer layers of gbm as much as layers of tsne. once i have relevant results from that i send it all in to gbm and let it do its work.

The groups i make out of tsne (2-d btw) will utilize the linear tsne i've already written (As its super fast and does what i need to do pretty well). and while i will send the the results from the tsne right in to the gbm I'll do more than that. I will also isolate out each regions in the results and build groups from each of those.

To do that, i make a gravity-well map for every point on the map... like they are all little planets. the resulting map is evenly subdivided (likes its a giant square map of X by X...i've been using 50 for each side) and have a 0 or larger value in every square.  (inverse squared distance ... 1/(1+ (sourceX - x0)^2 + sourceY - y0)^2 ). i've been limiting the influence to 5 squares from each point just to keep the runtime speedy (its only a 50x50 map anyway so that's pretty good reach and at 6 away the influence is 1/(1+36+36) which is only 0.013 so that's a decent cutoff anyways.  Once i have the map i look for any positive points that have lower or equal points in all directions. those points are then used to make groups. so there may only be 1... but likely there will be bunches.

the groups are based off of distance from the various centers of local low spot in the gravity well map. if you were to take the differential of the map, these would be 0 points.  so we just find the distance of every point from those centers. This has the added effect of possibly making multiple groups in 1 one pass (which is great) not to mention the data is made via a transformation that works completely statistically and has no direct 1 to 1 correlation with the data you feed it.  So you are actually adding something to the gbm mechanism to work with.

At that point the resulting groups should probably be filtered to see if they add any real value using the correlation mechanism i already mentioned in the last post. ideally gbm would just ignore them if they are noisy.

Thoughts on the AI layers

So, I've given my layered GBM a little thought. I'll explain what i want to do by how i came up with it. My goals seem pretty straight forward I think I can describe them with these 2 ideas/rules:

1. I want each layer that is created to contribute to the whole/final result in a unique way. So there is as little redundancy (ie wasted potential) as possible.

2. The number of layers should be arbitrary. If the data drives us to a one layer system, that's all you create. That is, if the data just points directly to a final result, you just figure out the final result.  If the data drives us to 100 layer system... well there is that then too.

To do the fist one, each layer should have access to previous layers to make it's results so it will have more options to make different groups in that layer, with the training data available to everything. Each layer will also know what previous layers produced to keep from making "similar" things. There will likely be some controlling variable for how different a group needs to be.

To do the second one we need to have a way to move towards our goal of predicting the final result. We could just make groups until we happen to make one that fits our results really well, that sort of brute force approach might take forever (depending on the algorithm). If however we have a deterministic way of measuring the predicted groups similarity to each other and the final result we can use that to throw out any groups that are either too similar to previous groups and we will have a way of slowly building a target result that is improved on the previous prediction.

Each layer then will probably (nothing is written) produce a guess for the final answer as well as additional useful groups if it can find any. If at any point the guess fails to be an improvement on the previous result  we stop and take the best guess. we could continue making layers if for some reason we think we might eventually make a better guess. In this way concurrent failures might be our stopping spot or perhaps we will make layers until no new groups can be found or finally at least make sure a minimum number of layers is created if new groups are available but the answers aren't improving in general.

So how do we go about picking these things/groups? That at least isn't to hard using one of my favorite statistical measurements for data mining the Correlation Coefficient. we look at the data and measure each feature's and/or group of feature's movement against the final results and against any other possible groups we've made. The features for now will be selected randomly, though I'll probably find a good mathematically ground way to limit their selection.

There are two types of groups we will make. The type that contributes and the type that is a possible answer. A contributing group wants to have a unique correlation coefficient. that is a number we haven't seen that is at least X away from any other group. the possible answer group is always as close to 1 as possible.

Since I don't really have a way to make multiple groups at once, what I will end up doing is making any old group i can... and again checking against previous groups. then generating a result and seeing if said result is actually still valid. if it is, it goes in to the results. I will probably either alternate between this and an actual prediction or make a fixed number of groups and then make a prediction. (and wash rinse repeat until i'm done in whatever capacity.)

In truth i doubt a created group will ever be very comparable to a result we are looking for. if data mining were that easy we wouldn't spend much time on it.  What I expect is for a bunch of rather generic groups to be made and for them to act like created features to be used in subsequent feature generation... etc till a really useful one is generated that the final answer uses.

The only thing I would add i I would probably throw in a TSNE feature pair as well at each level. this actually would act like a 2nd and 3rd group, ideally any grous the TSNE feature finds will likely be a possible grouping for the final result. since if those two features are paired together and used to build a new feature to predict on then we have something that is a statistically relevant group.

What do I mean by grouping multiple features? basically use the distance equation... figuring distance from of the center (avg) values of the two features from the values of each rows position. said another way f(x,y) =  Sqrt( ((x - avg(x))^2 + (y - avg(y))^2) )



another long over due blog

Hi again faithful readers (you know who you are). So the marathon is back off the plate. Basically, i can't afford to spend the extra cash right now on a trip. Or rather I'm not going to borrow the money for a trip like that. I over spent when i thought the trip was off and upon reflection when it came time to book the trip, it just didn't make sense. So... maybe next year.

So for the last month i've been toying with things with my data mining code base. I removed tons of old code that wasnt being used/tested. i honed it down to just GBM. then i spent some time seeing if i can make a version that boosts accuracy over log loss (normal gbm produces a very balanced approach) i was successful. I did it by taking the output from one gbm and sending it in to another in essence over fitting.

Why would i want to do that? Well there is a contest ( https://www.kaggle.com/c/santander-product-recommendation ) i tried to work on it a little, but the positives are far and few in between. the predictions generally put any given positive at less than 1% chance of being right so i tried bumping the number since right now it returned all negatives. The results are actually running right now. i still expect all negatives but the potential positives should have higher percentages and I can pick a cut off point to round to a positive that is a little higher than .0005% i would have had to use before. All this, so I can send in a result other than all negatives. (which scores you at the bottom of the leader board) Will this give me a better score than all negatives... no idea :)

I tried making a variation on the GBM tree i was using that worked like some stuff i did years ago. it wasnt bad, but still not as good as the current gbm implementation. I also modified the Tree to be able to handle time series data in that it can lock 1 row to another and put the columns in time sensitive order. this allows me to process multi month data really well. It also gave me a place to feed in fake data if i want to stack 1 gbm on top of another i can send in the previous gbm's results as new features to train on (along with the normal training data).

This leads me to where i think the next evolution of this will be. I'm slowly building a multi layer GBM ... which essentially is a form of neural net. The thing i need to work out is how best to sub divide the things each layer should predict. that is i could make it so the GBM makes 2 or 1000 different groups and predicts rows results for each and feeds those in for the next prediction...etc. till we get to the final prediction. the division of the groups is something that can probably be done using a form of multi variable analysis that makes groups out of variables that change together.  figuring out how to divide it in to multiple layers is a different problem all together.

Do you want an AI cause this is how you get AIs! heh, seriously, thats what it turns in to. once you have a program that takes in data builds a great answer in layers solving little problems and assembles them in to a final answer that is super great. well you pretty much have an AI.

Incidently, TSNE might also help here as it I might just feed the tsne results for that layer's training data (fake data included if we are a level or two down) to give the system a better picture of how things group statistically.

In other news, I started using blue apron. This is my first time trying a service like this and so far I'm really enjoying it. I'm pretty bad about going to the grocery store. And going every week ... well that aint gonna happen. This is my way to do that, without doing that :) . I'm sure most people have similar thoughts when they sign up, even if the selling point is supposed to be the dishes you are making. Honestly, I've just been eating too much take out. I don't mind cooking and the dishes they send you to prepare are for the most part really good.



Stopping before I start then starting again

I feel like I've started many things and then abruptly stopped before i really got in to the thick of it. I'd like to share some of the highlights. Doing so will give anyone reading a good picture of where everything is with the stuff i normally (its been over two months) blog about.


Data Mining: Almost 6 months ago I was working on a GBM in a GBM model and there it has sat since then. It's not that I think it's merit-less, it is just that I doubt it'll produce amazing results. So, I'm not really inspired to finish it. Also, I don't have a kaggle contest to work on right now and that helps drive my interest in the algorithm. Truth be told, I'm beginning to think data mining is getting near the end of its "big gains" period. the human analysis part may very well be improvable but that's never interested me much.


Running: I started running, then i stopped, then I started again. to expand on this, it was fine, then i over did it, then I got motivated and eased back in. I've got a marathon in mind I want to go to. It is still 11 weeks away so I'm training up for that. I'll talk more about it when I commit to it fully (basically in about 4 weeks).


Math: I spent a lot of time trying to solve the 3 cubes problem. My most recent attempt sent me down a rabbit hole of general factoring. I actually thought I had a method for doing that, only to realize that my solution to the problem was such a tiny corner case as to be unusable. The net-net is I got nothing I'm pursuing here right now.


Magic: Things continue to wind down. there is a grand prix tourney in Dallas in 2 weeks but I don't think I'm gonna go. I still have fun playing most weeks but I'm definitely not feeling the drive to brew decks like i did. And, lets be honest, i've never felt the competitive spirit this time around. That is to say, I want to win sure, but deck building always took precedence. as it is so much more interesting than playing a known decent deck, which just makes it hard to be competitive.


EM Drive: Have you seen this thing? Its about 2 years old from a "oh hey something new!" perspective. but the last few days I've been reading up on the science and watching youtubes about it. I have to admit I'd like to better understand how it works. I totally get what virtual particles are, but I don't get how they can get the transfer of momentum to them (they are incredibly hard to interact with). Just something I've been messing with and thought I'd include as a bullet point. I doubt I'll build anything in my garage but crazier things have happened.



Slow going on the GBM in GBM idea

I've been thinking about this for weeks. I'm normally a jump in there and do it kind of guy but with no contests around to motivate me I've been stewing on it more than writing it. I did implement one version of the GBM in GBM mechanism but it underperforms compared to my current normal tree. There are probably more things to do to it to hone it and make better but this is where I stopped and started stewing.

I've been thinking i'm approaching this the wrong way. I think that trees are great for emulating the process of making decisions based on a current known group of data, but don't we know more? Or rather can't we do better? We actually have the answers so isn’t there a way we can tease out the exact splits we want or get the precise score we want? I think there might be.

I've been looking at build a tree from the bottom up. I'm still thinking about the details but the short version is you start out with all the terminal nodes. You then take them in pairs of 2 and construct the tree. Any odd node sits out and gets put in next go. The "take them in pairs of 2" is the part i of been really thinking hard about. Conventionally going down a tree your splits are done through some system of finding features that cut the data in to pieces you are happy with. I'm going to be doing the same thing but in reverse. I want the 2 data pieces paired together to be as close to each other as possible from a Euclidian distance perspective at least with regards to whatever features I use. But (and this is one of the things I debate on) I also want the scores to be far apart.

When you think about what I’m trying to accomplish putting two items with really far apart scores makes sense. You want to figure out shared qualities in otherwise disjointed items. Similar items could be joined as well if we approach it that way the idea is you are building a tree that hones the answer really quickly and exactly. This however wouldn’t do a good job of producing scores we can further boost... we wouldn’t be finding a gradient. Instead we would be finding 1 point and after 1 or 2 iterations we'd be done.

By taking extreme values the separation would ideally be the difference from the maximum value and the minimum value. If we did that though it would only work for 2 of our data points. The rest would have to be closer each other (unless they all shared those 2 extremes) I think it would be best to match items in the center of the distribution with items on the far extreme. Giving all pairs a similar distance of (max-min)/2 and likely a value that actually is 1/2 the max or the min since it would average to the middle.

In this way we merge up items till we get to a fixed depth from the top (top being a root node). we could keep merging till then and try to climb back down the tree, i might try that at some point but since the splitters you would make won’t work well, i think the better way is to then introduce the test data at the closest terminal node (much like how nodes were merged together) and follow it up the tree till you get to a stopping spot. The average answer there is score you return.

Again I still haven’t implemented it, I’ve been stewing on it. The final piece of the puzzle is exactly how I want to do feature selection for the merging. There has to be some system for maximizing score and minimizing distance so it isn’t all ad-hoc.