So, I've given my layered GBM a little thought. I'll explain what i want to do by how i came up with it. My goals seem pretty straight forward I think I can describe them with these 2 ideas/rules:
1. I want each layer that is created to contribute to the whole/final result in a unique way. So there is as little redundancy (ie wasted potential) as possible.
2. The number of layers should be arbitrary. If the data drives us to a one layer system, that's all you create. That is, if the data just points directly to a final result, you just figure out the final result. If the data drives us to 100 layer system... well there is that then too.
To do the fist one, each layer should have access to previous layers to make it's results so it will have more options to make different groups in that layer, with the training data available to everything. Each layer will also know what previous layers produced to keep from making "similar" things. There will likely be some controlling variable for how different a group needs to be.
To do the second one we need to have a way to move towards our goal of predicting the final result. We could just make groups until we happen to make one that fits our results really well, that sort of brute force approach might take forever (depending on the algorithm). If however we have a deterministic way of measuring the predicted groups similarity to each other and the final result we can use that to throw out any groups that are either too similar to previous groups and we will have a way of slowly building a target result that is improved on the previous prediction.
Each layer then will probably (nothing is written) produce a guess for the final answer as well as additional useful groups if it can find any. If at any point the guess fails to be an improvement on the previous result we stop and take the best guess. we could continue making layers if for some reason we think we might eventually make a better guess. In this way concurrent failures might be our stopping spot or perhaps we will make layers until no new groups can be found or finally at least make sure a minimum number of layers is created if new groups are available but the answers aren't improving in general.
So how do we go about picking these things/groups? That at least isn't to hard using one of my favorite statistical measurements for data mining the Correlation Coefficient. we look at the data and measure each feature's and/or group of feature's movement against the final results and against any other possible groups we've made. The features for now will be selected randomly, though I'll probably find a good mathematically ground way to limit their selection.
There are two types of groups we will make. The type that contributes and the type that is a possible answer. A contributing group wants to have a unique correlation coefficient. that is a number we haven't seen that is at least X away from any other group. the possible answer group is always as close to 1 as possible.
Since I don't really have a way to make multiple groups at once, what I will end up doing is making any old group i can... and again checking against previous groups. then generating a result and seeing if said result is actually still valid. if it is, it goes in to the results. I will probably either alternate between this and an actual prediction or make a fixed number of groups and then make a prediction. (and wash rinse repeat until i'm done in whatever capacity.)
In truth i doubt a created group will ever be very comparable to a result we are looking for. if data mining were that easy we wouldn't spend much time on it. What I expect is for a bunch of rather generic groups to be made and for them to act like created features to be used in subsequent feature generation... etc till a really useful one is generated that the final answer uses.
The only thing I would add i I would probably throw in a TSNE feature pair as well at each level. this actually would act like a 2nd and 3rd group, ideally any grous the TSNE feature finds will likely be a possible grouping for the final result. since if those two features are paired together and used to build a new feature to predict on then we have something that is a statistically relevant group.
What do I mean by grouping multiple features? basically use the distance equation... figuring distance from of the center (avg) values of the two features from the values of each rows position. said another way f(x,y) = Sqrt( ((x - avg(x))^2 + (y - avg(y))^2) )