So things have moved forward and I've made a few submissions. Things have gone okay, but i'm looking for ... more.
It took the best part of a week to generate the data, though i didn't use suffix arrays. i transformed each cat scan image in to a series of images constructed of averages. first image was 1 byte that was an average of the whole thing. then the next 4 were for each quadrant. repeat in this manner till you have the level of detail you want. The point is, once it is sorted that level is done. So data might look something like this.
Image# 1 2 3 4 5 6 7 8
----------------------------------Data below here
sort this --> 4 2 1 1 0 2 1 4 <--- most significant byte (average of the whole image)
then this -> 4 1 1 2 3 4 5 6 <---- next most significant byte (top left quadrant average)
then this -> 0 2 1 0 0 0 0 2 <---- etc
I broke each image in to lots of little pieces, and each of those pieces got sorted in to 1 gigantic index using the method above. Once I had that data I built 3 different values for the grid elements, index location, average of the cancer content in nearby indexed cells (excluding gid segments from the same person) and finally how far away the nearest grid of the same position is (left and right on the indexed sort). This one is supposed to give me an idea of how out of place the index is. you would think that similar images would cluster and anomalies would be indicative of something special.
On the last method, I'm not certain the data sample size is large enough for that (or self-similar enough). It might be but without a visual way check it out (i haven't written one) it can be hard to be certain. its the sort of thing i would expect to work real well with maps, I just don't know about body features. I do this to try and solve the simple problem, I know that there is cancer in the patient I just don't know which slice/grid part its in. I would argue that's the real challenge for this contest. (more on that later)
Once I have these statistics I build a few others that relevant to the whole image and then take all the grid data ad the few other statistics and try to make a cancer / no cancer prediction for each image. ... again tons of false positives since each image is one part of the whole and cancer may only be on 1 cat scan.
Once I have the results (using a new form of my GBM) I take the 9 fold cross validation resuilt of the train data and the results from test data and send it all in to another GBM. This one takes the whole body (the slices have been organized in order by a label) an produces a uniform 1024 slices broken out by percentage (i take the results from the nearest image, that becomes my feature for that percentage location.) then i build 512, 256,128... down to 1 . these features dont use the nearest but average of the 2, 4, 8.. etc elements that went in to the 1024. I send all that data in to a GBM, get predictions and ... bob's your uncle.
The accuracy... is okay. the problem i've had is bias is super easy to introduce. local testing puts the results at around .55 log loss. but my submissions were more in the .6 neighborhood. which makes me think that: 1, i got nothing special and 2, all the false positives are screwing things up. Btw, getting to that point took over 3 weeks beginning to end, with many long hours of my computer doing things. I ahve since i started radically improved the speed of the whole process and can probably get from beginning to end in about 2-3 days now. so that's good. The biggest/most important improvement of late was threading the tree generation internal to the tree itself. I've had the trees themselves be threaded for years, but never bothered to thread out the node work. it is now and it really helps make as much use of the cpu as I can. That change if nothing else will be great for the future.
So lets talk about false positives real quick and knowing exactly which slice of cat scan image has the cancer in it. I think most people make a 3d model and handle the data as a whole (getting rid of the problem) that may be my solution. that is essentially what i did with 2nd level of the GBM. I was trying to solve the problem in an image by image way previously but i think there is just to much noise unless i add some insight that is missing. So with regards to that there is maybe 1 possible way to add some insight. consider this
No Cancer Cancer
if each 0 or 1 is a whole image the trick is to realize that 7 and 9 are unique to the cancer side. but how do you get to that point? right now i take all the numbers by themselves and say cancer on no cancer. so 0 has a percentage chance.. etc. Even this is a simplification because i dont have a clear picture of "3" or "2" it might actually just be a special pattern of 1s and 0s (that is to say it all looks normal and the particular oddness of the slices in the order they show is what makes ID as cancer).
So I'm noodling on this a little to see if I can find a good way to get the insight. if I can get the image analysis to indicate 3 or 2 is present. i'm in there, but more likely I'll make data out of the whole and rethink how to make the predictions there by adding more data to each prediction but removing the false positives.