Hi again faithful readers (you know who you are). So the marathon is back off the plate. Basically, i can't afford to spend the extra cash right now on a trip. Or rather I'm not going to borrow the money for a trip like that. I over spent when i thought the trip was off and upon reflection when it came time to book the trip, it just didn't make sense. So... maybe next year.
So for the last month i've been toying with things with my data mining code base. I removed tons of old code that wasnt being used/tested. i honed it down to just GBM. then i spent some time seeing if i can make a version that boosts accuracy over log loss (normal gbm produces a very balanced approach) i was successful. I did it by taking the output from one gbm and sending it in to another in essence over fitting.
Why would i want to do that? Well there is a contest ( https://www.kaggle.com/c/santander-product-recommendation ) i tried to work on it a little, but the positives are far and few in between. the predictions generally put any given positive at less than 1% chance of being right so i tried bumping the number since right now it returned all negatives. The results are actually running right now. i still expect all negatives but the potential positives should have higher percentages and I can pick a cut off point to round to a positive that is a little higher than .0005% i would have had to use before. All this, so I can send in a result other than all negatives. (which scores you at the bottom of the leader board) Will this give me a better score than all negatives... no idea :)
I tried making a variation on the GBM tree i was using that worked like some stuff i did years ago. it wasnt bad, but still not as good as the current gbm implementation. I also modified the Tree to be able to handle time series data in that it can lock 1 row to another and put the columns in time sensitive order. this allows me to process multi month data really well. It also gave me a place to feed in fake data if i want to stack 1 gbm on top of another i can send in the previous gbm's results as new features to train on (along with the normal training data).
This leads me to where i think the next evolution of this will be. I'm slowly building a multi layer GBM ... which essentially is a form of neural net. The thing i need to work out is how best to sub divide the things each layer should predict. that is i could make it so the GBM makes 2 or 1000 different groups and predicts rows results for each and feeds those in for the next prediction...etc. till we get to the final prediction. the division of the groups is something that can probably be done using a form of multi variable analysis that makes groups out of variables that change together. figuring out how to divide it in to multiple layers is a different problem all together.
Do you want an AI cause this is how you get AIs! heh, seriously, thats what it turns in to. once you have a program that takes in data builds a great answer in layers solving little problems and assembles them in to a final answer that is super great. well you pretty much have an AI.
Incidently, TSNE might also help here as it I might just feed the tsne results for that layer's training data (fake data included if we are a level or two down) to give the system a better picture of how things group statistically.
In other news, I started using blue apron. This is my first time trying a service like this and so far I'm really enjoying it. I'm pretty bad about going to the grocery store. And going every week ... well that aint gonna happen. This is my way to do that, without doing that :) . I'm sure most people have similar thoughts when they sign up, even if the selling point is supposed to be the dishes you are making. Honestly, I've just been eating too much take out. I don't mind cooking and the dishes they send you to prepare are for the most part really good.
I feel like I've started many things and then abruptly stopped before i really got in to the thick of it. I'd like to share some of the highlights. Doing so will give anyone reading a good picture of where everything is with the stuff i normally (its been over two months) blog about.
Data Mining: Almost 6 months ago I was working on a GBM in a GBM model and there it has sat since then. It's not that I think it's merit-less, it is just that I doubt it'll produce amazing results. So, I'm not really inspired to finish it. Also, I don't have a kaggle contest to work on right now and that helps drive my interest in the algorithm. Truth be told, I'm beginning to think data mining is getting near the end of its "big gains" period. the human analysis part may very well be improvable but that's never interested me much.
Running: I started running, then i stopped, then I started again. to expand on this, it was fine, then i over did it, then I got motivated and eased back in. I've got a marathon in mind I want to go to. It is still 11 weeks away so I'm training up for that. I'll talk more about it when I commit to it fully (basically in about 4 weeks).
Math: I spent a lot of time trying to solve the 3 cubes problem. My most recent attempt sent me down a rabbit hole of general factoring. I actually thought I had a method for doing that, only to realize that my solution to the problem was such a tiny corner case as to be unusable. The net-net is I got nothing I'm pursuing here right now.
Magic: Things continue to wind down. there is a grand prix tourney in Dallas in 2 weeks but I don't think I'm gonna go. I still have fun playing most weeks but I'm definitely not feeling the drive to brew decks like i did. And, lets be honest, i've never felt the competitive spirit this time around. That is to say, I want to win sure, but deck building always took precedence. as it is so much more interesting than playing a known decent deck, which just makes it hard to be competitive.
EM Drive: Have you seen this thing? Its about 2 years old from a "oh hey something new!" perspective. but the last few days I've been reading up on the science and watching youtubes about it. I have to admit I'd like to better understand how it works. I totally get what virtual particles are, but I don't get how they can get the transfer of momentum to them (they are incredibly hard to interact with). Just something I've been messing with and thought I'd include as a bullet point. I doubt I'll build anything in my garage but crazier things have happened.
If you are reading this you might be searching for a solution, saw my comment or just be browsing the internet. So here's the thing, I saw a Numberphile and was like "i gotta try that" https://www.youtube.com/watch?v=Y30VF3cSIYQ . I do recreational math, so you know.... i like a challenge. anyway here's the solution I came up with (I circled) the "magic step". Basically the realization is that the left and right sides are in the same form. and since I made "I" up I can choose it to be equal to B which forces the other side to be A which then of course dictates its value.
This was my 2nd attempt I spent a few minutes going down another path to see if i could turn it in to a quadratic and just get solutions... it was a mess so I went back and looked for well, what you see. Oh and above The J substitution isn't necessary. i was just doing it to see if anything made sense as a next step beyond the obvious. it didn't help you could remove it. I just didnt want to rewrite the work and i did it in ink so... there it stands.
I haven't had a blog in such a long time. The last thing i wrote about was the sum of 3 cubes, which while i'm working on is now in a pot with a bunch of other things I'm working on. this post though is all about gen con and magic. hopefully I'll get another up in the next day or so on some other stuff going on.
First, magic, I went Gen Con again this year, and played a lot of magic among other things there. it was great in general but the magic part of it was pretty lack luster. Not much to say here other than my magic days are winding down. Its not that I don't enjoy the game but the discovery and problem solving aspect of it is almost played out for me. It may be counter intuitive but when you get to a point where you can't sabotage your deck with random crazy ideas (cause you feel like you've tried them all) you start to do better but it becomes a lot more boring. That's a long winded way of saying winning with the expected decks is boring. And really winning in general is kind of boring. The struggle in anything is what makes it exciting.
I don't really plan on doing any more mtg grand prix's though I might squeeze in 1 or 2 next year if the event looks like it'll be a ton of fun. I expect this time next year, i'll either be done with it or have like 1 major event.
Also as a note, Gen Con next year is the 50th anniversary so if you aren't planning on going, change your plans! It'll be one not to miss. Assuming I go (you never know for sure) I won't be playing nearly as much magic, and I'll definitely be getting more sleep. being tired the entire con makes it less fun and less memorable. Also it goes by much quicker which stinks!
Besides magic, I played true dungeon (1st dungeon was meh, 2nd was pretty good). I did AEG big game night (fun when you have others with you). Did the keg tapping on Wednesday for the 20 sided rye ale. I did the orc stomp 5k (probably wont do this next year, cause the 6:30am start time... ie 5:45 wake up ruins a lot of the con cause of the lack of sleep i get). And I got a chance to demo a friends game for an hour and half. https://www.kickstarter.com/projects/1410499285/dragonstone-mine-a-family-friendly-board-game
Back to magic, what am I playing right now? In standard bant humans. In modern The rock (or my version of it). in legacy eldrazi-post (12 post with eldrazi) and painter. I'm working on a zur's weirdening/life gain deck in modern. And I'm working a black/white control/life gain deck in standard as well.
So I was watching/catching up on numberphile and watched this 74 is cracked - Numberphile . The short version is of all the numbers less than 100 they have shown that only 3 (now 2) can be expressed as the sum of 3 cubes except for a list of a few which are proven to have no solutions. I had watched the original video this one references before and while being entertained at the time, I promptly forgot about it after watching it.
I dont think many people realize or trully care that I've spent long years of my life thinking about Diophantine equations . I did it back when I was looking for new method to methodically factor primes (read 17-32 off and on) . While I came up with some interesting stuff (to me at least) I never really got anything that good. It was always a hobby/fun problem for me. The point is solving the 3 cubes problem is something the "work" i did could probably do relatively quickly if there arent tons of Residue Classes in the various modulos of 33 and 42 (the two remaining numbers). Just saying that makes me think there probably are, but if there arent merging them in to larger equations isn't very hard.
Even if there are bunches you can still get a new set of equations that dramatically cut down your search time. Getting past 10^16 is pretty remarkably easy heck getting past 10^32 might be trivial it just depends on the numbers. I think I'm gonna spend a little time on solving 33 and 42 just for fun and see how hard the problem really is. It would be neat to have my named tied to something in mathematics even if its just a simple solution to some rather mundane puzzle. And I need another hobby/would like to return to the old hobby for a while.
I've been thinking about this for weeks. I'm normally a jump in there and do it kind of guy but with no contests around to motivate me I've been stewing on it more than writing it. I did implement one version of the GBM in GBM mechanism but it underperforms compared to my current normal tree. There are probably more things to do to it to hone it and make better but this is where I stopped and started stewing.
I've been thinking i'm approaching this the wrong way. I think that trees are great for emulating the process of making decisions based on a current known group of data, but don't we know more? Or rather can't we do better? We actually have the answers so isn’t there a way we can tease out the exact splits we want or get the precise score we want? I think there might be.
I've been looking at build a tree from the bottom up. I'm still thinking about the details but the short version is you start out with all the terminal nodes. You then take them in pairs of 2 and construct the tree. Any odd node sits out and gets put in next go. The "take them in pairs of 2" is the part i of been really thinking hard about. Conventionally going down a tree your splits are done through some system of finding features that cut the data in to pieces you are happy with. I'm going to be doing the same thing but in reverse. I want the 2 data pieces paired together to be as close to each other as possible from a Euclidian distance perspective at least with regards to whatever features I use. But (and this is one of the things I debate on) I also want the scores to be far apart.
When you think about what I’m trying to accomplish putting two items with really far apart scores makes sense. You want to figure out shared qualities in otherwise disjointed items. Similar items could be joined as well if we approach it that way the idea is you are building a tree that hones the answer really quickly and exactly. This however wouldn’t do a good job of producing scores we can further boost... we wouldn’t be finding a gradient. Instead we would be finding 1 point and after 1 or 2 iterations we'd be done.
By taking extreme values the separation would ideally be the difference from the maximum value and the minimum value. If we did that though it would only work for 2 of our data points. The rest would have to be closer each other (unless they all shared those 2 extremes) I think it would be best to match items in the center of the distribution with items on the far extreme. Giving all pairs a similar distance of (max-min)/2 and likely a value that actually is 1/2 the max or the min since it would average to the middle.
In this way we merge up items till we get to a fixed depth from the top (top being a root node). we could keep merging till then and try to climb back down the tree, i might try that at some point but since the splitters you would make won’t work well, i think the better way is to then introduce the test data at the closest terminal node (much like how nodes were merged together) and follow it up the tree till you get to a stopping spot. The average answer there is score you return.
Again I still haven’t implemented it, I’ve been stewing on it. The final piece of the puzzle is exactly how I want to do feature selection for the merging. There has to be some system for maximizing score and minimizing distance so it isn’t all ad-hoc.
Things are going okay with my GBM in GBM. I've got the boosting in a tree now but the implementation i've got in there is not quite what I want. It's a little bit too much all or nothing. that is since each tree level is a binary splitter and my method for evaluating splits is also binary, boosting the binary splitter doesn't work real well. there is no gradient to boost persay. Everything is a yes or a no
You can change the mechanism to turn it in to a logic gate and essentially each itteration trys to make the final result all zeros (or ones). but this turns out to be so-so at best. You are missing the real advantage of easing in to a solution and picking up various feature to feature information that is weakly expressed. Don't get me wrong, it DOES work. it's just it's no better than a normal classifing tree.
When I do try and ease in to by turning the shrinkage way up. (the amount you multiply the result by before adding it in to your constructed result.) it ends up producing essentially the same result over and over again till you finially cross the threshold and the side the result wants to go on to changes (for right or wrong). the point is the function in my series of functions that add up to a final answer is producing nearly the same result each time it is called.
I can change this by making it randomly select features each call but it turns out not to make much difference in my particular implementation. What I really need is a way to weight the various responses so the influence varies based on probabilities. I thought it might be good to quantize these and make it more tiered as opposed to trying to build a standard accuracy model of sorts. so instead of 0-100% maybe 10%,20%,30% etc.. or some other measured amount. The idea is that like valued rows will group together. Though I don't know if that's really any better than a smooth accuracy distribution.
First let me share this 'cause you know, ridiculous.
That's pretty much what I'm working on right now in my data mining. it's about half done.
Also I haven't had any marathon training updates. The short version is I've decided to hold off and only give updates when there is something meaningful to share. I never wanted this blog to be work, not to mention I want it to be enjoyable to read. Doing a daily or weekly update for the sake of the update doesn't in my mind fit that. I'm still running, still putting in around 16 miles a week. The weight hasn't really come off. I'm hovering around 197. I figure in a few more weeks if something doesn't give I'll make a more concerted effort to change my diet.
Magic the gathering continues to be a fixture in my life, legacy play especially. I've been contemplating a couple decks in legacy. One I called fraken-mud which was a tool kit mud deck that used birthing pod and eldrazi instead of the usual suspects (like metalworker and forgemaster). It's been inconsistent. Sometimes its really really good, but last week i went 0 for 4... yeah talk about disappointing. I'm also looking at black-white cloudpost deck I want to try that just runs lots of my favorite things (stoneforge, wurmcoil, deathrite, bitter blossom, ugin ..etc) not sure if it'll work or not. I'll try it this friday. I've also got a white painter's deck in the wings I'm gonna try in a few weeks as well. I think going the all red or red white or red blue route on painter grindstone is fine. but I want to try an all white version.
In standard I've been trying to find something interesting to play but eh... it all seems like rehash. I'm probably going to play green white little creatures/good-stuff next time I play. And as for modern, well i have my abzan eldrazi deck still together. I think that's all you get till i get a wild hair and build ... i dunno a new version of affinity. I'd kind of like to try souping up affinity. making it more mid range/stable. but we'll see.
I've been trying to find my next logical step in data mining development. I think I've settled on it. I'm going to make a tree that uses boosting in it's fundamental splitting nature. It would be nice if the final results from the tree produced better results in general (from my current trees), but if it can remove all trace of bias and move towards a good uniformly evaluated answer, it will be a better tree to be used in subsequent ensembles.
Currently my tree building mechanism tries to build trees to get the most exact answer possible by grouping like-scoring training rows. The algorithm looks for features to use as splitters that would move rows down the left hand of the tree when they score closer to negative infinity and down the right side of the tree when they score closer to positive infinity. This allows me to do feature selection using the correlation coefficient instead of random selection.
This turns out to be a great way to select features. Not only does the feature selected work towards getting a good separation, but it also has the next step(s) split built-in to the feature separation. That is it correlates with final score as a whole. Subdividing the training set so part goes left and part goes right might be just as accurate using a different feature that doesn't correlate as well, but the mismatches should be closer to correct in the version with a higher correlation coefficient.
I've been making my trees using some form of this method for years. I do lose the ability to discern good secondary features unless they show up due to bagging. This is something I would like to try and address in the new version. This is also what I was talking about in terms of bias nature in the opening paragraph. My bias is to the highly correlated features.
I also much more recently moved to the "logistic function" (sigmoid) style splitter. so I build a curve for each feature that represents the feature's distribution. It then gets combined with 2nd, 3rd .. etc features I choose to use as well (depending on the number of features or type of data) to determine which path down the tree the data is set to go. This works wonderfully, it gave me my last big boost in Random Forest before I moved on to Gradient boosting which also makes good use of those stub trees.
The new tree I'm going to be making will endeavor to take the best of all worlds as it were. I'll be using every feature and building a splitter from the net total sum of where the data says it should go. I'll be weighting each opinion by it's overall correlation to the final score (this might change, but you should have some system of setting importance otherwise its noise) and finially I'll be doing the process itteratively where like in GBM each pass will attempt to adjust the result of the previous to hone in on the actual correct score.
Once a tree is finished I'll return it out to a normal GBM tree ensemble mechanism which will make a new tree and repeat the process. In the case of a binary classifier as you final score. The first tree will more likely than not be a single depth tree because it will have a binary classifier to deal with. But the trees after that will start to get some depth to them as it finds more and more resolution. I'll need to set a maximum depth allowed much like a normal tree building exercise.
Give me a few days and we'll see how it goes.
Just a small post today. I think I'm going to do updates on the marathon training once a week. I'll give mileage from the last update and the most recent timed run data I have. I don't time myself most runs as wearing the watch is a bit annoying but i'll try and do it the day before or day of the post. Originally, I was going to do an update every time i ran but that's just too much/boring.
But while i'm here I'll say, I did run today and i put in another 4 miles. The way the weather is shaping up I'll probably run some more this weekend. I also weighed myself (i usually do in the morning) and my weight was down to 196.2 i only mention this cause it's important to know that weight tends to vary +-2 pounds around some actual center weight. That's just another reason not to obsess about it. If you are seeing your own weight vary a lot from day to day (well a few pounds anyway) that's just food and water moving through you.