Friday, 17 April 2015

Can I out-predict a bunch of yahoos with statistics? NHL Playoff Edition

The answer is probably not. More on that later.

There comes a time in every male data analysts life where their testosterone builds to frightening levels and they have to apply their skills to over-analyze sports. For me, today is that day. The NHL playoffs began last night and for a Canadian econometrician who is male (most of them are) the playoffs are like Christmas to a kid or new-trailer-release day to a Star Wars fan. There's data and probability and forecasting a plenty. If you've heard of a statistics term it can probably be applied in some way to sports. 

More importantly, econometric forecasting in the playoffs can be used to make money and even more importantly to take other peoples money away from them. And really, what else is there to life other than getting obscenely wealthy at someone else's expense? So at the last minute I managed to finagle my way into a group of guys who were in a pool for the playoffs. The buy-in was $20. Each guy drafted twelve NHL players and whoever gets the most aggregate points (ie. goals and assists for all of their players) wins the pot. Second place gets their money back. 

I signed up for this draft on Monday afternoon and the draft itself was at 8PM that night. I dropped everything and rushed home to run regressions with great abandon. What follows is a sketch of my process and playoff drafting rules of thumb but I did this all in about three hours and so there is probably going to be some major problems with them.  I didn't look at the current hockey forecasting literature, I didn't use any fancy stats, I didn't backtest the statistical model. I cut corners all over the place.

Now that I've effectively exonerated myself in case my team sucks, lets talk about what I did. The basis for my draft is a statistical model that predicted how many points a player will get dependent upon several factors. Its a multivariate ordinary least squares model that uses player and team data from Hockey-Reference.com for the years of 2012 to 2015. In addition, I pulled one other variable from a Las Vegas betting website on historical team odds for winning the Stanley Cup. 

The general way I predicted how many points a player would get was to use variables from the regular season in my regression to see how well they predicted play-off performance in that year. The model would spit out an estimate for how much each variable predicted play-off point performance for the years of 2012-2014. I then used these coefficients to predict how players would do in the 2015 post-season using regular season data from 2015. This was the easiest way to make any predictions for this year as (obviously) we only have regular season statistics for 2015. I realize that previous play-off performance for each player would be useful but again, I was under the gun to get this done and I didn't have time. 

Onto model selection. When I was thinking about what would accurately predict how a player would do I basically classified the variables into three major categories. First, is the individual players skill. Sidney Crosby is likely going to get a fair amount of points because he's not the average NHL hockey player. He's a superstar and, like many other superstar players, skill likely plays a large role in the play-offs even if your team is bad and you aren't expected to go far. A guy who gets 10 points in the first round but who gets eliminated from the play-offs is probably not a bad pick depending on when you get him. So for individual skill I used regular season points and time on ice. I've also included dummy variables for what position the player plays (ie. LW vs. RW vs. C vs. D)

The second group of variables that I included were team related variables. Sidney Crosby is also on a team with a number of very good players who can score and pass him the puck so he can score. A player's team is also important for how deep your drafted players will go in a play-off run. A team that can go deep into the play-offs will likely have more chances to get more points for their players. I assumed that these factors were in some way predicted by team offense and defense statistics. For these variables I've included number of regular season wins, goals for, goals against and average save percentage of the teams goalie for the team each player is on. 

The final group of variables that I wanted to include were related to opponents that a team would be facing. Sidney Crosby and his band of merry hockey players are very good but they're also playing the Rangers who have been to the finals last year and who have had a very good regular season track record. This means that even though the Penguins are good they might not go very far. To control for this I added the Las Vegas betting odds that a team would win the Stanley Cup. Not only should the betting markets build into their price that a team is skilled and has a shot at winning the cup (which I've already attempted to control for with the second group of variables) but it should incorporate information on the opposition that they face in reaching the Stanley Cup.

In mathematical terms the model is basically this:


Post-season points for player i in year y are a function of all of the variables that I outlined above as well as an error term that is distributed iid. The betas are estimated using 2012 to 2014 data and then they are incorporated into an equation that uses 2015 data to predict a players 2015 post-season point totals.

Regression results here:

      Source |       SS       df       MS              Number of obs =     648
-------------+------------------------------           F( 10,   637) =   30.18
       Model |   4028.8757    10   402.88757           Prob > F      =  0.0000
    Residual |  8504.40208   637  13.3507097           R-squared     =  0.3215
-------------+------------------------------           Adj R-squared =  0.3108
       Total |  12533.2778   647  19.3713721           Root MSE      =  3.6539

------------------------------------------------------------------------------
     playpts |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         pts |   .1181479   .0140399     8.42   0.000     .0905777     .145718
             |
     poscode |
          D  |  -.6354497   .4383468    -1.45   0.148    -1.496229    .2253298
         LW  |   .0316336   .4398774     0.07   0.943    -.8321514    .8954186
         RW  |  -.4466971   .4306716    -1.04   0.300    -1.292405    .3990106
             |
         toi |   .0006188   .0005999     1.03   0.303    -.0005592    .0017969
           w |   .0261791   .0601923     0.43   0.664    -.0920203    .1443784
          gf |  -.0043203   .0164696    -0.26   0.793    -.0366616     .028021
          ga |  -.0147689   .0120157    -1.23   0.219    -.0383641    .0088262
       svper |    73.9586   23.63515     3.13   0.002     27.54636    120.3708
        odds |  -.0322589   .0166924    -1.93   0.054    -.0650378    .0005199
       _cons |  -64.19289   21.57956    -2.97   0.003    -106.5686   -21.81722
------------------------------------------------------------------------------


The predictive model kicks out this list of players in ranked order from most expected points to least expected points. This was the backbone for how I chose my twelve players. A couple of things about this list. First, the top twenty or thirty players on this list kind of make sense. This is reassuring that the model is making semi-decent picks. But anyone can pick a lot of these players off of a list of top performers in the playoffs. The real utility in the model will be in the middle picks (a Lars Eller or Brent Seabrook type pick) where it might take a more discerning choice to make or break a fantasy hockey team. Theres a total of 144 players that were picked by the pool so not all of them can be a Sidney Crosby or a Steven Stamkos where point production is assured.

Second, this model was not designed to pick a Stanley Cup winner but it does show some interesting things at the team level. If you graph the average rank of the players on the list by team (which in turn should rank the average number of points that each team should expect per player) this bar chart pops out. This shows that Montreal players have the highest average predicted points among all teams in the playoffs this year. The next tier of teams is St. Louis, NYR, Anaheim and Nashville. I would expect these five teams to be the favourites for the cup. At the other end are Calgary, Ottawa, and Winnipeg. Now this does not mean that these teams won't advance. I can think of scenarios where all the games they play are defensive battles without a lot of scoring. But especially since the model predicts high point totals from Montreal and Anaheim I would expect both Winnipeg and Ottawa to be out fairly soon. That being said, it's the playoffs and anything can happen. Also my model might be awful. Don't send me hate mail Winnipeg.




So this ranking list was the basic tool I used to make draft picks but in addition to this I also used two heuristics that I stole from finance. Full disclosure, I have never taken a real finance course in my life but I have taken a couple of macroeconomics courses and a financial risk analysis course (which was basically a primer on how not to tank the macroeconomy). I did however learn three things about finance from these courses. First, finance is so boring that you have to pay people a truck load of money to get them to do it. This is the least applicable lesson to this blog post (but maybe the most applicable lesson to life in general). Second, groups of people usually make more accurate decisions than individuals (a weak version of the efficient market hypothesis). Third, diversify, diversify, diversify. I use the analogy of a stock portfolio here because it's relevant. If you only invest in one stock, as opposed to several, you are likely to have returns that are highly volatile. They may be very high but they also may be very low. Similarly, if you pick players from one team, the team may go far but it may also flame out in the first round. Diversifying picks, just like stocks, is the least risky strategy.

These later two lessons are the basis for my heuristics. First, where the list identified two players who had similar predicted points I picked the player on a team that had better odds of progressing past the first round. This was according to the betting odds that I got from the Las Vegas bookie website.

Second, where I had two similar players, I would pick a player who was not on a team of a player that I had already drafted. In essence I made sure that I didn't have more than three players from the same team. This was to avoid tanking my team in one fell swoop if an NHL team with a whole bunch of my players exited the playoffs.

This resulted in the list of drafted players here. Its a solid list and includes three out of the top five ranked players and five out of the top 15. It did however, suggest some weird players like Mike Ribeiro.

It was interesting to watch the other members of the pool pick players because it was clear that most of them were also following their own little rules of thumb. Most of them followed a similar heuristic to my first, i.e. pick the players on the consensus favourite teams. This has the obvious benefit of allowing your players to go deep and will likely get you more points.

The heuristic that they didn't follow though was my second one. Most of them picked a team they liked and thought would go far and loaded up on players from that team. Eight of the twelve members have at least 50% of their players on two NHL teams. This is a high risk, high reward strategy, and it's probably the reason why I won't win the overall pool. Out of the group of yahoos, on average one of them is going to pick the Stanley cup winning team. Their players will go far in the playoffs and earn a lot of points. The rest will flame out in spectacular fashion once their team of choice gets knocked out of the playoffs.

For what its worth though here is the average rank of the players on each team by predicted points. To protect identities (and in honour of Alcohol Awareness month) I have coded the names of the pool members except for me.




If you believe my little model then I'm ahead both on the average rank of player chosen and the predicted number of points. But not by that much in the case of team Rum and team Rye. Interestingly (although I'm not sure if this was their strategy) team Rye and team Absinthe also diversified their picks as much as I did.

So I'll try and give updates on the pool rankings over the four rounds of play-offs not only to see how the model is doing but to publicly brag or mourn depending on results.

Also if the Jets want to give me a sweet seven figure salary, shoot me a message and I can tell you guys how badly you'll lose in a private venue.




Tuesday, 14 April 2015

Where is the most desirable place to do a medical residency?

This may seem like a trivial anecdote but it is one of my favourite examples of competitive forces at work. The University of Manitoba - Bannatyne campus is essentially the medical arm of the University of Manitoba. It's separate from the main campus and attached to the Health Sciences Centre which is the main hospital complex in Manitoba. Unfortunately because of contracts signed by the University of Manitoba all food services at the Bannatyne campus including coffee are provided by a certain soulless catering conglomerate. Now, I'm not entirely against greedy evil corporations like this one. They have a job to do which is to make money. But this company had a monopoly on all food services on campus which meant that you either ate their food and drank their coffee or shut up. They didn't have to compete and this made for poor service.

I remember in first and second year waiting in line at the campus Tim Horton's run by this multinational tool of a company. The line was never that long but they still took ages to get through. Once you got to the counter you were met by a coffee person (does barista apply at Tim Horton's?) with a glum look on their face who would brusquely take your order and shove a cup of hot coffee in your hands. The major kicker to all this was that, despite working in a building where you could literally take a picture of inside a persons brain, this Tim Horton's was a cash only affair. No debit or credit accepted. As a disorganized medical student who would regularly forget what day of the week it was, going to the bank for cash was way to advanced for me. So I would usually trudge by the Tim Horton's on the way to class disgruntled and coffee-less which was only slightly worse than standing in line for 20 minutes seething about the poor service.

Then, in second year, a hotel for families of patients opened up on the Health Sciences campus and inside that hotel was a Starbucks. This Starbucks, like any modern self-respecting business, had the ability to accept credit and debit purchases and so I began to spend all my money on Starbucks never to return to the Tim Horton's again except in desperation. But whenever I did go back I noticed, at least anecdotally, that the service was faster (if not nicer) and more importantly they had installed a debit machine. Moreover, the same company later set up their own Starbucks kiosk (with a debit machine) right across from the Tim Horton's in an attempt to capture some of the people who were buying coffee from the renegade hotel Starbucks.

This may seem like a small victory but its an important example of how competition works to make people better off. The hotel Starbucks forced the evil catering monopoly to change their business practices to make Tim Horton's more convenient for their customers. No longer did I or my disorganized cashless compatriots have to sit in a lecture without caffeine. We could get it whenever we wanted on demand.

This idea of voting with ones feet is an important part of competitive forces. People went to Starbucks instead of Tim Horton's and this forced them to change. But in a static sense competition also reveals desirability. Starbucks got more business than Tim Horton's because it provided a more desirable product.

This same idea can be used to (roughly) measure what university residency programs are the most desirable. Universities that are desirable should attract medical students to them and undesirable universities should drive medical students away. Medical students, like coffee customers, should vote with their feet. This seems simple in theory but its actually much harder to discern desirability in the residency match than in my coffee example. The problem with residency is that the number of spots that each program has are limited. At Starbucks you can line up out the door - theres no cap on demand. At a medical residency program though there are only so many spots and if you don't get one then, tough luck buddy, hit the bricks.

Because of this there are really two ways to get into a desirable residency. The first is to be a savant. On average desirable universities will have higher quality applicants and these applicants will get in ahead of their peers. The second way to get into a desirable program is luck. Village idiots get into desirable programs because all the stars align and they manage to fool their residency interviewers. This luck is helped when there are many of them. Like a swarm of mosquitoes, you can swat most of them down but, just on sheer numbers, one is going to get by you. 

So desirable universities should see high volumes of applications (the village idiot effect) from medical students at other schools and they should see high quality applicants (the savant effect) from other schools. Both will combine to result in a high number of medical students from other universities getting into desirable universities.

Besides this though a truly desirable program should be widely desired. If Manitoba is a desirable place to go for residency (this is obviously an example) there should be a high percentage of med students from UBC and McGill and Toronto and Calgary etc. who all get into the University of Manitoba. It shouldn't just be that all of the outside students are from Saskatchewan. Less desirable schools may see high numbers of med students from nearby schools come to them because of geography or whatever but they won't see large numbers or medical students from distant schools. Their popularity won't be wide spread. 

So the metric I use to gauge desirability is the average percentage of each medical school class that goes to do a residency at a particular university. This does not include those medical students that stay at their home school. So this metric for desirability at university x is then


where r is the number of incoming residents who go to university x for residency from university i, m is the number of medical students in each graduating class, and j is the number of medical schools. This is just a fancy way of saying the average percent of all other medical classes that go to university x is the proxy that I use for desirability.

For this, I took data for this from the Canadian Residency matching service website (CARMS) from their published match reports.  Although CARMS publishes a report every year most of their data is not in user friendly form and I was lazy. I only examine two years of data - 2004 and 2014. 

Below is the ranking for 2014. This shows that on average, medical schools (excluding the University of Toronto) saw about 9% of their class go to the University of Toronto for residency. This places UofT as top school by this metric. Conversely, just under 1% of any given medical school class (excluding Saskatchewan) went to the University of Saskatchewan for residency.




In everything thats preceded this, I deliberately avoided defining what a desirable residency program is to avoid speculation about what desirable attributes are. There are a lot of individual decisions that medical students make when they choose a residency program. But you can pick out at least partly why some of these places are low on the rankings and some are high on the rankings. The bottom four schools including my own soon-to-be alma mater are in "undesirable" locations.  The top five on the other hand are in "nice" cities or near nice cities (I'm looking at you McMaster). These top five universities also have anecdotal reputations for being good teaching programs which probably drives desirability as well. On thing I am surprised about is that Calgary isn't higher. It has a reputation for being a difficult place to get a residency position but this may just be a sign that this metric isn't that good of an indicator for desirability.

Universities in Quebec are also fairly far down the rankings. I suspect this is for two major reasons. First, you not only have to speak French in these residency programs but medical French. Medical school is basically a four year language program in latin sounding medical terminology so to do this all over again in a different language is probably not enticing for english speaking medical students. Second, pay for residents in Quebec is the lowest in the country. By quite a bit.

Next interesting question - how have these rankings changed in the last 10 years? I had to exclude the Quebec universities (except McGill) and NOSM because there was no data on the Quebec universities and because NOSM didn't exist in 2004. But for the remaining universities below is a linkplot of the rankings in 2004 and 2014 by school. 




Both Toronto and Saskatchewan retain their places as top and bottom school respectively. There are however a couple of major moves that occurred over the decade. McGill's ranking plummeted five spots largely, I suspect, because of the pay issue that I mentioned above. Quebec just didn't keep up with the wage bumps that other provinces gave to their residents. Western's ranking dropped as well, and I really don't have any explanation for that.

Calgary jumped a couple spots probably due to its growth as an oil town and all around hip place to be in the last ten years. Alberta has also thrown a lot of money at their medical system which probably partially drove this rise in desirability. Calgary also has a reputation (perhaps undeserved) of being a school with a good "work-life balance". Depending on interpretation this either means its a program thats rife with lay-about residents or its a program where they don't abuse their residents.

Queen's and McMaster also both jumped three spots and I also really couldn't say why. Maybe they bled off some of the medical student demand that had been going to Western, but I'm not sure. Both just finished building new medical school buildings so that might be part of it as well. Feel free to speculate wildly.

Now, before you send me an angry email about how your program is the best and I'm wrong and this ranking is completely incorrect blah blah blah let me tell you what this data does not say. As with most statistics it does not say anything about desirability of a residency program for an individual. My (failed) attempt to be agnostic about what desirable means was because desirability is really something that individual medical students determine. This is an average desirability for the average medical student. I know many people who have stayed in Manitoba because that fit best for them even though I rank it here as an "undesirable" place. Kids, condos, and familiarity make a place desirable for many people. Reflecting this, a decent chunk of people in medical school stay at their university for their residency (except Queen's which only retained a shocking 17% of its class last year). Similarly, the match isn't just about where you go but to what specialty program you go to. Someone who wants to do plastic surgery no matter what doesn't really care if they do it at the UofT or Memorial.

Moreover I suspect the actual training that one receives at all of these universities is essentially the same. There are certainly places that specialize in certain things but for the vast majority of medical students who are going into family medicine they receive about the same training at UofT or UofS. I would say this ranking of desirability is much more about location (Toronto, Vancouver, Calgary etc.) than reflective of the quality of university program.

The metric I use here is also of questionable quality. It's definitely subject to problems because it doesn't control for size of the accepting program. A larger program can absorb more students from other smaller schools and this would bump up its desirability metric. This problem may be reflected in why Toronto is first in both 2004 and 2014. It likely had the largest residency class in both years. But I'm too lazy to investigate this further. Now, if CARMs had released ranking data on location of program (i.e. number of people who ranked UofT first) then this whole logical exercise would be useless. But they don't and so we have to deal with a proxy of what we're actually trying to measure. As in the ranking of Calgary it gives some questionable results so I'm not sure how much to trust it.

This also says nothing about how good of a medical student someone is based upon the program they got into. While there will likely be a higher number of "savants" at desirable programs that doesn't mean a whole lot. I didn't define the term savant on purpose because residency programs are often quirky about what makes a medical student competitive. Besides this, there is a pretty decent chunk of luck built into getting a residency. Very smart people get into the University of Saskatchewan and village idiots like myself get into the University of Toronto. It's better to be lucky than good.