Friday, 17 April 2015

Can I out-predict a bunch of yahoos with statistics? NHL Playoff Edition

The answer is probably not. More on that later.

There comes a time in every male data analysts life where their testosterone builds to frightening levels and they have to apply their skills to over-analyze sports. For me, today is that day. The NHL playoffs began last night and for a Canadian econometrician who is male (most of them are) the playoffs are like Christmas to a kid or new-trailer-release day to a Star Wars fan. There's data and probability and forecasting a plenty. If you've heard of a statistics term it can probably be applied in some way to sports. 

More importantly, econometric forecasting in the playoffs can be used to make money and even more importantly to take other peoples money away from them. And really, what else is there to life other than getting obscenely wealthy at someone else's expense? So at the last minute I managed to finagle my way into a group of guys who were in a pool for the playoffs. The buy-in was $20. Each guy drafted twelve NHL players and whoever gets the most aggregate points (ie. goals and assists for all of their players) wins the pot. Second place gets their money back. 

I signed up for this draft on Monday afternoon and the draft itself was at 8PM that night. I dropped everything and rushed home to run regressions with great abandon. What follows is a sketch of my process and playoff drafting rules of thumb but I did this all in about three hours and so there is probably going to be some major problems with them.  I didn't look at the current hockey forecasting literature, I didn't use any fancy stats, I didn't backtest the statistical model. I cut corners all over the place.

Now that I've effectively exonerated myself in case my team sucks, lets talk about what I did. The basis for my draft is a statistical model that predicted how many points a player will get dependent upon several factors. Its a multivariate ordinary least squares model that uses player and team data from Hockey-Reference.com for the years of 2012 to 2015. In addition, I pulled one other variable from a Las Vegas betting website on historical team odds for winning the Stanley Cup. 

The general way I predicted how many points a player would get was to use variables from the regular season in my regression to see how well they predicted play-off performance in that year. The model would spit out an estimate for how much each variable predicted play-off point performance for the years of 2012-2014. I then used these coefficients to predict how players would do in the 2015 post-season using regular season data from 2015. This was the easiest way to make any predictions for this year as (obviously) we only have regular season statistics for 2015. I realize that previous play-off performance for each player would be useful but again, I was under the gun to get this done and I didn't have time. 

Onto model selection. When I was thinking about what would accurately predict how a player would do I basically classified the variables into three major categories. First, is the individual players skill. Sidney Crosby is likely going to get a fair amount of points because he's not the average NHL hockey player. He's a superstar and, like many other superstar players, skill likely plays a large role in the play-offs even if your team is bad and you aren't expected to go far. A guy who gets 10 points in the first round but who gets eliminated from the play-offs is probably not a bad pick depending on when you get him. So for individual skill I used regular season points and time on ice. I've also included dummy variables for what position the player plays (ie. LW vs. RW vs. C vs. D)

The second group of variables that I included were team related variables. Sidney Crosby is also on a team with a number of very good players who can score and pass him the puck so he can score. A player's team is also important for how deep your drafted players will go in a play-off run. A team that can go deep into the play-offs will likely have more chances to get more points for their players. I assumed that these factors were in some way predicted by team offense and defense statistics. For these variables I've included number of regular season wins, goals for, goals against and average save percentage of the teams goalie for the team each player is on. 

The final group of variables that I wanted to include were related to opponents that a team would be facing. Sidney Crosby and his band of merry hockey players are very good but they're also playing the Rangers who have been to the finals last year and who have had a very good regular season track record. This means that even though the Penguins are good they might not go very far. To control for this I added the Las Vegas betting odds that a team would win the Stanley Cup. Not only should the betting markets build into their price that a team is skilled and has a shot at winning the cup (which I've already attempted to control for with the second group of variables) but it should incorporate information on the opposition that they face in reaching the Stanley Cup.

In mathematical terms the model is basically this:


Post-season points for player i in year y are a function of all of the variables that I outlined above as well as an error term that is distributed iid. The betas are estimated using 2012 to 2014 data and then they are incorporated into an equation that uses 2015 data to predict a players 2015 post-season point totals.

Regression results here:

      Source |       SS       df       MS              Number of obs =     648
-------------+------------------------------           F( 10,   637) =   30.18
       Model |   4028.8757    10   402.88757           Prob > F      =  0.0000
    Residual |  8504.40208   637  13.3507097           R-squared     =  0.3215
-------------+------------------------------           Adj R-squared =  0.3108
       Total |  12533.2778   647  19.3713721           Root MSE      =  3.6539

------------------------------------------------------------------------------
     playpts |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         pts |   .1181479   .0140399     8.42   0.000     .0905777     .145718
             |
     poscode |
          D  |  -.6354497   .4383468    -1.45   0.148    -1.496229    .2253298
         LW  |   .0316336   .4398774     0.07   0.943    -.8321514    .8954186
         RW  |  -.4466971   .4306716    -1.04   0.300    -1.292405    .3990106
             |
         toi |   .0006188   .0005999     1.03   0.303    -.0005592    .0017969
           w |   .0261791   .0601923     0.43   0.664    -.0920203    .1443784
          gf |  -.0043203   .0164696    -0.26   0.793    -.0366616     .028021
          ga |  -.0147689   .0120157    -1.23   0.219    -.0383641    .0088262
       svper |    73.9586   23.63515     3.13   0.002     27.54636    120.3708
        odds |  -.0322589   .0166924    -1.93   0.054    -.0650378    .0005199
       _cons |  -64.19289   21.57956    -2.97   0.003    -106.5686   -21.81722
------------------------------------------------------------------------------


The predictive model kicks out this list of players in ranked order from most expected points to least expected points. This was the backbone for how I chose my twelve players. A couple of things about this list. First, the top twenty or thirty players on this list kind of make sense. This is reassuring that the model is making semi-decent picks. But anyone can pick a lot of these players off of a list of top performers in the playoffs. The real utility in the model will be in the middle picks (a Lars Eller or Brent Seabrook type pick) where it might take a more discerning choice to make or break a fantasy hockey team. Theres a total of 144 players that were picked by the pool so not all of them can be a Sidney Crosby or a Steven Stamkos where point production is assured.

Second, this model was not designed to pick a Stanley Cup winner but it does show some interesting things at the team level. If you graph the average rank of the players on the list by team (which in turn should rank the average number of points that each team should expect per player) this bar chart pops out. This shows that Montreal players have the highest average predicted points among all teams in the playoffs this year. The next tier of teams is St. Louis, NYR, Anaheim and Nashville. I would expect these five teams to be the favourites for the cup. At the other end are Calgary, Ottawa, and Winnipeg. Now this does not mean that these teams won't advance. I can think of scenarios where all the games they play are defensive battles without a lot of scoring. But especially since the model predicts high point totals from Montreal and Anaheim I would expect both Winnipeg and Ottawa to be out fairly soon. That being said, it's the playoffs and anything can happen. Also my model might be awful. Don't send me hate mail Winnipeg.




So this ranking list was the basic tool I used to make draft picks but in addition to this I also used two heuristics that I stole from finance. Full disclosure, I have never taken a real finance course in my life but I have taken a couple of macroeconomics courses and a financial risk analysis course (which was basically a primer on how not to tank the macroeconomy). I did however learn three things about finance from these courses. First, finance is so boring that you have to pay people a truck load of money to get them to do it. This is the least applicable lesson to this blog post (but maybe the most applicable lesson to life in general). Second, groups of people usually make more accurate decisions than individuals (a weak version of the efficient market hypothesis). Third, diversify, diversify, diversify. I use the analogy of a stock portfolio here because it's relevant. If you only invest in one stock, as opposed to several, you are likely to have returns that are highly volatile. They may be very high but they also may be very low. Similarly, if you pick players from one team, the team may go far but it may also flame out in the first round. Diversifying picks, just like stocks, is the least risky strategy.

These later two lessons are the basis for my heuristics. First, where the list identified two players who had similar predicted points I picked the player on a team that had better odds of progressing past the first round. This was according to the betting odds that I got from the Las Vegas bookie website.

Second, where I had two similar players, I would pick a player who was not on a team of a player that I had already drafted. In essence I made sure that I didn't have more than three players from the same team. This was to avoid tanking my team in one fell swoop if an NHL team with a whole bunch of my players exited the playoffs.

This resulted in the list of drafted players here. Its a solid list and includes three out of the top five ranked players and five out of the top 15. It did however, suggest some weird players like Mike Ribeiro.

It was interesting to watch the other members of the pool pick players because it was clear that most of them were also following their own little rules of thumb. Most of them followed a similar heuristic to my first, i.e. pick the players on the consensus favourite teams. This has the obvious benefit of allowing your players to go deep and will likely get you more points.

The heuristic that they didn't follow though was my second one. Most of them picked a team they liked and thought would go far and loaded up on players from that team. Eight of the twelve members have at least 50% of their players on two NHL teams. This is a high risk, high reward strategy, and it's probably the reason why I won't win the overall pool. Out of the group of yahoos, on average one of them is going to pick the Stanley cup winning team. Their players will go far in the playoffs and earn a lot of points. The rest will flame out in spectacular fashion once their team of choice gets knocked out of the playoffs.

For what its worth though here is the average rank of the players on each team by predicted points. To protect identities (and in honour of Alcohol Awareness month) I have coded the names of the pool members except for me.




If you believe my little model then I'm ahead both on the average rank of player chosen and the predicted number of points. But not by that much in the case of team Rum and team Rye. Interestingly (although I'm not sure if this was their strategy) team Rye and team Absinthe also diversified their picks as much as I did.

So I'll try and give updates on the pool rankings over the four rounds of play-offs not only to see how the model is doing but to publicly brag or mourn depending on results.

Also if the Jets want to give me a sweet seven figure salary, shoot me a message and I can tell you guys how badly you'll lose in a private venue.




No comments:

Post a Comment