Update: College Football Pickems Decision Model
In a post last week I discussed my problem with picking college football games against the spread. After stating the problem (I was dead last in the standings) I walked through a very basic exploratory and statistical analysis by using probabilities from the first 2 weeks to develop a decision-making model for selecting my picks. And the results were...
As a recap, here is the model:
|BCS versus non-BCS||1 – 9||Favorite|
|BCS versus non-BCS||9 –15||Underdog|
|BCS versus non-BCS||18-27||Favorite|
|BCS versus non-BCS||>27||Underdog|
|BCS versus BCS||1-3||Favorite|
|BCS versus BCS||3–12||Underdog|
|BCS versus BCS||12-24||Favorite|
|BCS versus BCS||>24||Underdog|
Using this framework, I correctly selected 15 of the 27 applicable games for a 55.6% winning percentage. Considering the favorites won 50% of the games during the week I am definitely pleased with the improvement, especially the move from last to the middle of the standings. However, a lot of my competitors had a great week as well, so I did not gain any ground on the league leaders. Indeed, if the model was a good one I should have won over 80% of the games. There is definitely some needed improvement. As a data analyst consultant I strive for data-aided perfection, or as close to it as possible.
Where to go from here
The framework summarized in the table above is a structured way of using past data to make decisions. It is a model that uses historical data to make a prediction, however, it is an overly simplified approach that does not take into account the relative team strengths, team weaknesses and competitive nuances of each game. What I need is a more robust model that can quantify these as predictive indicators.
In order to develop an enhanced predictive model we need to define the objective or output. In this case I want to be able to predict the point spread for each game. Each college football game has dozens of variables that can affect the point spread. I like to think of it in terms of macro and micro: macro are those variables not related to actual performance of the teams and micro variables are related to the performance of the teams. We can construct this as follows:
point spread = f(macro) + f(micro)
So what are some macro indicators, things that neither team can control during a game but may impact the results? I am thinking home field advantage, weather, and time of day may be a good start. I also think schedule strength may be important so I want to look at that as well. Additionally, I would like to find a way to factor in wear and tear and change in competitiveness throughout the season. For example, as players suffer injuries and teams play more conference games with more similarly situated competition the point spreads may go down as the season progresses.
What about micro indicators? Well this is where it can get very tricky and may require a lot of training of the model, however, it can be summarized as the net of the strengths and weaknesses of each team. The goal is to find a handful of offensive and/or defensive statistics that can be used to predict the scoring ability for each team. For example, the model will make a prediction that Texas will score x points based on their offensive/defensive performance to date. By subtracting the relative score of their opponent and netting against the macro factors for the game we will be able to offer a prediction of the point spread that we can compare to the line to make our picks.
Since time is running short before the deadline for my picks I will not be able to train a model for this week. I did stumble upon a college football data source that I will use for this analysis. With data going back several years I am hopeful that we can design a good model. Time will tell.