Generating Odds
Most of the statistics in the scores table for each pool are based on simulations run using the Monte Carlo method.
The process:
- Pull the latest values from kenpom.com for each team's adjusted efficiency margin and adjusted tempo.
- For each actual or hypothetical matchup in the tournament, estimate team A's probability of defeating team B using the method described here. (As I understand it, there may be small discrepancies if he decides that certain venues give teams partial home-court advantages, e.g. when UNC plays in Charlotte.)
- Before the real tournament starts and each time a new result comes in, simulate a large number (several thousand) of hypothetical tournaments using the P(A,B) probabilities to determine random outcomes for each of the games that hasn't yet been played. For each hypothetical tournament, score all of the entries in each pool and record how they performed (in total points and relative to one another).
- Aggregate the results from all of the thousands of runs to generate the statistics shown in the scores table.
Explanation of individual statistics:
- Best possible score (theoretical): This statistic doesn't require the Monte Carlo simulations. It's simply the score you could achieve if every remaining game went in your favor.
- Best possible score (observed): In all of the thousands of simulated tournaments, the maximum score that you achieved. At the beginning of the tournament this will typically be significantly lower than the theoretical maximum; as the tournament progresses, the two numbers will converge. Caveat: you may occasionally see an observed best score that exceeds the corresponding potential best score. This happens because the potential score updates almost immediately after a new result comes in, whereas the observed number will only update after the prediction model runs again (which typically takes 10-20 minutes).
- Projected score: In all of the simulated tournaments, your average (mean) score.
- Best possible finish: In all of the simulated tournaments, the single best ranking you achieved.
- Worst possible finish: In all of the simulated tournaments, the single worst ranking you achieved.
- Odds of winning: The number of runs in which you finished first divided by the total number of runs. Note that the sum of these numbers will exceed 100% if there are possible scenarios in which two or more entries tie for first place.
Known issues/limitations:
- Even if the prior probabilities are perfect, the Monte Carlo method is only an approximation. (In its defense, though, its results should converge to the true values if the number of runs is sufficiently high.) But why not just calculate the exact numbers? Once the tournament is down to 16 or 8 teams, that's fine — but when 64 teams remain in the field, the number of possible outcomes (263) is so large that even the fastest computer in the world couldn't iterate through all of them. I suspect you could make the computation tractable by taking clever shortcuts to avoid considering most of the cases, but I'm not getting paid enough to work on that problem.
- If you have a tiny but non-zero probability of winning, it's possible that you won't finish first in a single run of the simulation — in which case the site will erroneously list your probability of winning as being zero.
- As noted above, the probabilities I generate for individual matchups fail to account for partial home-court advantages.
Potential improvements in future years:
- Allow the user to view details of the simulation's runs so (s)he can identify the specific scenarios that will yield a victory in the pool.
- Provide an interface for the user to construct a hypothetical bracket (i.e. fill in winners for the remaining games) and then score a pool based on these results.
- (not related to the simulation) Show how many teams each entry has alive in upcoming rounds.
- Other ideas? Let me know.
|