Thursday, March 5, 2020

Predicting MLB Attendance: Modeling

The previous post explored the effect of certain factors on attendance; however, utilizing models to predict attendance can yield insight as to which factors are most important when it comes to the attendance of a given game in a more precise way than a simple exploration can. Using five different model types and then examining the best model will reveal the importance of each factor in attendance. The five models used were a linear model, a generalized linear model (GLM), an extreme gradient boosting (XGBoost) linear model, a random forest model, and an XGBoost tree model. The linear model is the simplest model and the easiest to interpret, yet it is not as accurate as the more complex models and some of the base assumptions of the linear model are not upheld with this data. The GLM and the XGBoost Linear models are more complex versions of the linear model, while the random forest and XGBoost tree models use an ensemble of decision trees to generate predictions. The code used to tune and find the optimal parameters for each model can be found here.

The performance of each model was evaluated by using the Root Mean Square Error (RMSE), which is the square root of the sum of the squares of the model error, or residuals. The RMSE is preferred because it penalizes larger errors than smaller ones. The two XGBoost models performed the best, with the tree model outperforming the linear model.



While the XGBoost Tree model performed best according to the RMSE metric, it tended to underestimate the attendance for games where it predicted lower attendance and overestimated the attendance where it predicted higher attendance, but overall it does a fairly good job.


The two teams that played in each game were by far the most important variables for the XGBoost Tree model. Combined, they accounted for approximately 97% of the importance of the variables. However, if MLB would like to increase attendance they cannot simply make the most popular teams play each other more often, so the focus should be on factors that MLB can control in order to increase attendance.


The Series Number and Day of Week were the two most important variables besides the teams that played in the game. The Series Number variable was used as a proxy for the time of year that the game was played, while the Day of Week variable is simply the day of the week that the game was played. These two variables are clearly the most important factors that MLB can actually control in order to increase attendance.



The average attendance of games is approximately higher by 5,000 during the late June to mid August time spans when generally compared to the rest of the season; however, attendance does seem to pick-up towards the end of the season as well. The spike at the beginning of the season is largely a result of opening day, and given that a team can only have one opening day there is not much MLB can do in that sense.


Attendance is noticeably higher on weekends than it is during the week. Saturdays see the highest average attendance, followed closely by Fridays and Sundays. On average Saturdays have an average attendance approximately 7,000 higher than during the week and Fridays and Sundays have a higher average attendance of roughly 5,000 compared to weekdays.


Given that attendance is higher on weekends and during the late June to Mid August time span, MLB should consider scheduling more games on weekends and during the summer. While almost all of these days already have games scheduled, MLB should consider adding doubleheaders (two games on one day) on Saturdays and Sundays during the summer. MLB has not had a scheduled doubleheader since 2011, yet doubleheaders on summer weekends could lead to an increase in attendance and a shorter season, which some many members of baseball prefer. That said, an increase in doubleheaders could lead to less games played in prime time (weekday evenings) which is when most people watch baseball on TV and would hurt TV ratings. However, MLB teams frequently play the last game of the series during the day in order to allow the visiting team time to travel to their next series. MLB should consider replacing weekday day games with a day off, and moving those games to Saturdays for scheduled doubleheaders. TV networks would be incentivized because they would likely get higher ratings for weekend games rather than weekday day games, players would probably enjoy more days off during the season. If MLB is able to get TV networks and players to buy-in, they would be able to increase attendance which would lead to more exposure and revenue.

The code used for this analysis can be accessed here: https://github.com/pfmccull/MLB-Attendance/tree/master/Modeling

1 comment:

  1. Casinos Near Me - Oklahoma Casino Guide
    The largest casino in the world. The name of 먹튀 검증 먹튀 랭크 the casino in 온라인 슬롯머신 Oklahoma is is located in the 실시간 바카라 city of is located in 다 파벳 먹튀 the town of 축구토토 Muhlebeek, Oklahoma.

    ReplyDelete

Best First Wordle Guess

In Wordle there are 2,315 words that are valid answer to the puzzle and there are 10,657 words that are valid guesses that are not potential...