Tuesday, February 8, 2022

Best First Wordle Guess


In Wordle there are 2,315 words that are valid answer to the puzzle and there are 10,657 words that are valid guesses that are not potential answers. As a result there are 12,972 valid guess words. The most popular letter in the set of 2,315 answer words is 'e' which occurs in 1,056 of the answer words or about 46% of potential answer words. Unsurprisingly, half of the ten most common letters are vowels while 'r' is by far the most common consonant. 

The most common first letter is 's' which is the first letter in 366 of words. 'a' is the most common second and third
letter with counts of 304 and 307, respectively. Likewise, 'e' is the most common fourth and fifth letter occurring as the fourth letter 318 times and as the fifth letter 424 times. The only letters that rank in the top five by position that are not in the most ten frequent letters are in the first position ('c', 'b', 'p') and in the fifth position ('y'). 


If the goal of the first guess is to get as many matches as possible, with both letters contained in a word (letter tiles) and the correct location of said letters (green tiles) then it would appear that a word containing 'e', 'a', 'r', 'o', and 't' would be best, potentially with an 'o' in the second position, 'a' in the third position and an 'e' in the fifth position. However, merely matching (or not matching) in the first turn is not necessarily the best strategy to solve the puzzle in the fewest turns.

The most accurate way to determine the best first guess would be to play the game to completion with each answer word and each valid guess word. However, this is extremely difficult computationally, iterating over each of the 2,315 answer words and 12,972 valid guess words for the first guess results in 30,030,180 combinations alone. Iterating over a second guess would result in 390 billion possible permutations. The table for the 30 million combinations is below, which shows that 'roate' is the first guess that eliminates the most possible answers after one guess. On average the potential number of answers is reduced form 2,315 to 60.4, a 97% reduction. 'roate' is a valid guess word, but it is not a potential answer, if one wished to use a valid answer word for the first guess then 'raise' is the best first guess word. On average, a first guess of 'raise' results in 61.0 average words remaining.


The best ten first guess words all contain the most frequent letters without multiple instances of the same letter in the word suggesting that choosing words based on letter frequency is important. Likewise, of the two words that contain each of the five most common letters, 'roate' and 'orate', 'roate' results in fewer remaining words. This would indicate that letter positioning is import as well, given that 'o' is the second most common second letter and 'r' is the third most common second letter. Additionally, 'r' is the 11th most common first letter whereas 'o' is the 17th most common first letter. 

Thursday, March 5, 2020

Predicting MLB Attendance: Modeling

The previous post explored the effect of certain factors on attendance; however, utilizing models to predict attendance can yield insight as to which factors are most important when it comes to the attendance of a given game in a more precise way than a simple exploration can. Using five different model types and then examining the best model will reveal the importance of each factor in attendance. The five models used were a linear model, a generalized linear model (GLM), an extreme gradient boosting (XGBoost) linear model, a random forest model, and an XGBoost tree model. The linear model is the simplest model and the easiest to interpret, yet it is not as accurate as the more complex models and some of the base assumptions of the linear model are not upheld with this data. The GLM and the XGBoost Linear models are more complex versions of the linear model, while the random forest and XGBoost tree models use an ensemble of decision trees to generate predictions. The code used to tune and find the optimal parameters for each model can be found here.

The performance of each model was evaluated by using the Root Mean Square Error (RMSE), which is the square root of the sum of the squares of the model error, or residuals. The RMSE is preferred because it penalizes larger errors than smaller ones. The two XGBoost models performed the best, with the tree model outperforming the linear model.



While the XGBoost Tree model performed best according to the RMSE metric, it tended to underestimate the attendance for games where it predicted lower attendance and overestimated the attendance where it predicted higher attendance, but overall it does a fairly good job.


The two teams that played in each game were by far the most important variables for the XGBoost Tree model. Combined, they accounted for approximately 97% of the importance of the variables. However, if MLB would like to increase attendance they cannot simply make the most popular teams play each other more often, so the focus should be on factors that MLB can control in order to increase attendance.


The Series Number and Day of Week were the two most important variables besides the teams that played in the game. The Series Number variable was used as a proxy for the time of year that the game was played, while the Day of Week variable is simply the day of the week that the game was played. These two variables are clearly the most important factors that MLB can actually control in order to increase attendance.



The average attendance of games is approximately higher by 5,000 during the late June to mid August time spans when generally compared to the rest of the season; however, attendance does seem to pick-up towards the end of the season as well. The spike at the beginning of the season is largely a result of opening day, and given that a team can only have one opening day there is not much MLB can do in that sense.


Attendance is noticeably higher on weekends than it is during the week. Saturdays see the highest average attendance, followed closely by Fridays and Sundays. On average Saturdays have an average attendance approximately 7,000 higher than during the week and Fridays and Sundays have a higher average attendance of roughly 5,000 compared to weekdays.


Given that attendance is higher on weekends and during the late June to Mid August time span, MLB should consider scheduling more games on weekends and during the summer. While almost all of these days already have games scheduled, MLB should consider adding doubleheaders (two games on one day) on Saturdays and Sundays during the summer. MLB has not had a scheduled doubleheader since 2011, yet doubleheaders on summer weekends could lead to an increase in attendance and a shorter season, which some many members of baseball prefer. That said, an increase in doubleheaders could lead to less games played in prime time (weekday evenings) which is when most people watch baseball on TV and would hurt TV ratings. However, MLB teams frequently play the last game of the series during the day in order to allow the visiting team time to travel to their next series. MLB should consider replacing weekday day games with a day off, and moving those games to Saturdays for scheduled doubleheaders. TV networks would be incentivized because they would likely get higher ratings for weekend games rather than weekday day games, players would probably enjoy more days off during the season. If MLB is able to get TV networks and players to buy-in, they would be able to increase attendance which would lead to more exposure and revenue.

The code used for this analysis can be accessed here: https://github.com/pfmccull/MLB-Attendance/tree/master/Modeling

Predicting MLB Attendance: Exploration

Attendance at Major League Baseball games has decreased for 12 straight years. Theories for the decline are as numerous as they are varied, such as an increase in game time, an aging fan base, less competition among teams, higher costs associated with attending games, a changes in game play, more entertainment options, and even the weather. Building a model to predict attendance can give insights into what actually matters when it comes to attendance.

Using game day attendance figures from the past ten MLB seasons (2010-2019) can provide a wealth of data to examine the important of individual factors in a given game's attendance. Theoretically, there should be 24,300 (81 home games, 30 teams, 10 years) individual regular season games to work with over that time span plus a few one-game playoffs. However, for predictive purposes rescheduled games, one-game playoffs, games moved to one-off circumstances (such as hurricanes, G20 Summits, riots, etc.), and games held in special venues (College World Series, Little League World Series, Mexico, Puerto Rico, Japan etc.) will be excluded, reducing the number of games to examine at 23,814.

To get a general idea of how attendance is influenced by different variables, exploration will be limited to the team with the largest ballpark seating capacity: the Los Angeles Dodgers. The Dodgers have a maximum seating capacity of 56,000. The average attendance at Dodger Stadium has been relatively consistent since 2013, which is the year after their notoriously bad owner, Frank McCourt, sold the team. Using the seasons from 2013-2019 should remove any confounding variables that might be present for any other given team, as the Dodgers have been won the division each of those seasons and have not had any major changes to their payroll, stadium, or ownership situations over that time period.

It is not surprising to see that the Dodgers have a higher average attendance on weekends than they do on weekdays. With an average attendance of 50,028, Saturdays are by far the most popular day of week to attend games, 3,214 more fans than the average game. Monday and Wednesday games see the largest drops in attendance, with decreases of 2,963 and 2,036 relative to average, respectively. However, it does seem surprising that Tuesdays do not see a more substantial decrease in attendance. 



The attendance appears to be greatest when the temperature is just below 80 °F and it tends to decrease when the temperature is in 60s or in the high 80s or above. Perhaps temperature is a proxy for time of year, given that April through June are usually the colder months of the baseball season in Los Angeles and individuals may be less likely to go to games in the beginning of the season when the season is still young and families still have children in school. 



There appears to be a high interest in games at the very beginning of the season followed by a quick drop. This may be caused by interest in opening day, as the first home game of the season has an average attendance of 53,459, while all other games have an average attendance of 46,775. Attendance increases in the beginning and middle of June, which would correspond with the school year ending. There appears to be a modest drop in attendance beginning in the middle of August, which is when the school year begins again; however, this drop-off is not as substantial as the increase in June, so perhaps both weather and the time of year are both factors in attendance. 

It does appear that both temperature and time of year impact attendance. The increase in attendance in June occurs before the increase in temperature. Likewise, the attendance decreases once the school year begins while temperature stays relatively high; however, the attendance is still higher than it is at the beginning of the season.


It goes without saying that certain teams have larger fan bases and have more of a nationwide following. It is not surprising that the New York Yankees highest change in attendance when they are the away team, but it is surprising how much more attendance increases when they are the away team compared to the second most popular team: the Boston Red Sox. On average, when the Yankees were the away team the attendance of the game increased by 21%, while the Red Sox only saw an increase of 11%. It is surprising to see that only a third (ten) of teams saw an increase in attendance as the away team and it would be fair to say that popular teams are more of a draw than less popular teams are a deterrence. Every team except the Angels that saw an increase in attendance as the road team has played in the World Series since 2009. That said, it does not necessarily mean that team success is indicative of higher attendance for away games, as teams with larger followings naturally have more resources. 

The Atlanta Braves rank thirtieth out of the thirty teams in change in attendance as the road team, with an average change of -1.5%. They started the decade as a good team and made the playoffs in 2010, 2012, and 2013 then in the middle of the decade they "tanked" and posted losing records in four straight seasons, and to close out the decade they won their division in 2018 and 2019. As a result of their varying success this decade, they might be worth examining to get an idea of team quality on the difference in road attendance. As the road team they had a positive affect on attendance through 2014, even though they posted a losing record that year. For the remaining five years of the decade they had a negative affect on attendance. It appears that there may be a lagging affect of team record on attendance, as in 2018 they posted a .556 winning percentage, yet had their largest negative affect on attendance with -12.7%. Overall, it does appear that recent success does have some impact on attendance as the road team.

While this exploration is not conclusive as to the exact factors of attendance at Major League Baseball games, it does provide a good starting point before creating any models to predict attendance. It appears that the day of the week, temperature, time of year, and the away team all have an impact on the attendance of any given game. 


The code used for this analysis can be accessed here: https://github.com/pfmccull/MLB-Attendance/tree/master/Exploration

Monday, February 10, 2020

What Should XFL Teams do on Point(s) After Touchdown Attempts?

The new football league XFL started play this past weekend. In an effort to create a more exciting play style compared to the NFL, the XFL has implemented some unique rules. One of the more notable rules is that after a touchdown, the team that scored the touchdown must run an offense play from the 2, 5, or 10-yard line, and a successful play is worth 1, 2, or 3 points respectively. What is the optimal strategy for the point(s) after attempts? Using available NFL play-by-play data from 2013 to 2019, the value for attempts from the 2, 5, and 10-yard lines can be estimated. XFL players certainly are not as talented as NFL players; however, the NFL serves as a better proxy for these values than NCAA football players would due to the larger talent differential between the XFL and NCAA.

Although the majority of the two-point conversion attempts that are attempted in the NFL occur at the 1 or 2-yard line, fourth and goal attempts can be used as a stand-in for attempts at other distances. All-in, there were 1,025 fourth and goal or two-point conversion attempts in the seven seasons spanning 2013-2019, 1,018 which were from the 15-yard line or closer. Conversions from the two-yard line succeed at roughly a rate of 50% rate, which, in the XFL would have an expected value of half a point.



It gets trickier to estimate the success rate from the 5 and 10-yard lines due to the significantly less amount of attempts. Given that a 5-yard conversion is worth two points, the 5-yard conversion rate would only have to be higher than 25% to yield a higher amount of expected points than the 2-yard conversion. Considering the rates at the 4, 5, 6, and 7-yard lines are all over 25% and that the combined success rate between 4 and 7 yards is 39%, going for the two-point attempt from the 5-yard line appears to be a superior strategy than going for the one-point attempt at the 2-yard line. Attempting the two-point conversion would yield .78 points on average compared to .5 points for the one-point conversion. 

Estimating the true success rate for the three-point conversion from the 10-yard line is significantly more difficult given the much lower number of attempts. In order for the three-point conversion to be worth more points as the two-point conversion, it must have a success rate greater than 26%, which would yield .78 expect points. Using the 33% success rate at the 10-yard line would not be optimal way to infer the overall success rate given that there were only six attempts. There were twelve combined attempts from the 9-11 yard lines, which were converted at a combined rate of 25%. Likewise, using the local polynomial regression line (a generalization of moving average and polynomial regression) line in the plot below, the estimated success rate is a bit under 25%. All things being equal, it would appear that the optimal strategy for point(s) after attempts in the XFL is to attempt the two-point conversion from the 5-yard line. 



One obvious thought that comes to mind when looking at NFL teams that attempted to go for a two-point conversion or for a fourth and goal situation is that the team may have been forced to because they were losing. If a team was losing, then that team is more likely to be an inferior team to the team that they are playing. To see if there was a selection bias in the plays included in the sample, the efficiency rating for each offense and defense in the sample can be used. The efficiency rating, or DVOA, is calculated by Football Outsiders to determine the quality of each offense, defense, and team overall. A positive difference in DVOA means that the offense was better and a negative difference means that the defense was better, so a .98% difference in DVOA means that the offense was .98% better than the defense in that play. It is not a surprise that teams that had attempts at the one or two-yard line had a positive differential, because these teams might go for it when they are not forced to because of confidence in their offense. Likewise, teams that have attempts of greater yardage tend to be worse, likely as a result of the team being more likely to be a desperate situation. While the difference in DVOA does not easily allow for any sort of handicapping of success rate, it can be used to give context to the success rate. 


The success rates and expected point totals at the 2, 5, and 10-yard lines indicate that XFL teams should avoid attempting one-point conversions from the 2-yard line and should either attempt two-point or three-point conversions from the 5 or 10-yard lines respectively. Although the true success rates are not know, it is likely that two-point conversion attempts from the 5-yard line will produce the highest number of expected points in a given situation in the XFL.




The code and files used for this analysis can be accessed here: https://github.com/pfmccull/Football-Point-s-After-Attempts

Saturday, February 8, 2020

Anticipating Chicago's Demographic Trends

The narrative surrounding the 2020 US Census is that the Midwest and Northeast will experience a decline in population relative to the Sun Belt and the West. Chicago, and the Chicagoland area more generally, is often cited as a city from the Midwest that will experience a relative population decline in the 2020 Census, affecting the state of Illinois' representation in the federal government. Estimates provided by the US Census Bureau through the American Community Survey (ACS) can provide insight into the population change in Chicago. The 2018 5-year ACS provides estimates in population at varying levels in the United States over the 2014-2018 time period. The ACS provides data that can go beyond the county level as the Bureau's most detailed published estimates for each individual year are at the county level. However, the ACS does not contain a fixed date for its estimates; rather, the estimates fall within the 5-year span. That said, the ACS can provide insights within a smaller geographic footprint by estimating data at a tract level rather than limiting the data to the county level. The Census Bureau defines tracts as small subdivisions within each county that are between 1,200 and 8,000 inhabitants, averaging 4,000 inhabitants per tract. 

The ACS does not provide data at a city level per se, but in tracts that are separated by county. To use the ACS tract-level data, the Cook County tracts must be limited to Chicago itself. As shown below, this can be done by using spatial vector functions in R. This does not allow the data to be limited to strictly Chicago, as some Cook County tracts straddle both Chicago and other areas that lie within the county but, for this analysis it will suffice. 


The 801 census tracts that make up Chicago are too overwhelming to make sense of when discussing subsets of Chicago. Using data available on The Chicago Data Portal, the tracts can be combined into defined Community Areas with further use of spatial functions in R. The 77 Community Areas are much easier to understand visually than the 801 tract groups. The US Census Bureau's published boundaries do not exactly overlap with the city or its Community areas as shown below, but again, it is appropriate for analysis. 




Most of the population growth occurred in and around the Loop as well as the northern part of the city, while most of the population decreases occurred in the south and western parts of the city. With the previously stated caveats regarding city boundaries and the fact that the 2018 5-year ACS figures are estimates, the overall population of the increased slightly with an increase of .8%. 






Four community areas declined by roughly 20% percent whereas the rest of the community areas with population decreases contributed more modestly.  Only one community area increased by more 20%, which was the Loop at 28.6%; however, the change in population in community areas with an increased population was spread more evenly than in areas with population decreases. 





Restricting the plot to only show areas with population changes of greater than ±10% shows that the most significant community area population decreases were indeed in the South Side (Burnside, Fuller Park) or the Southwest Side (Englewood, West Englewood). The areas with the most significant population growth were in the Central (Loop, Near North, Near South), adjacent to the Central (Douglas, Near West Side, Oakland), the North Side (North Center), or on the very edges of the city (Clearing, Riverdale). A breakdown of which community areas fall within each side can be viewed here.


The North Side and Central areas that have experienced the most growth also tend to have higher per capita incomes. The Near North Side ($96,661), Near South Side ($85,146), Loop ($79,779), and North Center ($70,550) areas had four of the five highest per capita incomes in Chicago and are well above Chicago's per capita income of $34,750. These areas also had population increases of greater than 10%. Higher per capita incomes suggests that many of the people who live in these areas are young professionals, given that children and the elderly contribute less to per capita income as they are less likely to be employeed
The opposite cannot be said for the areas with the lowest per capita income. Curiously, Riverdale had the lowest per capita income at $12,215 even with it's 12% increase in population. However, Englewood ($14,410),  New City ($14,410), West Englewood ($15,865), and Fuller Park ($16,134) had the 7th, 8th, 11th, and 12th lowest per capital incomes, respectively.  The fact that these areas had lower per capita incomes does support the idea that the areas with increasing population have a higher income and areas with decreasing populations tend to have a lower per capital income. 


Many of the community areas have per capita incomes that are on the lower end of the distribution. The areas with higher per capita incomes are the ones that stand out, and those areas do tend to have higher growth. The trendline is significantly influence by those areas with higher per capita incomes. It would be safe to say that the ares with higher per capita income tend to have a population increase, as all but one of the community areas with a per capita income above average had an increase in population.


It appears that the community areas that have a higher increase in population and a higher per capita income tend to have a higher proportion of people who identify as White. At the very least, the areas that had the largest population decreases tend to have a much lower percentage of White individuals. 

The community areas that had the largest decrease in population, those in the Southwest and South Side, tend to have a much higher proportion of Black or African American individuals than the rest of the city. Riverdale is a notable exception, in that it had a higher increase in population (12%), yet it has a very higher proportion of Black or African American individuals with 96%. Sadly, Riverdale also stood out for having a lower per capita income despite its higher increase in population. Likewise, Oakland (92%) and Douglas (68%) have a high proportion of Black or African American individuals, high increases in population, but lower per capita incomes. 


There does not appear to be any community areas that saw a large decrease in population that with a high proportion of Hispanic or Latino individuals, as the majority of Hispanics or Latinos are primarily located in the western parts of the city. Most areas with a large proportion of Hispanic or Latino individuals saw either moderate increases or decreases in population; however, the Clearing community area is 57.7% Hispanic or Latino and a population increase of 11.2%. Although Clearing did not see the largest increase in population, nor is it one of the areas with the highest proportion of Hispanic or Latino individuals it is the most notable combination of both. Likewise, two other areas with large Hispanic or Latino populations that had sizable population increases were West Elsdon (6.8%) and East Side (5.2%) with the areas consisting of 81.4% and 81.7% individuals who identify as Hispanic or Latino, respectively. 


Given that the community areas of Chicago that have seen an increase in population tend to be more White and have a higher per capita income, it would not be a surprise to see that the city is both more White and has a higher per capita income in the 2020 US Census. The areas that have seen a decrease in population are predominately Black or African American indicating that the relative Black or African American population of Chicago will also decrease in the next census. Additionally, there may be an uptick in the relative population of Hispanics or Latinos as many areas with a large proportion of Hispanics or Latinos saw modest increases in their populations.




The code used for this analysis can be found here.

Monday, February 3, 2020

Examining California Population Changes

                There has been speculation that California may lose a Congressional Seat once the 2020 Census is completed. The data provided by the US Census Bureau can be used to analyze the extent of population changes throughout the state by using data from the 2010 US Census and the bureau's published 2018 population estimates over the approximately 8-year span.1 The published estimates are limited as to the data provided and the smallest geographic area provided is at the county level. While the American Community Survey (ACS) provides more granular estimates with respect to both data and geographic boundaries. However, the ACS estimates are not pinpointed to a certain time, but fall within a range of one or five years. The 1-year ACS estimates are not provided for all counties because some do not meet the required population threshold. The 5-year ACS estimates would provide more detailed data, but the uncertainty within the time interval would result in less certain conclusions. For the sake of this analysis more recent results at the county level are more appropriate to get an idea of the population trends at a state-wide level.

The population in the state increased by 5.8% over the time span, but the population did not change uniformly. The population has declined in the northern and eastern parts of the state; Lassen County had the largest population decline at 11.7%. The population in the southern and western parts of the state, especially around the Bay Area, have grown. Four counties saw population growth over 10%: Placer (12.8%), Riverside (11.9%), San Benito (11.3%), and Alameda (10.4%).


The counties that saw the largest population declines tend to have higher median ages as well; this would suggest that the counties that have had a population decline will continue to experience a decrease in population. The map of median age by county looks almost like a mirror image of the map of population change with the counties with population growth skewing younger and vice versa. Sierra County had the oldest median age at with 55 years, while Yolo, Tulare, and Merced Counties were the youngest at 31 years. 



The population of California got older from 2010 to 2018, with the median age increasing from 35.2 years to 37.0 years. This increase is not surprising, as the median age of the United State population increased by .8 years over the same time span, from 37.2 years to 38 years. As the map below shows, there is no obvious trend in the change in median age relative to population change. That said, the northeastern part of the state had the largest increase in median age. Only two of the fifty-eight counties got younger over the time span, San Francisco by .5 years (38.5 to 38) and Butte by .2 years (37.2 to 37). Alpine County had the largest increase in median age at 4.3 years (46.7 to 51).



1 The 2010 Census is for 4/1/2010, whereas the 2018 estimates are for 7/1/2018.

The code used to pull the data and create the visualizations can be accessed at https://github.com/pfmccull/California-Population-Changes

Wednesday, March 13, 2019

Modeling Arbitration Contracts: Batters

Please read the introduction to understand the purpose and background of the salary arbitration process in Major League Baseball.

There have been 650 batters who exited the arbitration process with one-year contracts between the 2012 and 2019 seasons. As shown below, most players earn below $5 million and the more service time that a player has, the more they are likely to earn.


And the distribution of arbitration raises: 

There are 7 cases where a player’s salary decreased through the arbitration process, which constitutes just over 1% of the sample of batters. None of these cases required an arbitration hearing, which suggests that these players would have been non-tendered (effectively released) had they not agreed to a reduction in salary. We can see that the vast majority of batters received raises of less than $2.5M. Both the distribution of raises and of contracts are right skewed due to fact that prior to arbitration most players earn either the minimum salary or close to it. Likewise, only truly exceptional players receive very large contracts through the arbitration process. The fact that the data is skewed means that a linear model may not be best model for prediction; however, we will use a linear model for this analysis because it is one of the most interpretable models that can be used.

To start, first model is fitted to predict the absolute dollar raise for each player. In addition to traditional statistics, the model also attempts to account for exceptional players by including the number of awards that a player has won, as defined by the Lahman Database. These awards include MVP awards, Cy Young awards, World Series/NLCS/ALCS MVP awards, Gold Gloves awards, Silver Slugger awards, and All-Star starters.
The initial model includes as many traditional stats as possible: PA, AB, R, HR, RBI, SB, AVG, SLG, OBP, and OPS. Then, by using the Akaike Information Criterion (AIC) to select variables for the best model in terms of fit and simplicity, the predictors of the model are reduced to total awards, PA, AB, R, HR, RBI, AVG, SLG, and OBP. The model yields a correlation coefficient (R2) of .72, indicating that 72% of the variance in the salary raise is accounted for by the included variables.

Batter Arbitration Absolute Raise
Intercept
Award Total
PA
R
HR
RBI
OBP
R2
-$642,584
$626,283
-$2,246
$29,028
$34,340
$16,224
$2,092,643
0.72

The linear model fits the form:
predicted raise = β0 + xawardsβawards+ xPAβPA + xRβR + xHRβHR + + xRBIβRBI + xOBPβOBP
Where β0 is the intercept, xawards is the number of awards that a given player has won, βawards is the model slope of the awards variable, listed in the table above, xPA is the number of plate appearances for a given player, βPA is the model slope for plate appearances, and so on. This model indicates that for each PA that a batter has, and all other variables remain constant, his raise will decrease by $2,246, for each R his his raise will increase by $29,028, and so on. Likewise, if a player does not accrue any playing time and thus has 0 for awards, PA, R, HR, RBI and OBP, his salary will decrease by $642,584, as indicated by the intercept term.
All variable terms are positive except for PA which seems correct, because the more PA that a batter has without increasing any of the other stats, the lower his performance is. Since OBP is the only rate statistic, this model would predict a player with a small number of plate appearances, but a high OBP to achieve a higher raise than he would likely get in the actual arbitration process. As a result, it would be better to force OBP statistic to interact with PA to create a value indicating the number of times that the player got on base, or ‘successful’ plate appearances.   

Batter Arbitration Absolute Raise with Times on Base
Intercept
Award Total
PA
R
HR
RBI
OBP*PA
R2
-$45,879
$559,735
-$6,879
$18,898
$42,913
$11,958
$18,389
0.73

This model is a slight improvement according to the R2 statistic and the coefficients for the model variables are all still similar, except that the intercept has decreased by more than a magnitude of 10.  Plate appearances that are unsuccessful, as in they do not end with the player getting on base, now have a higher penalty of -$6,879 and, assuming the number of plate appearances stay constant, each successful plate appearance increases a player’s raise by $18,389. This means that without any additional awards, R, HR, and RBI a player must have an OBP over .374 for additional plate appearances to increase their salary. This may seem daunting given that the average OBP in 2018 was .318; however, for each successful plate appearance, the expected number of R increases by.378, the expected number of HR increases by .103, and the expected number of RBI increases by .354. Given that these statistics are all inevitably correlated, because the number of times that a player gets on base, the higher the number of R, HR, and RBI that a player is expected to get, it might be best to remove some from the model. This problem is most obvious in the case of HR, because for each HR, the number of R increases by 1, and the number of RBI increases by at least 1. Both the PA and successful plate appearances statistics significantly add multicollinearity to the model.

Batter Arbitration Absolute Raise with No PA
Intercept
Award Total
R
HR
RBI
R2
-$158,152.00
$714,809.00
$19,069.00
$47,226.00
$7,452.00
0.70





The model without PA and successful plate appearances does not have significant multicollinearity, while it has only a slightly lower correlation coefficient, which would imply that this model is more robust than the previous model. The remaining variable slopes, other than RBI, have larger values, but are not much larger. The intercept term is much larger, suggesting that the PA and successful plate appearance terms were primarily driving player salaries down.
Now, let’s see how the model predicts the player raises to the actual raise. 


The actual raises are plotted on the x-axis and the predicted raises are plotted on the y-axis. The plotted line has a slope of one, indicating that the predicted raise is equal to the actual raise. From the line, we can see that the model severely underestimates larger raises. Most raises are relatively small, which likely is affecting ability of the model to predict larger raises. Since raise is only part of what we are interested in that is the final salary of the player, let’s see if a different model that includes previous salary does a better job of predicting a player’s raise.  

Batter Arbitration Raise with Previous Salary
Intercept
Award Total
R
HR
RBI
Previous Salary
R2
-$191,900
$709,500
$19,050
$46,630
$6,621
0.00395
0.70




There is a slight increase in the R2 with the new model. The intercept and slope for all of the other predictor values has decreased to compensate for the inclusion of the previous salary term. The use of the previous salary term does not appear to add much to the new model, let’s see if the predicted values on the higher end of salaries are more appropriate.


The new model has a slightly better fit at higher raises albeit it is barely noticeable on the plot. Given that the new model has a slightly better fit, a better AIC, and previous salary does not have strong correlation with any of the other predictor variables it is best to keep the new model. Now, let’s check the assumptions of the linear model:


The first assumption is that there is constant error variance. This plot shows the predicted values (fitted values) plotted against the residuals (error between predicted and actual values). We can see that the model does not have constant error variance, as the errors tend to be smaller at the lower projected salaries and larger at the higher projected salaries. A plot that shows constant error variance will show a random scattering of points, roughly filling out the shape of a circle. However, this plot is cone-shaped which is a telltale sign of nonconstant error variance.


Another assumption of the linear model is that the residuals are normally distributed, we can easily view this by looking at a Q-Q Plot. The Q-Q Plot should show a straight line if the residuals are normally distributed. We can see that the residuals are not normally distributed because the tails of the line are curved, indicating that the assumption of normally distributed residuals is not upheld. Likewise, we can see that the residuals do not follow a normal distribution by plotting a histogram of them. The residuals are strongly right skewed, which is not a surprise given that the model underestimates the raise of players whose actual raise is higher.   


While the linear model is not the most accurate method to predict the raise of arbitration eligible batters in this situation, it is the most interpretable model due to its simple nature. Other methods such as a linear model with a transformed value of raise, such as log(raise), or a polynomial model may prove to be a more accurate way to predict arbitration raises. 










Best First Wordle Guess

In Wordle there are 2,315 words that are valid answer to the puzzle and there are 10,657 words that are valid guesses that are not potential...