Wednesday, March 13, 2019

Modeling Arbitration Contracts: Batters

Please read the introduction to understand the purpose and background of the salary arbitration process in Major League Baseball.

There have been 650 batters who exited the arbitration process with one-year contracts between the 2012 and 2019 seasons. As shown below, most players earn below $5 million and the more service time that a player has, the more they are likely to earn.


And the distribution of arbitration raises: 

There are 7 cases where a player’s salary decreased through the arbitration process, which constitutes just over 1% of the sample of batters. None of these cases required an arbitration hearing, which suggests that these players would have been non-tendered (effectively released) had they not agreed to a reduction in salary. We can see that the vast majority of batters received raises of less than $2.5M. Both the distribution of raises and of contracts are right skewed due to fact that prior to arbitration most players earn either the minimum salary or close to it. Likewise, only truly exceptional players receive very large contracts through the arbitration process. The fact that the data is skewed means that a linear model may not be best model for prediction; however, we will use a linear model for this analysis because it is one of the most interpretable models that can be used.

To start, first model is fitted to predict the absolute dollar raise for each player. In addition to traditional statistics, the model also attempts to account for exceptional players by including the number of awards that a player has won, as defined by the Lahman Database. These awards include MVP awards, Cy Young awards, World Series/NLCS/ALCS MVP awards, Gold Gloves awards, Silver Slugger awards, and All-Star starters.
The initial model includes as many traditional stats as possible: PA, AB, R, HR, RBI, SB, AVG, SLG, OBP, and OPS. Then, by using the Akaike Information Criterion (AIC) to select variables for the best model in terms of fit and simplicity, the predictors of the model are reduced to total awards, PA, AB, R, HR, RBI, AVG, SLG, and OBP. The model yields a correlation coefficient (R2) of .72, indicating that 72% of the variance in the salary raise is accounted for by the included variables.

Batter Arbitration Absolute Raise
Intercept
Award Total
PA
R
HR
RBI
OBP
R2
-$642,584
$626,283
-$2,246
$29,028
$34,340
$16,224
$2,092,643
0.72

The linear model fits the form:
predicted raise = β0 + xawardsβawards+ xPAβPA + xRβR + xHRβHR + + xRBIβRBI + xOBPβOBP
Where β0 is the intercept, xawards is the number of awards that a given player has won, βawards is the model slope of the awards variable, listed in the table above, xPA is the number of plate appearances for a given player, βPA is the model slope for plate appearances, and so on. This model indicates that for each PA that a batter has, and all other variables remain constant, his raise will decrease by $2,246, for each R his his raise will increase by $29,028, and so on. Likewise, if a player does not accrue any playing time and thus has 0 for awards, PA, R, HR, RBI and OBP, his salary will decrease by $642,584, as indicated by the intercept term.
All variable terms are positive except for PA which seems correct, because the more PA that a batter has without increasing any of the other stats, the lower his performance is. Since OBP is the only rate statistic, this model would predict a player with a small number of plate appearances, but a high OBP to achieve a higher raise than he would likely get in the actual arbitration process. As a result, it would be better to force OBP statistic to interact with PA to create a value indicating the number of times that the player got on base, or ‘successful’ plate appearances.   

Batter Arbitration Absolute Raise with Times on Base
Intercept
Award Total
PA
R
HR
RBI
OBP*PA
R2
-$45,879
$559,735
-$6,879
$18,898
$42,913
$11,958
$18,389
0.73

This model is a slight improvement according to the R2 statistic and the coefficients for the model variables are all still similar, except that the intercept has decreased by more than a magnitude of 10.  Plate appearances that are unsuccessful, as in they do not end with the player getting on base, now have a higher penalty of -$6,879 and, assuming the number of plate appearances stay constant, each successful plate appearance increases a player’s raise by $18,389. This means that without any additional awards, R, HR, and RBI a player must have an OBP over .374 for additional plate appearances to increase their salary. This may seem daunting given that the average OBP in 2018 was .318; however, for each successful plate appearance, the expected number of R increases by.378, the expected number of HR increases by .103, and the expected number of RBI increases by .354. Given that these statistics are all inevitably correlated, because the number of times that a player gets on base, the higher the number of R, HR, and RBI that a player is expected to get, it might be best to remove some from the model. This problem is most obvious in the case of HR, because for each HR, the number of R increases by 1, and the number of RBI increases by at least 1. Both the PA and successful plate appearances statistics significantly add multicollinearity to the model.

Batter Arbitration Absolute Raise with No PA
Intercept
Award Total
R
HR
RBI
R2
-$158,152.00
$714,809.00
$19,069.00
$47,226.00
$7,452.00
0.70





The model without PA and successful plate appearances does not have significant multicollinearity, while it has only a slightly lower correlation coefficient, which would imply that this model is more robust than the previous model. The remaining variable slopes, other than RBI, have larger values, but are not much larger. The intercept term is much larger, suggesting that the PA and successful plate appearance terms were primarily driving player salaries down.
Now, let’s see how the model predicts the player raises to the actual raise. 


The actual raises are plotted on the x-axis and the predicted raises are plotted on the y-axis. The plotted line has a slope of one, indicating that the predicted raise is equal to the actual raise. From the line, we can see that the model severely underestimates larger raises. Most raises are relatively small, which likely is affecting ability of the model to predict larger raises. Since raise is only part of what we are interested in that is the final salary of the player, let’s see if a different model that includes previous salary does a better job of predicting a player’s raise.  

Batter Arbitration Raise with Previous Salary
Intercept
Award Total
R
HR
RBI
Previous Salary
R2
-$191,900
$709,500
$19,050
$46,630
$6,621
0.00395
0.70




There is a slight increase in the R2 with the new model. The intercept and slope for all of the other predictor values has decreased to compensate for the inclusion of the previous salary term. The use of the previous salary term does not appear to add much to the new model, let’s see if the predicted values on the higher end of salaries are more appropriate.


The new model has a slightly better fit at higher raises albeit it is barely noticeable on the plot. Given that the new model has a slightly better fit, a better AIC, and previous salary does not have strong correlation with any of the other predictor variables it is best to keep the new model. Now, let’s check the assumptions of the linear model:


The first assumption is that there is constant error variance. This plot shows the predicted values (fitted values) plotted against the residuals (error between predicted and actual values). We can see that the model does not have constant error variance, as the errors tend to be smaller at the lower projected salaries and larger at the higher projected salaries. A plot that shows constant error variance will show a random scattering of points, roughly filling out the shape of a circle. However, this plot is cone-shaped which is a telltale sign of nonconstant error variance.


Another assumption of the linear model is that the residuals are normally distributed, we can easily view this by looking at a Q-Q Plot. The Q-Q Plot should show a straight line if the residuals are normally distributed. We can see that the residuals are not normally distributed because the tails of the line are curved, indicating that the assumption of normally distributed residuals is not upheld. Likewise, we can see that the residuals do not follow a normal distribution by plotting a histogram of them. The residuals are strongly right skewed, which is not a surprise given that the model underestimates the raise of players whose actual raise is higher.   


While the linear model is not the most accurate method to predict the raise of arbitration eligible batters in this situation, it is the most interpretable model due to its simple nature. Other methods such as a linear model with a transformed value of raise, such as log(raise), or a polynomial model may prove to be a more accurate way to predict arbitration raises. 










No comments:

Post a Comment

Best First Wordle Guess

In Wordle there are 2,315 words that are valid answer to the puzzle and there are 10,657 words that are valid guesses that are not potential...