Please read the
introduction to understand the purpose and background of the salary arbitration process in Major League Baseball.
There have been 650 batters who exited the arbitration
process with one-year contracts between the 2012 and 2019 seasons. As shown
below, most players earn below $5 million and the more service time that a
player has, the more they are likely to earn.
And the distribution of arbitration raises:
There are 7 cases where a player’s salary decreased through
the arbitration process, which constitutes just over 1% of the sample of
batters. None of these cases required an arbitration hearing, which suggests
that these players would have been non-tendered (effectively released) had they
not agreed to a reduction in salary. We can see that the vast majority of batters
received raises of less than $2.5M. Both the distribution of raises and of
contracts are right skewed due to fact that prior to arbitration most players
earn either the minimum salary or close to it. Likewise, only truly exceptional
players receive very large contracts through the arbitration process. The fact
that the data is skewed means that a linear model may not be best model for
prediction; however, we will use a linear model for this analysis because it is
one of the most interpretable models that can be used.
To start, first model is fitted to predict the absolute
dollar raise for each player. In addition to traditional statistics, the model
also attempts to account for exceptional players by including the number of
awards that a player has won, as defined by the
Lahman Database.
These awards include MVP awards, Cy Young awards, World Series/NLCS/ALCS MVP
awards, Gold Gloves awards, Silver Slugger awards, and All-Star starters.
The initial model includes as many traditional stats as
possible: PA, AB, R, HR, RBI, SB, AVG, SLG, OBP, and OPS. Then, by using the
Akaike Information Criterion (AIC) to select variables for the best model in
terms of fit and simplicity, the predictors of the model are reduced to total
awards, PA, AB, R, HR, RBI, AVG, SLG, and OBP. The model yields a correlation
coefficient (R2) of .72, indicating that 72% of the variance in the
salary raise is accounted for by the included variables.
Batter Arbitration Absolute Raise
|
Intercept
|
Award Total
|
PA
|
R
|
HR
|
RBI
|
OBP
|
R2
|
-$642,584
|
$626,283
|
-$2,246
|
$29,028
|
$34,340
|
$16,224
|
$2,092,643
|
0.72
|
The linear model fits the form:
predicted raise = β0
+ xawardsβawards+ xPAβPA + xRβR
+ xHRβHR + + xRBIβRBI + xOBPβOBP
Where β0 is the intercept, xawards is the
number of awards that a given player has won, βawards is the model
slope of the awards variable, listed in the table above, xPA is the
number of plate appearances for a given player, βPA is the model
slope for plate appearances, and so on. This model indicates that for each PA
that a batter has, and all other variables remain constant, his raise will
decrease by $2,246, for each R his his raise will increase by $29,028, and so
on. Likewise, if a player does not accrue any playing time and thus has 0 for
awards, PA, R, HR, RBI and OBP, his salary will decrease by $642,584, as
indicated by the intercept term.
All variable terms are positive
except for PA which seems correct, because the more PA that a batter has
without increasing any of the other stats, the lower his performance is. Since
OBP is the only rate statistic, this model would predict a player with a small
number of plate appearances, but a high OBP to achieve a higher raise than he
would likely get in the actual arbitration process. As a result, it would be
better to force OBP statistic to interact with PA to create a value indicating
the number of times that the player got on base, or ‘successful’ plate
appearances.
Batter Arbitration Absolute Raise
with Times on Base
|
Intercept
|
Award Total
|
PA
|
R
|
HR
|
RBI
|
OBP*PA
|
R2
|
-$45,879
|
$559,735
|
-$6,879
|
$18,898
|
$42,913
|
$11,958
|
$18,389
|
0.73
|
This model is a slight improvement according to the R2
statistic and the coefficients for the model variables are all still similar,
except that the intercept has decreased by more than a magnitude of 10. Plate appearances that are unsuccessful, as in
they do not end with the player getting on base, now have a higher penalty of
-$6,879 and, assuming the number of plate appearances stay constant, each
successful plate appearance increases a player’s raise by $18,389. This means
that without any additional awards, R, HR, and RBI a player must have an OBP
over .374 for additional plate appearances to increase their salary. This may
seem daunting given that the average OBP in 2018 was .318; however, for each
successful plate appearance, the expected number of R increases by.378, the
expected number of HR increases by .103, and the expected number of RBI
increases by .354. Given that these statistics are all inevitably correlated,
because the number of times that a player gets on base, the higher the number
of R, HR, and RBI that a player is expected to get, it might be best to remove
some from the model. This problem is most obvious in the case of HR, because
for each HR, the number of R increases by 1, and the number of RBI increases by
at least 1. Both the PA and successful plate appearances statistics significantly
add multicollinearity to the model.
Batter Arbitration Absolute Raise with No PA
|
Intercept
|
Award Total
|
R
|
HR
|
RBI
|
R2
|
-$158,152.00
|
$714,809.00
|
$19,069.00
|
$47,226.00
|
$7,452.00
|
0.70
|
The model without PA and successful plate appearances does
not have significant multicollinearity, while it has only a slightly lower
correlation coefficient, which would imply that this model is more robust than
the previous model. The remaining variable slopes, other than RBI, have larger
values, but are not much larger. The intercept term is much larger, suggesting
that the PA and successful plate appearance terms were primarily driving player
salaries down.
Now, let’s see how the model predicts the player raises to
the actual raise.
The actual raises are plotted on the x-axis and the
predicted raises are plotted on the y-axis. The plotted line has a slope of
one, indicating that the predicted raise is equal to the actual raise. From the
line, we can see that the model severely underestimates larger raises. Most
raises are relatively small, which likely is affecting ability of the model to
predict larger raises. Since raise is only part of what we are interested in
that is the final salary of the player, let’s see if a different model that
includes previous salary does a better job of predicting a player’s raise.
Batter Arbitration Raise with Previous Salary
|
Intercept
|
Award Total
|
R
|
HR
|
RBI
|
Previous Salary
|
R2
|
-$191,900
|
$709,500
|
$19,050
|
$46,630
|
$6,621
|
0.00395
|
0.70
|
There is a slight increase in the R2 with the new
model. The intercept and slope for all of the other predictor values has
decreased to compensate for the inclusion of the previous salary term. The use
of the previous salary term does not appear to add much to the new model, let’s
see if the predicted values on the higher end of salaries are more appropriate.
The new model has a slightly better fit at higher raises
albeit it is barely noticeable on the plot. Given that the new model has a
slightly better fit, a better AIC, and previous salary does not have strong
correlation with any of the other predictor variables it is best to keep the
new model. Now, let’s check the assumptions of the linear model:
The first assumption is that there is constant error
variance. This plot shows the predicted values (fitted values) plotted against
the residuals (error between predicted and actual values). We can see that the
model does not have constant error variance, as the errors tend to be smaller at
the lower projected salaries and larger at the higher projected salaries. A
plot that shows constant error variance will show a random scattering of points,
roughly filling out the shape of a circle. However, this plot is cone-shaped
which is a telltale sign of nonconstant error variance.
Another assumption of the linear model is that the residuals
are normally distributed, we can easily view this by looking at a Q-Q Plot. The
Q-Q Plot should show a straight line if the residuals are normally distributed.
We can see that the residuals are not normally distributed because the tails of
the line are curved, indicating that the assumption of normally distributed
residuals is not upheld. Likewise, we can see that the residuals do not follow
a normal distribution by plotting a histogram of them. The residuals are
strongly right skewed, which is not a surprise given that the model underestimates
the raise of players whose actual raise is higher.
While the linear model is not the most accurate method to
predict the raise of arbitration eligible batters in this situation, it is the most
interpretable model due to its simple nature. Other methods such as a linear
model with a transformed value of raise, such as log(raise), or a polynomial
model may prove to be a more accurate way to predict arbitration raises.