Using Variables to Explain
And Predict Winning Percentage,
1903-2003
Kristopher Kuzera
Int. Quantitative Methods
Geography 347
Fall 2003
Summer is an important time for the city of
When analyzing the baseball performance of the Cubs, there are many parts of the game worth examining in depth. Some examples would be; how many runs does the team accumulate annually? Or, how many do they allow the opposing team? How about the performance of the Cubs’ pitching? Does it matter if they perform well, or is it a better indicator if they are consistently poor? Since there are so many variables worth investigating, it is hard to know which will have the most explanatory power towards predicting the winning percentage.
To begin, a number of these variables have been selected based on the belief that they will have significance in explaining the annual winning percentage. These variables include the annual totals of strike outs, home runs, hits, and runs for both batting and pitching, base on balls allowed and team Earned Run Average (ERA) for pitching, team batting average, pitching shutouts, complete games, and attendance per game. In order to eliminate some of the possible redundancy in the variables and expose some underlying factors in the data, a principal components analysis has been performed. This type of analysis allows the variables to form factors which represent the common hidden dimensions of the information, compacted in order to give the variables more uniqueness in their explanatory power. Later sections of the paper will describe, in depth, the processes in which the principal components analysis was performed, and how the resulting factors were regressed against the winning percentage to illustrate how the data relates to the dependent variable. First, however, the data itself will be examined in detail.
DEVELOPMENT OF THE DATA
Baseball is often considered to be a game of numbers and match-ups. Whether counting the four bases, the three strikes, the four balls, the three outs, or even the nine innings, there are numbers involved in every aspect of the sport. However, at the end of the game, there are really only two numbers that matter in determining if the game is a victory. These crucial numbers are the total runs of one team versus the total runs of the other. Whoever has the higher of the two numbers will be considered victorious. But, in reality, are these two the only truly important factors that decide which will be the winning team? Other aspects of the game contribute to the manufacturing of runs for either or both teams that must be considered when determining a team’s winning percentage.
In order to accurately assess the winning performance, a large sample of data exclusively pertaining to the Chicago Cubs has been acquired from baseball-reference.com, an informational website containing statistics for all of Major League Baseball. The sample contains 100 seasons of baseball, from 1903 to 2002, with each season representing one unit of observation. Each observation will have one value per variable because the statistics are accumulated and maintained at a yearly scale. The data has also been gathered for the latest season, 2003, which will be used as a predictor for that season’s winning percentage, based on the regression lines developed from the 100-year data.
The following baseball components were considered important in interpreting the winning percentage for the Chicago Cubs. All of the variables below were analyzed in a correlation matrix to see how each relate and interact with each other. These correlations are important in understanding how the variables correlate, in order to estimate how they will be used when regressed against the dependent variable.
First, it is necessary to understand each of the variables being used as input for the study. The first variable to examine is the dependent variable, winning percentage. Winning percentage was calculated as the ratio of total number of wins over the total number of games per season. Over the past century, the Chicago Cubs have experienced quite a fluctuation in the range of their annual winning percentages. They have experienced percentages as high as 0.748 (116 wins out 155 total) to as low as 0.358 (38 wins out of 106), while maintaining an average of 0.503 for the 100 year span. This will be the variable to which all the other variables will be compared in the study.
The statistic Earned Run Average (ERA) is used as an independent variable. ERA is a pitching statistic which is calculated by averaging the number of runs a pitcher allows the opposing team over a nine inning period. The average ERA for Cubs pitching over the past 100 years has been 3.70, meaning that the pitching staff typically gives up that many runs per game. It is believed that there is a negative relationship with this and the dependent variable, meaning that the higher this number is, the lower the winning percentage will be for that season.
Related to the ERA statistic are the total runs per season. This is an important variable because it is the main factor which determines whether a game is considered a win or a loss. Total runs have been included as both runs for the Cubs and as runs against them. Runs for are accumulated offensively by the batters on the team, while runs against (or allowed) are accumulated defensively against the pitchers. It is speculated that the total runs against the Cubs is highly collinear with the ERA statistic because they essentially measure the same factor, however, it was included in the study as a counterpart to the variable tallying runs in the Cubs favor. The seasonal average of runs for the Cubs is 677 while against the Cubs is 670. Because of the similarity between the two averages, it is no surprise that the average winning percentage is around 0.500.
Another component that measures both for and against the Cubs is total hits. A hit is essentially when a batter reaches base safely after making contact with the ball. Hits can vary from reaching one base (single) to circling all the bases (home run). Large numbers of hits can often result in the manufacturing of runs, leading to an increased winning percentage. The average hits for the Cubs over the 100 year period are 1,384 while against the Cubs is 1,392. Though there are slightly fewer hits in favor for the team, the Cubs have maintained a slightly better than average winning percentage.
A third component that calculates both for and against the Cubs are home runs. Home runs are when a batter hits the ball out of the playing field in fair territory. The batter is rewarded with an automatic trip around the bases which results in a run. The number of runs given to the batting team depends on how many runners were on base at the time of the hit. Anywhere from 1 to 4 runs can be accumulated as the result of a home run, but this amount is not tabulated in this statistic. Therefore, it is uncertain how effective of a winning percentage predictor are the total of home runs. Over the time period, home run totals have ranged from 9 to 212 with a mean of 103.12 for the Cubs, and from 6 to 231 with a mean of 99.9 against the Cubs. Early in the period, home run totals were much lower than later years. This is due to the development and dominance of power hitters in recent years throughout baseball, including the Cubs’ very own Sammy Sosa who has consistently and single-handedly hit more than 50 home runs per year over the past few seasons. Changes in the sport, such as this, are also likely to alter the predictability of these variables.
A fourth variable which tallies both for pitching and batting are strike outs. A strike out is measured when a pitcher gets a batter out on the third strike in the at-bat. Batting strike outs are accumulated when the Cubs’ batter is struck out by the opposing pitcher, while pitching strike outs are accumulated by the Cubs’ pitchers striking out the opposing batters. Batting strike outs for the Cubs have ranged from 343 to 1269 per season with a mean value of 743.72, while pitching strikeouts have ranged from 402 to 1344 with a mean value of 722.6. It can be speculated that the higher total strike outs by Cubs’ batters, the more likely it is that the team will have fewer victories. However, all innings are completed after three outs, regardless if the outs were generated from strike outs, ground outs, fly outs, or any combination of these. Therefore, it could prove to be an unpredictable indicator for winning percentage.
A final component that measures both for and against the Cubs are base on balls. A base on balls, or walk, is the result of a pitcher giving up four balls to a batter, allowing him free passage to first base. It is commonly believed that excessive walks by a pitching staff can result in large accumulations of runs for the opposing team, hence fewer victories for the Cubs. Cubs’ pitching has issued as little as 294 and as many 658 walks in a season, with an average of 489.55 for the century. For this study, only Base on Balls against the Cubs (by the Cubs pitching) was considered because the statistic in favor of the batters fluctuated too much over the course of the century.
There are two related pitching statistics regarding single game pitching performances that were considered for this analysis, complete games and shut outs. The first, complete games, is the feat of a single pitcher and their ability to remain in the entire game without being relieved by the bullpen. This is usually a managerial decision, and often occurs only when the team is nearly guaranteed a victory because of the pitcher’s strong performance. The Cubs have had as many as 32 and as little as 1 in a single season. The second statistic, shut outs, occurs when a team prevents the opponent from scoring in the game, therefore, an automatic victory for the team. The team has experienced as many as 139 shut outs in one season and as little as 5 in another. Though it can be suspected that the more frequent the occurrence for both of these in a season, the higher the winning percentage is likely to be, it is not always a guarantee because bad pitching performances are not documented in these statistics. For both, the totals were generally much higher earlier in the time frame.
There is a batting statistic, similar to hits for the Cubs per season, which was included in this study. This is the batting average. The batting average is a ratio calculated for the entire team by dividing the total at-bats from the total hits. It is suspected that this component is highly collinear with hits for that reason. However, it is included in the analysis because of the variation of at-bats from season to season. The Cubs typically experience a team batting average around 0.261 per season.
The last statistic considered for this study is the attendance per game. This is only calculated for Cubs home games at
Wrigley Field, which do not account for road game attendance when the Cubs are
the visiting team. A season typically consists
of half the games being at home while the other half are away. The Cubs originally played in a small stadium
west of the city before moving to
For each of the variables, the data generally resembles a normal distribution. Some exceptions are pitching strikeouts, pitching home runs, and attendance per game. These variables either had a slight skew towards one end of the curve (as was the case with attendance), or had a two peaks along the curve. Because these still slightly resembled normal, they were included in the study to test their performance. No transformations were performed for any of the variables.
TECHNIQUES
To determine the amount of multicollinearity between the variables, an initial regression was run with all the independent variables against the dependent variable. Figure 1 gives the collinearity statistics as outputs of tolerance and VIF for all of the variables. Since neither approaches the value of 1, it is clear that there are large amounts of multicollinearity between the variables.

Figure 1: This table shows the results of the initial
regression to determine amounts of collinearity.
Due to nature of the data, many of the variables are collinear in their explanation. For example, ERA and total runs allowed describe nearly the same aspect of the game. Because these two represent similar statistics, the unique explanatory power is reduced for each variable. There are various other variables that suffer from multicollinearity. Therefore, Principal Components Analysis, a method of factor analysis, was chosen as a technique to reduce the multicollinearity of the variables, and increase the explanatory power of the underlying factors within the data.
To begin, each of the variables, excluding the dependent
variable winning percentage, were included in the factor analysis. The purpose of this was to see how each affected
the outcome of the components. The
results began with the KMO and
Further results of this factor analysis produced high degrees of communality and a total of three components having eigenvalues greater than 1 being extracted from the data. However, when the components were rotated, three variables appeared to be highly correlated on more than one factor. These three variables were complete games, total home runs, and total home runs allowed. Figure 2 (below) shows the rotated component matrix with the three noted variables.

Figure 2: This is the Rotated Component Matrix from the original
factor analysis.
In order for the three extracted components to have more uniqueness, these three variables were removed in hopes that the new factors would be clear in their representation of the data.
With these three variables removed, the factor analysis was then repeated. Although the results of the KMO and Bartlett’s test were not as strong as previously (0.684 and 1294), they still expressed appropriateness very well and, this time, produced factors that uniquely described different aspects of the game. With the resulting three components, a multiple regression analysis was performed to test the relationship with the dependant variable. The analysis of this multiple regression, as well as the findings in the factor analysis, will be discussed in the next section.
ANALYSIS OF THE RESULTS
With the factor analysis, a table of communalities was produced which describes the total amount of variance each variable shares with the other variables in the principal components analysis. Each variable shared at least half of the total variance with all of the other variables in the study. Because there are such high shared variance levels among the variables, factor analysis is an important technique that can summarize this variance into unique new components.
The results of the factor analysis produced three new components (or factors), accounting for a cumulative 84.2% of the variance of all the data, with each factor describing a different aspect of baseball. Figure 3 (next page) shows the output of the rotated component matrix, which groups the variables into their relevant factors.

Figure 3: This is the Rotated Component Matrix from the second
run of factor analysis. Three variables have been excluded.
After analyzing the relationship between the components and
the corresponding variables, it became clear what each of the components
represented. Factor 1 correlated
extremely highly with the independent variables representing the pitching
aspects of the game. ERA, total runs
allowed, total hits allowed, and total base on balls allowed had a strong
positive correlation with the component, while shutouts had a strong negative
correlation. This shows that the
positive correlating variables represent when the Cubs pitching does poorly,
while the negative correlating variable represents the pitching doing
well. Therefore, Factor 1 represents
poor pitching by the Cubs.
Factor 2 involves the other major aspect of the game, batting. Here, each of the three variables representing total hits, total runs, and batting average has a strong positive correlation with component 2. It states that large numbers of hits and runs, as well as a good team batting average are strongly related to each other. Therefore, Factor 2 represents strong hitting by the Cubs.
Factor 3 is not as straight forward as the first two. Each of the three variables representing pitching strike outs, batting strike outs and game attendance, have a strong positive correlation with Factor 3. This could be interpreted as when good, power pitchers are starting for either the Cubs or their opponents, attendance at Wrigley Field is higher. Inversely, poor-performing pitchers would allow for lower attendance at each game. The attendance at Wrigley Field could also be compounded when ace pitchers are starting for both the Cubs and the opponent, and negatively compounded when both starting pitchers are generally poor. Therefore, Factor 3 represents strong power pitching and attendance.
Before regressing the three new factors against the dependant variable winning percentage, dummy variables were created to separate the 100 observations into three distinct groups. It is believed that the game of baseball has changed character over the past century. Early in the study period, the game appeared to be focused upon strong starting pitching, while more recent years have reflected numbers showing a focus on power hitting. These differences are evident in the data with more occurrences of complete games and shutouts present at the turn of the century, and more home runs and strike outs happening today. Because of these apparent changes, two dummy variables were used to divide up the observations into the following groups: Early (1903-1936), Middle (1937-1967), and Recent (1968 to 2002). With the introduction of these dummy variables, three different regression lines resulted to predict the winning percentage for any year depending within which era it is located.
An additional dummy variable was created to separate the 100-year sample into two groups: one for testing, and one for validating. The sample group was divided by whether the season was an odd year or an even year, with the odd years being used for testing and the even years being used for validation. By doing this, it was believed that the randomness from season to season will allow for two good sample sets to test the prediction and explanation of the dependent variable.
The final step of the analysis involved a multiple regression of the three newly created components along with the two dummy variables against winning percentage, using the odd year dummy variable to separate the sample into groups. At first, a regression was done using the stepwise method for entering variables. Each of the three components were significant at the 0.025 level, however the dummy variables were rejected from the model due to their high levels of significance. Figure 4 shows the results of the stepwise model, highlighting the levels of significance.

Figure 4: This table shows the significance levels of the
three components using the stepwise method for entering variables. The dummy variables remained excluded.
In order to separate the regression into the three time periods and include the dummy variables in the model, as well as separate the sample into testing and validation groups, the enter method of including variables was used. With the dummy variables included, suddenly, the third factor was no longer significant at the minimum 0.05 level. The addition of the new variables caused Factor 3 to lose its low significance level, likely because the beta coefficient for Factor 3 is extremely low which can cause fluctuation with new variable introduction. However, since the dummy variables are included just to separate the regression into three lines, the third factor was included because of its previously low significance level. Figure 5 shows the results of the enter method of including variables.

Figure 5: This table highlights the significance levels of the
components and dummy variables when using the enter method for including
variables.
With each of the new variables successfully included in the model, an R square of 0.873 was attained. This means that 87.3% of the variance in winning percentage can be explained by the combination of the three factors. That is a remarkably high number. In addition, the R values for both the testing sample and validation sample were nearly equal. It can, therefore, be interpreted that the combination of poor pitching, good batting, and power pitching can be used to confidently calculate the winning percentage of the Cubs for any given year. Figure 6 shows the generated output of the model summary. Figure 7 shows the residual plot for predicted Y-values against the actual Y-values. The plot appears to violate none of the assumptions of linear regression.

Figure 6: This table features the R square value for the
regression model, including the R values for the test sample and the validation
sample.

Figure 7: The chart above is the residual plot for the
regression model. No assumptions appear
to be violated when examining this residual plot.
With the model successfully completed, three regression lines have been formed to calculate the winning percentage for each year within its particular era. Below are the three regression lines developed from the unstandardized beta coefficients in Figure 5.
Early = 0.495 + -0.060*(Factor 1) + 0.036*(Factor 2) + -0.001*(Factor 3)
+ 0.011*(Early Dummy)
Middle = 0.495 + -0.060*(Factor 1) + 0.036*(Factor 2) + -0.001*(Factor 3)
+ 0.017*(Middle Dummy)
Recent = 0.495 + -0.060*(Factor 1) + 0.036*(Factor 2) + -0.001*(Factor 3)
Each of these regression lines has the same slope with only the intercept changing value due to the introduction of the era dummy variables. The regression line for the Recent era can be interpreted in the following way. The intercept, 0.495, is the point at which the regression line crosses the Y-axis. The regression for each year starts with a winning percentage of 0.495, and the slope of the line is adjusted either positively or negatively by the performance of the three factors.
The coefficient for the factor describing poor pitching (Factor 1) is negative at -0.060. This means that the more positive the poor pitching performance for the Cubs in a certain year, the lower the winning percentage will be. Because this is the largest coefficient of the three factors along with having a high beta coefficient, it can be assumed that poor pitching performance is the strongest contributor in determining how many games the Cubs win in a year.
The coefficient for the strong batting factor (Factor 2) is positive with a value of 0.036. This means that when the Cubs have a stronger year batting, their winning percentage will likely be improved. This coefficient, however, is smaller than that for the poor pitching, meaning that strong batting has less of an effect than poor pitching.
The factor describing strong power pitching regarding strikeouts and attendance (Factor 3) has a coefficient slightly more negative than zero at -0.001. This number is so small, partially because the number for attendance per game is so high, and partially because it is not as strong in determining the winning percentage as the other two factors with an extremely low beta coefficient.
The regression lines for the Early and Middle eras have intercepts that are slightly higher than the Recent era. This can be interpreted by looking at Figure 8 (next page). The table describes the average values for each of the variables over the three eras. Columns in light grey correlate highly with Factor 1, in white correlate highly with Factor 2, and in dark grey with Factor 3. After examining the average values, variables that are highly correlated with Factors 1 and 3 appear to increase and decrease in a somewhat linear fashion, meaning that the Early and Recent eras contain the extreme values while the Middle era has the middle value. Base on balls is the only exception; however, this value was the least correlated of the variables with the poor pitching factor. Opposite to these findings, variables which are highly correlated with the Factor 2 have extreme values which are in the Early and Middle eras, with the Recent era containing the middle value. This is similar to the performance of the average winning percentages for the three eras. Because of this, it is possible that Factor 2 could have more of a prediction for winning percentage than previously determined. In regards to affecting the intercepts of the three regression lines, lower numbers in the variables correlated with Factor 2 for Era 2 may be compensated with the higher intercept value. The high intercept for Era 1 is likely due to the already higher average winning percentage, compounded with stronger pitching.

Figure 8: This table shows the average values for each
independent variable at each era.
USING THE REGRESSION EQUATION FOR PREDICTION
Now that the regression equations have been created to predict and explain winning percentage of the Chicago Cubs over the past 100 years, the equation for the most recent era has been used to predict the winning percentage for the latest season in Cubs baseball, 2003, and validate the equation itself. In order to get the factor scores for 2003, the values for each independent variable were computed against the component score values for each factor (see figure 9). Because the Component Score Coefficient Matrix is standardized, z-scores of each of the independent variable values were used in the calculation.

Figure 9: This table contains the standardized values for each
variable to be used when calculating each component.
The computation of the 2003 Cub factor scores resulted in a coefficient of -0.792 for Factor 1, 0.515 for Factor 2, and 3.21 for Factor 3. Plugging these into the regression equation for the Recent era, the result was a predicted winning percentage of 0.558 (90 wins out of 162 games). With the conclusion of the 2003 season, the actual winning percentage for the Chicago Cubs was 0.543 (88 wins out of 162 games). The difference of two wins between the predicted and actual 2003 season winning percentage is likely because of close games between the Cubs and their opponent, where the outcome could have gone either way.
SUMMARY AND CONCLUSIONS
Due to the resulting strong R square value, it was found that independent variables about specified aspects of baseball can be used to explain and predict winning percentage for the Chicago Cubs between 1903 and 2003. Using Principal Components Analysis to expose the underlying components of the data, these factors can be regressed against the dependent variable to confidently state winning percentage for any given year within the time period. By separating the data into groups, analysis was performed on the odd year seasons and checked against the results of the even year seasons. This proved to be a successful form of validation of the regression. The data was also divided into three eras, each signifying dominant changes throughout the sport of baseball. Using the regression from the most recent era, the equation was able to predict a winning percentage comparable to the actual percentage for the 2003 season. For the Chicago Cubs, it can be concluded that the underlying components of the game of baseball are significant indicators of seasonal winning percentage. Play ball!
RESOURCES
All data used in this study was attained from the following website:
Forman, Sean. Baseball-Reference.com. 2003. Web Resource:
http://www.baseball-reference.com
APPENDIX

Factor Analysis
|
|
|
Regression
|
|
|
RAW DATA



