Home Runs in Major League Baseball
Mathematics Teacher: Tim Allega, email:allegat@bellsouth.net
Concepts
One-variable data
Matched pairs data
Measures of central tendency
Box plots
Histograms
Normal probability plot
Hypothesis testing
Materials
TI-83/84
Internet access
Technology Goals
Entering data into a list
Arithmetic with lists
Setting a graph window
Use of STAT TESTS T-Test
Overview
This activity provides an opportunity to collect data from the internet, and perform data analysis employing capabilities of the TI-83/84.
Introduction
Midway through the 2006 major league baseball season, a USA Today cover story described what appeared to be a significant increase in the number of home runs over the 2005 season. Player use of performance enhancing drugs was a controversial explanation. This lesson tasks students with applying basic statistics skills to the data and leads to discussion of the various explanations for the increase outlined in the newspaper article.
USA Today Cover Story July 11, 2006
Why so many home runs?
Fans blame drug use: players offer different theories.
By Bob Nightengale
Were at a stage now that if theres any variation in power totals and strikeout totals, there has to be a reason. If home runs are up, its because players are taking drugs. If the numbers are down, its because they used to take drugs. There are no more normal variances. I just dont know what to do anymore.
Don Fehr, players union chief
Player Reasons for Increase in Home Runs from 2005 to 2006
Normal variances
Decline in the quality of the pitching
A change in the way baseballs are made (rubber core instead of a cork core)
Not enough good pitching to go around
Hitters use of elbow padding causing pitchers to avoid inside pitches
Data as reported in USA Today July 11, 2006 used with permission.
MLBs big boppers
Thirteen players have 24 or more home runs at the All-Star break as opposed to three last season.
Player, team 2005 HRs 2006 HRs(7/11/06)
David Ortiz
Boston Red Sox 47 31
Jim Thome
Chicago White Sox 7 (injured) 30
Albert Pujols
St. Louis Cardinals 41 29
Ryan Howard
Philadelphia Phillies 22 28
Adam Dunn
Cincinnati Reds 40 27
Alfonso Soriano
Washington Nationals 36 27
Jason Giambi
New York Yankees 32 27
Carlos Lee
Milwaukee Brewers 32 26
Carlos Beltran
New York Mets 16 25
Travis Hafner
Cleveland Indians 33 25
Jermaine Dye
Chicago White Sox 31 25
Lance Berkman
Houston Astros 24 24
Manny Ramirez
Boston Red Sox 45 24
USA Today (7/11/06) Gallup Poll
Results of a USA TODAY/Gallup Poll of 594 baseball fans and survey of 476 Major League Baseball players conducted by USA TODAY and the Sports Xchange.
What do you think is the most likely cause for the increase in home runs this season?
Hitters continuing to use performance-enhancing drugs:
Players 3%
Fans 47%
An increase in the quality of hitting:
Players 28%
Fans 27%
A decline in the quality of pitching:
Players 16%
Fans 9%
A change in the way baseballs are made:
Players 10%
Fans 8%
Nothing:
Players 29%
Fans 1%
Other/No opinion:
Players 14%
Fans 7%
Results from the players survey
What percentage of players do you believe have used amphetamines this season?
None 21%
1% - 10% 52%
11% - 25% 17%
26% -50% 3%
Over 50% 3%
No opinion 3%
What percentage of players do you believe have used another performance-enhancing drug this season?
None 24%
1% - 10% 54%
11% - 25% 12%
26% - 50% 2%
Over 50% 1%
No opinion 6%
Baseball fans were questioned June 9-11 and the margin of error is +/- 4 percentage points. The players were questioned June 24-July 9.
Data Analysis
I. Examine the data presented in the USA Today article.
A. Obtain data from the MLB web site to update the home run data for the entire 2006 season.
B. Look for anomalies in the data set. (e. g. Jim Thome was injured in 2005.) Propose adjustments to the data set.
II. Use TI-83/84 for the data analysis. Do the data support a conclusion that there was an increase in the number of home runs in 2006 over 2005?
A. Use L5 for 2005 HR and L6 for the 2006 HR for the players in the article.
B. Sketch a scatter-plot of (L5,L6) ordered pairs for comparison with the identity function Y1 = x. Interpret the graph.
C. Name and fill a new list D: L6 L5.
D. Record the 1-VarStats results; and sketch a histogram and box plot for List D. Interpret the results.
III. Extension Activity
A. Determine if the criteria are met for use of a matched pairs hypothesis test (T-Test on List D).
B. The null hypothesis is that the difference from 2005 to 2006 in the number of home runs was not statistically significant. Ho: EMBED Equation.3 = 0. And the alternate hypothesis is that there was a statistically significant increase in the number of home runs in 2006 over 2005. Ha: EMBED Equation.3 > 0. Is the increase due to normal variance or not?
C. Write a reaction paper (critique) of the referenced USA Today article, that is supported by your analysis. Comment on the strengths and weaknesses of the Gallup Poll and Survey.
Teacher Notes. Part I of the Data Analysis might be used as a homework assignment or accomplished in a computer lab. Part II is designed as a classroom activity. Part III, the Extension Activity, could be used as a project assignment for an advanced math or AP Statistics class. Part III could also be used as an Introduction to Hypothesis Testing presentation for an upper division math class.
I.A. Google a favorite Major League Baseball team and link to the teams official site, then the Stats page. Or go directly to:
HYPERLINK "http://atlanta.braves.mlb.com/NASApp/mlb/index.jsp?c_id=atl" http://atlanta.braves.mlb.com/NASApp/mlb/index.jsp?c_id=atl
There is a link to players on other teams from the Stats page.
I.B. Answers will vary; plenty of room for individual judgment.
1. I chose to use Jim Thomes 2004 HR stats where he had approximately the same number of AB (at bats) as 2006.
2. I also chose to use the 2004 AB for Carlos Beltran since 2005 was his first year with the Mets.
3. Ryan Howard was removed form my analysis, as an anomaly, due to the large differences in number of AB.
2004 2 HR in 39 AB
2005 22 HR in 312 AB
2006 58 HR in 581 AB.
4. There were two additional players I chose to leave in even though there were differences noted in AB season to season:
Lance Berkman
2005 24 HR in 468 AB
2006 35 HR in 536 AB
Manny Ramirez
2005 45 HR in 554 AB
2006 35 HR in 449 AB
NOTE: One might consider converting all of the data to a HR/AB ratio.
Player, team 2005 HRs 2006 HRs
David Ortiz
Boston Red Sox 47 54
Jim Thome
Chicago White Sox 42* 42
Albert Pujols
St. Louis Cardinals 41 49
Adam Dunn
Cincinnati Reds 40 40
Alfonso Soriano
Washington Nationals 36 46
Jason Giambi
New York Yankees 32 37
Carlos Lee
Milwaukee Brewers 32 28
Carlos Beltran
New York Mets 38* 41
Travis Hafner
Cleveland Indians 33 42
Jermaine Dye
Chicago White Sox 31 44
Lance Berkman
Houston Astros 24 45
Manny Ramirez
Boston Red Sox 45 35
*2004 Home runs II.B. Note that Y1 = x, the identity function is the diagonal. A data point above the line indicates an increase 2006 over 2005, on the line no increase, and above the line a decrease.
Increase 8
No change 2
Decrease 2.
II.D Main points:
1-Var Stats:
Min -10
Mean 5.17
Median 6
Max 21
Histogram
Approximately symmetrical
Gap +13 to +21
Box plot
Five number summary
Min/Q1/Median/Q3/Max
-10/0/6/9.5/21
III.A. A matched pairs hypothesis test is appropriate when: (1) The sample data are from a simple random sample (SRS) and (2) The population data have characteristics of a normal distribution function.
The 13 players in the sample data set were not selected at random from all MLB players. However, unless there is some reason to suspect a bias due to the selection method, the sample could be assumed representative. The selection criteria must be described in the analysis report.
Characteristics of a normal distribution include symmetry with no gaps or outliers. The sample distribution of differences (List D) is fairly symmetrical. There is a gap from +13 to +21, but the +21 doesnt meet the generally accepted criteria for an outlier. The normal probability plot (percentile plot) is approximately linear supporting an assumption of a normal distribution for the population.
The results are statistically significant at the EMBED Equation.3 = 0.05 level, since the p-value < 0.05. One can reject the null hypothesis and accept the alternate that the increase in average number of home runs is greater than zero. It is very unlikely that a sample average of 5.17 or greater would occur if the population mean were zero (i.e. no increase in the number of home runs). This would occur only 25 times out of 1,000 in repeated sampling, very unlikely.
One can thus conclude that the increase in the number of home runs is NOT due to normal variance as suggested by Mr. Fehr, the players union chief.
III.C: Gallup Poll and Player Survey
Strengths:
Gallup is a reputable firm and USA Today is a respected publication
The Gallup Poll is sufficiently large and the margin of error +/- 4% is reasonable for the fan opinions.
Weaknesses:
A fan is not defined nor is the selection criteria identified in the article. Results of voluntary response polls are typically unreliable. Also, the non-response rate is not addressed.
The player survey is closer to a census than a sample poll. The survey procedures were not addressed in the article. Were all of the players required to respond? If not, who did not respond and why? Were the responses anonymous?
For the player survey, it would be hard to answer none, implying that not even one player used amphetamines or other drugs. For 500 players 1% would be 5. The low category might have read, less than 1%.
STUDENT WORKSHEET
Player, team 2005 HRs 2006 HRs
David Ortiz
Boston Red Sox 47 ___________
Jim Thome
Chicago White Sox 7 (injured) ___________
Albert Pujols
St. Louis Cardinals 41 ___________
Ryan Howard
Philadelphia Phillies 22 ___________
Adam Dunn
Cincinnati Reds 40 ___________
Alfonso Soriano
Washington Nationals 36 ___________
Jason Giambi
New York Yankees 32 ___________
Carlos Lee
Milwaukee Brewers 32 ___________
Carlos Beltron
New York Mets 16 ___________
Travis Hafner
Cleveland Indians 33 ___________
Jermaine Dye
Chicago White Sox 31 ___________
Lance Berkman
Houston Astros 24 ___________
Manny Ramirez
Boston Red Sox 45 ___________
Data Analysis
I. Examine the data presented in the USA Today article.
A. Obtain data from the MLB web site to update the home run data for the entire 2006 season.
B. Look for anomalies in the data set. (e. g. Jim Thome was injured in 2005.) Propose adjustments to the data set.
II. Use TI-83/84 for the data analysis. Do the data support a conclusion that there was an increase in the number of home runs in 2006 over 2005?
A. Use L5 for 2005 HR and L6 for the 2006 HR for the players in the article.
B. Sketch a scatter-plot of (L5,L6) ordered pairs for comparison with the identity function Y1 = x. Interpret the graph.
C. Name and fill a new list D: L6 L5.
D. Record the 1-VarStats results; and sketch a histogram and box plot for List D. Interpret the results. .
III. Extension Activity
A. Determine if the criteria are met for use of a matched pairs hypothesis test (T-Test on List D).
B. The null hypothesis is that the difference form 2005 to 2006 in the number of home runs was not statistically significant. Ho: EMBED Equation.3 = 0. And the alternate hypothesis is that there was a statistically significant increase in the number of home runs in 2006 over 2005. Ha: EMBED Equation.3 > 0. Is the increase due to normal variance or not?
C. Write a reaction paper (critique) of the referenced USA Today article, supported by your analysis. Comment on the strengths and weaknesses of the Gallup Poll and Survey.
Tim Allega
Tim Allega$rdDocWord.Document.89q