CS249 HW4: Baseball Modeling solution

$24.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (2 votes)

For this problem please install the Lahman package, a comprehensive package about Baseball statistics, and use it to answer a few questions.

Important information:

project home page (with links to impressive graphics): http://lahman.r-forge.r-project.org/
package documentation (html): http://lahman.r-forge.r-project.org/doc/
The documentation includes descriptions of the many tables in this package, such as the Salaries table: http://lahman.r-forge.r-project.org/doc/Salaries.html

The Goal
There are two problems for you to solve:

Problem 1: construct a model that predicts a player’s salary based on his baseball statistics. Your model should have better performance (higher R-squared) than the baseline model given.
Problem 2: construct a model that predicts whether a player will be inducted into the Hall of Fame. Your model should have better performance (higher Hall-of-Fame-Accuracy-Rate) than the baseline model given.
Here, Hall-of-Fame-Accuracy-Rate is a weighted percentage of correct predictions for players in the Hall of Fame: correct prediction for players in the Hall of Fame is worth 100 times more than for players who are not in the Hall of Fame. Then, as in HW3, upload a .csv file containing your models to CCLE. ## Step 1: build the models Using the ‘RelevantInformation’ table, one model should predict a player’s maximum salary, the other should predict whether or not they will get into the Hall of Fame. YOU CAN USE ANY MODEL YOU LIKE. The baseline models are a linear regression model and a logistic regression model ———- but you can choose any model. Please produce the most accurate models you can — more accurate models will get a higher score.
## Step 2: generate a CSV file “HW4_Baseball_Models.csv” including your 2 models If these were your two models, then to complete the assignment you would create a CSV file HW4_Baseball_Models.csv containing two lines: 0.8999,”lm( log10(max_salary) ~ AB+R+H+X2B+X3B+HR+RBI+SB, data = RelevantInformation)” 0.7888,”glm( HallOfFame ~ AB+R+H+X2B+X3B+HR+SlugPct, data = RelevantInformation, family=binomial)” Each line gives the accuracy of a model, as well as the exact command you used to generate the model. There is no length restriction on the lines.
## Step 3: upload your CSV file and notebook to CCLE Finally, go to CCLE and upload:your output CSV file HW4_Baseball_Models.csv
your notebook file HW4_Baseball_Modeling.ipynb
We are not planning to run any of the uploaded notebooks. However, your notebook should have the commands you used in developing your models — in order to show your work. As announced, all assignment grading in this course will be automated, and the notebook is needed in order to check results of the grading program.

Get the Lahman package for R — a database of Baseball Statistics

The safe way to install it, so it will work with Jupyter — execute the command:

sudo conda install -c https://conda.anaconda.org/asmeurer r-lahman

(The ‘sudo’ is not necessary if your conda installation is not write-protected.)

Another way to install the Lahman package (if this works from within your Jupyter session):
In [403]:
if (!(is.element(“Lahman”, installed.packages()))) install.packages(“Lahman”, repos=”http://cran.us.r-project.org”)

Load the Lahman baseball data
In [404]:
library(Lahman)

Another way to get the data, if you cannot load the Lahman package:
The files

PlayersAndStats.csv and

PlayersAndStatsAndSalary.csv are distributed with the homework assignment, and are used in the notebook below.

You can use these fiels rather than recompute the tables using the Lahman package.

Extract Tables of Relevant Information for your Models

Player information — from the Master table
http://lahman.r-forge.r-project.org/doc/Master.html
In [405]:
SelectedColumns = c(“playerID”,”nameFirst”,”nameLast”,”birthYear”, “weight”,”height”,”bats”,”throws”)
Players = na.omit( Master[, SelectedColumns] )
summary(Players)

Out[405]:
playerID nameFirst nameLast birthYear
Length:17071 Length:17071 Length:17071 Min. :1835
Class :character Class :character Class :character 1st Qu.:1902
Mode :character Mode :character Mode :character Median :1941
Mean :1935
3rd Qu.:1969
Max. :1994
weight height bats throws
Min. : 65.0 Min. :43.00 B: 1131 L: 3430
1st Qu.:170.0 1st Qu.:71.00 L: 4721 R:13641
Median :185.0 Median :72.00 R:11219
Mean :186.2 Mean :72.34
3rd Qu.:200.0 3rd Qu.:74.00
Max. :320.0 Max. :83.00

Player Maximum Salary — from the Salaries table
http://lahman.r-forge.r-project.org/doc/Salaries.html
In [406]:
summary(Salaries)

# example(Salaries) # see demos of results from the Salaries table

PlayerMaxSalary = aggregate( salary ~ playerID, Salaries, max )
colnames(PlayerMaxSalary) = gsub( “salary”, “max_salary”, colnames(PlayerMaxSalary) )

head(PlayerMaxSalary)

Out[406]:
yearID teamID lgID playerID
Min. :1985 CLE : 893 AL:12123 Length:24758
1st Qu.:1993 LAN : 893 NL:12635 Class :character
Median :2000 PHI : 893 Mode :character
Mean :2000 SLN : 886
3rd Qu.:2007 BAL : 883
Max. :2014 BOS : 883
(Other):19427
salary
Min. : 0
1st Qu.: 260000
Median : 525000
Mean : 1932905
3rd Qu.: 2199643
Max. :33000000

Out[406]:

playerID
max_salary
1
aardsda01
4500000
2
aasedo01
675000
3
abadan01
327000
4
abadfe01
525900
5
abbotje01
300000
6
abbotji01
2775000
In [407]:
PlayerStartYear = aggregate( yearID ~ playerID, Salaries, min )
colnames(PlayerStartYear) = gsub( “yearID”, “startYear”, colnames(PlayerStartYear) )

PlayerEndYear = aggregate( yearID ~ playerID, Salaries, max )
colnames(PlayerEndYear) = gsub( “yearID”, “endYear”, colnames(PlayerEndYear) )

head(PlayerStartYear)

Out[407]:

playerID
startYear
1
aardsda01
2004
2
aasedo01
1986
3
abadan01
2006
4
abadfe01
2011
5
abbotje01
1998
6
abbotji01
1989

Batting Statistics — from the BattingStats table
http://lahman.r-forge.r-project.org/doc/battingStats.html

(See also the Batting table: http://lahman.r-forge.r-project.org/doc/Batting.html )

A glossary for Baseball Statistics Acronyms is in http://en.wikipedia.org/wiki/Baseball_statistics
In [408]:
BattingStats = battingStats()

Aggregate Batting Stats — cumulative, over a player’s career
In [409]:
TotalBattingCounts = aggregate( cbind(AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,BA,PA,TB) ~ playerID,
BattingStats, sum)
nrow(TotalBattingCounts)
MaxBattingPcts = aggregate( cbind(SlugPct,OBP,OPS,BABIP) ~ playerID,
BattingStats, max )
nrow(MaxBattingPcts)

AggregateBattingStats = merge(TotalBattingCounts,MaxBattingPcts, by=”playerID”)
summary(AggregateBattingStats)
nrow(AggregateBattingStats)

Out[409]:
11933
Out[409]:
16037
Out[409]:
playerID AB R H
Length:11532 Min. : 1.0 Min. : 0.0 Min. : 0.0
Class :character 1st Qu.: 19.0 1st Qu.: 1.0 1st Qu.: 3.0
Mode :character Median : 136.5 Median : 12.0 Median : 25.0
Mean : 896.7 Mean : 117.6 Mean : 234.8
3rd Qu.: 834.5 3rd Qu.: 95.0 3rd Qu.: 199.0
Max. :14053.0 Max. :2295.0 Max. :4256.0
X2B X3B HR RBI
Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.: 1.0
Median : 4.00 Median : 0.000 Median : 1.0 Median : 10.0
Mean : 41.29 Mean : 6.723 Mean : 21.4 Mean : 109.6
3rd Qu.: 33.00 3rd Qu.: 5.000 3rd Qu.: 10.0 3rd Qu.: 85.0
Max. :746.00 Max. :173.000 Max. :762.0 Max. :2297.0
SB CS BB BA
Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. :0.000
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.:0.230
Median : 0.00 Median : 0.000 Median : 8.00 Median :0.628
Mean : 15.59 Mean : 8.255 Mean : 86.33 Mean :1.116
3rd Qu.: 7.00 3rd Qu.: 6.000 3rd Qu.: 64.00 3rd Qu.:1.596
Max. :1406.00 Max. :335.000 Max. :2558.00 Max. :7.804
PA TB SlugPct OBP
Min. : 1.0 Min. : 0.0 Min. :0.000 Min. :0.0000
1st Qu.: 21.0 1st Qu.: 3.0 1st Qu.:0.250 1st Qu.:0.2500
Median : 153.0 Median : 33.0 Median :0.388 Median :0.3360
Mean : 1007.6 Mean : 353.8 Mean :0.393 Mean :0.3365
3rd Qu.: 926.5 3rd Qu.: 282.2 3rd Qu.:0.500 3rd Qu.:0.4000
Max. :15861.0 Max. :6856.0 Max. :3.000 Max. :1.0000
OPS BABIP
Min. :0.0000 Min. :0.0000
1st Qu.:0.5000 1st Qu.:0.2590
Median :0.7210 Median :0.3330
Mean :0.7185 Mean :0.3541
3rd Qu.:0.8880 3rd Qu.:0.4000
Max. :4.0000 Max. :1.0000
Out[409]:
11532

Inducted into the Hall of Fame? — from the HallOfFame table
http://lahman.r-forge.r-project.org/doc/HallOfFame.html
In [410]:
data(HallOfFame)
head(HallOfFame)

InductedIntoHallOfFame = subset(HallOfFame, inducted == ‘Y’)[ , 1:2]

head(InductedIntoHallOfFame)
nrow(InductedIntoHallOfFame)

Out[410]:

playerID
yearID
votedBy
ballots
needed
votes
inducted
category
needed_note
1
cobbty01
1936
BBWAA
226
170
222
Y
Player
NA
2
ruthba01
1936
BBWAA
226
170
215
Y
Player
NA
3
wagneho01
1936
BBWAA
226
170
215
Y
Player
NA
4
mathech01
1936
BBWAA
226
170
205
Y
Player
NA
5
johnswa01
1936
BBWAA
226
170
189
Y
Player
NA
6
lajoina01
1936
BBWAA
226
170
146
N
Player
NA
Out[410]:

playerID
yearID
1
cobbty01
1936
2
ruthba01
1936
3
wagneho01
1936
4
mathech01
1936
5
johnswa01
1936
111
lajoina01
1937
Out[410]:
310

Include HallOfFame information in the Players table
In [411]:
PlayersWithHallOfFame = transform( merge( Players, InductedIntoHallOfFame, all.x=TRUE, by=”playerID”),
HallOfFame = ifelse( is.na(yearID), 0, 1 ),
yearID = ifelse( is.na(yearID), 0, yearID )
)
colnames(PlayersWithHallOfFame) = gsub( “yearID”, “HallOfFameYear”, colnames(PlayersWithHallOfFame) )
head(PlayersWithHallOfFame, 20)

Out[411]:

playerID
nameFirst
nameLast
birthYear
weight
height
bats
throws
HallOfFameYear
HallOfFame
1
aardsda01
David
Aardsma
1981
205
75
R
R
0
0
2
aaronha01
Hank
Aaron
1934
180
72
R
R
1982
1
3
aaronto01
Tommie
Aaron
1939
190
75
R
R
0
0
4
aasedo01
Don
Aase
1954
190
75
R
R
0
0
5
abadan01
Andy
Abad
1972
184
73
L
L
0
0
6
abadfe01
Fernando
Abad
1985
220
73
L
L
0
0
7
abadijo01
John
Abadie
1854
192
72
R
R
0
0
8
abbated01
Ed
Abbaticchio
1877
170
71
R
R
0
0
9
abbeybe01
Bert
Abbey
1869
175
71
R
R
0
0
10
abbeych01
Charlie
Abbey
1866
169
68
L
L
0
0
11
abbotda01
Dan
Abbott
1862
190
71
R
R
0
0
12
abbotfr01
Fred
Abbott
1874
180
70
R
R
0
0
13
abbotgl01
Glenn
Abbott
1951
200
78
R
R
0
0
14
abbotje01
Jeff
Abbott
1972
190
74
R
L
0
0
15
abbotji01
Jim
Abbott
1967
200
75
L
L
0
0
16
abbotku01
Kurt
Abbott
1969
180
71
R
R
0
0
17
abbotky01
Kyle
Abbott
1968
200
76
L
L
0
0
18
abbotod01
Ody
Abbott
1888
180
74
R
R
0
0
19
abbotpa01
Paul
Abbott
1967
185
75
R
R
0
0
20
aberal01
Al
Aber
1927
195
74
L
L
0
0
In [412]:
nrow(PlayersWithHallOfFame)
nrow(subset(PlayersWithHallOfFame, HallOfFame == 1))

Out[412]:
17071
Out[412]:
277
In [413]:
PlayersAndStats = merge( PlayersWithHallOfFame, AggregateBattingStats )

nrow(PlayersAndStats)
nrow(subset(PlayersAndStats, HallOfFame == 1))

# write.csv(PlayersAndStats, file=”PlayersAndStats.csv”, quote=FALSE, row.names=FALSE)

Out[413]:
11299
Out[413]:
194

Join Information for your Baseball Salary model into one Table

Merge Aggregate Batting Statistics and Maximum Salary into the Relevant Information table
In [414]:
PlayersAndStatsAndSalary = transform(
merge( merge( merge( PlayersAndStats, PlayerMaxSalary ), PlayerStartYear), PlayerEndYear ),
totalYears = endYear – startYear + 1
)
head(PlayersAndStatsAndSalary)
nrow(PlayersAndStatsAndSalary)

# write.csv(PlayersAndStatsAndSalary, file=”PlayersAndStatsAndSalary.csv”, quote=FALSE, row.names=FALSE)

Out[414]:

playerID
nameFirst
nameLast
birthYear
weight
height
bats
throws
HallOfFameYear
HallOfFame
ellip.h
PA
TB
SlugPct
OBP
OPS
BABIP
max_salary
startYear
endYear
totalYears
1
aardsda01
David
Aardsma
1981
205
75
R
R
0
0
<8b
4
0
0
0
0
0
4500000
2004
2012
9
2
aasedo01
Don
Aase
1954
190
75
R
R
0
0
<8b
5
0
0
0
0
0
675000
1986
1989
4
3
abadan01
Andy
Abad
1972
184
73
L
L
0
0
<8b
25
2
0.118
0.4
0.4
0.167
327000
2006
2006
1
4
abadfe01
Fernando
Abad
1985
220
73
L
L
0
0
<8b
8
1
0.143
0.143
0.286
0.25
525900
2011
2014
4
5
abbotje01
Jeff
Abbott
1972
190
74
R
L
0
0
<8b
649
248
0.492
0.343
0.79
0.32
300000
1998
2001
4
6
abbotji01
Jim
Abbott
1967
200
75
L
L
0
0
<8b
24
2
0.095
0.095
0.19
0.182
2775000
1989
1999
11
Out[414]:
4090

Problem 1: construct a model with better performance (higher R-squared) than this Baseline Salary Model

For this salary model, consider only those players who started playing after 2000:
In [415]:
RecentPlayersAndStatsAndSalary = subset( PlayersAndStatsAndSalary, startYear = 2000 )
nrow(RecentPlayersAndStatsAndSalary)

Out[415]:
1720
In [416]:
summary(PlayersAndStatsAndSalary)

Out[416]:
playerID nameFirst nameLast birthYear
Length:4090 Length:4090 Length:4090 Min. :1925
Class :character Class :character Class :character 1st Qu.:1964
Mode :character Mode :character Mode :character Median :1972
Mean :1972
3rd Qu.:1980
Max. :1993
weight height bats throws HallOfFameYear
Min. :140.0 Min. :66.0 B: 397 L: 830 Min. : 0.00
1st Qu.:182.2 1st Qu.:72.0 L:1158 R:3260 1st Qu.: 0.00
Median :195.0 Median :73.0 R:2535 Median : 0.00
Mean :197.7 Mean :73.4 Mean : 18.62
3rd Qu.:210.0 3rd Qu.:75.0 3rd Qu.: 0.00
Max. :295.0 Max. :83.0 Max. :2015.00
HallOfFame AB R H
Min. :0.000000 Min. : 1.0 Min. : 0.0 Min. : 0.0
1st Qu.:0.000000 1st Qu.: 32.0 1st Qu.: 2.0 1st Qu.: 4.0
Median :0.000000 Median : 306.5 Median : 28.0 Median : 59.5
Mean :0.009291 Mean : 1361.8 Mean : 181.9 Mean : 358.6
3rd Qu.:0.000000 3rd Qu.: 1846.2 3rd Qu.: 225.0 3rd Qu.: 463.8
Max. :1.000000 Max. :14053.0 Max. :2295.0 Max. :4256.0
X2B X3B HR RBI
Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 1.0
Median : 11.00 Median : 1.000 Median : 3.00 Median : 25.0
Mean : 67.54 Mean : 8.248 Mean : 38.09 Mean : 171.7
3rd Qu.: 88.00 3rd Qu.: 9.000 3rd Qu.: 35.00 3rd Qu.: 209.0
Max. :746.00 Max. :147.000 Max. :762.00 Max. :1996.0
SB CS BB BA
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. :0.0000
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 1.0 1st Qu.:0.4052
Median : 1.00 Median : 1.00 Median : 19.0 Median :1.0580
Mean : 27.36 Mean : 12.14 Mean : 131.8 Mean :1.5236
3rd Qu.: 19.00 3rd Qu.: 12.00 3rd Qu.: 156.0 3rd Qu.:2.3415
Max. :1406.00 Max. :335.00 Max. :2558.0 Max. :7.8040
PA TB SlugPct OBP
Min. : 1.0 Min. : 0.0 Min. :0.0000 Min. :0.0000
1st Qu.: 36.0 1st Qu.: 5.0 1st Qu.:0.2920 1st Qu.:0.2730
Median : 348.5 Median : 86.0 Median :0.4385 Median :0.3530
Mean : 1530.3 Mean : 556.9 Mean :0.4347 Mean :0.3515
3rd Qu.: 2066.5 3rd Qu.: 706.0 3rd Qu.:0.5317 3rd Qu.:0.4070
Max. :15861.0 Max. :5976.0 Max. :3.0000 Max. :1.0000
OPS BABIP max_salary startYear
Min. :0.0000 Min. :0.0000 Min. : 60000 Min. :1985
1st Qu.:0.5643 1st Qu.:0.3000 1st Qu.: 342000 1st Qu.:1989
Median :0.7895 Median :0.3490 Median : 700000 Median :1997
Mean :0.7737 Mean :0.3863 Mean : 2497992 Mean :1998
3rd Qu.:0.9350 3rd Qu.:0.4368 3rd Qu.: 3000000 3rd Qu.:2006
Max. :4.0000 Max. :1.0000 Max. :33000000 Max. :2014
endYear totalYears
Min. :1985 Min. : 1.000
1st Qu.:1996 1st Qu.: 2.000
Median :2004 Median : 5.000
Mean :2003 Mean : 6.106
3rd Qu.:2012 3rd Qu.: 9.000
Max. :2014 Max. :27.000
In [417]:
BaselineSalaryModel = lm( log10(max_salary) ~
AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP + startYear + totalYears,
data = PlayersAndStatsAndSalary)
summary(BaselineSalaryModel)

Out[417]:
Call:
lm(formula = log10(max_salary) ~ AB + R + H + X2B + X3B + HR +
RBI + SB + CS + BB + BA + PA + SlugPct + OBP + BABIP + startYear +
totalYears, data = PlayersAndStatsAndSalary)

Residuals:
Min 1Q Median 3Q Max
-1.9378 -0.2139 -0.0706 0.2238 1.5111

Coefficients:
Estimate Std. Error t value Pr(|t|)
(Intercept) -4.752e+01 1.247e+00 -38.122 < 2e-16 ***
AB -2.826e-03 2.860e-04 -9.881 < 2e-16 ***
R -1.949e-03 2.633e-04 -7.400 1.64e-13 ***
H 4.363e-04 1.886e-04 2.314 0.020727 *
X2B 5.930e-04 3.455e-04 1.716 0.086173 .
X3B 3.191e-03 8.454e-04 3.774 0.000163 ***
HR 2.807e-03 5.168e-04 5.432 5.89e-08 ***
RBI -5.176e-04 2.389e-04 -2.167 0.030328 *
SB 6.342e-04 2.307e-04 2.749 0.006005 **
CS -6.031e-04 7.514e-04 -0.803 0.422225
BB -2.548e-03 2.635e-04 -9.672 < 2e-16 ***
BA -9.005e-02 1.061e-02 -8.486 < 2e-16 ***
PA 2.924e-03 2.680e-04 10.908 < 2e-16 ***
SlugPct -1.775e-02 3.243e-02 -0.547 0.584078
OBP -9.577e-02 5.729e-02 -1.672 0.094671 .
BABIP 2.555e-01 3.618e-02 7.062 1.92e-12 ***
startYear 2.642e-02 6.224e-04 42.451 < 2e-16 ***
totalYears 9.980e-02 1.423e-03 70.126 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 0.3229 on 4072 degrees of freedom
Multiple R-squared: 0.7198, Adjusted R-squared: 0.7186
F-statistic: 615.2 on 17 and 4072 DF, p-value: < 2.2e-16

In [418]:
# Create your model here …

Problem 2: construct a model with better performance (higher accuracy) than this Baseline Hall of Fame Model

Hall of Fame election rules:
A. A baseball player must have been active as a player in the Major Leagues at some time during a period beginning fifteen (15) years before and ending five (5) years prior to election.

B. Player must have played in each of ten (10) Major League championship seasons, some part of which must have been within the period described in 3(A).

C. Player shall have ceased to be an active player in the Major Leagues at least five (5) calendar years preceding the election but may be otherwise connected with baseball.

Consequently: only consider players born before 1970
(They must start around 20 years of age, play at least 10 years, have stopped playing at least 5 years earlier, and take perhaps 10 years to win the ballot — so born at least 45 years ago.)
In [419]:
HallOfFameContenders = subset( PlayersAndStats, birthYear < 1970 )
head(HallOfFameContenders)
nrow(HallOfFameContenders)

Out[419]:

playerID
nameFirst
nameLast
birthYear
weight
height
bats
throws
HallOfFameYear
HallOfFame
ellip.h
SB
CS
BB
BA
PA
TB
SlugPct
OBP
OPS
BABIP
2
aaronha01
Hank
Aaron
1934
180
72
R
R
1982
1
<8b
240
73
1402
6.927
13940
6856
0.669
0.41
1.079
0.338
3
aaronto01
Tommie
Aaron
1939
190
75
R
R
0
0
<8b
9
8
86
1.545
1045
309
0.374
0.318
0.686
0.276
4
aasedo01
Don
Aase
1954
190
75
R
R
0
0
<8b
0
0
0
0
5
0
0
0
0
0
7
abadijo01
John
Abadie
1854
192
72
R
R
0
0
<8b
1
0
0
0.472
49
11
0.25
0.25
0.5
0.25
9
abbotji01
Jim
Abbott
1967
200
75
L
L
0
0
<8b
0
0
0
0.095
24
2
0.095
0.095
0.19
0.182
10
abbotku01
Kurt
Abbott
1969
180
71
R
R
0
0
<8b
22
11
133
2.511
2227
864
0.465
0.326
0.77
0.354
Out[419]:
8111
In [420]:
BaselineHallOfFameModel = glm( HallOfFame ~ AB+R+H+X2B+X3B+HR+RBI+SB+CS+BB+BA+PA+SlugPct+OBP+BABIP,
data = HallOfFameContenders, family=binomial)

summary(BaselineHallOfFameModel)

Out[420]:
Call:
glm(formula = HallOfFame ~ AB + R + H + X2B + X3B + HR + RBI +
SB + CS + BB + BA + PA + SlugPct + OBP + BABIP, family = binomial,
data = HallOfFameContenders)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0236 -0.1609 -0.1354 -0.1225 3.2096

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept) -5.177921 0.245464 -21.094 < 2e-16 ***
AB -0.015193 0.002452 -6.195 5.82e-10 ***
R 0.004717 0.001993 2.366 0.01797 *
H 0.004591 0.001633 2.812 0.00493 **
X2B -0.017811 0.003509 -5.076 3.86e-07 ***
X3B 0.021059 0.006559 3.210 0.00133 **
HR -0.007845 0.003394 -2.312 0.02080 *
RBI 0.006136 0.001578 3.890 0.00010 ***
SB 0.005979 0.002021 2.958 0.00309 **
CS -0.034386 0.007499 -4.586 4.53e-06 ***
BB -0.013913 0.002313 -6.015 1.80e-09 ***
BA 0.065975 0.135318 0.488 0.62587
PA 0.013597 0.002289 5.941 2.83e-09 ***
SlugPct 0.539446 0.509461 1.059 0.28966
OBP 0.570971 1.114045 0.513 0.60829
BABIP 0.035612 0.804150 0.044 0.96468

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1824.3 on 8110 degrees of freedom
Residual deviance: 1295.1 on 8095 degrees of freedom
AIC: 1327.1

Number of Fisher Scoring iterations: 7

In [421]:
confusionMatrix = table( round(predict(BaselineHallOfFameModel, type=”response”)), HallOfFameContenders$HallOfFame )
confusionMatrix
# terrible prediction accuracy: only 34 Hall-of-Fame players were identified correctly:

Out[421]:

0 1
0 7899 155
1 19 38

Warning! This dataset is severely imbalanced. Read Ch.16 of [APM]¶
Only about 1% or 2% of all players are inducted into the Hall of Fame:
In [422]:
( FameTally = table( HallOfFameContenders$HallOfFame ) )

Out[422]:
0 1
7918 193
In [423]:
data.frame( percentageOfHallOfFamers = FameTally[2] / sum(FameTally) )

Out[423]:

percentageOfHallOfFamers
1
0.02379485

The measure of accuracy will heavily emphasize correct prediction of Hall-of-Fame players
(i.e., the measurement of accuracy will focus on correct prediction of Hall-of-Fame players)

Even though classifying everybody as a NON-Hall-of-Fame player is right for about 98% of the players, predictions for Hall-of-Fame players will be weighted heavily in this assignment. Ignoring these players will get a very low score on this assignment.

Specifically, your model will be evaluated by its Hall-of-Fame-Accuracy-Rate:

This rate is a weighted percentage of correct predictions for players in the Hall of Fame: correct prediction for players in the Hall of Fame is worth 100 times more than for players who are not in the Hall of Fame.
In [424]:
# Create your model here …

In [ ]: