SABR
GAMES and SIMULATIONS
COMMITTEE

Generating Opening Day Rosters from Retrosheet Event Files Using R

I wanted to utilize the Opening Day MLB rosters for various season replays in my computer-based simulations, but I wasn’t able to locate this information in the usual places (Baseball-Reference, Retrosheet, etc.) I belatedly found out that a number of individuals and volunteers in the sabermetric community have already done this work, scouring box scores and transaction pages to cobble together this data. Additionally, some of the current computer baseball simulations such as Out of the Park, Action! PC and Digital Diamond Baseball include built-in functionality or utilities that provide users with the option to incorporate Opening Day Rosters, As-Played Lineups and Real-Life Transactions into their replays. I discovered the ATMgrforBBW@groups.io website which supports the Automatic Transaction Manager (ATMgr) developed by Gary Leven for APBA’s Baseball for Windows (BBW).

However, since I went to the trouble of writing an R script to facilitate my needs, I figured that I’d share the results with the community. The script essentially imports one or more years’ worth of Retrosheet event files and iterates through the data to extract the starters and substitutes from every game in the file. The code will also gather the game dates and relevant team information. Once the data is extracted, the resulting data table can be further sorted and manipulated to remove all but the first game in a given season for every player. You can export the results and take the first 25 players for each team as their Opening Day Roster. Individuals who do not wish to go through the entire process outlined below can still acquire the data as I’ve shared it via Google Sheets here ->

https://docs.google.com/spreadsheets/d/e/2PACX-1vRrSOoOTDuTYYrd9jfwNj7Tt4WLWrJBIwu-ASxDCuYc9035HOW2cB7V-91dhu_YWmbPeuzH55kYHR9P/pubhtml

What is R?

The definition of the R programming language, as it relates to baseball data, was articulated in the book “Analyzing Baseball Data with R” by Max Marchi and Jim Albert. Quoting from the Preface of the first edition: “R is a system for statistical computation and graphics, and it is a computer language designed for typical and possibly specialized statistical and graphical applications.. The public availability of baseball data and the open-source R software is an attractive marriage. R provides a large range of tools for importing, arranging and organizing large datasets. By the use of built-in functions and collections of packages from the R user-community, one can perform various data and graphical analyses, and communicate this work easily to other baseball enthusiasts.”

Requirements:

You’ll need to download and install R 3.3.0+ and R Studio Desktop for Windows from the following site:

https://posit.co/download/rstudio-desktop/

The installation process for R and R Studio are beyond the scope of this article.

You will need to download the Retrosheet event files and extract them into a folder.

https://www.retrosheet.org/game.htm

You can choose to download the event files for individual seasons or, if you scroll down on the page, look for the section “Regular season event files by decade”. If you have ample available disk space, I suggest downloading the decade event files.

I extract the event files into C:RetrosheetEvents on my computer. You can change the folder/subfolder location if you wish, but you’ll need to update any references to that folder in the R script.

The resulting files are exported to a folder called C:OpeningDayRosters. Again, you can export to any folder that you choose, but you’ll need to modify that line of code in the R script.

Optional – merge the event files. Open a command prompt and execute the following commands:

       copy 191*.ev* events_1910.csv

       copy 192*.ev* events_1920.csv 

       copy 193*.ev* events_1930.csv

       copy 194*.ev* events_1940.csv

       copy 195*.ev* events_1950.csv 

       copy 196*.ev* events_1960.csv

       copy 197*.ev* events_1970.csv

       copy 198*.ev* events_1980.csv 

       copy 199*.ev* events_1990.csv

       copy 200*.ev* events_2000.csv

       copy 201*.ev* events_2010.csv 

       copy 202*.ev* events_2020.csv

You’ll need the R script – extract the .zip file into a C:R_Scripts folder or alternate directory if you have another location that you’d prefer to house your R script files.

A closer look at Retrosheet Event Files

Please check out the following link for a detailed description of the Retrosheet event file contents and scoring system. https://www.retrosheet.org/eventfile.htm

Each game in an event file consists of multiple record types including:

id, version, info, start, sub, badj, padj, ladj, radj, presadj, data, com

We will use Jackie Robinson’s MLB debut for the Brooklyn Dodgers on April 15, 1947 as our sample event file.  As you’ll see in my explanation of the script that I wrote to determine the Opening Day Rosters, I drop all of the record types except for id, info, start and sub very early in the process. However, you may wish to examine the Event files for play-by-play or other information. In that case, you’ll need to determine which records are pertinent to your project.

The ‘id’ row is fairly self-explanatory. The entry consists of a three-letter abbreviation for the home team followed by the year, month, day and game number (single game (0), first game (1) or second game (2) if a double-header was played on that date).

id,BRO194704150

You may safely ignore the ‘version’ row.

version,1

The ‘info’ records encompass approximately 30 rows of data but only a handful of rows are germane to the Opening Day Rosters script. I’m only interested in visteam, hometeam, date and number – the remaining rows will be discarded.

info,inputprogvers,"version 7RS(19) of 07/07/92"

info,visteam,BSN

info,hometeam,BRO

info,date,1947/04/15

info,site,NYC15

info,number,0

info,starttime,0:00PM

info,daynight,day

info,usedh,false

info,umphome,pineb101

info,ump1b,barla901

info,ump2b,(none)

info,ump3b,gorea901

info,scorer,"27,32"

info,translator,"Smith"

info,inputter,"Smith"

info,inputtime,1993/05/15 8:54PM

info,howscored,unknown

info,pitches,none

info,temp,0

info,winddir,unknown

info,windspeed,-1

info,fieldcond,unknown

info,precip,unknown

info,sky,unknown

info,timeofgame,146

info,attendance,26623

info,wp,gregh102

info,lp,sainj101

info,save,caseh101

info,gwrbi,

The ‘start’ rows consist of 18 to 20 lines of data depending on whether the designated hitter is present in the lineup for the particular game. Each row contains the player’s unique Retrosheet ID, full name, team designation (either ‘0’ for visiting or ‘1’ for home team), batting order position and fielding position. Substitutions or ‘sub’ entries are comprised of similar fields and they appear chronologically within the play-by-play rows.

start,culld101,"Dick Culler",0,1,6

start,hoppj102,"Johnny Hopp",0,2,8

start,mccom101,"Mike McCormick",0,3,9

start,ellib103,"Bob Elliott",0,4,5

start,litwd101,"Danny Litwhiler",0,5,7

start,torge101,"Earl Torgeson",0,6,3

start,masip101,"Phil Masi",0,7,2

start,ryanc102,"Connie Ryan",0,8,4

start,sainj101,"Johnny Sain",0,9,1

start,stane101,"Eddie Stanky",1,1,4

start,robij103,"Jackie Robinson",1,2,3

start,reisp101,"Pete Reiser",1,3,8

start,walkd101,"Dixie Walker",1,4,9

start,hermg101,"Gene Hermanski",1,5,7

start,edwab101,"Bruce Edwards",1,6,2

start,jorgs101,"Spider Jorgensen",1,7,5

start,reesp101,"Peewee Reese",1,8,6

start,hattj101,"Joe Hatten",1,9,1

Noteworthy information regarding the game is registered via the ‘com’ (comment) field.

com,"$Dodgers manager Leo Durocher suspended for the 1947 season for associating"

com,"with known gamblers; coach Clyde Sukeforth managed the team for the first"

com,"two games; debut for Jackie Robinson; debut for umpire Artie Gore"

Play-by-play events are registered in the order in which they occur during the contest.  This data features the inning, team (visitor or home), Retrosheet player ID, the ball-strike count when the event occurred, pitch-by-pitch description (when available) and the play/event record such as a ‘K’ for a strikeout, ‘W’ for a walk or a ‘S7’ for a single to left field.

play,1,0,culld101,??,,53

play,1,0,hoppj102,??,,K

play,1,0,mccom101,??,,S8

play,1,0,ellib103,??,,WP.1-2

play,1,0,ellib103,??,,W

play,1,0,litwd101,??,,8/F8D

play,1,1,stane101,??,,43

play,1,1,robij103,??,,53

play,1,1,reisp101,??,,W

play,1,1,walkd101,??,,13

play,2,0,torge101,??,,W

play,2,0,masip101,??,,6

play,2,0,ryanc102,??,,46(1)3/GDP

play,2,1,hermg101,??,,4/P

play,2,1,edwab101,??,,8/F8D

play,2,1,jorgs101,??,,W

play,2,1,reesp101,??,,9/P

play,3,0,sainj101,??,,53

play,3,0,culld101,??,,63

play,3,0,hoppj102,??,,43

play,3,1,hattj101,??,,K

play,3,1,stane101,??,,43

play,3,1,robij103,??,,7

play,4,0,mccom101,??,,S7

play,4,0,ellib103,??,,S8.1-3

play,4,0,litwd101,??,,FC1.3XH(1);1-2;B-1

play,4,0,torge101,??,,K/C

play,4,0,masip101,??,,7/L78

play,4,1,reisp101,??,,W

play,4,1,walkd101,??,,43.1-2

play,4,1,hermg101,??,,S8.2-3

play,4,1,edwab101,??,,54(1)/FO.3-H

play,4,1,jorgs101,??,,43

play,5,0,ryanc102,??,,S8

play,5,0,sainj101,??,,3/SH.1-2

play,5,0,culld101,??,,53/SH.2-3

play,5,0,hoppj102,??,,S7.3-H

play,5,0,mccom101,??,,W.1-2

play,5,0,ellib103,??,,8/P

play,5,1,reesp101,??,,D7

play,5,1,hattj101,??,,S/BG.2-3

play,5,1,stane101,??,,4/L

play,5,1,robij103,??,,64(1)3/GDP

play,6,0,litwd101,??,,HP

play,6,0,torge101,??,,E2/TH/BG.1-2

play,6,0,masip101,??,,53/SH.1-2;2-3

play,6,0,ryanc102,??,,S7.3-H(UR);2-H(UR)

play,6,0,sainj101,??,,13/SH.1-2

play,6,0,culld101,??,,3/FL

play,6,1,reisp101,??,,S7

play,6,1,walkd101,??,,S9.1-3

play,6,1,hermg101,??,,NP

sub,tatut101,"Tommy Tatum",1,4,12

play,6,1,hermg101,??,,9

play,6,1,edwab101,??,,HP.1-2

play,6,1,jorgs101,??,,NP

sub,rackm101,"Marv Rackley",1,6,12

play,6,1,jorgs101,??,,43.1-2;2-3;3-H

play,6,1,reesp101,??,,IW

play,6,1,hattj101,??,,NP

sub,steve101,"Ed Stevens",1,9,11

play,6,1,steve101,??,,K

play,7,0,hoppj102,??,,NP

sub,tatut101,"Tommy Tatum",1,4,9

play,7,0,hoppj102,??,,NP

sub,bragb101,"Bobby Bragan",1,6,2

play,7,0,hoppj102,??,,NP

sub,gregh102,"Hal Gregg",1,9,1

play,7,0,hoppj102,??,,7

play,7,0,mccom101,??,,41

play,7,0,ellib103,??,,W

play,7,0,litwd101,??,,K

play,7,1,stane101,??,,NP

sub,roweb101,"Bama Rowell",0,5,7

play,7,1,stane101,??,,W

play,7,1,robij103,??,,E3/TH1/SH.1-3;B-2

com,"$the throw hit Jackie Robinson and caromed into RF"

play,7,1,reisp101,??,,D9.3-H;2-H(UR)

play,7,1,tatut101,??,,NP

sub,coopm101,"Mort Cooper",0,9,1

play,7,1,tatut101,??,,NP

sub,vauga101,"Arky Vaughan",1,4,11

play,7,1,vauga101,??,,13.2-3

play,7,1,hermg101,??,,8.3-H(UR)

play,7,1,bragb101,??,,63

play,8,0,torge101,??,,NP

sub,furic101,"Carl Furillo",1,4,9

play,8,0,torge101,??,,K

play,8,0,masip101,??,,3/P

play,8,0,ryanc102,??,,S7

play,8,0,coopm101,??,,NP

sub,neilt101,"Tommy Neill",0,9,11

play,8,0,neilt101,??,,HP.1-2

play,8,0,culld101,??,,NP

sub,holmt101,"Tommy Holmes",0,1,11

play,8,0,holmt101,??,,7

play,8,1,jorgs101,??,,NP

sub,sists101,"Sibby Sisti",0,1,6

play,8,1,jorgs101,??,,NP

sub,lanfw101,"Walt Lanfranconi",0,9,1

play,8,1,jorgs101,??,,43

play,8,1,reesp101,??,,K/C

play,8,1,gregh102,??,,K

play,9,0,hoppj102,??,,NP

sub,schuh101,"Howie Schultz",1,2,3

play,9,0,hoppj102,??,,6/L

play,9,0,mccom101,??,,S9

play,9,0,ellib103,??,,W.1-2

play,9,0,roweb101,??,,NP

sub,caseh101,"Hugh Casey",1,9,1

play,9,0,roweb101,??,,3/FL

play,9,0,torge101,??,,K

The ‘data’ record is currently used to track earned runs allowed for each pitcher in the game. These rows are discarded by the Opening Day Roster script.

data,er,sainj101,3

data,er,coopm101,0

data,er,lanfw101,0

data,er,hattj101,1

data,er,gregh102,0

data,er,caseh101,0

Walking through the process

Here’s the code with some commentary along the way:

We need to utilize two frequently-used packages containing commands that are not included in the base-R language – dplyr and sqldf.

install.packages("dplyr")

install.packages("sqldf")

library(dplyr)

library(sqldf)

Next, we use the setwd to change the working directory within R Studio to the location where you extracted the Retrosheet Event files and merged them into .csv files. 

setwd("C:/retrosheet/events")

# change the events_####.csv to reference the decade event file

# that you wish to import

The read.csv command imports the designated .csv file into a dataframe named “lineups_subs”.

lineups_subs <- read.csv("events_2020.csv", header = FALSE, sep = ",",

                         col.names = c("startsub", "retroID",

                           "playerName", "visitorHome",

                           "lineupPos", "fieldingPos",

                           "unused1"), fill = TRUE, quote = "",

                           stringsAsFactors = FALSE)

As I noted earlier, the Retrosheet Event files include play-by-play and comments along with several other data points that are superfluous to this particular task. We’re using the subset command to remove the ‘play’, ‘com’ and ‘version’ data references first. Then we will perform another subset to remove references to umpires, weather, etc.

lineups_subs_filter2 <- subset(lineups_subs, startsub == 'info' |

                                startsub == 'start' | startsub == 'sub' |

                                startsub == 'id')

lineups_subs_filter <- subset(lineups_subs_filter2,

                                retroID != 'starttime' &

                                retroID != 'daynight' &

                                retroID != 'usedh' &

                                retroID != 'innings' &

                                retroID != 'tiebreaker' &

                                retroID != 'umphome' &

                                retroID != 'ump1b' &

                                retroID != 'ump2b' &

                                retroID != 'ump3b' &

                                retroID != 'umplf' &

                                retroID != 'umprf' &

                                retroID != 'inputtime' &

                                retroID != 'howscored' &

                                retroID != 'pitches' &

                                retroID != 'oscorer' &

                                retroID != 'temp' &

                                retroID != 'winddir' & 

                                retroID != 'windspeed' & 

                                retroID != 'fieldcon' &

                                retroID != 'precip' &

                                retroID != 'sky' & 

                                retroID != 'timeofgame' &

                                retroID != 'attendance' &

                                retroID != 'wp' &

                                retroID != 'lp' &

                                retroID != 'save' )

We’re going to iterate through the ‘lineups_sub_filter’ dataframe, but first we need to do some housekeeping and create several new variables to help us keep track of everything.

Let’s create a brand new dataframe called ‘lineups_subs_temp’ while retaining the same structure as ‘lineups_sub_filter’. Note: we’re renaming the ‘unused1’ column to ‘gameID’ and adding 5 new columns to the dataframe – visTeam, homeTeam, TeamYear, date and number.

lineups_subs_temp <- lineups_subs_filter

names(lineups_subs_temp)[names(lineups_subs_temp) == 'unused1'] <- 'gameID'

lineups_subs_temp[,'visTeam'] = NA

lineups_subs_temp[,'HomeTeam'] = NA

lineups_subs_temp[,'TeamYear'] = NA

lineups_subs_temp[,'date'] = NA

lineups_subs_temp[,'number'] = NA

lineups_subs_temp <- lineups_subs_temp[0, ]

Here are the new variables for tracking information as the for-loop processes. We will track the number of rows in the new dataframe using ‘tempcounter’ . The other variables are updated when the loop processes a row in ‘lineups_subs_temp’ that matches exactly on the corresponding information.

tempcounter = 1

currentID <- ""

currentVisTeam <- ""

currentHomeTeam <- ""

currentTeamYear <- ""

currentDate <- ""

currentNumber <- "0"      # game number '0' first game

                          # or first of a double-header,

                          # '1' for second game of double-header

This is where the majority of the magic happens! The variable ‘I’ counts the rows as we search through the lineups_sub_filter dataframe. Then we scan the contents of the current row (based on the value of ‘I’) using several ifelse statements to populate the currentID, currentVisTeam, currentHomeTeam, currentDate and currentNumber variables.

for(i in 1:nrow(lineups_subs_filter)) {       # for-loop over columns

      currentID <- ifelse(lineups_subs_filter[i,1] == "id", lineups_subs_filter[i,2], currentID)

      currentVisTeam <- ifelse(lineups_subs_filter[i,1] == "info" & lineups_subs_filter[i,2] == "visteam", lineups_subs_filter[i,3], currentVisTeam)

      currentHomeTeam <- ifelse(lineups_subs_filter[i,1] == "info" & lineups_subs_filter[i,2] == "hometeam", lineups_subs_filter[i,3], currentHomeTeam)

      currentDate <- ifelse(lineups_subs_filter[i,1] == "info" & lineups_subs_filter[i,2] == "date", lineups_subs_filter[i,3], currentDate)

      currentNumber <- ifelse(lineups_subs_filter[i,1] == "info" & lineups_subs_filter[i,2] == "number", lineups_subs_filter[i,3], currentNumber)

The next set of statements only fire if the value of the first column in the current row is equal to “start” or “sub”. When the statement is true, we create a record in new dataframe (lineups_subs_temp) containing the following fields:

date, number, visitor, home, player_Team (matching the player’s team based on status of visitorHome field – 0 or 1), startSub, retroID, playerName, lineupPos, fieldingPos

if(lineups_subs_filter[i,1] == "start" | lineups_subs_filter[i,1] == "sub") {

        lineups_subs_temp[tempcounter , 1] <- lineups_subs_filter[i,1]

        lineups_subs_temp[tempcounter , 2] <- lineups_subs_filter[i,2]

        lineups_subs_temp[tempcounter , 3] <- gsub('["]', '',

lineups_subs_filter[i,3])

        lineups_subs_temp[tempcounter , 4] <- lineups_subs_filter[i,4]

        lineups_subs_temp[tempcounter , 5] <- lineups_subs_filter[i,5]

        lineups_subs_temp[tempcounter , 6] <- lineups_subs_filter[i,6]

        lineups_subs_temp[tempcounter , 7] <- currentID

        lineups_subs_temp[tempcounter , 8] <- currentVisTeam

        lineups_subs_temp[tempcounter , 9] <- currentHomeTeam

We confirm whether the current player is a member of the visiting or home team and assign the “TeamYear” accordingly (i.e. NYY1983, SEA2001). Use the substr command to extract the year from the currentDate variable.

currentTeamYear <- ifelse(lineups_subs_filter[i,4] == "0",

                        paste(currentVisTeam,

substr(currentDate,1,4),sep="_"),

paste(currentHomeTeam,

substr(currentDate,1,4),sep="_"))

        lineups_subs_temp[tempcounter , 10] <- currentTeamYear

        lineups_subs_temp[tempcounter , 11] <- currentDate

        lineups_subs_temp[tempcounter , 12] <- currentNumber

        tempcounter <- tempcounter + 1

      }

} # end of for..loop

Once the loop completes (this can take several hours if you combined multiple event files as I did), we execute a sqldf statement to generate a new dataframe called OpeningDayRosters_temp. This dataframe will be sorted in ascending order by TeamYear, retroID, date, number.

OpeningDayRosters_temp <- sqldf("SELECT * FROM lineups_subs_temp

                                ORDER BY TeamYear ASC, retroID ASC,

                                date ASC, number ASC")

Next, we create new columns for PlayerTeamYear and Year to assist with the sorting process.

OpeningDayRosters_temp$PlayerTeamYear <-

paste(OpeningDayRosters_temp$retroID, 

                        OpeningDayRosters_temp$TeamYear, sep="_")

OpeningDayRosters_temp$year <- substr(OpeningDayRosters_temp$gameID,4,7)

This is where the dplyr library comes into play. We’re utilizing the group_by, mutate and slice commands.  First, we group the rows by PlayerTeamYear. Then we use the mutate command to generate a new variable ‘gamenumber’ that contains each row’s rank based on the date. Slice using the 1:1 parameters will omit every row with the exception of those in which gameNumber is equal to 1. The resulting rows are placed into the OpeningDayRosters dataframe. In essence, we’re eliminating every game that a player participated in besides their first game of a given season.

OpeningDayRosters <- OpeningDayRosters_temp %>%

  group_by(PlayerTeamYear) %>%

  mutate(gameNumber = rank(date)) %>%

  slice(1:1)

Here we use sqldf again, creating a new dataframe called OpeningDayRosters25 which will be sorted by Team_Year and Date.

OpeningDayRosters25 <- sqldf("SELECT * FROM OpeningDayRosters

                                ORDER BY year ASC, TeamYear ASC, date ASC")

Let’s change the working directory in R Studio to the C:OpeningDayRosters subfolder. The final step is to write the dataframe out to a .csv (comma-separated values) file that you can view in a text editor or edit in a spreadsheet or database application such as Microsoft Excel, Microsoft Access, etc.

setwd("C:/OpeningDayRosters")

# uncomment the next line if you want to export all of the start/sub data for

# a given season or decade

# write.csv(OpeningDayRosters,"OpeningDayRosters_All_2020.csv")

write.csv(OpeningDayRosters25,"OpeningDayRosters_first25_2020.csv")

Results

Here’s an example of the resulting output using the 1983 California Angels. The top 25 rows (from Juan Beniquez through Joe Ferguson) would be considered the Opening Day roster.

TeamYearplayerNamedatestartsubretroIDvisHomelineupfieldPos
CAL_1983Juan Beniquez4/5/1983subbenij001117
CAL_1983Bob Boone4/5/1983startboonb001192
CAL_1983Bob Clark4/5/1983startclarb002129
CAL_1983Doug DeCinces4/5/1983startdecid001165
CAL_1983Brian Downing4/5/1983startdownb001117
CAL_1983Tim Foli4/5/1983startfolit001186
CAL_1983Bobby Grich4/5/1983startgricb001174
CAL_1983Andy Hassler4/5/1983subhassa001101
CAL_1983Reggie Jackson4/5/1983startjackr0011410
CAL_1983Bruce Kison4/5/1983startkisob001101
CAL_1983Fred Lynn4/5/1983startlynnf001158
CAL_1983Luis Sanchez4/5/1983subsancl001101
CAL_1983Daryl Sconiers4/5/1983startscond001133
CAL_1983Ron Jackson4/6/1983startjackr002163
CAL_1983Tommy John4/6/1983startjohnt001101
CAL_1983Ricky Adams4/7/1983subadamr001145
CAL_1983Rod Carew4/7/1983subcarer0011811
CAL_1983Dave Goltz4/7/1983subgoltd101101
CAL_1983Rob Wilfong4/7/1983subwilfr0011812
CAL_1983Mike Witt4/7/1983startwittm001101
CAL_1983Doug Corbett4/8/1983subcorbd001001
CAL_1983Geoff Zahn4/8/1983startzahng001001
CAL_1983Ken Forsch4/9/1983startforsk001001
CAL_1983Jack Curtis4/10/1983subcurtj001001
CAL_1983Joe Ferguson4/11/1983startfergj101182
CAL_1983Ellis Valentine5/6/1983subvalee001069
CAL_1983Bill Travers5/10/1983starttravb101001
CAL_1983Byron McLaughlin6/7/1983submclab102101
CAL_1983Curt Brown6/10/1983subbrowc001101
CAL_1983Mike O’Berry6/22/1983startoberm001092
CAL_1983Rick Burleson6/30/1983startburlr001116
CAL_1983Steve Lubratich7/20/1983startlubrs101125
CAL_1983Mike Brown7/21/1983startbrowm002179
CAL_1983Rick Steirer7/22/1983substeir001101
CAL_1983Steve Brown8/1/1983startbrows001001
CAL_1983Bob Lacey9/8/1983sublaceb001001
CAL_1983Jerry Narron9/8/1983subnarrj0010611
CAL_1983Gary Pettis9/8/1983startpettg001019
CAL_1983Dick Schofield9/8/1983subschod001096

Again, the entire data set is available here:

https://docs.google.com/spreadsheets/d/e/2PACX-1vRrSOoOTDuTYYrd9jfwNj7Tt4WLWrJBIwu-ASxDCuYc9035HOW2cB7V-91dhu_YWmbPeuzH55kYHR9P/pubhtml

I would encourage you to venture forward with R or the programming language of your choice. You may wish to utilize this script in its original form or as a basis to mine the event files further as you seek answers within the play-by-play and other records contained in this amazing data set!

References and Resources

Adler, Joseph. Baseball Hacks: Tips & Tools for Analyzing and Winning with Statistics.

Sebastopol, CA: O’Reilly Media, 2006. Print.

Baseball-Reference. Web. < http://www.baseball-reference.com >.

Marchi, Max and Albert, Jim. Analyzing Baseball Data with R. Boca Raton, FL: CRC Press, 2014. Print.

Retrosheet. Web. < http://www.retrosheet.org >.

The information used here was obtained free of charge from and is copyrighted by Retrosheet.  Interested parties may contact Retrosheet at “www.retrosheet.org”.

About the Author

I am a New Jersey native with a passion for baseball, statistics, computers and video games who enjoys spending quality time with his family.

Hardball Architects – Volume 1 (American League)“, published in July 2020 and
Hardball Architects – Volume 2 (National League)”, published in April 2022, examines the trades, free agent acquisitions, draft picks and other transactions for the 30 Major League Baseball franchises, divided into a 2-volume set. Both books are available in paperback and digital (Kindle) format at Amazon.com. All key moves are scrutinized for every team and Sabermetric principles are applied to the roster construction throughout the lifetime of the organization to encapsulate the hits and misses by front office executives. Team performances are analyzed based on transaction type with graphs depicting the WAR (Wins Above Replacement) in every decade. Individual results for each player-transaction is charted over the duration of their stint with the franchise. Every team chapter includes All-Time Rosters and Single-Season Leaders based on transaction type. The Team Trade Record chronicles the WAR and WS (Win Shares) accumulated by players acquired in comparison to those traded to opposing teams. The opening chapter is devoted to the Evolution of the General Manager and incorporates a discussion with former Dodgers GM Fred Claire (along with former Angels and Red Sox GM Mike Port and current Reds GM Nick Krall in Volume 2) on a variety of front office topics.

Hardball Retroactive”, published in June 2018, is available in paperback and digital (Kindle) format at Amazon.com.  Hardball Retroactive is a modest collection of selected articles that I have written for Seamheads.com along with my Baseball Analytics blog since 2010. Exclusive content includes the chapter on “Minors vs. Majors” which assesses every franchise’s minor league successes and failures in relation to their major league operations.

“Hardball Retrospective” is available in paperback and digital (Kindle) format at Amazon.com. Supplemental Statistics, Charts and Graphs along with a discussion forum are offered at TuataraSoftware.com. In Hardball Retrospective, I placed every ballplayer in the modern era (from 1901-present) on their original teams. Using a variety of advanced statistics and methods, I generated revised standings for each season based entirely on the performance of each team’s “original” players. I discuss every team’s “original” players and seasons at length along with organizational performance with respect to the Amateur Draft (or First-Year Player Draft), amateur free agent signings and other methods of player acquisition. Season standings, WAR and Win Shares totals for the “original” teams are compared against the real-time or “actual” team results to assess each franchise’s scouting, development and general management skills.

Don Daglow (Intellivision World Series Major League Baseball, Earl Weaver Baseball, Tony LaRussa Baseball) contributed the foreword for Hardball Retrospective. The foreword and preview of my book are accessible here

“Hardball Retrospective – Addendum 2014 to 2016” supplements my research for Hardball Retrospective, providing retroactive standings based on Wins Above Replacement (WAR) and Win Shares (WS) for each “original” team over the past three seasons (2014-2016). Team totals from 2010 – 2013 are included for reference purposes. “Addendum” is available in paperback and digital (Kindle) format at Amazon.com. 

+ posts
0 0 votes
Article Rating
Subscribe
Notify of
guest

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Tim Blaker
Tim Blaker
11 months ago

Derek,

This reminds me of what I did for finding opening day batting orders. Teams use so many different orders during any one season, I wondered which one would be the team’s #1 choice. I figured out how to find the order that was used most often, but even that is used much less often than one might expect.

I decided that the order used on Opening Day would, more often than not, be the one that a team would like to use as often as possible throughout the season.

The lineups are available in retrosheet game logs (separate from the event files). I store all retrosheet data in MySQL tables so I just had to write a query to find the orders used in the first game of each season. I used the opening day order compare to the simulated run production from all other possible combinations.

As I store each year in a separate table, I then had to write a PERL script to loop through the seasons.

This (and last night’s meeting) has given me a couple of ideas about simulations that I am going to start working on. If I can find a break in golf season, I could whip up a post.

Cheers,
-Tim