Quantcast
Channel: SB Nation - Toronto Blue Jays
Viewing all articles
Browse latest Browse all 2466

Methods section: Creating your own park factors

$
0
0

A guide to the theory and practice of creating park factors, using ISO as an example.

Park factors are really just the worst.

I don’t mean in principle; the concept of adjusting player stats, either retrospectively for the sake of comparison or prospectively for the sake of projection, to account for the ballpark in which they play is a good idea. The problem lies in the math behind them. I’ve spent the better part of my free time for the last two or three weeks reading everything I could find written about park factors and adjustments, all the way from a post at Patriot’s old Tripod site to an article in an academic journal, and I’ve come to a few conclusions. First, it’s impossible to get park factors exactly correct. Second, it’s important to consider the tradeoffs you have to make if and when you decide to use a particular version. Third, there might be something wrong with me for spending so much time on this.

The basic idea behind calculating a park factor is easy enough, right? Usually in the context of runs per game (though sometimes by components like singles, HRs, etc), it’s a team’s production at home over production away. Boom. done.

Well, no, not really. That’s affected pretty heavily by the skill level of the team. Better include the opponents, too. So, for a given team, it’s that team’s production at home PLUS its opponents production in those games divided by that team’s production on the road PLUS its opponents productions in *those* games. There. That makes sense. Done.

Well, no, wait. What about interleague games? Adding/removing the DH from lineups will change the hitting skill level. We should probably exclude those. And what about pitchers batting? Given their lack of skill, will they be affected by a park the same way? Maybe we should leave their PAs out. And what about right- versus left-handers? It’s not like parks are symmetric across a line in center field, that probably makes a difference. And what about ground ball hitters versus fly ball hitters? They’ll be affected differently, too… but also, maybe the GB/FB ratio is affected by parks, too… You see my point, hopefully. This gets very complicated, very quickly, and all of the above is focused mostly on sample issues, not even really getting into problems with the formulas people use.

Still, for whatever foolish reasons drive any of us to dive headfirst into inconsequential things, I’ve decided on a method that I think makes sense, and will report on that here (and walk you through the creating of park factors for isolated power (ISO), the creation of which spurred this whole thing). As any thorough researcher would, I relied heavily on the works of those that came before me to teach me how to do this, and to give me a starting point from which to branch out. If you want to read more than I discuss here about the theories and calculations behind all this stuff, please see the following sites:

Patriot’s Park Factors
Baseball Reference Park Adjustments
FanGraphs Library - Park Factors
Park Factor Thoughts by TangoTiger
High Boskage - Baseball Data Normalization
Park Effects by Jim Furtado
The Philosophy of Park Factors by Colin Wyers:

Okay. Bearing all that in mind, and probably also some resources I forgot to mention, here's what I did to create park factors for isolated power. There's a LOT of methodological detail ahead, which I think some of you might want to see, but if you don't, I respect that - just skip to the results.

Using MySQL to query the Events table of my Retrosheet database (complete years only, so 1974-2013), I created a spreadsheet of year, home team, away team, batting team, league, at bats, handedness, and ISO. Using that information, for each home team I found the ISO (separated by batter handedness) of that team and that team's opponents. To each of these I applied a regression term, which I'll explain in the next paragraph. After making an adjustment to the opponents' number (that will be described later) I combined the two, proportionally weighting the opponents figure by how many opponents there were - so, say for Atlanta in 1974, the number is 1/12th Atlanta ISO, 11/12ths opponent ISO.

Regression, both in the above procedure and anywhere else I mention it, was based on the idea of reliability of statistics measured by Cronbach's alpha, which I was introduced to via Russell Carleton. He helped me out a bit when I was trying to figure it out, which I'm very grateful for. His articles on the subject can explain it much better than I ever could, so I direct you that way if you'd like to know more about it. I measured alpha separately for home and away ISO, using each season as an individual test subject while excluding the two strike years in the sample (as well as interleague and pitcher ABs). I truncated the seasonal data where necessary in order to make all years data lines equal length, and used the 'psy' package in R to actually calculate the alpha. For home teams, it came out to 0.668 in 1092 at bats for righties and 0.642 in 721 at bats for lefties; for away teams, it was 0.486 in 696 at bats and 0.467 in 405 at bats, respectively.

Since Cronbach's alpha is effectively a split-half correlation coefficient, I was able to use the value I found to determine how much regression should be included in my calculations based on the formula R = AB/(AB + X), where R is the alpha I found, AB is the number of at bats (per season) corresponding to the alpha, and X is the amount of at bats to use in regression. Note that AB is half of the actual number of at-bats used because of the split-half nature of Cronbach's alpha; I could have used the Spearman-Brown prophecy formula to get a predicted alpha for the entire set of at bats, but the math works out identically either way. Bottom line, for home right handers 271 ABs of league average ISO was added, for home lefties 200 ABs, for away righties 368 ABs, and for away lefties 231 ABs.

The adjustment to opponents' ISO I'd mentioned attempts to account for a sample difference across the different home teams. Weighting by quantity of opponents means that the batters contributing to the measured ISO will be evenly distributed across all league teams (or close to it, though not exactly even because of the unbalanced schedule); pitchers’ contributions, however, will then be coming disproportionately highly from the home team’s pitchers. To fix this problem, I decided to multiply the opponents’ ISO term by a term defined as the league average ISO allowed divided by the team’s pitchers’ regressed ISO allowed. There might be better ways to do this/solve this, and I’d love to hear them if there are, but this is what I went with for the results below.

That just about wraps up the home team term; now, on to the denominator of the equation. While many if not most park factors compare home production to away production, unless you have a very specific and more obscure goal for your park factors, this isn’t the correct way to do things. If a park factor is meant to remove any park effects and place a player in a theoretical league-average context, the point of comparison needs to be league average production, not away production. Now, the closer a park factor is to neutral the less this matters, since the distance of the road production from league average production must be 1/n the distance of the home production from average (because park factors must average out to neutral). If it were difficult to get the league average version, you could justify using away figures instead, but since it's very much *not* difficult, I used the league average. No further adjustments were needed; since regression is towards league average, none was included here, and using league average eliminated any over-representation from a single team in the sample.

All of that gives you a raw park factor number. In theory, if you do this for all teams in a given year, they should average to 1. I found that this generally doesn't happen; I assume it's due to the regression and adjustments, but I can't say for sure. As the last step in the process I artificially and linearly adjust each factor to force the average in each league to be 1. The final equation comes out to the following (which looks even worse in Excel, trust me), where TOI is team of interest, OPP is that team’s opponents, and POI is park of interest:

Iso_eqn

At this linked spreadsheet, you can find single-year, three-year average, and five-year average park factors, both halved and unhalved, split by handedness for all teams and years since 1974. The averages are "surrounding"-year averages; that is to say, the year in question is the central point of the time period being averaged. Averages are interrupted by teams moving to new parks, but not by any configuration changes to existing parks.

I personally find the most value in the single year numbers, but there are good arguments to be made for using averages. Single-year averages certainly appear to be noisier, but this is to be expected, and it's closer to being a feature than a bug. Part of that noise is due to a park "feature" that absolutely has an impact on the game, but gets lost if averaged factors are used: weather.Over the long term, the *climate* of a given city will be relatively stable, with changes happening over the course of many years; the *weather*, however, is much more variable season-to-season, and has a huge impact on batted balls, pitch movement, etc. Any park factor that's going to be applied to past data should account for that; hence, a single-year factor is best. Further, since the baseline is league average (and is hence affected by changes, in weather or anything else, in all league parks), it's to be expected that yearly numbers vary a bit.

Multi-years numbers definitely have their place as well, though. Anything forward-looking - say, a projection system - that wants to account for park effects would be better served in using multi-year park factors to estimate the adjustment that should be used. I didn't have the time to get the data on that, but it can be inferred from the following graphs, which show single-year, three-year, and five-year average park factors for Wrigley Field.

Wf_rh_iso_pfWf_lh_iso_pf

Throughout the above, ISO was my example; this is because wanting to create ISO+ (that is, league- and park-adjusted ISO) drove me to all of this in the first place. Not wanting to leave that idea hanging, below you can find both ISO and ISO+ for qualified batters in 2013. A quick glance through the data shows that the Pirates are helped out a lot, in terms of overall rank, by this method, with Andrew McCutchen and Neil Walker each jumping 17 spots. The Blue Jays are hurt (again, by ranking) a bit, with Jose Bautista and Adam Lind falling 7 and 8 spots, respectively. This is the most superficial of analyses, but maybe someone can find something more interesting.

NameTeamLeagueISOISO+
Chris DavisBALAL0.347217
Miguel CabreraDETAL0.288195
Brandon MossOAKAL0.267186
Pedro AlvarezPITNL0.240183
Paul GoldschmidtARINL0.249176
David OrtizBOSAL0.255171
Edwin EncarnacionTORAL0.262166
Mike TroutANAAL0.234163
Evan LongoriaTBAAL0.229162
Alfonso Soriano- - -- - -0.235160
Giancarlo StantonMIANL0.231160
Troy TulowitzkiCOLNL0.229157
Mike NapoliBOSAL0.223155
Mark TrumboANAAL0.219152
Jose BautistaTORAL0.239152
Marlon Byrd- - -NL0.220150
Domonic BrownPHINL0.222150
Nate SchierholtzCHNNL0.218149
Carlos GomezMILNL0.222148
Jayson WerthWASNL0.214148
Chris CarterHOUAL0.227148
Will VenableSDNNL0.216147
Andrew McCutchenPITNL0.190146
Adam DunnCHAAL0.223145
Jedd GyorkoSDNNL0.196144
Jay BruceCINNL0.216144
Carlos BeltranSLNNL0.195142
Hunter PenceSFNNL0.200142
Justin UptonATLNL0.201141
Adam JonesBALAL0.208139
Robinson CanoNYAAL0.202139
Yoenis CespedesOAKAL0.203139
Matt HollidaySLNNL0.190137
Michael CuddyerCOLNL0.198137
Mitch MorelandTEXAL0.205136
Josh DonaldsonOAKAL0.198136
Adrian BeltreTEXAL0.193134
Adam LindTORAL0.208134
Brandon BeltSFNNL0.192134
Ryan ZimmermanWASNL0.191132
Chase UtleyPHINL0.191129
Freddie FreemanATLNL0.182129
Dan UgglaATLNL0.183128
Anthony RizzoCHNNL0.187128
Coco CrispOAKAL0.183127
Neil WalkerPITNL0.167127
Joey VottoCINNL0.186124
Carlos SantanaCLEAL0.186124
Starling MartePITNL0.161123
Josh HamiltonANAAL0.182123
Adrian GonzalezLANNL0.168122
Matt CarpenterSLNNL0.163120
Adam LaRocheWASNL0.166120
Ian DesmondWASNL0.173120
Justin SmoakSEAAL0.174119
Shin-Soo ChooCINNL0.178119
Nick SwisherCLEAL0.176118
Brian DozierMINAL0.170118
Jonathan LucroyMILNL0.175117
Matt WietersBALAL0.181117
Kendrys MoralesSEAAL0.171117
Todd FrazierCINNL0.173116
Mark Reynolds- - -NL0.172115
Russell MartinPITNL0.151115
Desmond JenningsTBAAL0.162114
Yadier MolinaSLNNL0.159114
J.J. HardyBALAL0.169113
Prince FielderDETAL0.178113
Kyle SeagerSEAAL0.166111
Jason KipnisCLEAL0.169110
Andre EthierLANNL0.151110
Buster PoseySFNNL0.156110
Torii HunterDETAL0.162109
Shane VictorinoBOSAL0.157108
Jed LowrieOAKAL0.156108
Asdrubal CabreraCLEAL0.160106
Matt DominguezHOUAL0.162105
Alex Rios- - -AL0.154105
Chase HeadleySDNNL0.150105
Alex GordonKCAAL0.157104
Andrelton SimmonsATLNL0.148104
Allen CraigSLNNL0.142102
A.J. PierzynskiTEXAL0.153101
Joe MauerMINAL0.153100
Justin Morneau- - -- - -0.151100
Manny MachadoBALAL0.14899
Ryan DoumitMINAL0.14999
Brett GardnerNYAAL0.14399
Howie KendrickANAAL0.14299
Salvador PerezKCAAL0.14198
Austin JacksonDETAL0.14598
Eric HosmerKCAAL0.14697
Pablo SandovalSFNNL0.13996
Trevor PlouffeMINAL0.13996
Daniel NavaBOSAL0.14296
Chris JohnsonATLNL0.13696
Martin PradoARINL0.13495
Nolan ArenadoCOLNL0.13895
Ian KinslerTEXAL0.13695
Gerardo ParraARINL0.13594
Alejandro De AzaCHAAL0.14292
Brandon PhillipsCINNL0.13691
Daniel MurphyNYNNL0.12990
Nate McLouthBALAL0.14188
Mike MoustakasKCAAL0.13187
Jean SeguraMILNL0.12986
Billy ButlerKCAAL0.12486
Chris DenorfiaSDNNL0.11786
Jacoby EllsburyBOSAL0.12886
James LoneyTBAAL0.13185
David FreeseSLNNL0.11985
Ben ZobristTBAAL0.12885
Zack CozartCINNL0.12785
Victor MartinezDETAL0.12984
Leonys MartinTEXAL0.12582
Brandon CrawfordSFNNL0.11480
Dustin PedroiaBOSAL0.11479
Michael Young- - -NL0.11678
Yunel EscobarTBAAL0.11078
Alberto CallaspoOAKAL0.11076
Erick AybarANAAL0.11176
Paul KonerkoCHAAL0.11175
Denard SpanWASNL0.10273
Michael BrantleyCLEAL0.11273
Starlin CastroCHNNL0.10273
Jon JaySLNNL0.09570
Darwin BarneyCHNNL0.09668
Jimmy RollinsPHINL0.09765
Alexei RamirezCHAAL0.09665
Michael BournCLEAL0.09764
Eric Young- - -NL0.08760
Gregor BlancoSFNNL0.08459
Norichika AokiMILNL0.08457
Ichiro SuzukiNYAAL0.08156
Nick MarkakisBALAL0.08553
Jose AltuveHOUAL0.08052
Marco ScutaroSFNNL0.07251
Adeiny HechavarriaMIANL0.07150
Alcides EscobarKCAAL0.06646
Elvis AndrusTEXAL0.06042

Anyway, I hope someone out there found this all useful. I’d love any feedback or questions you might have, since I’m planning on doing this same process to establish (better) park factors for a bunch of different stats as prep work for a series of cross-era comparison articles coming somewhere down the line. For example, I thought about trying to account for schedule imbalance when I weighted opponents' ISO in the numerator, but it was difficult to accomplish in my spreadsheet and I guessed that the increase in accuracy wasn't worth the effort. If there's anything you notice, please let me know.

. . .

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org. Some other statistics courtesy of FanGraphs and Baseball-Reference.

John Choiniere is a researcher and featured (occasional) writer at Beyond the Box Score. You can follow him on Twitter at @johnchoiniere.


Viewing all articles
Browse latest Browse all 2466

Trending Articles