SB Nation - Login for mobile commenting

Lookout Landing

Sabermetrics 101: Sample Sizes

Now seemed as good a time as any.

Prerequisites for understanding: Regression, correlation.

Prerequisites for derivation: Data, regression, correlation.

Star-divide

Sample Sizes

We're familiar with regression and correlation, so let's get a little more in depth with the nature of sample sizes. If we have a certain set of data, how can we assess its reliability? What is it actually telling us? We unpacked a little of this in a previous post, but didn't touch on how it applies to the data we wish to analyse. There are many many resources that describe this problem and the solutions in minute, tortuous (to some) detail. We don't need to rehash them here - these posts are intended to be more overview than encyclopedic. So - an overview:

The idea is that different skills and stats have different thresholds for sample size tolerance. We know that we must regress our measurements towards a mean, and we've thought a little bit about which means we should be using. What we haven't really discussed is how far we should be regressing our given values. This is governed by our sample size and the stability of the statistic - higher samples and higher stability means less regression. The important thing to point out is that the amount of regression we apply should be a continuum, rather than a step - meaning that for every sample size there is a certain amount of associated information. The smallest sample (i.e. zero) tells you nothing, and we slowly work our way up the ladder until we reach the largest samples, which still don't tell you everything.

Some Rules of Thumb

The way we determine the information associated with a given sample size for a given statistic is to look at the stability across the MLB population while taking into the relative persistence of the statistic year by year in individuals. Needless to say, this can be a daunting task. In lieu of pursuing some intensive mathematics, here are some rules of thumb:

  • Using sample size on a per plate appearance basis can be misleading. Remember that if a statistic is based around pitches seen, the actual sample is going to be three to four times as large as the number of plate appearances. This makes statistics like swing% appear far more stable than they actually are (although we can treat them as highly stable since we have a large sample anyway), and statistics like home run per fly ball become artificially destabilised, since our sample isn't really plate appearances but plate appearances that end in a fly ball.
  • It should be noted that even if you consider total fly balls as your sample, pitcher home run per fly is still highly unstable.
  • In general, the more players involved in our measurement of a statistic, the less stable. Batting average for both pitchers and batters is determined by the pitcher, the batter, and the defence. Strikeouts cut out the defence, and they're far more stable on both ends than average is.
  • By and large, batting statistics are more reliable than pitching statistics, which are more reliable than defensive statistics. Batting statistics, namely on-base percentage and slugging percentage, tend to stabilise with two-thirds so so of a season (~400 PA). Pitchers only see strikeouts, line drives, and groundball rates stabilise early, with walks coming late to the party. Nowhere to be seen is ERA. Fielding measurements such as UZR require something like three years of data to be comfortable with. Putting up great fielding numbers for a season is like hitting well for April and half of May.

What Follows

Projection systems, understanding splits.

0 recs  |  17 comments

Comments

Finding out that Batter vs. Pitcher stats are massively subject to small sample sizes was one fo the most upsetting things to realize.
Doesn't ERA tend to stabilize over the course of an entire career?
Over 6-7 years I would say an adjusted ERA is better than DIPS
Than our current implementations of DIPS, sure

Eventually we’ll get the more interesting information encoded in ERA into our defence-independent statistics

Yes

SIERA is a start to that, although it’s going about it the wrong why IMO.

If you play for a bunch of different teams, probably

But if you pitch in the same park, in front of similar defenses, then it may not.

Warning - Math content - If you don't want to read, please ignore

Here is a good example of a case where the concept of variance (or more importantly its square root, the standard deviation) could help explain what is going on.

When we calculate a rate, there is a “true rate” which our sample is approximating. Every rate exhibits variability (think about flipping a coin 10 times – on any set of 10 reps, you will see lots of different results – when you observe all the results, the mean would be 5, the standard deviation measures (not exactly, but close) the average difference from the mean of a sample of size 10. The standard deviation of the rate that you calculate from you sample represents the expected variability that you may see in that rate even though the mean rate is .5. With larger samples, the standard deviation will be smaller. The rate of decrease in proportional to the square root of the sample size.

What does this mean. An OBP calculated based on 400 PA exhibits half the variability of an OBP based on 100 PA. To get half the variability of an OBP based on 400 PA, you’d have to go to 1600 PA. Sorry for the interruption :)

I think there are at least two separate issues here

which it may be enlightening to distinguish. The primary question is: What prevents us from estimating something precisely? In terms of baseball statistics, there may be two different challenges:

1) The quantity you are trying to measure is inherently variable. For example, due to “randomness” (really a bunch of small unobservable factors) a player’s defensive performance may actually fluctuate quite a lot from game to game, week to week, or even year to year. So we need large sample sizes to estimate the true mean (i.e. talent level) precisely.

2) Even if the quantity of interest doesn’t necessarily have high variability, the nature of baseball restricts the sample sizes available to estimate it. For example, if a batter is platooned, then he may have only a small number of at-bats against a same-handed pitchers, and it will obviously be hard to estimate his performance against them.

A subtler point related to 2) is that the “events” (swings, at-bats, pitches, etc.) we look at are dependent (i.e. correlated) to varying degrees. Swings, for example, are grouped into at-bats, and four or five consecutive at-bats often take place against the same pitcher, inducing dependence. Generally speaking, the larger the dependence of the events, the larger sample size you need to estimate your quantity accurately.

Here's what confuses me about sample size and sports:

the idea that we learn “nothing” from small sample sizes.

Perhaps this is just a language issue, and when someone says that we learn nothing from that stat, because of sample size, they really mean that we learn very very little.

I think of it like this. If I see a batter come to the plate and I know nothing about him (except that he is on a major league baseball team), and then he hits a homerun in that single at bat. That has low value. But is it “zero”? I mean, isn’t it more likely than not that he is a power hitter? The odds are higher than they were when he came up to bat (and I could only expect league averages from him), even if that increase in odds is just 1% or .1%, it is something, right?

This rolls around in my brain most often when it comes to batter matchups vs. pitchers. If Batter A is 9 for 12 lifetime against Pitcher B, then isn’t there an increased chance that he is better against that pitcher than if he were 1 for 12? It isn’t a large enough sample to draw a statistically significant sample, but aren’t the odds somewhat higher?

Sorry to make this so long, but I use analogy to articulate what I mean. Lets say I have a random quarter I found on the ground and I haven’t looked at it yet. The odds that this quarter is a two-sided coin (both heads) is, lets just say, .001% (if 1 out of every 100,000 quarters on the ground is two sided trick coin).

Now I start flipping that coin. I get heads three times in a row. Three is a tiny sample size and for all effective purposes it is worthless. But aren’t the odds that my coin is two-sided now higher? Maybe they are .002%. Is there something I’m missing here, or is this kind of differentiation between ‘zero’ and ‘really tiny’ only valuable as a curious mental exercise?

Jordan Schafer homered in his first big league at bat then proceeded to be very very bad offensively for the rest of his time with the Braves.

On at bat tells you that, yes, Jordan Schafer has the physical ability to hit a home run. You knew this because he had homered at least once in high school and the minors. Even if something has a tiny tiny chance of happening, the single occurrence in one observation of that something doesn’t tell you anything about the likelihood of it happening again.

Also, Kenji Johjima hit a home run to right field in his first game as a Mariner.

Yes that actually happened, but it did not mean he was a hitter that would routinely show power the opposite way.

That is to say that it tells you that the event isn't impossible, but it's not much more instructive than that.
And most of the time, we already knew that.
Wellllll

A sample size of one has to tell you something, because otherwise a sample size of ten thousand couldn’t tell you anything. It just tells you not very much at all, and our brains are really bad at handling Bayesian Inferences. All in all, you’re better off thinking of it as worth nothing rather than being worth some small figure. We’re just too prone to vastly overestimating what the very small number is to bother with it.

I think everyone should play a few dozen freeroll poker tournaments...

Once you get over the anger and frustration, they really are amazing for realizing how bad your tendency to extrapolate from insufficient sample sizes is. That’s assuming you actually pay attention to the odds, what actually happens, and your natural response to what actually happens.

We generally treat all events as equally informative

So the first at-bat is no more or less informative than the 17th (taken in isolation). There is a statistical quantity called “information” which increases in relation to the inverse of the variance.

For the record, I don’t like the notion that there is some point at which sample sizes become “reliable”. Everything’s on a continuum; if you tell me how “reliable” you want your estimate to be, I’ll come back with a sample size which will accomplish that. People often seem to use R=0.5 as a benchmark for reliability, but that’s a somewhat arbitrary choice.

Also, thanks for these awesome articles

This is a ton of work for you. I took statistics in college and I’ve never been turned off by the sabermetrics world, finding it approachable enough to add to my understand of baseball. But these articles are a fantastic introduction and explanation.

You must Login with your SB Nation account and be a member of Lookout Landing to post a comment.