I've decided to start writing a sabremetric 'course' akin to what Fett42 gave us in the FanPosts but easier to link to. My hope is that eventually every major concept gets covered and we can autotag this into our posts whenever it comes up. The goal is to be able to give people who haven't encountered this before a quick reference tool where they can look up any unfamiliar concept. With that in mind, I should start from what I think is the top: Game State.
Prerequisites for understanding: None.
Prerequisites for derivation: Database.

The What
Baseball is without a doubt the easiest North American sport to analyse. A major reason that this is the case is that there are discrete states for a game. There's the score and inning, obviously, but there's also outs and baserunners to take into account. With those four pieces of information, you can describe any baseball game at any time. We call the combination of score, inning, baserunners, and outs the game state. The game state matrix is typically considered to reflect the number of runs that an average MLB team will score in an inning given any combination of baserunners and outs (run expectancy), but a similar concept can be applied to a team's chances of winning any given game (win expectancy).
The Why
Why is this important? Every game state has been experienced in the major leagues thousands of times, leaving analysts with the ability to determine the odds of a team winning a game at any given point, or even how many runs might be expected to score in an inning. If you have the average number of runs expected to score in an inning after any game state, you can figure out how many runs a stolen base is worth, or a triple, or a strikeout. The game state essentially allows us to relate everything that happens on the diamond back to the major currencies of baseball: winning and runs. Without it, there would be no apples to apples comparison between pitching and hitting, walks and doubles, you name it. The game state is the key concept behind linear weights, and therefore understanding what it means is vital to achieving a good grasp on how most modern statistics work.
There are a few major points to keep in mind:
The How
To determine how many runs score from a specific baserunner/outs situation, one simply needs to find the number of times that situation occured and then tally the total runs that scored between that instance and the close of the inning. Win expectancy is derived in much the same manner. The only difficult part of it is data gathering and knowing which seasons to look at, as different run environments naturally yield different results. Ensure that the game state in use is appropriate for the run environment - don't use the 1970s to model happenings in the late 1990s, for example.
Example
Tom Tango has derived this matrix for run expectancy from 1999-2002.
What Follows
WPA, Linear Weights.
17 recs | 38 comments
This is a really good idea. I am very grateful that you are doing this.
That is all.
Griffin Cooper - February 14, 2010
This is a great idea, what with all the new people coming to the site.
OlSalty - February 14, 2010
I demand 42% of the profits
Fett42 - February 14, 2010
You should do another FanPost on this, call it the "Graham State"
Janic - February 14, 2010
That's basically the French Revolution, only the monarchy puts the proletariat to the guillotine.
Kermit. - February 14, 2010
I'll just link to these in my link collection
Fett42 - February 14, 2010
Then when they come back to this post it will be one vicious circle.
CapSea - February 14, 2010
Awesome idea. Looking forward to further installments.
stupidquestions - February 14, 2010
When you say Prequesite for derivation: database ...
do you mean a play-by-play database of all games?
And just out of curiosity, is this database (whatever it is) freely available?
rickpo - February 14, 2010
Retrosheet is freely availabe
And it contains play by play data for most years since around 1955. You can get all of that data by following the steps here:
http://www.hardballtimes.com/main/blog_article/building-a-retrosheet-database-the-short-form/
Or, if you don’t want to go through that hassle, just import some of these SQL ZIP files:
http://www.wantlinux.net/2009/04/retrosheet-baseball-mysql-database-download/
vivaelpujols - February 15, 2010
That's where all of the run expentacny data comes from BTW
vivaelpujols - February 15, 2010
Do you foresee a time in the not too distant future where differences in skill level will be integrated to Win Probability?
My thought is no, for the following reasons:
- Good teams will have ~60% win percentage, bad teams ~40% with some variation, which is close enough to 50/50
- True talent can never be truly measured. It is always an estimate based on outcomes, age, and probability.
- It would be too difficult and a waste of energy to add those adjustments for each game state, especially since they would be inexact.
But I would like to hear your thoughts. And if you do expect it to be added, why?
CapSea - February 14, 2010
There's no real reason to
Doing such a thing would contextualise WPA and make it useless for player valuation (as opposed to useless for measuring talent level, which it currently is). The only reason I can see to implement such a system is for in-game gambling, which seems like a pretty small niche market to me.
Graham MacAree - February 15, 2010
Could be a niche market, but would be a niche market with a huge amount of disposable income
if you could make something like that, no serious gambler could afford to be without it when betting in-game
seattlebruin - February 15, 2010
Do people bet in-game very often?
Graham MacAree - February 15, 2010
I would guess no, but when they do, it's a significant amount of money
seattlebruin - February 15, 2010
Yeah, people probably do in the playoffs
And that’s where the big money gets thrown around
OlSalty - February 15, 2010
Well I have a summer project then.
Graham MacAree - February 15, 2010
Hell, you could probably make a lot of money doing this, as long as the casinos never figured out what tool you're using
essentially when the bookmaker doesn’t have a lot of time to put up odds for an event, a smart player will always be at an advantage
seattlebruin - February 15, 2010
See I would rather just make the tool and sell it to the casinos
Graham MacAree - February 15, 2010
I expect royalties.
CapSea - February 15, 2010
Isn't it nice to have expectations?
ToddK - February 15, 2010
.
Graham MacAree - February 15, 2010
I expect my royalties to be less incestuous.
CapSea - February 15, 2010
I've done this a few times.
Bodog.com
This is probably a little different then you guys are talking about though, this is literally betting on the outcome of every pitch.
hcoguy - February 15, 2010
Great idea Graham.
I think this will help everyone understand the smaller aspects that go in to stats, like linear weights and such. You’ve been doing some great work on the site lately too.
Kirk - February 15, 2010
What is a good range of years for the data to take?
I mean, if I were evaluating a player for 2010, should I take the past 10 years? 7? Is there any standard dataset size when talking about this stuff?
88fingerslukee - February 15, 2010
For run expectancy you only need a few years or even one year of data, because the run/out states happen so much
For win expectancy, you need more data, which is why current win expectancy charts are moving away from empirical data towards more theoretical measures. I’ll explain that in more detail in a later post.
Graham MacAree - February 15, 2010
Is there much variance on run expectancy?
Seeing that the data was from ‘99-’02 and is still being used, I am guessing that the variance is minimal but am curious to know what the range is.
ToddK - February 15, 2010
Yes, there's a decent amount of varience.
The ‘99-’02 data from Tango was just an example. I like single year data for the leagues I’m looking at for run expectancy.
Graham MacAree - February 15, 2010
How far back do you go for your projection? 1 year only?
ToddK - February 15, 2010
One year is fine for run expectancy.
Graham MacAree - February 15, 2010
Cool. Thanks.
ToddK - February 15, 2010
I am really, really excited for this.
Thanks so much, Graham.
Torrid - February 15, 2010
I really like WPA and WE stuff
Its the thing that first got me interested in Lookoutlanding.
I think one of its biggest shortfalls when the time comes to use it is that it assumes everything is average. It doesn’t know you have the top of the order coming up in the bottom of the ninth or that Mariano Rivera is probably the best closer ever. It is blissfully ignorant to all of this. I don’t mind that it assumes everything is average. It is what it is. I still like it a lot but it makes it harder to confidently use at times.
Edgar for Pres - February 15, 2010
Awesome
Thanks, Graham. I’ve been wishing that I had a better understanding of the terms you guys throw around.
Cablinasian - February 15, 2010
You must Login with your SB Nation account and be a member of Lookout Landing to post a comment.