Clustering Applied to Quarterback Play in the NFL


In this blog post I want to talk a bit about unsupervised learning. As some of you that know me may know, I am relatively new to data science and machine learning, having my formal educational training in applied mathematics/mathematical biology. My interest in machine learning came not through mathematical biology or ecology, but through studying football.

Using ProFootballFocus data (I am a data scientist for PFF) we can study the quality of quarterback play through the process of grading players on every play of every game of every season.  To do so, it’s the most efficient to “cluster” quarterback seasons into buckets of similar seasons.  The best way to do this (do date) is through k-means clustering.

While there are many references on k-means clustering in the literature and on the web, I’ll briefly summarize the idea in this blog.  K-means clustering is an unsupervised learning algorithm that aims to partition a data set of n observations into k clusters where each observation belongs to one and only one cluster with the nearest mean.  Visually, one can think of a cluster as a collection of objects in m-dimensional space that are “close” to each other.  Below is an example of clustering quarterbacks from the 2016 season by their proportions of positively-graded and negatively-graded throws.  Different clusters are visualized with different colors:

As a part of our in-depth study of quarterback play at PFF, we clustered quarterbacks on the composition of their play-by-play grades in various settings (when under pressure, when kept clean, with using play action).  This gave us a tier-based system in which to evaluate the position throughout the PFF era (2006-present).  In 2016 the only quarterback that was in our top cluster on all throws, throws when from a clean pocket, throws when under pressure, and throws on third and long was New England Patriots’ star Tom Brady.

Stay tuned for more of an in-depth look at the quarterback position by visiting both in-season and during the offseason.



Random versus fixed effects


Wrapping my head around random versus fixed effects took me a while in graduate school. In part, this is because multiple definitions exist. Within ecology, the general definition I see is that a fixed effect is estimated by itself whereas a random effect comes from a higher distribution. Two examples drilled this home for me and helped it click.

First, the question: “Do we care about the specific group or only that the groups might be having an impact?” helped me see the difference between fixed and random effects. For example, if we were interested in air quality data as a function of temperature across cities, city could be either a fixed or random effect. If city was a fixed effect, then we would be interested the air quality at that specific city (e.g., the air quality in New York, Los Angles, and Chicago). Conversely, if city as a random effect, then we would not care about a specific city, only that a city might impact the results due to city specific conditions.

Second, an example in one of Marc Kerry’s book on WinBugs drilled home the point. Although he used WinBugs, the R package lme4 can be used to demonstrate this. Additionally, although his example was something about snakes, a generic regression will work. (I mostly remember the figure and had to recreate it from memory. It was about ~5 or 6 years ago and I have not been able to find the example in his book to recreate it, hence I coded this from memory). Here’s the code


population = rep(c(“a”, “b”, “c”), each = 3)
intercept = rep( c(1, 5, 6), each = 3)
slope = 4
sd = 2.0

dat = data.frame(
population = population,
interceptKnown = intercept,
slopeKnown = slope,
sdKnown = sd,
predictor = rep(1:3, times = 3))


dat$response = with(dat,
rnorm(n = nrow(dat), mean = interceptKnown, sd = sdKnown) +
predictor * slopeKnown

## Run models
lmOut <- lm(response ~ predictor + population, data = dat)
lmerOut <- lmer( response ~ predictor + (1 | population), data = dat)

## Create prediction dataFrame
dat$lm <- predict(lmOut, newData = dat)
dat$lmer <- predict(lmerOut, newData = dat)

ggplot(dat, aes(x = predictor, y = response, color = population)) +
geom_point(size = 2) +
scale_color_manual(values = c(“red”, “blue”, “black”)) +
theme_minimal() +
geom_line(aes(x = predictor, y = lm)) +
geom_line(aes(x = predictor, y = lmer), linetype = 2)

Which produces this figure:

Example of a fixed-effect intercept (solid line) compared to a random-effect (dashed line) regression analysis.

Play around with the code if you want to explore this more. At first, I could not figure out how to make the dashed lines be farther apart from the solid lines. Change the simulated standard deviation to see what happens. Hint, my initial guess of decreasing did not help.