In this blog post I want to talk a bit about unsupervised learning. As some of you that know me may know, I am relatively new to data science and machine learning, having my formal educational training in applied mathematics/mathematical biology. My interest in machine learning came not through mathematical biology or ecology, but through studying football.

Using ProFootballFocus data (I am a data scientist for PFF) we can study the quality of quarterback play through the process of grading players on every play of every game of every season. To do so, it’s the most efficient to “cluster” quarterback seasons into buckets of similar seasons. The best way to do this (do date) is through k-means clustering.

While there are many references on k-means clustering in the literature and on the web, I’ll briefly summarize the idea in this blog. K-means clustering is an unsupervised learning algorithm that aims to partition a data set of n observations into k clusters where each observation belongs to one and only one cluster with the nearest mean. Visually, one can think of a cluster as a collection of objects in m-dimensional space that are “close” to each other. Below is an example of clustering quarterbacks from the 2016 season by their proportions of positively-graded and negatively-graded throws. Different clusters are visualized with different colors:

As a part of our in-depth study of quarterback play at PFF, we clustered quarterbacks on the composition of their play-by-play grades in various settings (when under pressure, when kept clean, with using play action). This gave us a tier-based system in which to evaluate the position throughout the PFF era (2006-present). In 2016 the only quarterback that was in our top cluster on all throws, throws when from a clean pocket, throws when under pressure, and throws on third and long was New England Patriots’ star Tom Brady.

Stay tuned for more of an in-depth look at the quarterback position by visiting profootballfocus.com both in-season and during the offseason.