When the power of data.table clicked

Standard

Historically, R has been limited by memory (for example, see this post). Although the program has gotten better through time, R still faces data limitations when working with “big data”. In my experience, data.frames and read.csv become too slow when reading files >0.5GB or so and data.frames start to clog up the system when the get bigger than >1.0 GB. However, some recent packages have been developed work with “big data”. The R High Performance Computing page describes some packages and work around.

My personal favorite package for working with big data in R is data.table. The package was designed by quantitative finance people for working with big data (e.g., files > 100GB). I discovered the package while trying to optimize a population model. Now, I use the package as my default methods for reading in data to R and manipulating data in R.

Besides being great for large data, the package also uses a slick syntax for manipulating data. As described by the vignette on the topic, the package has some cool methods for merging and sorting data. The package maintainers describe it as similar to SQL, although I do not know SQL, so I cannot comment on the analogy.

After taking the DataCamp Course on data.table, I better learned how to use the package. I was also soon able to improve my work by this knowledge. One call in particular blows my mind and impresses me:

 lakeTotalPop <-  lakeScenariosData[ , .('Population' = sum(Population)), by = .(Year,  Scenario, Stocking, Lifespan, PlotLife, Sex == "Male" | Sex == "Female")] 

This code allowed me to aggregate data by a condition. Something that requires multiple steps of clunky code in base R. Even if I learned nothing else, this one line of code would make my entire DataCamp experience worth while!

Knitr and R Markdown

Standard

I’m late to the game, but have recently begun using R Markdown. I was motivated because my employer now has an open data/open code requirement for all data we generate.  My specific problem was that I am often using R code, but need to document what I am doing so that I may share my code. Hence, R Markdown was a perfect solution for me. As an added bonus, I have also switched over my R teaching materials to R Markdown and am now using Markdown to develop an online course on mixed-effect models with R.

Previously, I used sweave. Although powerful, sweave offers similar functionality to RMarkdown, but requires the file to be complied multiple times. Thus, sweave offers me no benefit compared to RMarkdown.

I usuallyRStudio as my editor and loving how it works. RStudio is easy to use and R Markdown is well documented. I was able to learn the program easily and get up to speed because of 3 factors. First, I previously used sweave. Second, I am familiar with Markdown from StackOverflow. Third, I am good with R. My only regret is that I did not start using it earlier.

As for time, learning R Markdown only required a couple of hours on Monday afternoon and I am now fully up to speed. The tutorials built into RStudio were fabulous! In summary, I would recommend RMarkdown for everybody wanting to create documents with R Code embedded within them. !

 

Review of Modeling for Insight

Standard

During graduate school, I attended a Joint Mathematics Meeting and attended a session on “quantitative literacy” as part of higher education. During that session, the book Modeling for Insight: A Master Class for Business Analysts was recommended to me and I purchased the book shortly after the meeting.

The book was written by two professors Stephen Powell currently at Darmoth College and Bob Batt, currently at UW-Madison. They wrote the book for their graduate-level business program while both were at Darmoth College. The target audience is people who need to do quantitative modeling and analysis, but lack programming skills (e.g., MBA students). The book teaches both the modeling process necessary to make quantitative decisions and how to do so using Microsoft Excel (with some plugins).

The book’s overview of the how to make quantitative decisions makes the book a worthwhile purchase in-and-of-itself. For example, the book describes “spreadsheet engineering” as four steps:

  1. Design
  2. Build
  3. Test
  4. Analyze

Although basic steps, they are important. And, many scientists I know build models using more complicated tools and do not use all of these steps (for example, many ecological modelers often do not “test” their code). Also, many simple scientific model builders lack any formal process! Furthermore, even though the model targets business applications, many of the lessons apply to scientists as well. In directly, the authors do a good job of teaching the modeling process. Directly, much of modern science involves running large projects and planning these project. The case studies in the book can be adapted to scientific projects just as easy a traditional business projects.

I have recommended the book to this book to friends who work in business and the admin team at my center for their own business modeling needs. Additionally, I have recommended the book for ecologists who teach modeling classes with Excel because of its value Excel modeling content. More broadly, this book is good for anyone who wants to learn either spreadsheet modeling or wants an introduction to modeling in general.

That being said, there were few things I disliked about the book. First, I do not like that they require Excel plugins. I understand why the authors use them, but I could not justify buying them. For example, Oracle Crystal Ball currents costs just slightly less than $1,000! However, this makes a good case for learning open source programs such a R or Python! Second, some of their terms like “spreadsheet engineering” seems gimmicky to me. However, I suspect jargon like that is often used in the business world. Last, I am coding snob and think everyone should learn programming. But, alas we live in an imperfect world…

Overall, I give this book 5/5 starts! I recommend this book for business people who need to learn to model, but do not want to code; ecologists who use spreadsheet models (although I shake my head at you while doing so); and professors teaching undergraduates spreadsheet modeling either in ecology or business classes.

Data.gov, a great source of teaching data

Standard

Recently, I’ve been finding datasets for a R course I’m teaching. In the past, I’ve used my own data or the default R courses. But, I wanted to find broader datasets that would appeal to a broader audience than the ecologists and environmental scientists I usually interact with. Enter Data.gov.

Although quirky, the I’ve found several interesting datasets for students to explore.  These range from economic data to medical data to crime data. I only have 3 criticisms. First, the page took a little bit to figure out how to navigate around. However, I quickly got around this. Second, many cool datasets only have meta-data because the original data cannot be shared. Atlas, all good things have limits. Third, several pages had dead links. However, I was often able to find the original dataset using quick Google search.

In conclusion, Data.gov rocks. It is an integrated data warehouse for data from around the US ranging from local governments up to state and federal datasets. If it was a commercial product, I would give it 3 stars, but as government product I give it 4 starts because of all the agencies it bridges.

My Favorite Topic to Teach

Standard

In this post I’m going to discuss a topic that I’m currently covering in UW – La Crosse’s MTH 353 Differential Equations course: Laplace transforms. While (like many math topics) I didn’t appreciate transforms as much as I should have when I learned them for the first time, they have become my favorite topic to teach in the undergraduate curriculum.

First, Pierre-Simon Laplace was an absolute crusher in the mathematical sciences. In addition to the transform method that bears his name, he’s responsible for a lot of the theoretical underpinnings of Bayesian statistics (one of Richard’s favorite topics), tidal flow, spherical harmonics, potential theory and Laplace’s equation, among many other things.

The Laplace transform, in its simplest application, transforms linear, generally inhomogeneous, constant-coefficient ordinary differential equations of time t into an algebraic equation of a (complex) frequency variable s.

What I love the most about Laplace transforms as a topic in the mathematics curriculum is that is requires students to apply techniques from earlier in their training. For example:

– Completing the square (elementary algebra)

– Horizontal translation of functions (elementary algebra)

– Improper integration (second-semester calculus)

– Partial fraction decomposition (second-semester calculus)

– Linear transformations (linear algebra/functional analysis)

– Elementary theory of linear ODEs (elementary differential equations)

Another nice thing about the Laplace transform is that it can handle discontinuous inhomogeneous (forcing) data like Heaviside step functions and Dirac delta functions, as well as forcing terms that aren’t their own derivatives like polynomials, sines, cosines and exponential functions. When viewing the solution to differential equations in this setting, in the space of functions of the variable s, one can clearly see how initial data and forcing data is propagated in time back in the original solution space.

If you’re interested in Laplace transforms, I’ve created some videos for my MTH 353 course, and they can be found here on my YouTube page!

 

Multiple inheritance in Python

Standard

The major reasons I switched my population modeling from R to Python is that Python is a nicer language (obligatory xkcd reference here).

Today, I was trying to figure out how multiple inheritance works. Multiple inheritance allows a Python class to inherit the attributes (e.g., functions) from other classes. In my specific context, I am trying to build a complex spatial population model. I want to be able to merge spatial and demographic attributes to create new population classes. However, I was getting tripped up about how to code it in Python.

I found an example on StackOverflow, but I knew I would need to recode it myself to remember. Plus, it’s a physics example, and I like biology story problems better. Here’s the example  I created:


class apples:
def myName(self):
print "apples"

class oranges:
def myName(self):
print “oranges”

class fruit1(apples, oranges):
pass

class fruit2( oranges, apples):
pass

f1 = fruit1()
f2 = fruit2()
f1.myName()
f2.myName()

Note that if you run the code, each fruit object produces different a different name. This demonstrates the order of multiple inheritance in Python as well as the concept. Also, note how the function myName() is overloaded (which simply demonstrates inheritance).

Clustering Applied to Quarterback Play in the NFL

Image

In this blog post I want to talk a bit about unsupervised learning. As some of you that know me may know, I am relatively new to data science and machine learning, having my formal educational training in applied mathematics/mathematical biology. My interest in machine learning came not through mathematical biology or ecology, but through studying football.

Using ProFootballFocus data (I am a data scientist for PFF) we can study the quality of quarterback play through the process of grading players on every play of every game of every season.  To do so, it’s the most efficient to “cluster” quarterback seasons into buckets of similar seasons.  The best way to do this (do date) is through k-means clustering.

While there are many references on k-means clustering in the literature and on the web, I’ll briefly summarize the idea in this blog.  K-means clustering is an unsupervised learning algorithm that aims to partition a data set of n observations into k clusters where each observation belongs to one and only one cluster with the nearest mean.  Visually, one can think of a cluster as a collection of objects in m-dimensional space that are “close” to each other.  Below is an example of clustering quarterbacks from the 2016 season by their proportions of positively-graded and negatively-graded throws.  Different clusters are visualized with different colors:

As a part of our in-depth study of quarterback play at PFF, we clustered quarterbacks on the composition of their play-by-play grades in various settings (when under pressure, when kept clean, with using play action).  This gave us a tier-based system in which to evaluate the position throughout the PFF era (2006-present).  In 2016 the only quarterback that was in our top cluster on all throws, throws when from a clean pocket, throws when under pressure, and throws on third and long was New England Patriots’ star Tom Brady.

Stay tuned for more of an in-depth look at the quarterback position by visiting profootballfocus.com both in-season and during the offseason.

 

 

Random versus fixed effects

Image

Wrapping my head around random versus fixed effects took me a while in graduate school. In part, this is because multiple definitions exist. Within ecology, the general definition I see is that a fixed effect is estimated by itself whereas a random effect comes from a higher distribution. Two examples drilled this home for me and helped it click.

First, the question: “Do we care about the specific group or only that the groups might be having an impact?” helped me see the difference between fixed and random effects. For example, if we were interested in air quality data as a function of temperature across cities, city could be either a fixed or random effect. If city was a fixed effect, then we would be interested the air quality at that specific city (e.g., the air quality in New York, Los Angles, and Chicago). Conversely, if city as a random effect, then we would not care about a specific city, only that a city might impact the results due to city specific conditions.

Second, an example in one of Marc Kerry’s book on WinBugs drilled home the point. Although he used WinBugs, the R package lme4 can be used to demonstrate this. Additionally, although his example was something about snakes, a generic regression will work. (I mostly remember the figure and had to recreate it from memory. It was about ~5 or 6 years ago and I have not been able to find the example in his book to recreate it, hence I coded this from memory). Here’s the code

library(ggplot2)
library(lme4)

population = rep(c(“a”, “b”, “c”), each = 3)
intercept = rep( c(1, 5, 6), each = 3)
slope = 4
sd = 2.0

dat = data.frame(
population = population,
interceptKnown = intercept,
slopeKnown = slope,
sdKnown = sd,
predictor = rep(1:3, times = 3))

 

dat$response = with(dat,
rnorm(n = nrow(dat), mean = interceptKnown, sd = sdKnown) +
predictor * slopeKnown
)

## Run models
lmOut <- lm(response ~ predictor + population, data = dat)
lmerOut <- lmer( response ~ predictor + (1 | population), data = dat)

## Create prediction dataFrame
dat$lm <- predict(lmOut, newData = dat)
dat$lmer <- predict(lmerOut, newData = dat)

ggplot(dat, aes(x = predictor, y = response, color = population)) +
geom_point(size = 2) +
scale_color_manual(values = c(“red”, “blue”, “black”)) +
theme_minimal() +
geom_line(aes(x = predictor, y = lm)) +
geom_line(aes(x = predictor, y = lmer), linetype = 2)

Which produces this figure:

Example of a fixed-effect intercept (solid line) compared to a random-effect (dashed line) regression analysis.

Play around with the code if you want to explore this more. At first, I could not figure out how to make the dashed lines be farther apart from the solid lines. Change the simulated standard deviation to see what happens. Hint, my initial guess of decreasing did not help.

Test Driven Development

Standard

Recently at work, I’ve been building a complex, spatially-explicit population model. The model is complex enough that I started using Python to program it because R did not easily allow me to program the model. Initially while developing the model, I used informal “testing” to make it produced the correct results. For example, I would write a test script to plot simple results and make sure they outputs looked okay. However, this approach was not satisfactory and it was suggested to me that I use Test Drive Development (TDD).

With TDD, I write a small unit test and then program a function or few of lines of code to answer to test. The test is written in a second script file. After writing the new model code, I run the script test file and make sure the test passes. As part of Python’s “Batteries included” philosophy, base Python even comes with a module for unit testing built in.

This seems simple enough, but I now love TDD! With TDD, I know my code does what I think it is doing (something that is not always easy with complicated models or code). Also, I know (can change my code, but not it’s behavior! For example, if I try to improve a function, I now simply re-run the test script to make sure I didn’t change or break anything.

Although seemingly overkill for simple ecological models, TDD improves the quality and reproducibility of our models. Also, using TDD makes me a better and more confident programmer. My only regret is that I did not start using TDD earlier! For anyone wanting to learn TDD, I found this to be a helpful introduction as well as the Python documentation on unit testing.

Teaching Mathematical Biology at the College Level

Standard

As another semester at the University of Wisconsin – La Crosse reaches its halfway point, it’s time to start preparing for my spring class – MTH 265: Mathematical Models in Biology. This is a course that only has a first-semester calculus prerequisite, meaning that it is unlike many of the mathematical biology courses around the world (which often require differential equations and/or linear algebra as a prerequisite).

My thought process when teaching this course is that the students likely do not have the mathematical background to fully appreciate the breadth and depth that mathematical biology has to offer. Whether it’s the global and/or asymptotic stability of equilibria to difference equations, principal component analysis applied to multivariate data, or Markov Processes applied to Allele frequencies, research-level mathematical biology requires mathematical flexibility and maturity. However, most of the students in my MTH 265 class are not mathematics majors. Many will be researchers or practitioners of the life sciences, though, meaning that they will have to interact in a meaningful way with mathematicians, statisticians and computer scientists at some point during their careers. Thus, my goal for the course eventually became to give a survey of many different topics pertaining to mathematical biology during the 15-week course.  This way, they will know that a solution (possibly) exists to their quantitative problems (even if they may not be able to come up with it themselves).  Simply knowing such a solution exists allows one to approach the right people for collaborations, and keeps the math-biology interface a fruitful one.

Survey courses are fairly common in graduate work, but students in their second semester of mathematics are pretty new to reading mathematics. Thus, to cover the material in 15 weeks, I created a collection of videos as a part of an inverted, or “flipped” classroom.  Videos appear to be a medium that reaches current students better than (or in conjunction with) traditional textbooks. Students were asked to view these videos prior to class, while during class they were assigned groups in which they worked on “case studies” that took the duration of the hour. I provided assistance with the case studies, as well as any homework questions the students had.

The term “flipped” comes from the way the course is structured relative to a traditional course, where lectures occur during the regular class period (where the professor is present but the student engagement is low) and homework/case studies occur outside of the classroom space (where demands on the student are high, but direct help from the professor is not immediately available).

This course has been a great success. Some of the things we’ve learned from flipping the course can be found in this paper, and were used in a section of Grand Valley State professor Robert Talbert’s new book on flipped learning in the college classroom.  I owe a great deal of my ideas to the Mathematical Association of America, especially their Project NExT program.  The progress we’ve made as educators even in the short time (six years) I’ve been a mathematics professor has me excited for what is to come for the future.