Knitr and R Markdown

Standard

I’m late to the game, but have recently begun using R Markdown. I was motivated because my employer now has an open data/open code requirement for all data we generate.  My specific problem was that I am often using R code, but need to document what I am doing so that I may share my code. Hence, R Markdown was a perfect solution for me. As an added bonus, I have also switched over my R teaching materials to R Markdown and am now using Markdown to develop an online course on mixed-effect models with R.

Previously, I used sweave. Although powerful, sweave offers similar functionality to RMarkdown, but requires the file to be complied multiple times. Thus, sweave offers me no benefit compared to RMarkdown.

I usuallyRStudio as my editor and loving how it works. RStudio is easy to use and R Markdown is well documented. I was able to learn the program easily and get up to speed because of 3 factors. First, I previously used sweave. Second, I am familiar with Markdown from StackOverflow. Third, I am good with R. My only regret is that I did not start using it earlier.

As for time, learning R Markdown only required a couple of hours on Monday afternoon and I am now fully up to speed. The tutorials built into RStudio were fabulous! In summary, I would recommend RMarkdown for everybody wanting to create documents with R Code embedded within them. !

 

Review of Modeling for Insight

Standard

During graduate school, I attended a Joint Mathematics Meeting and attended a session on “quantitative literacy” as part of higher education. During that session, the book Modeling for Insight: A Master Class for Business Analysts was recommended to me and I purchased the book shortly after the meeting.

The book was written by two professors Stephen Powell currently at Darmoth College and Bob Batt, currently at UW-Madison. They wrote the book for their graduate-level business program while both were at Darmoth College. The target audience is people who need to do quantitative modeling and analysis, but lack programming skills (e.g., MBA students). The book teaches both the modeling process necessary to make quantitative decisions and how to do so using Microsoft Excel (with some plugins).

The book’s overview of the how to make quantitative decisions makes the book a worthwhile purchase in-and-of-itself. For example, the book describes “spreadsheet engineering” as four steps:

  1. Design
  2. Build
  3. Test
  4. Analyze

Although basic steps, they are important. And, many scientists I know build models using more complicated tools and do not use all of these steps (for example, many ecological modelers often do not “test” their code). Also, many simple scientific model builders lack any formal process! Furthermore, even though the model targets business applications, many of the lessons apply to scientists as well. In directly, the authors do a good job of teaching the modeling process. Directly, much of modern science involves running large projects and planning these project. The case studies in the book can be adapted to scientific projects just as easy a traditional business projects.

I have recommended the book to this book to friends who work in business and the admin team at my center for their own business modeling needs. Additionally, I have recommended the book for ecologists who teach modeling classes with Excel because of its value Excel modeling content. More broadly, this book is good for anyone who wants to learn either spreadsheet modeling or wants an introduction to modeling in general.

That being said, there were few things I disliked about the book. First, I do not like that they require Excel plugins. I understand why the authors use them, but I could not justify buying them. For example, Oracle Crystal Ball currents costs just slightly less than $1,000! However, this makes a good case for learning open source programs such a R or Python! Second, some of their terms like “spreadsheet engineering” seems gimmicky to me. However, I suspect jargon like that is often used in the business world. Last, I am coding snob and think everyone should learn programming. But, alas we live in an imperfect world…

Overall, I give this book 5/5 starts! I recommend this book for business people who need to learn to model, but do not want to code; ecologists who use spreadsheet models (although I shake my head at you while doing so); and professors teaching undergraduates spreadsheet modeling either in ecology or business classes.

Data.gov, a great source of teaching data

Standard

Recently, I’ve been finding datasets for a R course I’m teaching. In the past, I’ve used my own data or the default R courses. But, I wanted to find broader datasets that would appeal to a broader audience than the ecologists and environmental scientists I usually interact with. Enter Data.gov.

Although quirky, the I’ve found several interesting datasets for students to explore.  These range from economic data to medical data to crime data. I only have 3 criticisms. First, the page took a little bit to figure out how to navigate around. However, I quickly got around this. Second, many cool datasets only have meta-data because the original data cannot be shared. Atlas, all good things have limits. Third, several pages had dead links. However, I was often able to find the original dataset using quick Google search.

In conclusion, Data.gov rocks. It is an integrated data warehouse for data from around the US ranging from local governments up to state and federal datasets. If it was a commercial product, I would give it 3 stars, but as government product I give it 4 starts because of all the agencies it bridges.

Multiple inheritance in Python

Standard

The major reasons I switched my population modeling from R to Python is that Python is a nicer language (obligatory xkcd reference here).

Today, I was trying to figure out how multiple inheritance works. Multiple inheritance allows a Python class to inherit the attributes (e.g., functions) from other classes. In my specific context, I am trying to build a complex spatial population model. I want to be able to merge spatial and demographic attributes to create new population classes. However, I was getting tripped up about how to code it in Python.

I found an example on StackOverflow, but I knew I would need to recode it myself to remember. Plus, it’s a physics example, and I like biology story problems better. Here’s the example  I created:


class apples:
def myName(self):
print "apples"

class oranges:
def myName(self):
print “oranges”

class fruit1(apples, oranges):
pass

class fruit2( oranges, apples):
pass

f1 = fruit1()
f2 = fruit2()
f1.myName()
f2.myName()

Note that if you run the code, each fruit object produces different a different name. This demonstrates the order of multiple inheritance in Python as well as the concept. Also, note how the function myName() is overloaded (which simply demonstrates inheritance).

Random versus fixed effects

Image

Wrapping my head around random versus fixed effects took me a while in graduate school. In part, this is because multiple definitions exist. Within ecology, the general definition I see is that a fixed effect is estimated by itself whereas a random effect comes from a higher distribution. Two examples drilled this home for me and helped it click.

First, the question: “Do we care about the specific group or only that the groups might be having an impact?” helped me see the difference between fixed and random effects. For example, if we were interested in air quality data as a function of temperature across cities, city could be either a fixed or random effect. If city was a fixed effect, then we would be interested the air quality at that specific city (e.g., the air quality in New York, Los Angles, and Chicago). Conversely, if city as a random effect, then we would not care about a specific city, only that a city might impact the results due to city specific conditions.

Second, an example in one of Marc Kerry’s book on WinBugs drilled home the point. Although he used WinBugs, the R package lme4 can be used to demonstrate this. Additionally, although his example was something about snakes, a generic regression will work. (I mostly remember the figure and had to recreate it from memory. It was about ~5 or 6 years ago and I have not been able to find the example in his book to recreate it, hence I coded this from memory). Here’s the code

library(ggplot2)
library(lme4)

population = rep(c(“a”, “b”, “c”), each = 3)
intercept = rep( c(1, 5, 6), each = 3)
slope = 4
sd = 2.0

dat = data.frame(
population = population,
interceptKnown = intercept,
slopeKnown = slope,
sdKnown = sd,
predictor = rep(1:3, times = 3))

 

dat$response = with(dat,
rnorm(n = nrow(dat), mean = interceptKnown, sd = sdKnown) +
predictor * slopeKnown
)

## Run models
lmOut <- lm(response ~ predictor + population, data = dat)
lmerOut <- lmer( response ~ predictor + (1 | population), data = dat)

## Create prediction dataFrame
dat$lm <- predict(lmOut, newData = dat)
dat$lmer <- predict(lmerOut, newData = dat)

ggplot(dat, aes(x = predictor, y = response, color = population)) +
geom_point(size = 2) +
scale_color_manual(values = c(“red”, “blue”, “black”)) +
theme_minimal() +
geom_line(aes(x = predictor, y = lm)) +
geom_line(aes(x = predictor, y = lmer), linetype = 2)

Which produces this figure:

Example of a fixed-effect intercept (solid line) compared to a random-effect (dashed line) regression analysis.

Play around with the code if you want to explore this more. At first, I could not figure out how to make the dashed lines be farther apart from the solid lines. Change the simulated standard deviation to see what happens. Hint, my initial guess of decreasing did not help.

Test Driven Development

Standard

Recently at work, I’ve been building a complex, spatially-explicit population model. The model is complex enough that I started using Python to program it because R did not easily allow me to program the model. Initially while developing the model, I used informal “testing” to make it produced the correct results. For example, I would write a test script to plot simple results and make sure they outputs looked okay. However, this approach was not satisfactory and it was suggested to me that I use Test Drive Development (TDD).

With TDD, I write a small unit test and then program a function or few of lines of code to answer to test. The test is written in a second script file. After writing the new model code, I run the script test file and make sure the test passes. As part of Python’s “Batteries included” philosophy, base Python even comes with a module for unit testing built in.

This seems simple enough, but I now love TDD! With TDD, I know my code does what I think it is doing (something that is not always easy with complicated models or code). Also, I know (can change my code, but not it’s behavior! For example, if I try to improve a function, I now simply re-run the test script to make sure I didn’t change or break anything.

Although seemingly overkill for simple ecological models, TDD improves the quality and reproducibility of our models. Also, using TDD makes me a better and more confident programmer. My only regret is that I did not start using TDD earlier! For anyone wanting to learn TDD, I found this to be a helpful introduction as well as the Python documentation on unit testing.

Continuous time, discrete event models

Standard

Recently, I’ve been exposed to situations where I am trying to model discrete, binary events (i.e., 0 or 1 like heads-or-tails). My knee-jerk response has been: use a logistic regression or another model with a binomial outcome. The jack-of-all trades generalize linear model usually servers me well in these situations. However, my recent events have had continuous-time predictors. Although Cox proportional hazards model can be used if the event is something like survival, this did not seem appropriate for my situation because I had multiple events occurring per individual. Enter in continuous-time, discrete events.

A Poisson regression is similar to a binomial if the probability of an even occurring is small enough. Enter in a Poisson regression as a method for modeling animal behavior. I first saw this in a mathematical statistics paper describing models animal movements, but found another paper by some of the co-authors that was more accessible. From this, I learned I needed to use the following version of the Poisson regression:

y ~ Poisson(μ)

μ = τ exp( β x’).

I was able to program this in Stan, by adopting code I found online. This model can also be modified to treat individuals a random effect (and prevent pseudo-replication) if the data allows or requires it.

Trend analysis from aggregate data

Standard

Often, people collect data for time with replication. For example, the LTRM program collects fish, aquatic vegetation, and water quality data through time. Multiple samples are collected for each year. However, these observations are not independent and failure to consider this would be pseudoreplication. Aggregating (or taking the mean) of data within a year can be one method to prevent pseudoreplication. Aggregating comes with a trade-off of losing information about the raw data. State-space models may be a method to recover this information.

State-space models describe a true, but unknown and un-measurable “state” (e.g., the “true” population of catfish in the Upper Mississippi River) and the observation error associated with collecting the data. Kalman Fileters can be used to fit these model such as the MARSS package in R can be used to fit these models.

We were interesting in comparing state-space models from the MARSS package to other methods such as simple linear regression and auto-regressive models (publication here). Using simulated data and observed data from the LTRM, we found that the simpler models performed better than the state-space models likely because the LTRM data was not long enough for the state-space models.

Integral projection models

Standard

Matrix population models describe populations as discrete life-, size-, or age-stages. Scientists apply these models to understand population ecology and guide conservation. However, some species have continuous life histories. For example, thistles grow continuously as presented within this paper.

Fish also grow continuously. We sought to understand how different management approaches could be used to control grass carp. This species impacts native ecosystems by out-competing native fish. Mangers were interesting in evaluating the use of YY-males to control populations. YY-males work because they spawn and only produce male offspring. Thus, it is possible in theory to cause a population to crash by biasing the sex-ratio.

We constructed an integral projection model for grass carp and compared different yy-male release methods. We found the life history of grass carp does not work well with the YY-male strategy because the species lives long and females produce many offspring.

 

 

Review of Deep Work

Standard

I recently read the book Deep Work by Carl Newport. In a short sentence, the book is anti-multitasking. More specifically, the book discusses the importance of focusing and concentrating in order to work deeply one topic and do it well.

I like the book, probably because it meets my preconceived notions about how to be effective. However, the book did provide new ideas for me so I decided to implement some of the book’s suggestions at work. Specifically, I scheduled my time and focused on one project at a time. By doing this, I got a great deal done. I suspect this was for 3 reasons:

  1. First, I was being intentional about being productive;
  2. Second, I created mini-deadlines, which made me more productive; and
  3. Third, I avoided multi-tasking.

I plan on adding this book to re-read every couple of years shelf and I highly recommend the book. I’ve also become aware of Carl’s blog because of the book and recommend that as well (Carl run a life hacks blog on working more efficiently). I particularly like his workflow re-engineering post. Maybe I’ll post an update of that works for me in a future post.