When the power of data.table clicked

Standard

Historically, R has been limited by memory (for example, see this post). Although the program has gotten better through time, R still faces data limitations when working with “big data”. In my experience, data.frames and read.csv become too slow when reading files >0.5GB or so and data.frames start to clog up the system when the get bigger than >1.0 GB. However, some recent packages have been developed work with “big data”. The R High Performance Computing page describes some packages and work around.

My personal favorite package for working with big data in R is data.table. The package was designed by quantitative finance people for working with big data (e.g., files > 100GB). I discovered the package while trying to optimize a population model. Now, I use the package as my default methods for reading in data to R and manipulating data in R.

Besides being great for large data, the package also uses a slick syntax for manipulating data. As described by the vignette on the topic, the package has some cool methods for merging and sorting data. The package maintainers describe it as similar to SQL, although I do not know SQL, so I cannot comment on the analogy.

After taking the DataCamp Course on data.table, I better learned how to use the package. I was also soon able to improve my work by this knowledge. One call in particular blows my mind and impresses me:

 lakeTotalPop <-  lakeScenariosData[ , .('Population' = sum(Population)), by = .(Year,  Scenario, Stocking, Lifespan, PlotLife, Sex == "Male" | Sex == "Female")] 

This code allowed me to aggregate data by a condition. Something that requires multiple steps of clunky code in base R. Even if I learned nothing else, this one line of code would make my entire DataCamp experience worth while!

Multiple inheritance in Python

Standard

The major reasons I switched my population modeling from R to Python is that Python is a nicer language (obligatory xkcd reference here).

Today, I was trying to figure out how multiple inheritance works. Multiple inheritance allows a Python class to inherit the attributes (e.g., functions) from other classes. In my specific context, I am trying to build a complex spatial population model. I want to be able to merge spatial and demographic attributes to create new population classes. However, I was getting tripped up about how to code it in Python.

I found an example on StackOverflow, but I knew I would need to recode it myself to remember. Plus, it’s a physics example, and I like biology story problems better. Here’s the example  I created:


class apples:
def myName(self):
print "apples"

class oranges:
def myName(self):
print “oranges”

class fruit1(apples, oranges):
pass

class fruit2( oranges, apples):
pass

f1 = fruit1()
f2 = fruit2()
f1.myName()
f2.myName()

Note that if you run the code, each fruit object produces different a different name. This demonstrates the order of multiple inheritance in Python as well as the concept. Also, note how the function myName() is overloaded (which simply demonstrates inheritance).

Random versus fixed effects

Image

Wrapping my head around random versus fixed effects took me a while in graduate school. In part, this is because multiple definitions exist. Within ecology, the general definition I see is that a fixed effect is estimated by itself whereas a random effect comes from a higher distribution. Two examples drilled this home for me and helped it click.

First, the question: “Do we care about the specific group or only that the groups might be having an impact?” helped me see the difference between fixed and random effects. For example, if we were interested in air quality data as a function of temperature across cities, city could be either a fixed or random effect. If city was a fixed effect, then we would be interested the air quality at that specific city (e.g., the air quality in New York, Los Angles, and Chicago). Conversely, if city as a random effect, then we would not care about a specific city, only that a city might impact the results due to city specific conditions.

Second, an example in one of Marc Kerry’s book on WinBugs drilled home the point. Although he used WinBugs, the R package lme4 can be used to demonstrate this. Additionally, although his example was something about snakes, a generic regression will work. (I mostly remember the figure and had to recreate it from memory. It was about ~5 or 6 years ago and I have not been able to find the example in his book to recreate it, hence I coded this from memory). Here’s the code

library(ggplot2)
library(lme4)

population = rep(c(“a”, “b”, “c”), each = 3)
intercept = rep( c(1, 5, 6), each = 3)
slope = 4
sd = 2.0

dat = data.frame(
population = population,
interceptKnown = intercept,
slopeKnown = slope,
sdKnown = sd,
predictor = rep(1:3, times = 3))

 

dat$response = with(dat,
rnorm(n = nrow(dat), mean = interceptKnown, sd = sdKnown) +
predictor * slopeKnown
)

## Run models
lmOut <- lm(response ~ predictor + population, data = dat)
lmerOut <- lmer( response ~ predictor + (1 | population), data = dat)

## Create prediction dataFrame
dat$lm <- predict(lmOut, newData = dat)
dat$lmer <- predict(lmerOut, newData = dat)

ggplot(dat, aes(x = predictor, y = response, color = population)) +
geom_point(size = 2) +
scale_color_manual(values = c(“red”, “blue”, “black”)) +
theme_minimal() +
geom_line(aes(x = predictor, y = lm)) +
geom_line(aes(x = predictor, y = lmer), linetype = 2)

Which produces this figure:

Example of a fixed-effect intercept (solid line) compared to a random-effect (dashed line) regression analysis.

Play around with the code if you want to explore this more. At first, I could not figure out how to make the dashed lines be farther apart from the solid lines. Change the simulated standard deviation to see what happens. Hint, my initial guess of decreasing did not help.

tikz in LaTeX and Structural Equation Modeling

Standard

During grad school, I attended an ESA Workshop on Structural Equation Modeling (SEM) let by Jim Grace. The approach allows for multivariate analysis with multiple predictors, multiple response variables, and latent variables. Up until now, my research never required using the method and I never bought the software he recommended at the time because the GUI program recommended by Grace was too expensive for my limited needs.

Recently, I had a need to use SEM at work. We had two response variables: environmental DNA (eDNA) and the ash-free dry weight of an aquatic organism (AFDW). Both were predicted by multiple environmental variables and AFDW predicted eDNA. A perfect problem for SEM.

To refresh myself of SEM, I revisited Grace’s work. I discovered that he maintains an excellent tutorial about SEM. The pages provide a nice introduction, as does his (slightly outdated) book, his classic book, and a recent Ecoshephere article.

However, I did not have a nice way to plot my results. I did not want to use a WYSIWYG tool like Inkscape or Power Point. But I remembered the tikz package in LaTeX. Here’s the figure I created:

Example of an SEM plot.

Example SEM plot.

I created the figure using this LaTeX code:


\documentclass{article}

\usepackage[paperheight =11.3cm, paperwidth =9.5cm, margin = 0.1cm]{geometry}

\usepackage{tikz}
\usetikzlibrary{arrows}
\usetikzlibrary{positioning}

\begin{document}

\pagenumbering{gobble}

\begin{tikzpicture}[ -> , >=stealth',auto,node distance=3.5cm,
thick,main node/.style={rectangle,draw, font=\sffamily}]

\node[main node] (1) {Lake};
\node[main node] (2) [below of=1] {Depth};
\node[main node] (3) [below of=2] {Non-habitat};
\node[main node] (4) [below of=3] {Habitat};

\node[main node] (6) [below right of=2, align = center] {AFDW\\ \(r^2 = 0.223\)};
\node[main node] (7) [right of=6, align = center] {eDNA\\ \(r^2 = 0.384\)};

\path[every node/.style={font=\sffamily\small}]
(1) edge node [above = 40pt] {\textbf{0.497}} (6)
(2) edge node [left = 10pt] {\textbf{-0.370}} (6)
(3) edge node [above] {0.094} (6)
(4) edge node [left = 10pt] {0.116} (6)

(1) edge[bend left] node [above = 10 pt] {\textbf{0.385}} (7)
(2) edge[bend left] node [above = 5pt ] {0.197} (7)
(3) edge[bend right] node [above = 0pt] {-0.298} (7)
(4) edge[bend right] node [below = 5pt] {0.204} (7)

(6) edge node [ ] {-0.180} (7);

\end{tikzpicture}

\end{document}

Using RcppArmadillo for a matrix population model

Standard

I’ve had a busy winter and spring with training for the Birkie and then recovering from the Birkie, hence the few posts. One of the things I’ve been doing is teaching myself the Rcpp package in R. This package lets people easily use C++ code within R. In this post, I demonstrate how I used Rcpp and specifically the RcppArmadillo package to create a population model.

Matrix models are popular in ecology. These models are series of difference equations (i.e., discrete time). I was interested in coding a simple example for a two life-stage species with the projection matrix R.

To code this up in R, I use the following code

popModelR A = matrix(c(0, 2, 0.25, 0.5), nrow = 2, byrow = TRUE),
P0 = c(0, 50)){
P = matrix(0, nrow = dim(A)[1], ncol = nYears + 1)
P[, 1] = P0
for( t in 1:nYears){
P[ , t + 1] = A  %*% P[, t]
}
return(P)
}

nYears = 10
A = matrix(c(0, 2, 0.25, 0.5),
nrow = 2, byrow = TRUE)
P0 = matrix(c(0, 50), nrow = 2)
P0
popModelROut <- popModelR(nYears = 10,
A = A,
P0 = P0)

Obviously, this simple model runs fast, but how would one code this with Rcpp? In order to get matrix operators, I needed to use the RcppArmadillo package, so my code looks like this:

library(“inline”)
library(“Rcpp”)
library(“RcppArmadillo”)

src1 <- ‘
int nYearsX = Rcpp::as<int>(nYears);
arma::mat P0X = Rcpp::as<arma::mat>(P0);
arma::mat AX  = Rcpp::as<arma::mat>(A);
arma::mat PX(AX.n_cols, nYearsX + 1);
PX.col(0) = P0X;

for(int t = 0; t < nYearsX; t++) {
PX.col(t + 1) =  AX * PX.col(t);
}

return Rcpp::wrap(PX);

popModelRcpp <- cxxfunction(signature(nYears = “integer”,
A = “matrix”,
P0 = “matrix”),
src1, plugin = “RcppArmadillo”)

popModelRcpp(nYears, A, P0)

Now, to compare the two functions, I use the benchmark package and run the model for 100 simulated years:

library(rbenchmark)
nYears = 100
res popModelR(nYears, A, P0),
columns = c(“test”, “replications”, “elapsed”,
“relative”, “user.self”, “sys.self”),
order = “relative”,
replications = 1000)
print(res)


test replications elapsed relative user.self sys.self
1 popModelRcpp(nYears, A, P0) 1000 0.02 1.00 0.00 0.00
2 popModelR(nYears, A, P0) 1000 0.53 29.61 0.64 0.00

The Rcpp code is almost 30 times quicker than the base code in R!