When the power of data.table clicked

Standard

Historically, R has been limited by memory (for example, see this post). Although the program has gotten better through time, R still faces data limitations when working with “big data”. In my experience, data.frames and read.csv become too slow when reading files >0.5GB or so and data.frames start to clog up the system when the get bigger than >1.0 GB. However, some recent packages have been developed work with “big data”. The R High Performance Computing page describes some packages and work around.

My personal favorite package for working with big data in R is data.table. The package was designed by quantitative finance people for working with big data (e.g., files > 100GB). I discovered the package while trying to optimize a population model. Now, I use the package as my default methods for reading in data to R and manipulating data in R.

Besides being great for large data, the package also uses a slick syntax for manipulating data. As described by the vignette on the topic, the package has some cool methods for merging and sorting data. The package maintainers describe it as similar to SQL, although I do not know SQL, so I cannot comment on the analogy.

After taking the DataCamp Course on data.table, I better learned how to use the package. I was also soon able to improve my work by this knowledge. One call in particular blows my mind and impresses me:

 lakeTotalPop <-  lakeScenariosData[ , .('Population' = sum(Population)), by = .(Year,  Scenario, Stocking, Lifespan, PlotLife, Sex == "Male" | Sex == "Female")] 

This code allowed me to aggregate data by a condition. Something that requires multiple steps of clunky code in base R. Even if I learned nothing else, this one line of code would make my entire DataCamp experience worth while!

Knitr and R Markdown

Standard

I’m late to the game, but have recently begun using R Markdown. I was motivated because my employer now has an open data/open code requirement for all data we generate.  My specific problem was that I am often using R code, but need to document what I am doing so that I may share my code. Hence, R Markdown was a perfect solution for me. As an added bonus, I have also switched over my R teaching materials to R Markdown and am now using Markdown to develop an online course on mixed-effect models with R.

Previously, I used sweave. Although powerful, sweave offers similar functionality to RMarkdown, but requires the file to be complied multiple times. Thus, sweave offers me no benefit compared to RMarkdown.

I usuallyRStudio as my editor and loving how it works. RStudio is easy to use and R Markdown is well documented. I was able to learn the program easily and get up to speed because of 3 factors. First, I previously used sweave. Second, I am familiar with Markdown from StackOverflow. Third, I am good with R. My only regret is that I did not start using it earlier.

As for time, learning R Markdown only required a couple of hours on Monday afternoon and I am now fully up to speed. The tutorials built into RStudio were fabulous! In summary, I would recommend RMarkdown for everybody wanting to create documents with R Code embedded within them. !

 

Random versus fixed effects

Image

Wrapping my head around random versus fixed effects took me a while in graduate school. In part, this is because multiple definitions exist. Within ecology, the general definition I see is that a fixed effect is estimated by itself whereas a random effect comes from a higher distribution. Two examples drilled this home for me and helped it click.

First, the question: “Do we care about the specific group or only that the groups might be having an impact?” helped me see the difference between fixed and random effects. For example, if we were interested in air quality data as a function of temperature across cities, city could be either a fixed or random effect. If city was a fixed effect, then we would be interested the air quality at that specific city (e.g., the air quality in New York, Los Angles, and Chicago). Conversely, if city as a random effect, then we would not care about a specific city, only that a city might impact the results due to city specific conditions.

Second, an example in one of Marc Kerry’s book on WinBugs drilled home the point. Although he used WinBugs, the R package lme4 can be used to demonstrate this. Additionally, although his example was something about snakes, a generic regression will work. (I mostly remember the figure and had to recreate it from memory. It was about ~5 or 6 years ago and I have not been able to find the example in his book to recreate it, hence I coded this from memory). Here’s the code

library(ggplot2)
library(lme4)

population = rep(c(“a”, “b”, “c”), each = 3)
intercept = rep( c(1, 5, 6), each = 3)
slope = 4
sd = 2.0

dat = data.frame(
population = population,
interceptKnown = intercept,
slopeKnown = slope,
sdKnown = sd,
predictor = rep(1:3, times = 3))

 

dat$response = with(dat,
rnorm(n = nrow(dat), mean = interceptKnown, sd = sdKnown) +
predictor * slopeKnown
)

## Run models
lmOut <- lm(response ~ predictor + population, data = dat)
lmerOut <- lmer( response ~ predictor + (1 | population), data = dat)

## Create prediction dataFrame
dat$lm <- predict(lmOut, newData = dat)
dat$lmer <- predict(lmerOut, newData = dat)

ggplot(dat, aes(x = predictor, y = response, color = population)) +
geom_point(size = 2) +
scale_color_manual(values = c(“red”, “blue”, “black”)) +
theme_minimal() +
geom_line(aes(x = predictor, y = lm)) +
geom_line(aes(x = predictor, y = lmer), linetype = 2)

Which produces this figure:

Example of a fixed-effect intercept (solid line) compared to a random-effect (dashed line) regression analysis.

Play around with the code if you want to explore this more. At first, I could not figure out how to make the dashed lines be farther apart from the solid lines. Change the simulated standard deviation to see what happens. Hint, my initial guess of decreasing did not help.

Trend analysis from aggregate data

Standard

Often, people collect data for time with replication. For example, the LTRM program collects fish, aquatic vegetation, and water quality data through time. Multiple samples are collected for each year. However, these observations are not independent and failure to consider this would be pseudoreplication. Aggregating (or taking the mean) of data within a year can be one method to prevent pseudoreplication. Aggregating comes with a trade-off of losing information about the raw data. State-space models may be a method to recover this information.

State-space models describe a true, but unknown and un-measurable “state” (e.g., the “true” population of catfish in the Upper Mississippi River) and the observation error associated with collecting the data. Kalman Fileters can be used to fit these model such as the MARSS package in R can be used to fit these models.

We were interesting in comparing state-space models from the MARSS package to other methods such as simple linear regression and auto-regressive models (publication here). Using simulated data and observed data from the LTRM, we found that the simpler models performed better than the state-space models likely because the LTRM data was not long enough for the state-space models.

tikz in LaTeX and Structural Equation Modeling

Standard

During grad school, I attended an ESA Workshop on Structural Equation Modeling (SEM) let by Jim Grace. The approach allows for multivariate analysis with multiple predictors, multiple response variables, and latent variables. Up until now, my research never required using the method and I never bought the software he recommended at the time because the GUI program recommended by Grace was too expensive for my limited needs.

Recently, I had a need to use SEM at work. We had two response variables: environmental DNA (eDNA) and the ash-free dry weight of an aquatic organism (AFDW). Both were predicted by multiple environmental variables and AFDW predicted eDNA. A perfect problem for SEM.

To refresh myself of SEM, I revisited Grace’s work. I discovered that he maintains an excellent tutorial about SEM. The pages provide a nice introduction, as does his (slightly outdated) book, his classic book, and a recent Ecoshephere article.

However, I did not have a nice way to plot my results. I did not want to use a WYSIWYG tool like Inkscape or Power Point. But I remembered the tikz package in LaTeX. Here’s the figure I created:

Example of an SEM plot.

Example SEM plot.

I created the figure using this LaTeX code:


\documentclass{article}

\usepackage[paperheight =11.3cm, paperwidth =9.5cm, margin = 0.1cm]{geometry}

\usepackage{tikz}
\usetikzlibrary{arrows}
\usetikzlibrary{positioning}

\begin{document}

\pagenumbering{gobble}

\begin{tikzpicture}[ -> , >=stealth',auto,node distance=3.5cm,
thick,main node/.style={rectangle,draw, font=\sffamily}]

\node[main node] (1) {Lake};
\node[main node] (2) [below of=1] {Depth};
\node[main node] (3) [below of=2] {Non-habitat};
\node[main node] (4) [below of=3] {Habitat};

\node[main node] (6) [below right of=2, align = center] {AFDW\\ \(r^2 = 0.223\)};
\node[main node] (7) [right of=6, align = center] {eDNA\\ \(r^2 = 0.384\)};

\path[every node/.style={font=\sffamily\small}]
(1) edge node [above = 40pt] {\textbf{0.497}} (6)
(2) edge node [left = 10pt] {\textbf{-0.370}} (6)
(3) edge node [above] {0.094} (6)
(4) edge node [left = 10pt] {0.116} (6)

(1) edge[bend left] node [above = 10 pt] {\textbf{0.385}} (7)
(2) edge[bend left] node [above = 5pt ] {0.197} (7)
(3) edge[bend right] node [above = 0pt] {-0.298} (7)
(4) edge[bend right] node [below = 5pt] {0.204} (7)

(6) edge node [ ] {-0.180} (7);

\end{tikzpicture}

\end{document}