# Review of Deep Work

Standard

I recently read the book Deep Work by Carl Newport. In a short sentence, the book is anti-multitasking. More specifically, the book discusses the importance of focusing and concentrating in order to work deeply one topic and do it well.

I like the book, probably because it meets my preconceived notions about how to be effective. However, the book did provide new ideas for me so I decided to implement some of the book’s suggestions at work. Specifically, I scheduled my time and focused on one project at a time. By doing this, I got a great deal done. I suspect this was for 3 reasons:

1. First, I was being intentional about being productive;

I plan on adding this book to re-read every couple of years shelf and I highly recommend the book. I’ve also become aware of Carl’s blog because of the book and recommend that as well (Carl run a life hacks blog on working more efficiently). I particularly like his workflow re-engineering post. Maybe I’ll post an update of that works for me in a future post.

# tikz in LaTeX and Structural Equation Modeling

Standard

During grad school, I attended an ESA Workshop on Structural Equation Modeling (SEM) let by Jim Grace. The approach allows for multivariate analysis with multiple predictors, multiple response variables, and latent variables. Up until now, my research never required using the method and I never bought the software he recommended at the time because the GUI program recommended by Grace was too expensive for my limited needs.

Recently, I had a need to use SEM at work. We had two response variables: environmental DNA (eDNA) and the ash-free dry weight of an aquatic organism (AFDW). Both were predicted by multiple environmental variables and AFDW predicted eDNA. A perfect problem for SEM.

To refresh myself of SEM, I revisited Grace’s work. I discovered that he maintains an excellent tutorial about SEM. The pages provide a nice introduction, as does his (slightly outdated) book, his classic book, and a recent Ecoshephere article.

However, I did not have a nice way to plot my results. I did not want to use a WYSIWYG tool like Inkscape or Power Point. But I remembered the tikz package in LaTeX. Here’s the figure I created:

Example SEM plot.

I created the figure using this LaTeX code:

 \documentclass{article}

 \usepackage[paperheight =11.3cm, paperwidth =9.5cm, margin = 0.1cm]{geometry} \usepackage{tikz} \usetikzlibrary{arrows} \usetikzlibrary{positioning} \begin{document} \pagenumbering{gobble} \begin{tikzpicture}[ -> , >=stealth',auto,node distance=3.5cm, thick,main node/.style={rectangle,draw, font=\sffamily}] \node[main node] (1) {Lake}; \node[main node] (2) [below of=1] {Depth}; \node[main node] (3) [below of=2] {Non-habitat}; \node[main node] (4) [below of=3] {Habitat}; \node[main node] (6) [below right of=2, align = center] {AFDW\\ $$r^2 = 0.223$$}; \node[main node] (7) [right of=6, align = center] {eDNA\\ $$r^2 = 0.384$$}; \path[every node/.style={font=\sffamily\small}] (1) edge node [above = 40pt] {\textbf{0.497}} (6) (2) edge node [left = 10pt] {\textbf{-0.370}} (6) (3) edge node [above] {0.094} (6) (4) edge node [left = 10pt] {0.116} (6) (1) edge[bend left] node [above = 10 pt] {\textbf{0.385}} (7) (2) edge[bend left] node [above = 5pt ] {0.197} (7) (3) edge[bend right] node [above = 0pt] {-0.298} (7) (4) edge[bend right] node [below = 5pt] {0.204} (7) (6) edge node [ ] {-0.180} (7); \end{tikzpicture} 

\end{document} 

# Using RcppArmadillo for a matrix population model

Standard

I’ve had a busy winter and spring with training for the Birkie and then recovering from the Birkie, hence the few posts. One of the things I’ve been doing is teaching myself the Rcpp package in R. This package lets people easily use C++ code within R. In this post, I demonstrate how I used Rcpp and specifically the RcppArmadillo package to create a population model.

Matrix models are popular in ecology. These models are series of difference equations (i.e., discrete time). I was interested in coding a simple example for a two life-stage species with the projection matrix R.

To code this up in R, I use the following code

popModelR A = matrix(c(0, 2, 0.25, 0.5), nrow = 2, byrow = TRUE),
P0 = c(0, 50)){
P = matrix(0, nrow = dim(A)[1], ncol = nYears + 1)
P[, 1] = P0
for( t in 1:nYears){
P[ , t + 1] = A  %*% P[, t]
}
return(P)
}

nYears = 10
A = matrix(c(0, 2, 0.25, 0.5),
nrow = 2, byrow = TRUE)
P0 = matrix(c(0, 50), nrow = 2)
P0
popModelROut <- popModelR(nYears = 10,
A = A,
P0 = P0)

Obviously, this simple model runs fast, but how would one code this with Rcpp? In order to get matrix operators, I needed to use the RcppArmadillo package, so my code looks like this:

library(“inline”)
library(“Rcpp”)

src1 <- ‘
int nYearsX = Rcpp::as<int>(nYears);
arma::mat P0X = Rcpp::as<arma::mat>(P0);
arma::mat AX  = Rcpp::as<arma::mat>(A);
arma::mat PX(AX.n_cols, nYearsX + 1);
PX.col(0) = P0X;

for(int t = 0; t < nYearsX; t++) {
PX.col(t + 1) =  AX * PX.col(t);
}

return Rcpp::wrap(PX);

popModelRcpp <- cxxfunction(signature(nYears = “integer”,
A = “matrix”,
P0 = “matrix”),

popModelRcpp(nYears, A, P0)

Now, to compare the two functions, I use the benchmark package and run the model for 100 simulated years:

library(rbenchmark)
nYears = 100
res popModelR(nYears, A, P0),
columns = c(“test”, “replications”, “elapsed”,
“relative”, “user.self”, “sys.self”),
order = “relative”,
replications = 1000)
print(res)

test replications elapsed relative user.self sys.self
1 popModelRcpp(nYears, A, P0) 1000 0.02 1.00 0.00 0.00
2 popModelR(nYears, A, P0) 1000 0.53 29.61 0.64 0.00

The Rcpp code is almost 30 times quicker than the base code in R!

# review of “How Not to be Wrong: The Power of Mathematical Thinking”

Standard
I just got done reading “How Not to be Wrong: The Power of Mathematical Thinking” by Jordan Ellenberg. My younger brother lent me his copy. In sentence, the book can be summed up this this phrase from the book: “Mathematics is the extension of common sense by other means.” The book does a great job of explaining how mathematics and statistics can be used understand the world around us. The book is filled many good examples such as Wald’s WWII on where to place extra armor on planes. Wald was given data of where plane got shot and asked where should extra armor be placed (answer: the places without the holes!). The book is filled with many other interesting examples as well.

The only downsides to the book are that it can become long and drawn out at times. Also, I was familiar with many of the examples and had seen them before. Finally, I would add the book is “math lite”, which is a strength for many potential readers.

# 6 tips for a new LaTeX user

Standard
Recently a coworker started using LaTeX and asked for some tips. Here’s my 6 tips for starting to use LaTeX:
1.  Start what you finish (i.e., close environments or else you get errors or weird bugs), for example $$needs an$$
2. Every document needs 3 things: \documentclass{<class>}, \begin{document}, and \end{document}
3. For equations, use  inline and \begin{eqnarray} \end{eqnarray} equations. \\ creates a new line. Use \\ \nonumber to continue on an equation and &=& to space multiple equations for example:
<code>
\begin{eqnarray}
a &=& b +c \\
a & =& b \\ \nonumber
& & c
\end{eqnarray}
</code>
Gives you something like:
a = b +c   (1)
a = b

c        (2)

4. Bib files are your friend for citations. Use Google Scholar to populate new citations.
5.  \textit{My italics text}, \textbf{my bold text}, should get most of your formatting. Do NOT use the depreciated {\bf bold} or {\it italics} style. (cf http://tex.stackexchange.com/questions/41681/correct-way-to-bold-italicize-text for more details on the second point)
6. {} can be very helpful, especially for complicated math functions, when order of operation is important. For example, \sigma_{\pi^2}^{2}

# ArcPy tips and observations

Standard

When I started using ArcGIS a few weeks ago, I quickly became frustrated when operations failed after pointing and clicking several times. This motivated me start scripting in ArcPy. Now, I only need to load the script file once to make it crash. Thus, I am crashing more efficiently :). Overall, my experience with ArcPy has been alright. I discover two blog posts about hating ArcGIS 10 and ArcGIS sucking that are well worth the read. Despite the titles, the author actually likes ArcGIS. I understand why. ArcPy and ArcGIS are powerful. Plus, the documentation is usually well done and easy to find online (although, including data and reproducible examples like R’s help files would be a great addition).  However, ArcPy does have some problems/idiosyncrasies. Here’s what I have discovered:

1) Non-basic error message are often cryptic and some require hours of using Google. The archives on GIS Stack Exchange has become a new favorite page of mine. As an example, I got this error message (I removed and shortened my file paths in the code snip):
 Layer (Folder) exists, please delete output if you wish to recreate layer done with extraction for us_asp 0:00:36.046000 Could not create tile. name: z:\my documents\...\slopeclipped, adepth: 32, type: 1, iomode: 3, version: 3, compression: 0, eval case: 1 (GetOpenTileChannel) Traceback (most recent call last): File "", line 1, in File "z:/My Documents/.../PythonFile.py", line 77, in mF.RasterClipper(Slope, grid, SlopeClipped, arcpy.env.workspace) File "myFunctions.py", line 37, in RasterClipper arcpy.Clip_management(raster, clipExt, out, "", "ClippingGeometry") File "C:\ArcGIS\Desktop10.1\arcpy\arcpy\management.py", line 12015, in Clip raise e arcgisscripting.ExecuteError: ERROR 999999: Error executing function. Failed to copy raster dataset Failed to execute (Clip).

This error leads me to my second tip.

2) The workspace should not include a space. Sometimes, but not all the time, this causes problems. This took me a while to figure out and I only found this by chance online when looking up another problem. For example, this code is okay: arcpy.env.workspace = "C:/Folder/" 

is okay, but arcpy.env.workspace = "C:/My Documents/" 

caused my example problem.

3) Be prepared to use libraries such as NymPy. ArcPy does not always play well with USA wide data. This seems to be the case for ArcPy’s wrapper functions more so than the base functions (My guess is the best performing functions are GDAL functions). I plan on posting this function once I have written it and tested it.

4) “Python Scripting for ArcGIS” is still a great reference!

I plan to post more of my project as I work through it.

# Learning Python

Standard

Over the past year, I have been teaching myself Python. I am writing this blog post so I remember how I wish I had learned Python in case I ever need to teach anybody else. My desire to learn Python has been driven by several factors: 1) my laziness and disdain for GUIs (i.e., I hate pointing and clicking); 2) my need to solve complex problems that GUIs would not solve (i.e., GUIs could not solve my problem); and 3) my curiosity to learn new things.

Python is a powerful computer language. I want to use Python for scripting and scientific computing, but many other uses of Python exist. A great way to learn Python and some computer science theory is with the book “Think Python: How to Think Like a Computer Scientist”. The book was originally written by the author for his college class using the Java language. However, he published the book with Creative Commons License and a high school teacher adapted the book for Python! Score one for open source freedom.

After working through this book, I went to the “Invent with Python” books, which are fun (but not necessarily that useful for my goal…). The first book, “Invent Your Own Computer Games with Python” covers basic game designs that are terminal based. The second book, “Making games with Python and Pygame” includes graphics. These books are good because they are simple and fun. If I was simply learning Python for fun, I would start with these books rather than “Think Python”. These books also gave me more experience coding (It’s been said the first 10,000 lines of coding are learning, so at least these gave me a fun couple of thousand), so perhaps the sugar coating of game development worked with me as well.

After these books, dive into subject area books. For me, these were GIS books. “Python Scripting for ArcGIS” does a great job introducing and covering ESRI’s flagship product, ArcGIS. The only downside is that the script is closed source. I decided to stop trying to read source code when a .py file including a warning about being a trade secret. “Learning Geospatial Analysis with Python” is an introduction to advanced (but open source) libraries such as Shapely and GDAL. However, the book is a bit pricey for what you get. But, the book is easier to read than the author’s blog that contains most of the same material. Also note that the ebook kills Pythons tab spacing (Python uses white space such as tabs to define functions and code). Also, I was initially overwhelmed because I did not know GIS or Python my first time reading it. My co-workers copy of “The Geospatial Desktop” provided a great refresher to my GIS skills!

Finally, Learning Python, 5th ed. is a tome, but very through. My own course of study began with the games books. Then I tried “Learning Geospatial Analysis with Python”. Along the way, I bought “Learning Python” and then discovered “Think Python”. I discovered the ArcGIS book only when I needed to script AcrPy for work. Not an optimal course of study, but needing to script ArcGIS has been a great crash course in Python for me. As an added bonus, Python has also made me a better R programmer! Horray for learning.

# Maxent, a tool for ecologist and other spatial modelers

Standard

What is Maxent?

Maxent is a tool that uses a “maximum-entropy approach for species habitat modeling”. Maximum entropy is a statistical concept that may be used as a method for fitting distributions. A research group from Princeton University applied this approach to habitat modeling for ecologists. Specifically, Maxent is designed to model presence only data, which is traditionally difficult for ecologists. This data occurs commonly in ecology because we know where organisms are found, but do not have good observations of where they are not located.

How does Maxent work?

That’s a good question. The underlying theory was based upon computer science and machine learning research from the 1950s (as a side note, applied statisticians across different disciplines are often discovering and rediscovering each others work. The proliferation of science has only made the lack of cross-communication worse because of the volume of scientific output produced). The Maxent authors provide three articles explaining the software on the program’s homepage. However, as far as I can tell, the program is close source, which means looking at the code is difficult if not impossible (for this reason, a colleague of mine passionately dislikes Maxent). Luckily for us, Renner and Warton dove into the weeds and explain how Maxent works. They discovered that the underlying model is simply a Poisson point processes model (a generalized linear model [GLM] with Poisson error distribution). What makes Maxent special is how the GLM is parametrized. Maxent uses a lasso approach. As an interesting site note, Warton also shows how the Poisson point processes model converges to logistic regression under certain conditions.

What does this mean for the casual user?

Ecologists are known for having messy data and needing powerful statistical tools. That being said, ecologists have also been criticized for not knowing enough statistics. For the non-statistician ecologist, Warton’s research means we have expanded our theory of how Maxent works. That being said, ecologists often become polarized and even dogmatic with their viewpoints. Some people hate Maxent while others seem to worship it. Personally, I am skeptical of it because the program feels black box/closed source and some users over-hype it. When Warton made some of these points on the Methods in Ecology and Evolution blog (while posting as an Associated Editor), he elicited some colorful comments. Additionally, some of Warton’s discussion arose because Maxent was initially presented as being free of some assumptions of a GLM. Not surprisingly (if you’ve hung out around ecologists), one of Warton’s critic soon dove into a God/religion comparison. Ironically, I would argue that his or her view supporting Maxent is dogmatic.  Rather than embracing a black box approach, programing a transparent appraoch seems less dogmatic, but I digresses (Ellison and Dennis make this point in their pathways to statistical fluency article when being critical of Program Mark users for lacking control over model assumptions).

What does this mean for the “DIY” crowd/advanced users?

The DYI (Do it yourself) crowd that is comfortable programming would not need Maxent. The glm function and a stepwise function in R combined with a GIS program such as GRASS or ESRI’s ArcGIS should be able to get the job done for a basic user (some R users might be able to do everything in R, but I have not had the best luck using GIS layers in R). For a more advanced user, a Bayesian program such as JAGS or Stan could be used for model parametrization and selection.

Conclusion:

Maxent fills a useful role for ecologists in that it improves their toolbox for modeling habitat using presence only data. For a basic user (such as the group of people who use Program Mark), this canned program could do an adequate job (this is why another colleague of mine likes Maxent and uses it for her own reserach). An advanced user who is comfortable with statistical programming, would likely be better suited fitting his or her data with a Poisson point processes model. All users also need to be aware of the assumptions of their model (be it Maxent or GLM or any program). My own statistical viewpoint has been shaped by Gelman (a future blog post at some point), so I will start off trying to use Stan to parametrize my habitat model with a Bayesian approach.

# How much will my 401k be in 30 years?

Standard

A couple of months ago, I discovered Mr. Money Mustache (MMM). He’s somebody who “retired” at the age of 30 by saving a ~3/4 of his income and living a frugal lifestyle. A unique and enthralling way of living the American Dream. I have been reading through his different posts and like what I read. As you could likely guess, many of MMM’s blogs deal with investing. I like his posts because they encourage me to think about topics and ideas that I may otherwise not think about.

One of these topics is the “safety factor” when dealing with investing. Safety factors are commonly used in engineering and MMM was trained as an engineer. In a nutshell, safety factors often use the worst case scenario and then include a multiplier (e.g., 10 times that value) as a “safety factor”. As an example, perhaps a type of chair breaks with 1,000 lbs during a test. A safety factor of 2 would limit the chair to 500 lbs max in case the wood deteriorates or other stressors are added or not accounted for.

In contract, I have been trained and an ecotoxicologist with a healthy mix of statistics tossed in to my education. We usually deal with messy data and safety factors are often discouraged because they often do not perform well. One alternative approach is probabilistic risk assessment. This approach assigns a probability distribution to inputs (e.g., market returns) and, conditional on the quality of the data, allows one to estimate the probability of an even occurring. In this post, I apply a a probabilistic approach to explore MMM’s investment assumptions.

MMM usually assumes a 5% return rate on the stock market (or more specifically, on broad index funds) after inflation and uses a 4% withdrawal rate as an added layer of safety. He uses this to define the “safe withdrawal rate” as

“the maximum rate at which you can spend your retirement savings, such that you don’t run out in your lifetime.”

Besides applying this to retirement, he also applies this to idea to understanding how investments grow. Overall, I like his approach. It is a nice approximation that is easy to understand. That being said, I use probabilistic tools at work and I want to explore this investment concept more.

Motivating my own case study is my pending switch from a contractor to Federal Term employee. I will have a ~$3,000 in my 401k when I leave and am wondering how much money will be there in 30 years when I can access it. I’ll do my simulations in R because I am familiar with the program and it is free so anybody can “follow along at home” if they would like (Note that you’ll need to not copy the “>” into R). To start off, I define two variables: > nYear = 30 # number of years > growth = 1.05 # annual growth after inflation and then use this to calculate how much money I will have in 30 years. This is a difference equation that MMM likes to use and the math is simple: > growth^nYear [1] 4.321942 Based upon this worst case scenario, I will should expect to have 4.3 times my starting investment. Therefore, my$3,000 will be come about $13,000 when I can access it in 30 years. However, what about the variability in this simulation and uncertainty? This year, I might get 1.05 growth, but next year could be 1.03 and the year after 1.05. We could simply take the product of 30 years: 1.05 * 1.04 * 1.03 * 1.6 … but this is tedious, error prone, and boring. Programing and simulations come to the rescue for situations such as this. Instead of generating the numbers by hand, we can assume that they come from a normal distribution. This requires me to guess a standard deviation. I’ll guess 0.05 because it seems reasonable and I do not know a great deal about quantitative finance. In R, this looks like: > rnorm(n = nYear, mean = 1.05, sd = 0.05) [1] 1.0460267 1.0851230 0.9941926 1.0551473 1.0233291 1.0695875 1.0287381 1.1444693 [9] 1.0428234 1.0557862 1.0947897 1.0367700 1.0036091 1.0797985 0.9861632 1.0154970 [17] 0.9705547 1.1681693 1.0083213 1.1077836 1.0220802 1.0551286 0.9268657 0.9814797 [25] 0.9921165 1.1120307 1.0640827 1.0439697 1.0824330 1.0698607 and this gives us 30 different growth rates. We then take the product of this and save it as x: > x = prod(rnorm(n = nYear, mean = 1.05, sd = 0.05)) > > x [1] 3.318441 In this example, my investment become 3.31 times larger. But, this is just one possible trajectory. To check out more, I created a function that also plots a histogram and tells me where 95% of the simulations fall: marketReturn <- function(nYear, growth, sd, reps = 10000, title = "Histogram"){ x <- numeric(reps) for(i in 1:reps){ x[i] <- prod(rnorm(nYear, growth, sd)) } hist(x, main = title) return(list( min = min(x), max = max(x), ci95 = quantile(x, c(0.025, 0.975)) ) ) } I can now use this function to run 10,000 simulations: > marketReturn(nYear = nYear, growth = growth, sd = sd, reps = 10^4, title = paste("sd =", sd))$min
[1] 2.070814

$max [1] 9.194219$ci95
2.5%    97.5%
2.996903 6.019339

In case you are unfamiliar, a historgram shows the number of counts on the y-axis and the observations on the x-axis. In this case, the x-axis is how many times my money has grown at the end of 30 years (e.g., 4 times).

Thus, assuming a worse case scenario, I can expect my money to grow between 3 and 6 times it current value. However, what if we dig around on the web, perhaps we can do better with the guess for the standard deviation:

> sd = 0.034
> marketReturn(nYear = nYear, growth = growth, sd = sd, reps = 10^4)
$min [1] 2.360319$max
[1] 7.885858

$ci95 2.5% 97.5% 2.989520 6.035744 Which yields a similar result. Another site uses 0.15, but this seems high > sd = 0.15 > marketReturn(nYear = nYear, growth = growth, sd = sd, reps = 10^4, title = paste("sd =", sd))$min
[1] 2.191817

$max [1] 7.982334$ci95
2.5%    97.5%
2.977771 6.008814

Regardless, the 5% annual growth after inflation is a worst case scenario from MMM. A more optimistic return might be 7%:

> growth = 1.07
> sd = 0.05
>
> marketReturn(nYear = nYear, growth = growth, sd = sd, reps = 10^4, title = paste("Mean growth of 7% with sd =", sd))
$min [1] 2.830627$max
[1] 18.28947

$ci95 2.5% 97.5% 4.491611 12.106446 Thus, if I am lucky, or at least average, I will most likely have 6-7x my initial amount when I retire. Thus, my$3,000 will likely be $18,000 to$21,000 in today’s dollars! If I am unlucky, I will only have ~3x my current amount (actually my worst case was ~2.8, which is more pessimistic that MMM’s worst case). And if the market does very well, I could have 12-18x my current amount!
Einstein is often quoted as saying:

“The power of compound interest the most powerful force in the universe.”

Although this quote may or may not be from him, these simulations demonstrate the power of compound interest.

Although, past performance is no indicator of future gains, the future does look promising! If you’d like to explore different scenarios, please feel free to adapt and share this code. If you find anything cool or any bugs with my code, please share it in the comments.

# Midwest Mathematical Biology Conference

Standard

Two weekends ago, I was lucky enough to attend the Midwest Mathematical Biology Conference.

This meeting was hosted by the University of Wisconsin-La Crosse and was the first time for the event. Overall, the meeting was great. I enjoyed the great keynote speakers. Meetings like this help inspire me because I enjoy seeing how other people solve problems. I also enjoy the networking and collaborations formed between researchers.

The keynote address that stood out most to me as the one by Carlos Castillo-Chavez. Carlos provided a great overview on epidemiology modeling and to use multiple scales of modeling together. He summarized a project that used an Individual-based Model to examine how disease spread at the city-level and then scaled up the results to a national level. These results were also compared traditional SIR models. The research does a great job illuminating where people catch infection disease during outbreaks. Surprisingly, most people get sick at home. This happens because people have the most contact with others when they are at home. However, when school is not in session, outbreaks are less likely to occur.

Besides the other great talks, my postdoc mentor had a great observation. He realized the USGS has very few mathematicians and all of them are doing geology (or at least non-biology) research. We wished we had more mathematical support. Furthermore, the Midwest Mathematical Biology Meeting expanded his view of biomath. Even if I gained nothing else from the meeting, having my postdoc mentor appreciate math more made it all worth while! However, I did gain more. So, the meeting was great opportunity!