ArcPy tips and observations

Standard

When I started using ArcGIS a few weeks ago, I quickly became frustrated when operations failed after pointing and clicking several times. This motivated me start scripting in ArcPy. Now, I only need to load the script file once to make it crash. Thus, I am crashing more efficiently :). Overall, my experience with ArcPy has been alright. I discover two blog posts about hating ArcGIS 10 and ArcGIS sucking that are well worth the read. Despite the titles, the author actually likes ArcGIS. I understand why. ArcPy and ArcGIS are powerful. Plus, the documentation is usually well done and easy to find online (although, including data and reproducible examples like R’s help files would be a great addition).  However, ArcPy does have some problems/idiosyncrasies. Here’s what I have discovered:

1) Non-basic error message are often cryptic and some require hours of using Google. The archives on GIS Stack Exchange has become a new favorite page of mine. As an example, I got this error message (I removed and shortened my file paths in the code snip):

Layer (Folder) exists, please delete output if you wish to recreate layer
done with extraction for us_asp
0:00:36.046000
Could not create tile. name: z:\my documents\...\slopeclipped, adepth: 32, type: 1, iomode: 3, version: 3, compression: 0, eval case: 1 (GetOpenTileChannel)
Traceback (most recent call last):
File "", line 1, in
File "z:/My Documents/.../PythonFile.py", line 77, in
mF.RasterClipper(Slope, grid, SlopeClipped, arcpy.env.workspace)
File "myFunctions.py", line 37, in RasterClipper
arcpy.Clip_management(raster, clipExt, out, "", "ClippingGeometry")
File "C:\ArcGIS\Desktop10.1\arcpy\arcpy\management.py", line 12015, in Clip
raise e
arcgisscripting.ExecuteError: ERROR 999999: Error executing function.
Failed to copy raster dataset
Failed to execute (Clip).

This error leads me to my second tip.

2) The workspace should not include a space. Sometimes, but not all the time, this causes problems. This took me a while to figure out and I only found this by chance online when looking up another problem. For example, this code is okay:
arcpy.env.workspace = "C:/Folder/"

is okay, but
arcpy.env.workspace = "C:/My Documents/"

caused my example problem.

3) Be prepared to use libraries such as NymPy. ArcPy does not always play well with USA wide data. This seems to be the case for ArcPy’s wrapper functions more so than the base functions (My guess is the best performing functions are GDAL functions). I plan on posting this function once I have written it and tested it.

4) “Python Scripting for ArcGIS” is still a great reference!

I plan to post more of my project as I work through it.

Learning Python

Standard

Over the past year, I have been teaching myself Python. I am writing this blog post so I remember how I wish I had learned Python in case I ever need to teach anybody else. My desire to learn Python has been driven by several factors: 1) my laziness and disdain for GUIs (i.e., I hate pointing and clicking); 2) my need to solve complex problems that GUIs would not solve (i.e., GUIs could not solve my problem); and 3) my curiosity to learn new things.

Python is a powerful computer language. I want to use Python for scripting and scientific computing, but many other uses of Python exist. A great way to learn Python and some computer science theory is with the book “Think Python: How to Think Like a Computer Scientist”. The book was originally written by the author for his college class using the Java language. However, he published the book with Creative Commons License and a high school teacher adapted the book for Python! Score one for open source freedom.

After working through this book, I went to the “Invent with Python” books, which are fun (but not necessarily that useful for my goal…). The first book, “Invent Your Own Computer Games with Python” covers basic game designs that are terminal based. The second book, “Making games with Python and Pygame” includes graphics. These books are good because they are simple and fun. If I was simply learning Python for fun, I would start with these books rather than “Think Python”. These books also gave me more experience coding (It’s been said the first 10,000 lines of coding are learning, so at least these gave me a fun couple of thousand), so perhaps the sugar coating of game development worked with me as well.

After these books, dive into subject area books. For me, these were GIS books. “Python Scripting for ArcGIS” does a great job introducing and covering ESRI’s flagship product, ArcGIS. The only downside is that the script is closed source. I decided to stop trying to read source code when a .py file including a warning about being a trade secret. “Learning Geospatial Analysis with Python” is an introduction to advanced (but open source) libraries such as Shapely and GDAL. However, the book is a bit pricey for what you get. But, the book is easier to read than the author’s blog that contains most of the same material. Also note that the ebook kills Pythons tab spacing (Python uses white space such as tabs to define functions and code). Also, I was initially overwhelmed because I did not know GIS or Python my first time reading it. My co-workers copy of “The Geospatial Desktop” provided a great refresher to my GIS skills!

Finally, Learning Python, 5th ed. is a tome, but very through. My own course of study began with the games books. Then I tried “Learning Geospatial Analysis with Python”. Along the way, I bought “Learning Python” and then discovered “Think Python”. I discovered the ArcGIS book only when I needed to script AcrPy for work. Not an optimal course of study, but needing to script ArcGIS has been a great crash course in Python for me. As an added bonus, Python has also made me a better R programmer! Horray for learning.

 

 

Maxent, a tool for ecologist and other spatial modelers

Standard

What is Maxent?

Maxent is a tool that uses a “maximum-entropy approach for species habitat modeling”. Maximum entropy is a statistical concept that may be used as a method for fitting distributions. A research group from Princeton University applied this approach to habitat modeling for ecologists. Specifically, Maxent is designed to model presence only data, which is traditionally difficult for ecologists. This data occurs commonly in ecology because we know where organisms are found, but do not have good observations of where they are not located.

How does Maxent work?

That’s a good question. The underlying theory was based upon computer science and machine learning research from the 1950s (as a side note, applied statisticians across different disciplines are often discovering and rediscovering each others work. The proliferation of science has only made the lack of cross-communication worse because of the volume of scientific output produced). The Maxent authors provide three articles explaining the software on the program’s homepage. However, as far as I can tell, the program is close source, which means looking at the code is difficult if not impossible (for this reason, a colleague of mine passionately dislikes Maxent). Luckily for us, Renner and Warton dove into the weeds and explain how Maxent works. They discovered that the underlying model is simply a Poisson point processes model (a generalized linear model [GLM] with Poisson error distribution). What makes Maxent special is how the GLM is parametrized. Maxent uses a lasso approach. As an interesting site note, Warton also shows how the Poisson point processes model converges to logistic regression under certain conditions.

What does this mean for the casual user?

Ecologists are known for having messy data and needing powerful statistical tools. That being said, ecologists have also been criticized for not knowing enough statistics. For the non-statistician ecologist, Warton’s research means we have expanded our theory of how Maxent works. That being said, ecologists often become polarized and even dogmatic with their viewpoints. Some people hate Maxent while others seem to worship it. Personally, I am skeptical of it because the program feels black box/closed source and some users over-hype it. When Warton made some of these points on the Methods in Ecology and Evolution blog (while posting as an Associated Editor), he elicited some colorful comments. Additionally, some of Warton’s discussion arose because Maxent was initially presented as being free of some assumptions of a GLM. Not surprisingly (if you’ve hung out around ecologists), one of Warton’s critic soon dove into a God/religion comparison. Ironically, I would argue that his or her view supporting Maxent is dogmatic.  Rather than embracing a black box approach, programing a transparent appraoch seems less dogmatic, but I digresses (Ellison and Dennis make this point in their pathways to statistical fluency article when being critical of Program Mark users for lacking control over model assumptions).

What does this mean for the “DIY” crowd/advanced users?

The DYI (Do it yourself) crowd that is comfortable programming would not need Maxent. The glm function and a stepwise function in R combined with a GIS program such as GRASS or ESRI’s ArcGIS should be able to get the job done for a basic user (some R users might be able to do everything in R, but I have not had the best luck using GIS layers in R). For a more advanced user, a Bayesian program such as JAGS or Stan could be used for model parametrization and selection.

Conclusion:

Maxent fills a useful role for ecologists in that it improves their toolbox for modeling habitat using presence only data. For a basic user (such as the group of people who use Program Mark), this canned program could do an adequate job (this is why another colleague of mine likes Maxent and uses it for her own reserach). An advanced user who is comfortable with statistical programming, would likely be better suited fitting his or her data with a Poisson point processes model. All users also need to be aware of the assumptions of their model (be it Maxent or GLM or any program). My own statistical viewpoint has been shaped by Gelman (a future blog post at some point), so I will start off trying to use Stan to parametrize my habitat model with a Bayesian approach.

 

 

 

 

 

 

 

How much will my 401k be in 30 years?

Standard

A couple of months ago, I discovered Mr. Money Mustache (MMM). He’s somebody who “retired” at the age of 30 by saving a ~3/4 of his income and living a frugal lifestyle. A unique and enthralling way of living the American Dream. I have been reading through his different posts and like what I read. As you could likely guess, many of MMM’s blogs deal with investing. I like his posts because they encourage me to think about topics and ideas that I may otherwise not think about.

One of these topics is the “safety factor” when dealing with investing. Safety factors are commonly used in engineering and MMM was trained as an engineer. In a nutshell, safety factors often use the worst case scenario and then include a multiplier (e.g., 10 times that value) as a “safety factor”. As an example, perhaps a type of chair breaks with 1,000 lbs during a test. A safety factor of 2 would limit the chair to 500 lbs max in case the wood deteriorates or other stressors are added or not accounted for.

In contract, I have been trained and an ecotoxicologist with a healthy mix of statistics tossed in to my education. We usually deal with messy data and safety factors are often discouraged because they often do not perform well. One alternative approach is probabilistic risk assessment. This approach assigns a probability distribution to inputs (e.g., market returns) and, conditional on the quality of the data, allows one to estimate the probability of an even occurring. In this post, I apply a a probabilistic approach to explore MMM’s investment assumptions.

MMM usually assumes a 5% return rate on the stock market (or more specifically, on broad index funds) after inflation and uses a 4% withdrawal rate as an added layer of safety. He uses this to define the “safe withdrawal rate” as

“the maximum rate at which you can spend your retirement savings, such that you don’t run out in your lifetime.”

Besides applying this to retirement, he also applies this to idea to understanding how investments grow. Overall, I like his approach. It is a nice approximation that is easy to understand. That being said, I use probabilistic tools at work and I want to explore this investment concept more.

Motivating my own case study is my pending switch from a contractor to Federal Term employee. I will have a ~$3,000 in my 401k when I leave and am wondering how much money will be there in 30 years when I can access it. I’ll do my simulations in R because I am familiar with the program and it is free so anybody can “follow along at home” if they would like (Note that you’ll need to not copy the “>” into R).

To start off, I define two variables:

> nYear = 30 # number of years
> growth = 1.05 # annual growth after inflation

and then use this to calculate how much money I will have in 30 years. This is a difference equation that MMM likes to use and the math is simple:

> growth^nYear
[1] 4.321942

Based upon this worst case scenario, I will should expect to have 4.3 times my starting investment. Therefore, my $3,000 will be come about $13,000 when I can access it in 30 years. However, what about the variability in this simulation and uncertainty? This year, I might get 1.05 growth, but next year could be 1.03 and the year after 1.05. We could simply take the product of 30 years:
1.05 * 1.04 * 1.03 * 1.6 …
but this is tedious, error prone, and boring. Programing and simulations come to the rescue for situations such as this. Instead of generating the numbers by hand, we can assume that they come from a normal distribution. This requires me to guess a standard deviation. I’ll guess 0.05 because it seems reasonable and I do not know a great deal about quantitative finance. In R, this looks like:

> rnorm(n = nYear, mean = 1.05, sd = 0.05)
 [1] 1.0460267 1.0851230 0.9941926 1.0551473 1.0233291 1.0695875 1.0287381 1.1444693
 [9] 1.0428234 1.0557862 1.0947897 1.0367700 1.0036091 1.0797985 0.9861632 1.0154970
[17] 0.9705547 1.1681693 1.0083213 1.1077836 1.0220802 1.0551286 0.9268657 0.9814797
[25] 0.9921165 1.1120307 1.0640827 1.0439697 1.0824330 1.0698607
and this gives us 30 different growth rates. We then take the product of this and save it as x:
> x = prod(rnorm(n = nYear, mean = 1.05, sd = 0.05))
> 
> x
[1] 3.318441

In this example, my investment become 3.31 times larger. But, this is just one possible trajectory. To check out more, I created a function that also plots a histogram and tells me where 95% of the simulations fall:

marketReturn <- function(nYear, growth, sd, reps = 10000, title = "Histogram"){
    x <- numeric(reps)
    for(i in 1:reps){
        x[i] <- prod(rnorm(nYear, growth, sd))
        }

hist(x, main = title)
return(list(
    min = min(x), 
    max = max(x), 
    ci95 = quantile(x, c(0.025, 0.975))
    )
    )
    }

I can now use this function to run 10,000 simulations:

> marketReturn(nYear = nYear, growth = growth, sd = sd, reps = 10^4, title = paste("sd =", sd))
$min
[1] 2.070814

$max
[1] 9.194219

$ci95
    2.5%    97.5% 
2.996903 6.019339

sd005

In case you are unfamiliar, a historgram shows the number of counts on the y-axis and the observations on the x-axis. In this case, the x-axis is how many times my money has grown at the end of 30 years (e.g., 4 times).

Thus, assuming a worse case scenario, I can expect my money to grow between 3 and 6 times it current value. However, what if we dig around on the web, perhaps we can do better with the guess for the standard deviation:
This page list 0.0346 as a standard deviation:

> sd = 0.034
> marketReturn(nYear = nYear, growth = growth, sd = sd, reps = 10^4)
$min
[1] 2.360319

$max
[1] 7.885858

$ci95
    2.5%    97.5% 
2.989520 6.035744

sd003

Which yields a similar result. Another site uses 0.15, but this seems high

> sd = 0.15
> marketReturn(nYear = nYear, growth = growth, sd = sd, reps = 10^4, title = paste("sd =", sd))
$min
[1] 2.191817

$max
[1] 7.982334

$ci95
    2.5%    97.5% 
2.977771 6.008814

sd015

Regardless, the 5% annual growth after inflation is a worst case scenario from MMM. A more optimistic return might be 7%:

> growth = 1.07
> sd = 0.05
> 
> marketReturn(nYear = nYear, growth = growth, sd = sd, reps = 10^4, title = paste("Mean growth of 7% with sd =", sd))
$min
[1] 2.830627

$max
[1] 18.28947

$ci95
     2.5%     97.5% 
 4.491611 12.106446

g7

Thus, if I am lucky, or at least average, I will most likely have 6-7x my initial amount when I retire. Thus, my $3,000 will likely be $18,000 to $21,000 in today’s dollars! If I am unlucky, I will only have ~3x my current amount (actually my worst case was ~2.8, which is more pessimistic that MMM’s worst case). And if the market does very well, I could have 12-18x my current amount!
Einstein is often quoted as saying:

“The power of compound interest the most powerful force in the universe.”

Although this quote may or may not be from him, these simulations demonstrate the power of compound interest.

Although, past performance is no indicator of future gains, the future does look promising! If you’d like to explore different scenarios, please feel free to adapt and share this code. If you find anything cool or any bugs with my code, please share it in the comments.

Midwest Mathematical Biology Conference

Standard

Two weekends ago, I was lucky enough to attend the Midwest Mathematical Biology Conference.

This meeting was hosted by the University of Wisconsin-La Crosse and was the first time for the event. Overall, the meeting was great. I enjoyed the great keynote speakers. Meetings like this help inspire me because I enjoy seeing how other people solve problems. I also enjoy the networking and collaborations formed between researchers.

The keynote address that stood out most to me as the one by Carlos Castillo-Chavez. Carlos provided a great overview on epidemiology modeling and to use multiple scales of modeling together. He summarized a project that used an Individual-based Model to examine how disease spread at the city-level and then scaled up the results to a national level. These results were also compared traditional SIR models. The research does a great job illuminating where people catch infection disease during outbreaks. Surprisingly, most people get sick at home. This happens because people have the most contact with others when they are at home. However, when school is not in session, outbreaks are less likely to occur.

Besides the other great talks, my postdoc mentor had a great observation. He realized the USGS has very few mathematicians and all of them are doing geology (or at least non-biology) research. We wished we had more mathematical support. Furthermore, the Midwest Mathematical Biology Meeting expanded his view of biomath. Even if I gained nothing else from the meeting, having my postdoc mentor appreciate math more made it all worth while! However, I did gain more. So, the meeting was great opportunity!

Meet a Quantitative Ecologist

Standard

“So, what does a quantitative ecologist do?”

This is the first question most people ask me when I rattle off my job title. Basically, I use math to study and solve ecological problems. I am living a giant math story problem and enjoying it! I use ecological model, computational mathematics, and advanced statistics to not only solve problems, but also define them. For those of you familiar with the term, I consider myself a quant. Unlike most quants, I primarily study the environment. On any given day, I might be modeling where a species occurs or studying how a chemicals affect wildlife. In addition to the environment, I also dabble in medical biostatistics and quantifying risk for business. Statistics and mathematics allow me to “play” in many different field’s backyards.

How might one become a Quantitative Ecologist do you ask?

Growing up, I always enjoyed the outdoors. As a family, my parents, brother and I would go on hikes and bike rides often. Through scouting, I went on many a camping trip ranging from backyards and pastures to wilderness areas. This led me to restore a prairie for my Eagle Scout project with a wildlife biologist. My project motivated me to study wildlife ecology  for my undergraduate major. However, the wildlife job market is very tough! So, I chose to go to grad school for environmental toxicology.

Okay, so I see you’re why you’re an ecologist, but where does the “quantitative” part fit in?

Growing up, I always liked to play computer games and enjoyed math until it became hard in high school (I still hate calculus to this day!). I did not link the two interests formally until grad school. I developed a mosquito/dengue model for my MS research. One of my MS committee members, and future PhD co-adviser, told me that I needed to learn mathematics if I wanted to model. Begrudging and almost whimsically, I followed this advice and eventually even took it to heart. Two years of coursework later, I completed a doctoral minor in mathematics. This coursework helped me to completed my doctoral research where I studied the effects of pesticides on Daphnia.

I saw a posting on the TWS Biometrics Working Group email list adverting a postdoctoral position to study the effects of wind energy on cave bats. I applied for the position, and am now a “Quantitative Ecologist”.

Introduction

Standard

“Nothing in the world of living things is permanently fixed.”
Hans Zinnser — Rats, Lice and History, 1935

Our world is ever-changing. Stock prices go up and down. Seas and temperatures are rising. New chemicals emerge from the lab that may either be a deadly poison or a modern miracle or possibly both. Search engines must scan a constantly changing web. Even the American pass time of baseball uses sabermetrics to choose and evaluate players.

These situations are all dynamic. “Dynamic” simply means things are changing rather than staying constant or “static.” Understanding these situations requires the use of models. More specifically, quantitative models. Hence the title of this blog: “Quantitative Dynamics”.

This blog is about quantitative dynamics. We’re going to cover topics ranging from cave bats to health care. I’ll also talk about other things that strike my fancy. Mainly these topics will be about environmental science and ecology, but I have a broad interest in how models may be applied across our lives (e.g,. FiveThirtyEight.com). I am writing this blog for several reasons: to share my research with non-scientists, to write down ideas/techniques so that I do not forget them, and to learn about blogging. Last, I enjoy teaching and will use this blog to share what I am learning both on and off the job.