When the power of data.table clicked


Historically, R has been limited by memory (for example, see this post). Although the program has gotten better through time, R still faces data limitations when working with “big data”. In my experience, data.frames and read.csv become too slow when reading files >0.5GB or so and data.frames start to clog up the system when the get bigger than >1.0 GB. However, some recent packages have been developed work with “big data”. The R High Performance Computing page describes some packages and work around.

My personal favorite package for working with big data in R is data.table. The package was designed by quantitative finance people for working with big data (e.g., files > 100GB). I discovered the package while trying to optimize a population model. Now, I use the package as my default methods for reading in data to R and manipulating data in R.

Besides being great for large data, the package also uses a slick syntax for manipulating data. As described by the vignette on the topic, the package has some cool methods for merging and sorting data. The package maintainers describe it as similar to SQL, although I do not know SQL, so I cannot comment on the analogy.

After taking the DataCamp Course on data.table, I better learned how to use the package. I was also soon able to improve my work by this knowledge. One call in particular blows my mind and impresses me:

 lakeTotalPop <-  lakeScenariosData[ , .('Population' = sum(Population)), by = .(Year,  Scenario, Stocking, Lifespan, PlotLife, Sex == "Male" | Sex == "Female")] 

This code allowed me to aggregate data by a condition. Something that requires multiple steps of clunky code in base R. Even if I learned nothing else, this one line of code would make my entire DataCamp experience worth while!

Leave a Reply

Your email address will not be published. Required fields are marked *